vLLM MCP Server

run_benchmark

Measure vLLM server performance by running benchmarks with configurable request rates, duration, and datasets to evaluate model efficiency.

Instructions

Run a performance benchmark against the vLLM server using GuideLLM

Input Schema

TableJSON Schema

Name	Required	Description	Default
`target`	No	Target URL (default: from settings)
`model`	No	Model to benchmark
`rate`	No	Request rate (requests/sec) or 'sweep'	sweep
`max_requests`	No	Maximum number of requests
`max_seconds`	No	Maximum duration in seconds
`data`	No	Dataset ('emulated' or path)	emulated
`output_path`	No	Path to save results

Implementation Reference

src/vllm_mcp_server/tools/benchmark.py:13-112 (handler)

Main handler function that executes the benchmark. It checks for GuideLLM availability, builds the command with arguments (target, model, rate, max_requests, max_seconds, data, output_path), runs the benchmark as a subprocess, captures stdout/stderr, and formats the results as TextContent.

async def run_benchmark(arguments: dict[str, Any]) -> list[TextContent]:
    """
    Run a benchmark against the vLLM server using GuideLLM.

    Args:
        arguments: Dictionary containing:
            - target: Target URL (default: from settings)
            - model: Model to benchmark (optional)
            - rate: Request rate (requests/sec) or "sweep" for rate sweep
            - max_requests: Maximum number of requests
            - max_seconds: Maximum duration in seconds
            - data: Dataset to use ("emulated" or path to dataset)
            - output_path: Path to save results (optional)

    Returns:
        List of TextContent with benchmark results.
    """
    # Check if guidellm is available
    if not shutil.which("guidellm"):
        return [
            TextContent(
                type="text",
                text="Error: GuideLLM is not installed. Install it with:\n"
                     "```bash\n"
                     "pip install guidellm\n"
                     "```",
            )
        ]

    settings = get_settings()

    target = arguments.get("target", f"{settings.base_url}/v1")
    model = arguments.get("model", settings.model)
    rate = arguments.get("rate", "sweep")
    max_requests = arguments.get("max_requests")
    max_seconds = arguments.get("max_seconds", 120)
    data = arguments.get("data", "emulated")
    output_path = arguments.get("output_path")

    # Build command
    cmd = [
        "guidellm",
        "--target", target,
        "--rate", str(rate),
        "--max-seconds", str(max_seconds),
        "--data", data,
    ]

    if model:
        cmd.extend(["--model", model])

    if max_requests:
        cmd.extend(["--max-requests", str(max_requests)])

    if output_path:
        cmd.extend(["--output-path", output_path])

    # Run benchmark
    try:
        process = await asyncio.create_subprocess_exec(
            *cmd,
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE,
        )

        # Stream output in real-time would be ideal, but for now we wait
        stdout, stderr = await process.communicate()

        stdout_str = stdout.decode("utf-8")
        stderr_str = stderr.decode("utf-8")

        if process.returncode != 0:
            return [
                TextContent(
                    type="text",
                    text=f"Benchmark failed with exit code {process.returncode}:\n\n"
                         f"**stderr:**\n```\n{stderr_str}\n```\n\n"
                         f"**stdout:**\n```\n{stdout_str}\n```",
                )
            ]

        # Format results
        result = "## Benchmark Results\n\n"
        result += f"**Command:** `{' '.join(cmd)}`\n\n"
        result += "**Output:**\n```\n"
        result += stdout_str
        result += "\n```"

        if output_path:
            result += f"\n\nResults saved to: `{output_path}`"

        return [TextContent(type="text", text=result)]

    except Exception as e:
        return [
            TextContent(
                type="text",
                text=f"Error running benchmark: {str(e)}",
            )
        ]

src/vllm_mcp_server/server.py:293-332 (schema)

Tool schema definition that defines the input/output parameters for run_benchmark. Includes target URL, model, rate, max_requests, max_seconds, data, and output_path with their types, descriptions, and defaults.

Tool(
    name="run_benchmark",
    description="Run a performance benchmark against the vLLM server using GuideLLM",
    inputSchema={
        "type": "object",
        "properties": {
            "target": {
                "type": "string",
                "description": "Target URL (default: from settings)",
            },
            "model": {
                "type": "string",
                "description": "Model to benchmark",
            },
            "rate": {
                "type": "string",
                "description": "Request rate (requests/sec) or 'sweep'",
                "default": "sweep",
            },
            "max_requests": {
                "type": "integer",
                "description": "Maximum number of requests",
            },
            "max_seconds": {
                "type": "integer",
                "description": "Maximum duration in seconds",
                "default": 120,
            },
            "data": {
                "type": "string",
                "description": "Dataset ('emulated' or path)",
                "default": "emulated",
            },
            "output_path": {
                "type": "string",
                "description": "Path to save results",
            },
        },
    },
),

src/vllm_mcp_server/server.py:365-366 (registration)
Tool registration in the call_tool handler that routes 'run_benchmark' calls to the handler function.
```
elif name == "run_benchmark":
    return await run_benchmark(arguments)
```
src/vllm_mcp_server/tools/__init__.py:14-14 (registration)
Import statement that makes run_benchmark available from the tools module.
```
from vllm_mcp_server.tools.benchmark import run_benchmark
```
src/vllm_mcp_server/tools/__init__.py:28-28 (registration)
Export of run_benchmark in __all__ list for module visibility.
```
"run_benchmark",
```

Tool Definition Quality

B3.2/5.0

Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden but offers minimal behavioral insight. It mentions 'performance benchmark' but doesn't disclose what metrics are measured, whether it's destructive (e.g., impacts server performance), authentication needs, rate limits, or output format. The agent lacks crucial context for safe and effective use.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, efficient sentence that front-loads the core purpose without unnecessary elaboration. Every word earns its place by specifying the action, target, and tool used, making it easy to parse quickly.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (performance benchmarking with 7 parameters) and lack of annotations/output schema, the description is insufficient. It doesn't explain what the benchmark measures, how results are returned, error conditions, or dependencies. For a tool that likely generates significant load, more context is needed for safe operation.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so parameters are well-documented in the schema itself. The description adds no additional parameter semantics beyond implying benchmarking context. This meets the baseline score of 3, as the schema adequately covers parameter details without needing description reinforcement.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('Run a performance benchmark'), target ('against the vLLM server'), and method ('using GuideLLM'). It distinguishes this tool from siblings like 'get_model_info' or 'vllm_status' by focusing on performance testing rather than status retrieval or chat/completion functions.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It doesn't mention prerequisites (e.g., server must be running), compare it to similar tools (e.g., 'vllm_complete' for single requests), or specify scenarios where benchmarking is appropriate versus unnecessary.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/micytao/vllm-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server