Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@vLLM MCP Serverstart a vLLM server with Llama 3.1 8B"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
vLLM MCP Server
A Model Context Protocol (MCP) server that exposes vLLM capabilities to AI assistants like Claude, Cursor, and other MCP-compatible clients.
Features
π Chat & Completion: Send chat messages and text completions to vLLM
π Model Management: List and inspect available models
π Server Monitoring: Check server health and performance metrics
π³ Platform-Aware Container Control: Supports both Podman and Docker. Automatically detects your platform (Linux/macOS/Windows) and GPU availability, selecting the appropriate container image and optimal settings (e.g.,
max_model_len)π Benchmarking: Run GuideLLM benchmarks (optional)
π¬ Pre-defined Prompts: Use curated system prompts for common tasks
Demo
Start vLLM Server
Use the start_vllm tool to launch a vLLM container with automatic platform detection:

Chat with vLLM
Send chat messages using the vllm_chat tool:

Stop vLLM Server
Clean up with the stop_vllm tool:

Installation
Using uvx (Recommended)
uvx vllm-mcp-serverUsing pip
pip install vllm-mcp-serverFrom Source
git clone https://github.com/micytao/vllm-mcp-server.git
cd vllm-mcp-server
pip install -e .Quick Start
1. Start a vLLM Server
You can either start a vLLM server manually or let the MCP server manage it via Docker.
Option A: Let MCP Server Manage Docker (Recommended)
The MCP server can automatically start/stop vLLM containers with platform detection. Just configure your MCP client (step 2) and use the start_vllm tool.
Option B: Manual Container Setup (Podman or Docker)
Replace podman with docker if using Docker.
Linux/Windows with NVIDIA GPU:
podman run --device nvidia.com/gpu=all -p 8000:8000 \
vllm/vllm-openai:latest \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0macOS (Apple Silicon / Intel):
podman run -p 8000:8000 \
quay.io/rh_ee_micyang/vllm-mac:v0.11.0 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--device cpu --dtype bfloat16Linux/Windows CPU-only:
podman run -p 8000:8000 \
quay.io/rh_ee_micyang/vllm-cpu:v0.11.0 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--device cpu --dtype bfloat16Option C: Native vLLM Installation
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.02. Configure Your MCP Client
Cursor
Add to ~/.cursor/mcp.json:
{
"mcpServers": {
"vllm": {
"command": "uvx",
"args": ["vllm-mcp-server"],
"env": {
"VLLM_BASE_URL": "http://localhost:8000",
"VLLM_MODEL": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"VLLM_HF_TOKEN": "hf_your_token_here"
}
}
}
}Note:
VLLM_HF_TOKENis required for gated models like Llama. Get your token from HuggingFace Settings.
Claude Desktop
Add to your Claude Desktop configuration:
{
"mcpServers": {
"vllm": {
"command": "uvx",
"args": ["vllm-mcp-server"],
"env": {
"VLLM_BASE_URL": "http://localhost:8000",
"VLLM_HF_TOKEN": "hf_your_token_here"
}
}
}
}3. Use the Tools
Once configured, you can use these tools in your AI assistant:
Server Management:
start_vllm- Start a vLLM container (auto-detects platform & GPU)stop_vllm- Stop a running containerget_platform_status- Check platform, Docker, and GPU statusvllm_status- Check vLLM server health
Inference:
vllm_chat- Send chat messagesvllm_complete- Generate text completions
Model Management:
list_models- List available modelsget_model_info- Get model details
Configuration
Configure the server using environment variables:
Variable | Description | Default |
| vLLM server URL |
|
| API key (if required) |
|
| Default model to use |
|
| HuggingFace token for gated models (e.g., Llama) |
|
| Default temperature |
|
| Default max tokens |
|
| Request timeout (seconds) |
|
| Container runtime ( |
|
| Container image (GPU mode) |
|
| Container image (macOS) |
|
| Container image (CPU mode) |
|
| Container name |
|
| GPU memory fraction |
|
Available Tools
P0 (Core)
vllm_chat
Send chat messages to vLLM with multi-turn conversation support.
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 1024
}vllm_complete
Generate text completions.
{
"prompt": "def fibonacci(n):",
"max_tokens": 200,
"stop": ["\n\n"]
}P1 (Model Management)
list_models
List all available models on the vLLM server.
get_model_info
Get detailed information about a specific model.
{
"model_id": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
}P2 (Status)
vllm_status
Check the health and status of the vLLM server.
P3 (Server Control - Platform Aware)
The server control tools support both Podman (preferred) and Docker, automatically detecting your platform and GPU availability:
Platform | GPU Support | Container Image | Default |
Linux (GPU) | β NVIDIA |
| 8096 |
Linux (CPU) | β |
| 2048 |
macOS (Apple Silicon) | β |
| 2048 |
macOS (Intel) | β |
| 2048 |
Windows (GPU) | β NVIDIA |
| 8096 |
Windows (CPU) | β |
| 2048 |
Note: The
max_model_lenis automatically set based on the detected mode (CPU vs GPU). CPU mode uses 2048 to match vLLM'smax_num_batched_tokenslimit, while GPU mode uses 8096 for larger context. You can override this by explicitly passingmax_model_lentostart_vllm.
start_vllm
Start a vLLM server in a Docker container with automatic platform detection.
{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"port": 8000,
"gpu_memory_utilization": 0.9,
"cpu_only": false,
"tensor_parallel_size": 1,
"max_model_len": null,
"dtype": "auto"
}Note: If
max_model_lenis not specified (ornull), it defaults to 2048 for CPU mode or 8096 for GPU mode.
stop_vllm
Stop a running vLLM Docker container.
{
"container_name": "vllm-server",
"remove": true,
"timeout": 10
}restart_vllm
Restart a vLLM container.
list_vllm_containers
List all vLLM Docker containers.
{
"all": true
}get_vllm_logs
Get container logs to monitor loading progress.
{
"container_name": "vllm-server",
"tail": 100
}get_platform_status
Get detailed platform, Docker, and GPU status information.
run_benchmark
Run a GuideLLM benchmark against the server.
{
"rate": "sweep",
"max_seconds": 120,
"data": "emulated"
}Resources
The server exposes these MCP resources:
vllm://status- Current server statusvllm://metrics- Performance metricsvllm://config- Current configurationvllm://platform- Platform, Docker, and GPU information
Prompts
Pre-defined prompts for common tasks:
coding_assistant- Expert coding helpcode_reviewer- Code review feedbacktechnical_writer- Documentation writingdebugger- Debugging assistancearchitect- System design helpdata_analyst- Data analysisml_engineer- ML/AI development
Development
Setup
# Clone the repository
git clone https://github.com/micytao/vllm-mcp-server.git
cd vllm-mcp-server
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # or `.venv\Scripts\activate` on Windows
# Install with dev dependencies
uv pip install -e ".[dev]"Local Development with Cursor
For debugging and local development, configure Cursor to run from source using uv run instead of uvx:
Add to ~/.cursor/mcp.json:
{
"mcpServers": {
"vllm": {
"command": "uv",
"args": [
"--directory",
"/path/to/vllm-mcp-server",
"run",
"vllm-mcp-server"
],
"env": {
"VLLM_BASE_URL": "http://localhost:8000",
"VLLM_HF_TOKEN": "hf_your_token_here",
"VLLM_CONTAINER_RUNTIME": "podman"
}
}
}
}This runs the MCP server directly from your local source code, so any changes you make will be reflected immediately after restarting Cursor.
Running Tests
uv run pytestCode Formatting
uv run ruff check --fix .
uv run ruff format .Architecture
vllm-mcp-server/
βββ src/vllm_mcp_server/
β βββ server.py # Main MCP server entry point
β βββ tools/ # MCP tool implementations
β β βββ chat.py # Chat/completion tools
β β βββ models.py # Model management tools
β β βββ server_control.py # Docker container control
β β βββ benchmark.py # GuideLLM integration
β βββ resources/ # MCP resource implementations
β β βββ server_status.py # Server health resource
β β βββ metrics.py # Prometheus metrics resource
β βββ prompts/ # Pre-defined prompts
β β βββ system_prompts.py # Curated system prompts
β βββ utils/ # Utilities
β βββ config.py # Configuration management
β βββ vllm_client.py # vLLM API client
βββ tests/ # Test suite
βββ examples/ # Configuration examples
βββ pyproject.toml # Project configuration
βββ README.md # This fileLicense
Apache License 2.0 - see LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
Resources
Looking for Admin?
Admins can modify the Dockerfile, update the server description, and track usage metrics. If you are the server author, to access the admin panel.