Skip to main content
Glama

start_vllm

Launch a vLLM server in a Docker container to serve HuggingFace models. Automatically detects platform and GPU availability for optimized deployment.

Instructions

Start a vLLM server in a Docker container. Automatically detects platform (Linux/macOS/Windows) and GPU availability.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
modelYesHuggingFace model ID to serve (e.g., 'TinyLlama/TinyLlama-1.1B-Chat-v1.0')
portNoPort to expose
gpu_memory_utilizationNoGPU memory fraction (0-1), only used when GPU is available
cpu_onlyNoForce CPU mode even if GPU is available
tensor_parallel_sizeNoNumber of GPUs for tensor parallelism
max_model_lenNoMaximum model context length (optional, uses model default)
dtypeNoData type: auto, float16, bfloat16, float32auto
container_nameNoName for the Docker container
extra_argsNoAdditional vLLM command-line arguments

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/micytao/vllm-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server