start_vllm
Launch a vLLM server in a Docker container to serve HuggingFace models. Automatically detects platform and GPU availability for optimized deployment.
Instructions
Start a vLLM server in a Docker container. Automatically detects platform (Linux/macOS/Windows) and GPU availability.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| model | Yes | HuggingFace model ID to serve (e.g., 'TinyLlama/TinyLlama-1.1B-Chat-v1.0') | |
| port | No | Port to expose | |
| gpu_memory_utilization | No | GPU memory fraction (0-1), only used when GPU is available | |
| cpu_only | No | Force CPU mode even if GPU is available | |
| tensor_parallel_size | No | Number of GPUs for tensor parallelism | |
| max_model_len | No | Maximum model context length (optional, uses model default) | |
| dtype | No | Data type: auto, float16, bfloat16, float32 | auto |
| container_name | No | Name for the Docker container | |
| extra_args | No | Additional vLLM command-line arguments |