Skip to main content
Glama

vllm_chat

Send chat messages to a vLLM server for multi-turn conversations with configurable model parameters and token limits.

Instructions

Send a chat message to the vLLM server. Supports multi-turn conversations.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
messagesYesList of messages in the conversation
modelNoModel to use (optional, uses default if not specified)
temperatureNoSampling temperature (0-2)
max_tokensNoMaximum tokens to generate

Implementation Reference

  • Tool registration for 'vllm_chat' defining its schema with messages, model, temperature, and max_tokens parameters
    Tool(
        name="vllm_chat",
        description="Send a chat message to the vLLM server. Supports multi-turn conversations.",
        inputSchema={
            "type": "object",
            "properties": {
                "messages": {
                    "type": "array",
                    "description": "List of messages in the conversation",
                    "items": {
                        "type": "object",
                        "properties": {
                            "role": {
                                "type": "string",
                                "enum": ["system", "user", "assistant"],
                                "description": "The role of the message sender",
                            },
                            "content": {
                                "type": "string",
                                "description": "The content of the message",
                            },
                        },
                        "required": ["role", "content"],
                    },
                },
                "model": {
                    "type": "string",
                    "description": "Model to use (optional, uses default if not specified)",
                },
                "temperature": {
                    "type": "number",
                    "description": "Sampling temperature (0-2)",
                    "default": 0.7,
                },
                "max_tokens": {
                    "type": "integer",
                    "description": "Maximum tokens to generate",
                    "default": 1024,
                },
            },
            "required": ["messages"],
        },
    ),
  • The handle_chat function that executes the vllm_chat tool logic: validates messages, extracts parameters, calls VLLMClient.chat_completion, and formats the response with usage info
    async def handle_chat(arguments: dict[str, Any]) -> list[TextContent]:
        """
        Handle chat completion request.
    
        Args:
            arguments: Dictionary containing:
                - messages: List of message objects with 'role' and 'content'
                - model: Optional model name to use
                - temperature: Optional temperature (0-2)
                - max_tokens: Optional maximum tokens to generate
                - stream: Whether to stream the response (default: False)
    
        Returns:
            List of TextContent with the assistant's response.
        """
        messages = arguments.get("messages", [])
        if not messages:
            return [TextContent(type="text", text="Error: No messages provided")]
    
        # Validate message format
        for msg in messages:
            if not isinstance(msg, dict) or "role" not in msg or "content" not in msg:
                return [
                    TextContent(
                        type="text",
                        text="Error: Each message must have 'role' and 'content' fields",
                    )
                ]
    
        model = arguments.get("model")
        temperature = arguments.get("temperature")
        max_tokens = arguments.get("max_tokens")
    
        try:
            async with VLLMClient() as client:
                response = await client.chat_completion(
                    messages=messages,
                    model=model,
                    temperature=temperature,
                    max_tokens=max_tokens,
                    stream=False,
                )
    
                # Extract the assistant's message
                choices = response.get("choices", [])
                if not choices:
                    return [TextContent(type="text", text="Error: No response from model")]
    
                assistant_message = choices[0].get("message", {}).get("content", "")
    
                # Include usage info
                usage = response.get("usage", {})
                usage_info = ""
                if usage:
                    usage_info = (
                        f"\n\n---\n"
                        f"Tokens: {usage.get('prompt_tokens', 0)} prompt + "
                        f"{usage.get('completion_tokens', 0)} completion = "
                        f"{usage.get('total_tokens', 0)} total"
                    )
    
                return [TextContent(type="text", text=assistant_message + usage_info)]
    
        except VLLMClientError as e:
            return [TextContent(type="text", text=f"Error: {str(e)}")]
  • Input schema definition for vllm_chat tool specifying required 'messages' array and optional parameters (model, temperature, max_tokens)
    inputSchema={
        "type": "object",
        "properties": {
            "messages": {
                "type": "array",
                "description": "List of messages in the conversation",
                "items": {
                    "type": "object",
                    "properties": {
                        "role": {
                            "type": "string",
                            "enum": ["system", "user", "assistant"],
                            "description": "The role of the message sender",
                        },
                        "content": {
                            "type": "string",
                            "description": "The content of the message",
                        },
                    },
                    "required": ["role", "content"],
                },
            },
            "model": {
                "type": "string",
                "description": "Model to use (optional, uses default if not specified)",
            },
            "temperature": {
                "type": "number",
                "description": "Sampling temperature (0-2)",
                "default": 0.7,
            },
            "max_tokens": {
                "type": "integer",
                "description": "Maximum tokens to generate",
                "default": 1024,
            },
        },
        "required": ["messages"],
    },
  • VLLMClient.chat_completion method that makes the actual OpenAI-compatible API call to the vLLM server's /chat/completions endpoint
    async def chat_completion(
        self,
        messages: list[dict[str, str]],
        model: Optional[str] = None,
        temperature: Optional[float] = None,
        max_tokens: Optional[int] = None,
        stream: bool = False,
        **kwargs: Any,
    ) -> dict[str, Any] | AsyncIterator[dict[str, Any]]:
        """Send a chat completion request."""
        model = model or self.settings.model
        if not model:
            # Try to get the first available model
            models = await self.list_models()
            if models:
                model = models[0].get("id")
            else:
                raise VLLMAPIError("No model specified and no models available", 400)
    
        payload = {
            "model": model,
            "messages": messages,
            "temperature": temperature or self.settings.default_temperature,
            "max_tokens": max_tokens or self.settings.default_max_tokens,
            "stream": stream,
            **kwargs,
        }
    
        session = await self._get_session()
        try:
            if stream:
                return self._stream_chat_completion(session, payload)
            else:
                return await self._send_chat_completion(session, payload)
        except aiohttp.ClientConnectorError as e:
            raise VLLMConnectionError(f"Cannot connect to vLLM server: {e}") from e
    
    async def _send_chat_completion(
        self, session: aiohttp.ClientSession, payload: dict
    ) -> dict[str, Any]:
        """Send non-streaming chat completion request."""
        async with session.post(
            f"{self.base_url}/chat/completions",
            headers=self.headers,
            json=payload,
        ) as response:
            if response.status != 200:
                body = await response.text()
                raise VLLMAPIError(
                    f"Chat completion failed: {response.status}",
                    response.status,
                    body,
                )
            return await response.json()
  • Tool call handler registration that routes 'vllm_chat' tool calls to the handle_chat function
    if name == "vllm_chat":
        return await handle_chat(arguments)
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden for behavioral disclosure. While it mentions multi-turn conversation support, it fails to describe important behavioral aspects like authentication requirements, rate limits, error conditions, response format, or whether this is a read/write operation. The description is insufficient for a tool with multiple parameters and no output schema.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extremely concise with just two sentences that directly state the tool's purpose and key capability. Every word earns its place, and the information is front-loaded without unnecessary elaboration or redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a chat tool with 4 parameters, no annotations, and no output schema, the description is inadequate. It doesn't explain what the tool returns, how errors are handled, authentication requirements, or important behavioral constraints. The description fails to compensate for the lack of structured metadata about this potentially complex interaction.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 100% schema description coverage, the baseline is 3. The description doesn't add any parameter-specific information beyond what's already documented in the schema. It mentions 'multi-turn conversations' which relates to the messages parameter but doesn't provide additional context about message structure or conversation flow.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the action ('Send a chat message') and target ('to the vLLM server'), with the specific capability of 'Supports multi-turn conversations' distinguishing it from simpler completion tools. However, it doesn't explicitly differentiate from sibling tools like 'vllm_complete' which might have overlapping functionality.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives like 'vllm_complete' or other chat-related tools. It mentions multi-turn conversations but doesn't specify prerequisites, limitations, or appropriate contexts for choosing this tool over others.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/micytao/vllm-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server