VLLM MCP Server

generate_multimodal_response

Generate AI responses using text prompts combined with images or files from multiple providers to create multimodal content and analysis.

Instructions

Generate response from multimodal model.

        Args:
            model: Model name to use
            prompt: Text prompt
            image_urls: Optional list of image URLs
            file_paths: Optional list of file paths
            system_prompt: Optional system prompt
            max_tokens: Maximum tokens to generate
            temperature: Generation temperature
            provider: Optional provider name (openai, dashscope)

        Returns:
            Generated response text

Input Schema

TableJSON Schema

Name	Required	Description	Default
`model`	Yes
`prompt`	Yes
`image_urls`	No
`file_paths`	No
`system_prompt`	No
`max_tokens`	No
`temperature`	No
`provider`	No

Output Schema

TableJSON Schema

Name	Required	Description	Default
`result`	Yes

Implementation Reference

src/vllm_mcp/server.py:131-252 (handler)

The main handler function for the 'generate_multimodal_response' tool. It is decorated with @self.server.tool(), processes input parameters, constructs a MultimodalRequest, delegates to the appropriate provider for generation, and formats the response.

@self.server.tool()
def generate_multimodal_response(
    model: str,
    prompt: str,
    image_urls: Optional[List[str]] = None,
    file_paths: Optional[List[str]] = None,
    system_prompt: Optional[str] = None,
    max_tokens: Optional[int] = 1000,
    temperature: Optional[float] = 0.7,
    provider: Optional[str] = None
) -> str:
    """Generate response from multimodal model.

    Args:
        model: Model name to use
        prompt: Text prompt
        image_urls: Optional list of image URLs
        file_paths: Optional list of file paths
        system_prompt: Optional system prompt
        max_tokens: Maximum tokens to generate
        temperature: Generation temperature
        provider: Optional provider name (openai, dashscope)

    Returns:
        Generated response text
    """
    try:
        # Auto-detect provider if not specified
        if not provider:
            if model.startswith("gpt"):
                provider = "openai"
            elif model.startswith("qwen"):
                provider = "dashscope"
            else:
                provider = list(self.providers.keys())[0] if self.providers else None

        if not provider or provider not in self.providers:
            return f"Error: Provider '{provider}' not available"

        # Build multimodal request
        text_contents = [TextContent(text=prompt)]
        image_contents = []
        file_contents = []

        # Add image content
        if image_urls:
            for url in image_urls:
                image_contents.append(ImageContent(
                    url=url,
                    mime_type="image/jpeg"  # Default, will be updated if needed
                ))

        # Add file content
        if file_paths:
            for file_path in file_paths:
                path = Path(file_path)
                if path.exists():
                    import mimetypes
                    mime_type, _ = mimetypes.guess_type(file_path)

                    if mime_type and mime_type.startswith("image/"):
                        image_contents.append(ImageContent(
                            image_path=file_path,
                            mime_type=mime_type
                        ))
                    elif mime_type and mime_type.startswith("text/"):
                        with open(path, 'r', encoding='utf-8') as f:
                            content = f.read()
                        file_contents.append(FileContent(
                            filename=path.name,
                            text=content,
                            mime_type=mime_type
                        ))

        request = MultimodalRequest(
            model=model,
            text_contents=text_contents,
            image_contents=image_contents,
            file_contents=file_contents,
            system_prompt=system_prompt,
            max_tokens=max_tokens,
            temperature=temperature
        )

        # Generate response
        try:
            # Check if we're already in an event loop
            try:
                loop = asyncio.get_running_loop()
                # We're already in a loop, create a task
                task = asyncio.create_task(
                    self.providers[provider].generate_response(request)
                )
                # Wait for the task to complete
                while not task.done():
                    asyncio.sleep(0.01)
                response = task.result()
            except RuntimeError:
                # No running loop, create a new one
                loop = asyncio.new_event_loop()
                asyncio.set_event_loop(loop)
                response = loop.run_until_complete(
                    self.providers[provider].generate_response(request)
                )
                loop.close()

            if response.error:
                return f"Error: {response.error}"

            result = response.text
            if response.usage:
                result += f"\n\n[Token usage: {response.usage}]"

            return result

        finally:
            loop.close()

    except Exception as e:
        logger.error(f"Error generating response: {e}")
        return f"Error: {str(e)}"

src/vllm_mcp/server.py:128-129 (registration)
The _setup_tools method where the generate_multimodal_response tool is registered using the @self.server.tool() decorator.
```
def _setup_tools(self):
    """Setup MCP tools."""
```

src/vllm_mcp/models.py:50-63 (schema)

Pydantic model used internally for structuring the multimodal request passed to providers. Supports input validation and typing.

class MultimodalRequest(BaseModel):
    """Multimodal request model."""
    model: str = Field(..., description="Model name")
    text_contents: List[TextContent] = Field(default_factory=list, description="Text content list")
    image_contents: List[ImageContent] = Field(default_factory=list, description="Image content list")
    file_contents: List[FileContent] = Field(default_factory=list, description="File content list")
    system_prompt: Optional[str] = Field(None, description="System prompt")
    max_tokens: Optional[int] = Field(1000, description="Maximum tokens to generate")
    temperature: float = Field(0.7, description="Generation temperature")
    top_p: Optional[float] = Field(None, description="Top-p sampling")
    top_k: Optional[int] = Field(None, description="Top-k sampling")
    stream: bool = Field(False, description="Whether to stream response")
    extra_params: Dict[str, Any] = Field(default_factory=dict, description="Extra model parameters")

src/vllm_mcp/providers/openai_provider.py:40-68 (helper)

Helper method in OpenAIProvider that performs the actual API call to generate multimodal responses, used by the main handler.

async def generate_response(
    self, request: MultimodalRequest
) -> MultimodalResponse:
    """Generate response from OpenAI multimodal model.

    Args:
        request: Multimodal request containing text, images, and files

    Returns:
        Multimodal response
    """
    try:
        messages = self._build_messages(request)

        response = self.client.chat.completions.create(
            model=request.model,
            messages=messages,
            max_tokens=request.max_tokens,
            temperature=request.temperature,
            stream=False,
        )

        return self._parse_response(response)

    except openai.APIError as e:
        raise Exception(f"OpenAI API error: {e}")
    except Exception as e:
        raise Exception(f"Error generating response: {e}")

Tool Definition Quality

C2.9/5.0

Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden of behavioral disclosure. It only states the basic action ('generate response') without detailing behavioral traits like rate limits, authentication needs, error handling, or what happens with invalid inputs. For a complex tool with 8 parameters, this lack of context is a significant gap, though it doesn't contradict any annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized and front-loaded, starting with the core purpose followed by parameter details in a structured format. Every sentence serves a purpose, with no wasted words, though the parameter explanations could be more concise. It efficiently conveys essential information without unnecessary elaboration.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (8 parameters, no annotations, but with an output schema), the description is moderately complete. It covers the purpose and parameters but lacks behavioral context and usage guidelines. The output schema handles return values, so the description doesn't need to explain those, but it should provide more operational details to fully guide an AI agent.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description lists all 8 parameters with brief explanations (e.g., 'Model name to use', 'Text prompt'), adding meaning beyond the input schema, which has 0% description coverage. However, the explanations are minimal and don't cover details like format constraints or examples. With low schema coverage, this partially compensates but doesn't fully address the complexity, warranting a baseline score.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Generate response from multimodal model.' This specifies the verb ('generate response') and resource ('multimodal model'), making it easy to understand what the tool does. However, it doesn't explicitly differentiate from sibling tools like 'list_available_providers' or 'validate_multimodal_request', which serve different purposes (listing vs. validation vs. generation).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It doesn't mention sibling tools or any context for choosing this tool over others, such as for generating outputs versus validating requests. Without such guidance, an AI agent might struggle to select the appropriate tool in a given scenario.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/StanleyChanH/vllm-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server