multimodal-mcp
Server Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
| VISION_MODEL | Yes | Model name, e.g., qwen3.7-plus | |
| VISION_API_KEY | Yes | API key for the vision model | |
| VISION_BASE_URL | Yes | Base URL for the vision model API, e.g., https://dashscope.aliyuncs.com/compatible-mode/v1 |
Capabilities
Features and capabilities supported by this server
| Capability | Details |
|---|---|
| tools | {
"listChanged": false
} |
| prompts | {
"listChanged": false
} |
| resources | {
"subscribe": false,
"listChanged": false
} |
| experimental | {} |
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| describe_imageA | Convert an image into structured text so a text-only model can "see" it. Call this tool whenever the current main model cannot view images directly
(e.g. glm-5.2, deepseek-v4-pro, qwen-text) but the user wants you to look
at an image. The image source is auto-detected from the
The clipboard path is what makes screenshots work without pasting: the
user takes a screenshot (Cmd+Shift+4 / Win+Shift+S / scrot) so the image
lives in the OS clipboard, then says something like "看下我的截图" or
"look at my screenshot" in the chat. You call this tool with no The image is sent to the configured vision model (any OpenAI-compatible multimodal endpoint: qwen3.7-plus, qwen-vl-max, gpt-4o, llava, etc.) and returned as structured Chinese text covering:
This tool does NOT answer questions about the image. It only converts the image to text. After it returns, YOU (the main model) do the reasoning and answer the user yourself, as if you had read the description. Args: image (Optional[str]): URL / data URI / file path / base64 / empty. Empty reads from the system clipboard. instruction (Optional[str]): custom vision instruction; if omitted, a comprehensive default prompt is used. detail (DetailLevel): 'high' (default) for OCR/dense content, 'low' for a quick rough summary. Returns: str: Markdown text describing the image on success. On failure: '[describe_image failed] : '. When to call:
- User pasted an image attachment / gave a URL / gave a file path.
- User says "看下我的截图" / "look at my screenshot" / "我刚截了张图"
(leave When NOT to call: - The user only sent text with no mention of any image. - You already have a textual description and no new image arrived. Examples: - "describe this: https://x.com/a.png" -> image= - "看下我的截图" -> image omitted - "识别 /tmp/chart.png 里的表格" -> image="/tmp/chart.png" - "把这张流程图(base64)转成 Mermaid" -> instruction="...", image= |
| multimodal_config_statusA | Report whether the required vision env vars are set (never the values). Call once after first wiring the server into a client, to confirm VISION_BASE_URL, VISION_API_KEY and VISION_MODEL are all configured. The API key itself is never exposed; only a boolean. Returns: str: JSON with vision_base_url_set, vision_api_key_set, vision_model. |
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
No prompts | |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
No resources | |
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/believe3344/multimodal-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server