Schema | multimodal-mcp

multimodal-mcp

Overview Schema Related Servers Score Discussions

Server Configuration

Describes the environment variables required to run the server.

Name	Required	Description
`VISION_MODEL`	Yes	Model name, e.g., qwen3.7-plus
`VISION_API_KEY`	Yes	API key for the vision model
`VISION_BASE_URL`	Yes	Base URL for the vision model API, e.g., https://dashscope.aliyuncs.com/compatible-mode/v1

Capabilities

Features and capabilities supported by this server

Capability	Details
`tools`	{ "listChanged": false }
`prompts`	{ "listChanged": false }
`resources`	{ "subscribe": false, "listChanged": false }
`experimental`	{}

Tools

Functions exposed to the LLM to take actions

Name Description

Name	Description
describe_imageA	Convert an image into structured text so a text-only model can "see" it. Call this tool whenever the current main model cannot view images directly (e.g. glm-5.2, deepseek-v4-pro, qwen-text) but the user wants you to look at an image. The image source is auto-detected from the `image` argument: http(s) URL -> downloaded data: URI -> base64 extracted local file path -> read from disk raw base64 string -> used as-is empty / None -> read from the system clipboard The clipboard path is what makes screenshots work without pasting: the user takes a screenshot (Cmd+Shift+4 / Win+Shift+S / scrot) so the image lives in the OS clipboard, then says something like "看下我的截图" or "look at my screenshot" in the chat. You call this tool with no `image` argument; it reads the clipboard and returns the description. The image is sent to the configured vision model (any OpenAI-compatible multimodal endpoint: qwen3.7-plus, qwen-vl-max, gpt-4o, llava, etc.) and returned as structured Chinese text covering: overall content and scene all visible text transcribed verbatim (preserving layout / tables) numbers, data, axes, chart values (as structured text, not omitted) key objects, colors, layout, UI elements any other detail useful for downstream reasoning This tool does NOT answer questions about the image. It only converts the image to text. After it returns, YOU (the main model) do the reasoning and answer the user yourself, as if you had read the description. Args: image (Optional[str]): URL / data URI / file path / base64 / empty. Empty reads from the system clipboard. instruction (Optional[str]): custom vision instruction; if omitted, a comprehensive default prompt is used. detail (DetailLevel): 'high' (default) for OCR/dense content, 'low' for a quick rough summary. Returns: str: Markdown text describing the image on success. On failure: '[describe_image failed] : '. When to call: - User pasted an image attachment / gave a URL / gave a file path. - User says "看下我的截图" / "look at my screenshot" / "我刚截了张图" (leave `image` empty - the tool reads the clipboard). - You need OCR, table extraction, or chart values from a picture. - IMPORTANT: the message contains an image placeholder like `[Image 1]`, `[Image N]`, `[图片]`, or `[Image attachment]` - this means the user pasted an image but the client/gateway replaced the real image data with a placeholder because the main model has no vision. The image still lives in the OS clipboard. Call this tool with `image` empty to read it from the clipboard, even if the user sent no text at all. When NOT to call: - The user only sent text with no mention of any image. - You already have a textual description and no new image arrived. Examples: - "describe this: https://x.com/a.png" -> image= - "看下我的截图" -> image omitted - "识别 /tmp/chart.png 里的表格" -> image="/tmp/chart.png" - "把这张流程图(base64)转成 Mermaid" -> instruction="...", image=
multimodal_config_statusA	Report whether the required vision env vars are set (never the values). Call once after first wiring the server into a client, to confirm VISION_BASE_URL, VISION_API_KEY and VISION_MODEL are all configured. The API key itself is never exposed; only a boolean. Returns: str: JSON with vision_base_url_set, vision_api_key_set, vision_model.

describe_imageA

Convert an image into structured text so a text-only model can "see" it.

Call this tool whenever the current main model cannot view images directly (e.g. glm-5.2, deepseek-v4-pro, qwen-text) but the user wants you to look at an image. The image source is auto-detected from the image argument:

http(s) URL -> downloaded
data: URI -> base64 extracted
local file path -> read from disk
raw base64 string -> used as-is
empty / None -> read from the system clipboard

The clipboard path is what makes screenshots work without pasting: the user takes a screenshot (Cmd+Shift+4 / Win+Shift+S / scrot) so the image lives in the OS clipboard, then says something like "看下我的截图" or "look at my screenshot" in the chat. You call this tool with no image argument; it reads the clipboard and returns the description.

The image is sent to the configured vision model (any OpenAI-compatible multimodal endpoint: qwen3.7-plus, qwen-vl-max, gpt-4o, llava, etc.) and returned as structured Chinese text covering:

overall content and scene
all visible text transcribed verbatim (preserving layout / tables)
numbers, data, axes, chart values (as structured text, not omitted)
key objects, colors, layout, UI elements
any other detail useful for downstream reasoning

This tool does NOT answer questions about the image. It only converts the image to text. After it returns, YOU (the main model) do the reasoning and answer the user yourself, as if you had read the description.

Args: image (Optional[str]): URL / data URI / file path / base64 / empty. Empty reads from the system clipboard. instruction (Optional[str]): custom vision instruction; if omitted, a comprehensive default prompt is used. detail (DetailLevel): 'high' (default) for OCR/dense content, 'low' for a quick rough summary.

Returns: str: Markdown text describing the image on success. On failure: '[describe_image failed] : '.

When to call: - User pasted an image attachment / gave a URL / gave a file path. - User says "看下我的截图" / "look at my screenshot" / "我刚截了张图" (leave image empty - the tool reads the clipboard). - You need OCR, table extraction, or chart values from a picture. - IMPORTANT: the message contains an image placeholder like [Image 1], [Image N], [图片], or [Image attachment] - this means the user pasted an image but the client/gateway replaced the real image data with a placeholder because the main model has no vision. The image still lives in the OS clipboard. Call this tool with image empty to read it from the clipboard, even if the user sent no text at all.

When NOT to call: - The user only sent text with no mention of any image. - You already have a textual description and no new image arrived.

Examples: - "describe this: https://x.com/a.png" -> image= - "看下我的截图" -> image omitted - "识别 /tmp/chart.png 里的表格" -> image="/tmp/chart.png" - "把这张流程图(base64)转成 Mermaid" -> instruction="...", image=

multimodal_config_statusA

Report whether the required vision env vars are set (never the values).

Call once after first wiring the server into a client, to confirm VISION_BASE_URL, VISION_API_KEY and VISION_MODEL are all configured. The API key itself is never exposed; only a boolean.

Returns: str: JSON with vision_base_url_set, vision_api_key_set, vision_model.

Prompts

Interactive templates invoked by user choice

Name	Description
No prompts

Resources

Contextual data attached and managed by the client

Name	Description
No resources

Server Configuration
Capabilities
Tools
Prompts
Resources

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/believe3344/multimodal-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server