describe_image

Convert an image into structured text so a text-only model can "see" it.

Call this tool whenever the current main model cannot view images directly (e.g. glm-5.2, deepseek-v4-pro, qwen-text) but the user wants you to look at an image. The image source is auto-detected from the image argument:

http(s) URL -> downloaded
data: URI -> base64 extracted
local file path -> read from disk
raw base64 string -> used as-is
empty / None -> read from the system clipboard

The clipboard path is what makes screenshots work without pasting: the user takes a screenshot (Cmd+Shift+4 / Win+Shift+S / scrot) so the image lives in the OS clipboard, then says something like "看下我的截图" or "look at my screenshot" in the chat. You call this tool with no image argument; it reads the clipboard and returns the description.

The image is sent to the configured vision model (any OpenAI-compatible multimodal endpoint: qwen3.7-plus, qwen-vl-max, gpt-4o, llava, etc.) and returned as structured Chinese text covering:

overall content and scene
all visible text transcribed verbatim (preserving layout / tables)
numbers, data, axes, chart values (as structured text, not omitted)
key objects, colors, layout, UI elements
any other detail useful for downstream reasoning

This tool does NOT answer questions about the image. It only converts the image to text. After it returns, YOU (the main model) do the reasoning and answer the user yourself, as if you had read the description.

Args: image (Optional[str]): URL / data URI / file path / base64 / empty. Empty reads from the system clipboard. instruction (Optional[str]): custom vision instruction; if omitted, a comprehensive default prompt is used. detail (DetailLevel): 'high' (default) for OCR/dense content, 'low' for a quick rough summary.

Returns: str: Markdown text describing the image on success. On failure: '[describe_image failed] : '.

When to call: - User pasted an image attachment / gave a URL / gave a file path. - User says "看下我的截图" / "look at my screenshot" / "我刚截了张图" (leave image empty - the tool reads the clipboard). - You need OCR, table extraction, or chart values from a picture. - IMPORTANT: the message contains an image placeholder like [Image 1], [Image N], [图片], or [Image attachment] - this means the user pasted an image but the client/gateway replaced the real image data with a placeholder because the main model has no vision. The image still lives in the OS clipboard. Call this tool with image empty to read it from the clipboard, even if the user sent no text at all.

When NOT to call: - The user only sent text with no mention of any image. - You already have a textual description and no new image arrived.

Examples: - "describe this: https://x.com/a.png" -> image= - "看下我的截图" -> image omitted - "识别 /tmp/chart.png 里的表格" -> image="/tmp/chart.png" - "把这张流程图(base64)转成 Mermaid" -> instruction="...", image=

Name	Required	Description	Default
`image`	No	Image source - auto-detected by content. Accepts: (1) http(s) URL - downloaded; (2) data URI 'data:image/png;base64,...' - extracted; (3) local file path - read from disk; (4) raw base64 string - used as-is; (5) empty/omitted - read from the SYSTEM CLIPBOARD (use this when the user took a screenshot and says 'look at my screenshot' but did NOT paste the image into the chat).
`instruction`	No	Optional instruction overriding the default vision prompt. Use this to focus the description on what you actually need, e.g. '只提取表格中的数字', '识别这张截图里的所有 UI 组件', '把流程图转成 Mermaid 代码'.
`detail`	No	Image processing detail. 'high' for OCR / chart / dense text; 'low' for a fast rough summary. Some backends ignore this field.	high

Name

Required

Description

Default

image

Image source - auto-detected by content. Accepts: (1) http(s) URL - downloaded; (2) data URI 'data:image/png;base64,...' - extracted; (3) local file path - read from disk; (4) raw base64 string - used as-is; (5) empty/omitted - read from the SYSTEM CLIPBOARD (use this when the user took a screenshot and says 'look at my screenshot' but did NOT paste the image into the chat).

instruction

Optional instruction overriding the default vision prompt. Use this to focus the description on what you actually need, e.g. '只提取表格中的数字', '识别这张截图里的所有 UI 组件', '把流程图转成 Mermaid 代码'.

detail

Image processing detail. 'high' for OCR / chart / dense text; 'low' for a fast rough summary. Some backends ignore this field.

high

Name	Required	Description	Default
`result`	Yes

Name

Required

Description

Default

result

Yes

multimodal-mcp

Instructions

Input Schema

Output Schema

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API