describe_image
Converts images to structured text for text-only models. Supports URLs, file paths, clipboard, and base64. Returns detailed description including OCR, chart data, and scene content.
Instructions
Convert an image into structured text so a text-only model can "see" it.
Call this tool whenever the current main model cannot view images directly
(e.g. glm-5.2, deepseek-v4-pro, qwen-text) but the user wants you to look
at an image. The image source is auto-detected from the image argument:
http(s) URL -> downloaded
data: URI -> base64 extracted
local file path -> read from disk
raw base64 string -> used as-is
empty / None -> read from the system clipboard
The clipboard path is what makes screenshots work without pasting: the
user takes a screenshot (Cmd+Shift+4 / Win+Shift+S / scrot) so the image
lives in the OS clipboard, then says something like "看下我的截图" or
"look at my screenshot" in the chat. You call this tool with no image
argument; it reads the clipboard and returns the description.
The image is sent to the configured vision model (any OpenAI-compatible multimodal endpoint: qwen3.7-plus, qwen-vl-max, gpt-4o, llava, etc.) and returned as structured Chinese text covering:
overall content and scene
all visible text transcribed verbatim (preserving layout / tables)
numbers, data, axes, chart values (as structured text, not omitted)
key objects, colors, layout, UI elements
any other detail useful for downstream reasoning
This tool does NOT answer questions about the image. It only converts the image to text. After it returns, YOU (the main model) do the reasoning and answer the user yourself, as if you had read the description.
Args: image (Optional[str]): URL / data URI / file path / base64 / empty. Empty reads from the system clipboard. instruction (Optional[str]): custom vision instruction; if omitted, a comprehensive default prompt is used. detail (DetailLevel): 'high' (default) for OCR/dense content, 'low' for a quick rough summary.
Returns: str: Markdown text describing the image on success. On failure: '[describe_image failed] : '.
When to call:
- User pasted an image attachment / gave a URL / gave a file path.
- User says "看下我的截图" / "look at my screenshot" / "我刚截了张图"
(leave image empty - the tool reads the clipboard).
- You need OCR, table extraction, or chart values from a picture.
- IMPORTANT: the message contains an image placeholder like
[Image 1], [Image N], [图片], or [Image attachment] - this
means the user pasted an image but the client/gateway replaced the
real image data with a placeholder because the main model has no
vision. The image still lives in the OS clipboard. Call this tool
with image empty to read it from the clipboard, even if the user
sent no text at all.
When NOT to call: - The user only sent text with no mention of any image. - You already have a textual description and no new image arrived.
Examples: - "describe this: https://x.com/a.png" -> image= - "看下我的截图" -> image omitted - "识别 /tmp/chart.png 里的表格" -> image="/tmp/chart.png" - "把这张流程图(base64)转成 Mermaid" -> instruction="...", image=
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| image | No | Image source - auto-detected by content. Accepts: (1) http(s) URL - downloaded; (2) data URI 'data:image/png;base64,...' - extracted; (3) local file path - read from disk; (4) raw base64 string - used as-is; (5) empty/omitted - read from the SYSTEM CLIPBOARD (use this when the user took a screenshot and says 'look at my screenshot' but did NOT paste the image into the chat). | |
| instruction | No | Optional instruction overriding the default vision prompt. Use this to focus the description on what you actually need, e.g. '只提取表格中的数字', '识别这张截图里的所有 UI 组件', '把流程图转成 Mermaid 代码'. | |
| detail | No | Image processing detail. 'high' for OCR / chart / dense text; 'low' for a fast rough summary. Some backends ignore this field. | high |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |