Skip to main content
Glama

describe_image

Read-onlyIdempotent

Converts images to structured text for text-only models. Supports URLs, file paths, clipboard, and base64. Returns detailed description including OCR, chart data, and scene content.

Instructions

Convert an image into structured text so a text-only model can "see" it.

Call this tool whenever the current main model cannot view images directly (e.g. glm-5.2, deepseek-v4-pro, qwen-text) but the user wants you to look at an image. The image source is auto-detected from the image argument:

  • http(s) URL -> downloaded

  • data: URI -> base64 extracted

  • local file path -> read from disk

  • raw base64 string -> used as-is

  • empty / None -> read from the system clipboard

The clipboard path is what makes screenshots work without pasting: the user takes a screenshot (Cmd+Shift+4 / Win+Shift+S / scrot) so the image lives in the OS clipboard, then says something like "看下我的截图" or "look at my screenshot" in the chat. You call this tool with no image argument; it reads the clipboard and returns the description.

The image is sent to the configured vision model (any OpenAI-compatible multimodal endpoint: qwen3.7-plus, qwen-vl-max, gpt-4o, llava, etc.) and returned as structured Chinese text covering:

  • overall content and scene

  • all visible text transcribed verbatim (preserving layout / tables)

  • numbers, data, axes, chart values (as structured text, not omitted)

  • key objects, colors, layout, UI elements

  • any other detail useful for downstream reasoning

This tool does NOT answer questions about the image. It only converts the image to text. After it returns, YOU (the main model) do the reasoning and answer the user yourself, as if you had read the description.

Args: image (Optional[str]): URL / data URI / file path / base64 / empty. Empty reads from the system clipboard. instruction (Optional[str]): custom vision instruction; if omitted, a comprehensive default prompt is used. detail (DetailLevel): 'high' (default) for OCR/dense content, 'low' for a quick rough summary.

Returns: str: Markdown text describing the image on success. On failure: '[describe_image failed] : '.

When to call: - User pasted an image attachment / gave a URL / gave a file path. - User says "看下我的截图" / "look at my screenshot" / "我刚截了张图" (leave image empty - the tool reads the clipboard). - You need OCR, table extraction, or chart values from a picture. - IMPORTANT: the message contains an image placeholder like [Image 1], [Image N], [图片], or [Image attachment] - this means the user pasted an image but the client/gateway replaced the real image data with a placeholder because the main model has no vision. The image still lives in the OS clipboard. Call this tool with image empty to read it from the clipboard, even if the user sent no text at all.

When NOT to call: - The user only sent text with no mention of any image. - You already have a textual description and no new image arrived.

Examples: - "describe this: https://x.com/a.png" -> image= - "看下我的截图" -> image omitted - "识别 /tmp/chart.png 里的表格" -> image="/tmp/chart.png" - "把这张流程图(base64)转成 Mermaid" -> instruction="...", image=

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
imageNoImage source - auto-detected by content. Accepts: (1) http(s) URL - downloaded; (2) data URI 'data:image/png;base64,...' - extracted; (3) local file path - read from disk; (4) raw base64 string - used as-is; (5) empty/omitted - read from the SYSTEM CLIPBOARD (use this when the user took a screenshot and says 'look at my screenshot' but did NOT paste the image into the chat).
instructionNoOptional instruction overriding the default vision prompt. Use this to focus the description on what you actually need, e.g. '只提取表格中的数字', '识别这张截图里的所有 UI 组件', '把流程图转成 Mermaid 代码'.
detailNoImage processing detail. 'high' for OCR / chart / dense text; 'low' for a fast rough summary. Some backends ignore this field.high

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true and destructiveHint=false, so safety is clear. The description adds context about clipboard reading, failure behavior, and the tool's limited role (no reasoning). No contradictions.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Structured with sections, bullet points, and examples. Front-loaded with core purpose. Could be slightly more concise (e.g., Args section somewhat repeats schema), but overall well-organized and informative.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Covers all necessary aspects: when to use, parameter handling, clipboard trick, failure mode, output format (Markdown). Output schema exists, so return value explanation is sufficient. No gaps given complexity and sibling context.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, but description enriches each parameter with auto-detection rules, clipboard fallback for 'image', custom instruction purpose, and detail level differences. Provides examples for usage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Clearly states the tool converts an image into structured text for text-only models. Distinguishes from sibling 'multimodal_config_status' by focusing on image description.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides explicit when-to-call scenarios (image attachments, screenshots, placeholders, OCR needs) and when-not-to-call (no image, already have description). Also explains the tool does not answer questions, leaving reasoning to the main model.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/believe3344/multimodal-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server