Skip to main content
Glama

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
VISION_MODELYesModel name, e.g., qwen3.7-plus
VISION_API_KEYYesAPI key for the vision model
VISION_BASE_URLYesBase URL for the vision model API, e.g., https://dashscope.aliyuncs.com/compatible-mode/v1

Capabilities

Features and capabilities supported by this server

CapabilityDetails
tools
{
  "listChanged": false
}
prompts
{
  "listChanged": false
}
resources
{
  "subscribe": false,
  "listChanged": false
}
experimental
{}

Tools

Functions exposed to the LLM to take actions

NameDescription
describe_imageA

Convert an image into structured text so a text-only model can "see" it.

Call this tool whenever the current main model cannot view images directly (e.g. glm-5.2, deepseek-v4-pro, qwen-text) but the user wants you to look at an image. The image source is auto-detected from the image argument:

  • http(s) URL -> downloaded

  • data: URI -> base64 extracted

  • local file path -> read from disk

  • raw base64 string -> used as-is

  • empty / None -> read from the system clipboard

The clipboard path is what makes screenshots work without pasting: the user takes a screenshot (Cmd+Shift+4 / Win+Shift+S / scrot) so the image lives in the OS clipboard, then says something like "看下我的截图" or "look at my screenshot" in the chat. You call this tool with no image argument; it reads the clipboard and returns the description.

The image is sent to the configured vision model (any OpenAI-compatible multimodal endpoint: qwen3.7-plus, qwen-vl-max, gpt-4o, llava, etc.) and returned as structured Chinese text covering:

  • overall content and scene

  • all visible text transcribed verbatim (preserving layout / tables)

  • numbers, data, axes, chart values (as structured text, not omitted)

  • key objects, colors, layout, UI elements

  • any other detail useful for downstream reasoning

This tool does NOT answer questions about the image. It only converts the image to text. After it returns, YOU (the main model) do the reasoning and answer the user yourself, as if you had read the description.

Args: image (Optional[str]): URL / data URI / file path / base64 / empty. Empty reads from the system clipboard. instruction (Optional[str]): custom vision instruction; if omitted, a comprehensive default prompt is used. detail (DetailLevel): 'high' (default) for OCR/dense content, 'low' for a quick rough summary.

Returns: str: Markdown text describing the image on success. On failure: '[describe_image failed] : '.

When to call: - User pasted an image attachment / gave a URL / gave a file path. - User says "看下我的截图" / "look at my screenshot" / "我刚截了张图" (leave image empty - the tool reads the clipboard). - You need OCR, table extraction, or chart values from a picture. - IMPORTANT: the message contains an image placeholder like [Image 1], [Image N], [图片], or [Image attachment] - this means the user pasted an image but the client/gateway replaced the real image data with a placeholder because the main model has no vision. The image still lives in the OS clipboard. Call this tool with image empty to read it from the clipboard, even if the user sent no text at all.

When NOT to call: - The user only sent text with no mention of any image. - You already have a textual description and no new image arrived.

Examples: - "describe this: https://x.com/a.png" -> image= - "看下我的截图" -> image omitted - "识别 /tmp/chart.png 里的表格" -> image="/tmp/chart.png" - "把这张流程图(base64)转成 Mermaid" -> instruction="...", image=

multimodal_config_statusA

Report whether the required vision env vars are set (never the values).

Call once after first wiring the server into a client, to confirm VISION_BASE_URL, VISION_API_KEY and VISION_MODEL are all configured. The API key itself is never exposed; only a boolean.

Returns: str: JSON with vision_base_url_set, vision_api_key_set, vision_model.

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription

No resources

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/believe3344/multimodal-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server