What can you do with this server?

The vision-mcp server gives text-only coding agents the ability to "see" and analyze images and videos by routing visual content to a compatible vision model backend and returning structured text results. Core Tools: * ui_to_artifact – Convert UI screenshots or design mockups into runnable code (React, Vue, HTML) or structured specifications. * extract_text_from_screenshot – Verbatim OCR extraction from screenshots, code, terminal output, error messages, and documents. * diagnose_error_screenshot – Analyze error/crash screenshots to identify root cause, verbatim error text, file/line location, and fix steps. * understand_technical_diagram – Interpret architecture diagrams, flowcharts, UML, ER, and sequence diagrams. * analyze_data_visualization – Read charts and dashboards to extract values, trends, anomalies, and insights. * ui_diff_check – Compare two UI screenshots and enumerate visual and implementation differences. * image_analysis – General-purpose fallback for understanding any image and answering arbitrary questions. * video_analysis – Analyze videos (screen recordings, clips); falls back to ffmpeg frame-sampling if the backend lacks native video support. Key Features: * Agentic Auto-Zoom: Automatically zooms coarse→fine from the full-resolution original to read tiny text or details that downsampling would lose. * Flexible detail_level: overview, normal, fine, or auto (zooms only when needed). * Multiple image sources: Local file paths, file://, http(s):// URLs, data: URIs, 'clipboard' (OS clipboard), or 'latest' (newest file in a drop directory). * Structured output: Returns both human-readable markdown and machine-readable metadata (confidence score, zoom regions, processing rounds, model/provider used). * Any OpenAI-compatible backend: Works with GLM, Kimi, OpenAI GPT-4o, local vLLM/Ollama, etc. * Security: Path allowlisting, SSRF checks on URLs, magic-byte file validation, and size caps.

Which integrations are available for this server?

Supports local Ollama vision models via an OpenAI-compatible endpoint, enabling private image and video analysis without sending data to external services. Integrates with OpenAI's vision API to provide image and video analysis capabilities including OCR, UI-to-code generation, error diagnosis, diagram understanding, chart reading, UI diff, and generic image understanding. Integrates with Xiaomi's MiMo vision models (e.g., mimo-v2.5) to perform image and video analysis tasks such as OCR, UI understanding, error diagnosis, and diagram interpretation.

How do I use vision-mcp?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@vision-mcp diagnose this error screenshot" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

vision-mcp

by Pelican0126

Overview Schema Related Servers Score Discussions

TypeScript

Local

vision-mcp

Give eyes to any "blind" coding agent — a local MCP server that lets text-only models see screenshots, error images, design mockups and video.
给所有看不见图的编码 agent 装上眼睛 —— 一个本地 MCP 服务，让纯文本模型也能看懂截图、报错图、设计稿和视频。

English · 中文

It hands images to a vision model "out of band" and returns text + machine-readable metadata to the host, so a host that can't see images (GLM coding model, DeepSeek, Qwen-coder, local models…) effectively gains sight. Works with any OpenAI-compatible vision backend (GLM / Kimi / Xiaomi MiMo / OpenAI / local vLLM), and adds server-led agentic zoom to read details that downsampling would otherwise lose.

English

Features

8 task-specific tools: UI→code, OCR, error diagnosis, diagram understanding, chart reading, UI diff, generic image, video.
Any OpenAI-compatible backend via one provider + per-backend profile. Switch with 3 env vars. GLM by default.
Agentic auto-zoom: server crops & upscales coarse→fine (grid → grounding → precise crop) from the full-resolution original, so tiny text becomes legible. Verified to fix hallucination on small detail (see Verified).
Universal video: backends without native video automatically get ffmpeg frame-sampling → multi-image.
Dual output: content (structured markdown) + structuredContent (confidence / regions / rounds / warnings / provider / model).
Safe by default: local-path allowlist, URL download + SSRF check (no blind passthrough), magic-byte validation, size caps.

How it works (30-second mental model)

host tool(image, detail_level)
  → validate + load media (full-res original + downsampled overview; path allowlist; URL SSRF)
  → overview ? single pass : zoomLoop (deterministic grid → model votes/grounding → crop original → early-exit)
  → content(markdown) + structuredContent(metadata)

The vision model is a "consultant" hired for one question; only its text answer comes back to the host. Crops always come from the full-resolution original (never the downsampled overview), so zoom actually recovers detail.

Quick start

1. Install & build (Node ≥ 20)

npm install
npm run build

2. Pick a backend. Default is GLM (z.ai). Any OpenAI-compatible /chat/completions endpoint with image_url support works — set 3 env vars:

export VISION_API_KEY=your_key
export VISION_BASE_URL=https://api.z.ai/api/paas/v4
export VISION_MODEL=glm-4.6v

3. Connect your MCP client (zcode / Cline / Cursor / Claude Desktop …). Add to its mcpServers config:

{
  "mcpServers": {
    "vision": {
      "command": "node",
      "args": ["<ABSOLUTE_PATH>/dist/index.js"],
      "env": {
        "VISION_API_KEY": "your_key",
        "VISION_BASE_URL": "https://api.z.ai/api/paas/v4",
        "VISION_MODEL": "glm-4.6v",
        "VISION_ALLOWED_DIRS": "<ABSOLUTE_PATH_TO_YOUR_PROJECT>"
      }
    }
  }
}

Timeout: reasoning vision models are slow (code/spec generation can take 30–60s+). Set your host's MCP tool timeout generously (≥120s). During deep zoom the server emits notifications/progress, so clients that honor resetTimeoutOnProgress stay alive.

Usage tutorial

The host is a text-only agent — it never sees the image. You tell the agent where the image is (a path under VISION_ALLOWED_DIRS, or a URL / data URI) and which tool to use; the agent calls the tool, the server returns text, the agent continues.

Example 1 — diagnose an error screenshot

Save your screenshot at e.g. ./screenshots/error.png (inside an allowed dir).
Ask your agent: "Use the vision MCP diagnose_error_screenshot on ./screenshots/error.png."
The tool returns markdown sections ## Root cause / ## Verbatim error / ## Location / ## Fix steps, and the agent uses it to write the fix.

Example 2 — read a tiny detail (agentic zoom)

For small text the model can't read at a glance, pass detail_level: "fine":

// the call your agent makes
{ "name": "extract_text_from_screenshot",
  "arguments": { "image": "./screenshots/big.png", "detail_level": "fine",
                 "question": "read the small key code in the bottom-right corner" } }

The server runs the zoom loop: it grids the image, the model votes which region matters, the server crops that region from the original and re-reads it — recovering text that a single overview pass would misread. structuredContent.regions shows where it looked, rounds how many passes it took.

detail_level values: overview (single fast pass) · normal · fine (deep zoom) · auto (default — zooms only when needed, early-exits when clear).

Example 3 — analyze a video

{ "name": "video_analysis",
  "arguments": { "video": "./clips/repro.mp4", "question": "what bug is shown?" } }

If the backend has no native video, the server samples frames with ffmpeg and analyzes them as images — so video works on any vision backend.

Quick local test (no client needed) — the MCP Inspector:

npx @modelcontextprotocol/inspector node dist/index.js
# then set the env vars in the Inspector UI and call any tool

Or run the bundled scripts: node scripts/smoke.mjs (lists tools), KEY=... node scripts/livetest.mjs (drives all 8 tools end-to-end).

Text-only hosts (e.g. ZCode + GLM-5.2): give a path, don't paste

If your host coding model is text-only (GLM-5.2, DeepSeek, …), do not paste/drag an image into the chat — the host attaches it to the model's turn and the provider rejects it (400 Model only support text input). DeepSeek silently drops the image instead (looks fine, but it never saw anything). vision-mcp runs downstream of the host, so it cannot intercept a pasted image.

Instead, let the agent pull the image itself so it never reaches the text model — three ways:

File path (zero setup): save the image and reference its path — "use diagnose_error_screenshot on D:\shots\err.png". The host sees only the text path; the MCP reads the file.
image: "clipboard": screenshot to the clipboard (Win+Shift+S) or copy an image, then say "read the error image on my clipboard". The MCP reads the OS clipboard server-side (built-in PowerShell on Windows — no dependency).
image: "latest": set VISION_DROP_DIR to a screenshots folder; the MCP grabs the newest image there.

Also: the vision backend (VISION_MODEL) must itself be a vision model — a text-only model like glm-5.2 can't be the backend either (use Doubao-vision / GLM-4.6V / a *-vision model).

Why DeepSeek "just works" but GLM-5.2 400s (both are blind): it's the upstream endpoint, not the model. Volcano's GLM-5.2 endpoint strictly rejects any request that contains an image; DeepSeek's endpoint silently ignores it and lets the model proceed to call this MCP. ZCode embeds the pasted image for both and has no per-model "vision" toggle — so there is no ZCode setting that fixes it; you simply avoid pasting and pull the image via the tool instead.

Verified end-to-end (ZCode + GLM-5.2 host + mimo-v2.5 backend): screenshot to clipboard → ask "look at the clipboard image and diagnose" → GLM-5.2 called image_analysis(image="clipboard") → mimo read ERR-4096: NPE app.ts:42 verbatim and counted the chart bars. No 400.

ZCode tips: turn off 计划模式 (Plan mode) when you want it to actually run the tool (in Plan mode it only plans); on the tool-approval prompt choose "始终允许本项目" (Always allow this project) so it stops asking each time.

Tools

Tool	Purpose
`ui_to_artifact`	UI screenshot → code / spec
`extract_text_from_screenshot`	verbatim OCR
`diagnose_error_screenshot`	error diagnosis (root cause / verbatim / location / fix)
`understand_technical_diagram`	architecture / flow / UML / ER / sequence diagrams
`analyze_data_visualization`	read charts / dashboards
`ui_diff_check`	compare two UI screenshots
`image_analysis`	generic image understanding (fallback)
`video_analysis`	video understanding (native or frame-sampled)

Common params: detail_level, question, region, thinking.

Backends

Backend	`VISION_PROFILE`	`VISION_BASE_URL`	`VISION_MODEL`
GLM (z.ai)	`glm`	`https://api.z.ai/api/paas/v4`	`glm-4.6v`
Kimi (Moonshot)	`kimi`	`https://api.moonshot.ai/v1`	`<kimi vision model>`
Xiaomi MiMo	`mimo`	`<your endpoint>/v1`	`mimo-v2.5` (vision build)
OpenAI	`openai`	`https://api.openai.com/v1`	`gpt-4o`
local vLLM/Ollama	`generic`	`http://localhost:8000/v1`	`<local VLM>`

Any OpenAI-compatible endpoint with image_url support works with generic.

Configuration (env)

Var	Default	Notes
`VISION_API_KEY` (alias `Z_AI_API_KEY`)	— required	backend key
`VISION_PROFILE`	`glm`	`glm` / `kimi` / `mimo` / `openai` / `generic`
`VISION_BASE_URL`	per profile	OpenAI-compatible endpoint
`VISION_MODEL`	per profile	vision model id
`VISION_ALLOWED_DIRS`	cwd	dirs allowed for local image paths (`;`/`:` separated)
`VISION_ALLOW_URL_PASSTHROUGH`	`false`	forward image URLs to the backend (default: download + SSRF-check → data URI)
`VISION_MAX_ZOOM_ROUNDS`	`3`	max agentic zoom rounds
`VISION_MAX_EDGE_PX`	`1568`	overview downsample longest edge
`VISION_VIDEO_FRAMES`	`8`	frames sampled per video
`VISION_MAX_IMAGE_MB` / `_VIDEO_MB`	`10` / `50`	size caps

Verified live

Tested against Xiaomi MiMo (mimo-v2.5 vision / mimo-v2.5-pro blind):

All 8 tools work end-to-end.
Agentic zoom proven: a tiny key code in a corner — overview hallucinated it (REV: 2A-74A1-0); detail_level=fine navigated to the bottom-right and read the correct KEY: ZX-7741-Q consistently across re-runs.
Video frame-sampling: mimo has no native video → auto frame-sampling succeeded.
Blind model degrades gracefully: mimo-v2.5-pro returns 404 No endpoints found that support image input → tool returns isError + a clear message, no crash.

Reproduce: KEY=... PROFILE=mimo MODEL=mimo-v2.5 BASE=<endpoint>/v1 node scripts/livetest.mjs.

Privacy

Your images/video are sent to the configured backend API. Don't use untrusted backends for sensitive content; a local deployment (generic profile → local vLLM) keeps data on your machine.

Development

npm run dev    # run source via tsx
npm run build  # tsc → dist/
npm test       # vitest, fully offline (no API key) — 33 tests

Tests cover the zoom state machine (early-exit / budget / parse-fail / out-of-bounds / grounding / tool-calling), media security (magic-bytes / path traversal / SSRF / downscale), tool schemas, and video frame sampling.

Architecture

src/: provider/ (OpenAI-compatible client + profiles), core/zoomLoop.ts, media/ (load / transform / security / video), tools/, prompts.ts.

Related MCP server: Vision MCP Server

中文

让纯文本模型也能「看见」：把图交给视觉模型「带外」分析，只把文字结论 + 机器可读元数据回传宿主。支持任何 OpenAI 兼容的视觉后端（GLM / Kimi / 小米 MiMo / OpenAI / 本地 vLLM），并用 server 主导的 agentic 缩放读出降采样会丢失的细节。

特性

8 个任务专用工具：UI 转代码、OCR、报错诊断、技术图理解、图表读数、UI 对比、通用图像、视频理解。
多后端：一个 provider + per-backend profile，改 3 个 env 即切换，GLM 默认。
Agentic 自动缩放：从全分辨率原图由粗到细裁切放大（九宫格 → grounding → 精确裁切），让小字可读；已实测能修正小细节上的幻觉（见已联机验证）。
视频通用：无原生视频能力的后端自动走 ffmpeg 帧采样 → 多图分析。
双输出：content（结构化 markdown）+ structuredContent（confidence / regions / rounds / warnings / provider / model）。
默认安全：本地路径白名单、URL 默认下载 + SSRF 校验（不透传）、magic-bytes 校验、大小上限。

工作原理（30 秒心智模型）

宿主 tool(image, detail_level)
  → 校验 + 载媒体（全分辨率原图 + 降采样概览图；路径白名单；URL SSRF）
  → overview ? 单次 : zoomLoop（确定性网格 → 模型投票/grounding → 裁原图 → 早退）
  → content(markdown) + structuredContent(元数据)

视觉模型是为「一个问题」临时请来的顾问，只有它的文字答案回到宿主。裁切始终从全分辨率原图取（不是降采样图），所以缩放才能真正找回细节。

快速上手

1. 安装构建（Node ≥ 20）

npm install
npm run build

2. 选后端。默认 GLM（z.ai）。任何支持 image_url 的 OpenAI 兼容端点都行，设 3 个 env：

export VISION_API_KEY=你的key
export VISION_BASE_URL=https://api.z.ai/api/paas/v4
export VISION_MODEL=glm-4.6v

3. 接入 MCP 客户端（zcode / Cline / Cursor / Claude Desktop……），在其 mcpServers 配置里加：

{
  "mcpServers": {
    "vision": {
      "command": "node",
      "args": ["<绝对路径>/dist/index.js"],
      "env": {
        "VISION_API_KEY": "你的key",
        "VISION_BASE_URL": "https://api.z.ai/api/paas/v4",
        "VISION_MODEL": "glm-4.6v",
        "VISION_ALLOWED_DIRS": "<你的项目绝对路径>"
      }
    }
  }
}

超时：推理型视觉模型较慢（生成代码/规格可能 30–60s+）。把宿主的 MCP 工具超时设宽松些（≥120s）。深度缩放期间 server 会发 notifications/progress，支持 resetTimeoutOnProgress 的客户端可借此保活。

使用教程

宿主是纯文本 agent，永远看不到图。你告诉它图在哪（VISION_ALLOWED_DIRS 下的路径，或 URL / data URI）、用哪个工具；agent 调用工具，server 回文字，agent 继续干活。

例 1 — 诊断报错截图

把截图存到 ./screenshots/error.png（在允许目录内）。
对 agent 说：「用 vision MCP 的 diagnose_error_screenshot 分析 ./screenshots/error.png」。
工具返回 ## 根因 / ## 错误原文 / ## 位置 / ## 修复步骤，agent 据此写修复。

例 2 — 读小细节（agentic 缩放）

模型一眼读不清的小字，传 detail_level: "fine"：

{ "name": "extract_text_from_screenshot",
  "arguments": { "image": "./screenshots/big.png", "detail_level": "fine",
                 "question": "读出右下角的小密钥码" } }

server 跑缩放循环：把图划网格 → 模型投票哪块相关 → 从原图裁该块重读，找回单次概览会读错的文字。structuredContent.regions 显示它看了哪、rounds 显示用了几轮。

detail_level 取值：overview（单次快速）·normal·fine（深度缩放）·auto（默认，需要才缩放、清晰则早退）。

例 3 — 分析视频

{ "name": "video_analysis",
  "arguments": { "video": "./clips/repro.mp4", "question": "视频里是什么 bug？" } }

后端无原生视频时，server 用 ffmpeg 抽帧当多图分析 —— 任意视觉后端都能处理视频。

本地快测（无需客户端） —— MCP Inspector：

npx @modelcontextprotocol/inspector node dist/index.js
# 在 Inspector 界面里填好 env，点任意工具

或用自带脚本：node scripts/smoke.mjs（列工具）、KEY=... node scripts/livetest.mjs（全 8 工具端到端）。

文本宿主（如 ZCode + GLM-5.2）：给路径，别粘贴

如果你的宿主编码模型是纯文本（GLM-5.2、DeepSeek……），别把图粘贴/拖进对话——宿主会把图塞进模型的 turn，提供商直接拒绝（400 Model only support text input）；DeepSeek 则静默丢图（看着没报错，其实没看见）。vision-mcp 在宿主下游，拦不住已粘贴的图。

正确做法：让 agent 自己把图取过来，图永不进文本模型——三条路：

文件路径（零配置）：把图存成文件、给路径——「用 diagnose_error_screenshot 看 D:\shots\err.png」。宿主只见文字路径，MCP 读文件。
image: "clipboard"：截图到剪贴板（Win+Shift+S）或复制一张图，然后说「看剪贴板里的报错图」。MCP 在 server 端读系统剪贴板（Windows 用内置 PowerShell，零依赖）。
image: "latest"：把 VISION_DROP_DIR 设成截图目录，MCP 取里面最新的图。

另外：视觉后端（VISION_MODEL）本身也必须是视觉模型——纯文本模型（如 glm-5.2）不能当后端（用 Doubao-vision / GLM-4.6V / 带 -vision 的模型）。

为什么 DeepSeek"能用"而 GLM-5.2 报 400（两个都是瞎子）： 区别在上游端点，不在模型。火山的 GLM-5.2 端点严格拒绝任何带图的请求；DeepSeek 端点则静默忽略、让模型继续去调本 MCP。ZCode 对两者都会嵌入粘贴的图、且没有按模型的"视觉"开关——所以 ZCode 里没有任何设置能修这个，你只能不粘贴、改用工具把图取过来。

已端到端验证（ZCode + GLM-5.2 宿主 + mimo-v2.5 后端）：截图到剪贴板 → 说*"看剪贴板里的图、诊断报错"* → GLM-5.2 调用 image_analysis(image="clipboard") → mimo 逐字读出 ERR-4096: NPE app.ts:42 并数对柱子。全程不再 400。

ZCode 小贴士： 想让它真正执行工具时，关掉计划模式（计划模式下只会"计划"不执行）；工具授权弹框里选**"始终允许本项目"**，就不用每次确认。

工具

工具	用途
`ui_to_artifact`	UI 截图 → 代码 / 规格
`extract_text_from_screenshot`	逐字 OCR
`diagnose_error_screenshot`	报错诊断（根因/原文/位置/修复）
`understand_technical_diagram`	架构/流程/UML/ER/时序图
`analyze_data_visualization`	图表/仪表盘读数
`ui_diff_check`	两张 UI 截图对比
`image_analysis`	通用图像理解（兜底）
`video_analysis`	视频理解（原生或帧采样）

公共参数：detail_level、question、region、thinking。

后端

后端	`VISION_PROFILE`	`VISION_BASE_URL`	`VISION_MODEL`
GLM (z.ai)	`glm`	`https://api.z.ai/api/paas/v4`	`glm-4.6v`
Kimi (Moonshot)	`kimi`	`https://api.moonshot.ai/v1`	`<kimi 视觉模型>`
小米 MiMo	`mimo`	`<你的端点>/v1`	`mimo-v2.5`（视觉版）
OpenAI	`openai`	`https://api.openai.com/v1`	`gpt-4o`
本地 vLLM/Ollama	`generic`	`http://localhost:8000/v1`	`<本地 VLM>`

任何 OpenAI 兼容、支持 image_url 的端点都能用 generic 直接接入。

配置（env）

变量	默认	说明
`VISION_API_KEY`（别名 `Z_AI_API_KEY`）	— 必填	后端 key
`VISION_PROFILE`	`glm`	`glm` / `kimi` / `mimo` / `openai` / `generic`
`VISION_BASE_URL`	随 profile	OpenAI 兼容端点
`VISION_MODEL`	随 profile	视觉模型
`VISION_ALLOWED_DIRS`	当前目录	允许读取本地图片的目录（`;`/`:` 分隔）
`VISION_ALLOW_URL_PASSTHROUGH`	`false`	是否把图片 URL 直接透传给后端（默认下载+SSRF 校验后转 data URI）
`VISION_MAX_ZOOM_ROUNDS`	`3`	agentic 缩放最大轮数
`VISION_MAX_EDGE_PX`	`1568`	概览图降采样最大边长
`VISION_VIDEO_FRAMES`	`8`	每个视频抽帧数
`VISION_MAX_IMAGE_MB` / `_VIDEO_MB`	`10` / `50`	大小上限

已联机验证

针对 小米 MiMo（mimo-v2.5 视觉版 / mimo-v2.5-pro 盲版）实测：

8 个工具全部端到端跑通。
Agentic 缩放验证有效：角落小密钥码，overview 单次会编造（REV: 2A-74A1-0）；detail_level=fine 导航到右下、复跑稳定读出正确的 KEY: ZX-7741-Q。
视频帧采样：mimo 无原生视频 → 自动抽帧成功。
盲模型优雅降级：mimo-v2.5-pro 返回 404 No endpoints found that support image input → 工具以 isError + 清晰提示返回，不崩溃。

复现：KEY=... PROFILE=mimo MODEL=mimo-v2.5 BASE=<端点>/v1 node scripts/livetest.mjs。

隐私

调用时图片/视频会发送到所配置的视觉后端 API。敏感内容请勿用不可信后端；本地部署（generic → 本地 vLLM）可避免数据外发。

开发

npm run dev    # tsx 直接跑源码
npm run build  # tsc → dist/
npm test       # vitest 离线测试（无需 key）—— 33 个用例

测试覆盖：缩放状态机（早退/预算/解析失败/越界/grounding/tool-calling）、媒体安全（magic-bytes/路径越界/SSRF/降采样）、工具 schema、视频帧采样。

架构

src/：provider/（OpenAI 兼容 + profile）、core/zoomLoop.ts、media/（load/transform/security/video）、tools/、prompts.ts。

Install Server

license - permissive license

quality

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

1dRelease cycle

2Releases (12mo)

Commit activity

Resources

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Tools

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Pelican0126/vision-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server