vision-mcp
Supports local Ollama vision models via an OpenAI-compatible endpoint, enabling private image and video analysis without sending data to external services.
Integrates with OpenAI's vision API to provide image and video analysis capabilities including OCR, UI-to-code generation, error diagnosis, diagram understanding, chart reading, UI diff, and generic image understanding.
Integrates with Xiaomi's MiMo vision models (e.g., mimo-v2.5) to perform image and video analysis tasks such as OCR, UI understanding, error diagnosis, and diagram interpretation.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@vision-mcpdiagnose this error screenshot"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
vision-mcp
Give eyes to any "blind" coding agent — a local MCP server that lets text-only models see screenshots, error images, design mockups and video.
给所有看不见图的编码 agent 装上眼睛 —— 一个本地 MCP 服务,让纯文本模型也能看懂截图、报错图、设计稿和视频。
English · 中文
It hands images to a vision model "out of band" and returns text + machine-readable metadata to the host, so a host that can't see images (GLM coding model, DeepSeek, Qwen-coder, local models…) effectively gains sight. Works with any OpenAI-compatible vision backend (GLM / Kimi / Xiaomi MiMo / OpenAI / local vLLM), and adds server-led agentic zoom to read details that downsampling would otherwise lose.
English
Features
8 task-specific tools: UI→code, OCR, error diagnosis, diagram understanding, chart reading, UI diff, generic image, video.
Any OpenAI-compatible backend via one provider + per-backend profile. Switch with 3 env vars. GLM by default.
Agentic auto-zoom: server crops & upscales coarse→fine (grid → grounding → precise crop) from the full-resolution original, so tiny text becomes legible. Verified to fix hallucination on small detail (see Verified).
Universal video: backends without native video automatically get ffmpeg frame-sampling → multi-image.
Dual output:
content(structured markdown) +structuredContent(confidence / regions / rounds / warnings / provider / model).Safe by default: local-path allowlist, URL download + SSRF check (no blind passthrough), magic-byte validation, size caps.
How it works (30-second mental model)
host tool(image, detail_level)
→ validate + load media (full-res original + downsampled overview; path allowlist; URL SSRF)
→ overview ? single pass : zoomLoop (deterministic grid → model votes/grounding → crop original → early-exit)
→ content(markdown) + structuredContent(metadata)The vision model is a "consultant" hired for one question; only its text answer comes back to the host. Crops always come from the full-resolution original (never the downsampled overview), so zoom actually recovers detail.
Quick start
1. Install & build (Node ≥ 20)
npm install
npm run build2. Pick a backend. Default is GLM (z.ai). Any OpenAI-compatible /chat/completions endpoint with image_url support works — set 3 env vars:
export VISION_API_KEY=your_key
export VISION_BASE_URL=https://api.z.ai/api/paas/v4
export VISION_MODEL=glm-4.6v3. Connect your MCP client (zcode / Cline / Cursor / Claude Desktop …). Add to its mcpServers config:
{
"mcpServers": {
"vision": {
"command": "node",
"args": ["<ABSOLUTE_PATH>/dist/index.js"],
"env": {
"VISION_API_KEY": "your_key",
"VISION_BASE_URL": "https://api.z.ai/api/paas/v4",
"VISION_MODEL": "glm-4.6v",
"VISION_ALLOWED_DIRS": "<ABSOLUTE_PATH_TO_YOUR_PROJECT>"
}
}
}
}Timeout: reasoning vision models are slow (code/spec generation can take 30–60s+). Set your host's MCP tool timeout generously (≥120s). During deep zoom the server emits
notifications/progress, so clients that honorresetTimeoutOnProgressstay alive.
Usage tutorial
The host is a text-only agent — it never sees the image. You tell the agent where the image is (a path under VISION_ALLOWED_DIRS, or a URL / data URI) and which tool to use; the agent calls the tool, the server returns text, the agent continues.
Example 1 — diagnose an error screenshot
Save your screenshot at e.g.
./screenshots/error.png(inside an allowed dir).Ask your agent: "Use the vision MCP
diagnose_error_screenshoton./screenshots/error.png."The tool returns markdown sections
## Root cause / ## Verbatim error / ## Location / ## Fix steps, and the agent uses it to write the fix.
Example 2 — read a tiny detail (agentic zoom)
For small text the model can't read at a glance, pass detail_level: "fine":
// the call your agent makes
{ "name": "extract_text_from_screenshot",
"arguments": { "image": "./screenshots/big.png", "detail_level": "fine",
"question": "read the small key code in the bottom-right corner" } }The server runs the zoom loop: it grids the image, the model votes which region matters, the server crops that region from the original and re-reads it — recovering text that a single overview pass would misread. structuredContent.regions shows where it looked, rounds how many passes it took.
detail_level values: overview (single fast pass) · normal · fine (deep zoom) · auto (default — zooms only when needed, early-exits when clear).
Example 3 — analyze a video
{ "name": "video_analysis",
"arguments": { "video": "./clips/repro.mp4", "question": "what bug is shown?" } }If the backend has no native video, the server samples frames with ffmpeg and analyzes them as images — so video works on any vision backend.
Quick local test (no client needed) — the MCP Inspector:
npx @modelcontextprotocol/inspector node dist/index.js
# then set the env vars in the Inspector UI and call any toolOr run the bundled scripts: node scripts/smoke.mjs (lists tools), KEY=... node scripts/livetest.mjs (drives all 8 tools end-to-end).
Tools
Tool | Purpose |
| UI screenshot → code / spec |
| verbatim OCR |
| error diagnosis (root cause / verbatim / location / fix) |
| architecture / flow / UML / ER / sequence diagrams |
| read charts / dashboards |
| compare two UI screenshots |
| generic image understanding (fallback) |
| video understanding (native or frame-sampled) |
Common params: detail_level, question, region, thinking.
Backends
Backend |
|
|
|
GLM (z.ai) |
|
|
|
Kimi (Moonshot) |
|
|
|
Xiaomi MiMo |
|
|
|
OpenAI |
|
|
|
local vLLM/Ollama |
|
|
|
Any OpenAI-compatible endpoint with
image_urlsupport works withgeneric.
Configuration (env)
Var | Default | Notes |
| — required | backend key |
|
|
|
| per profile | OpenAI-compatible endpoint |
| per profile | vision model id |
| cwd | dirs allowed for local image paths ( |
|
| forward image URLs to the backend (default: download + SSRF-check → data URI) |
|
| max agentic zoom rounds |
|
| overview downsample longest edge |
|
| frames sampled per video |
|
| size caps |
Verified live
Tested against Xiaomi MiMo (mimo-v2.5 vision / mimo-v2.5-pro blind):
All 8 tools work end-to-end.
Agentic zoom proven: a tiny key code in a corner —
overviewhallucinated it (REV: 2A-74A1-0);detail_level=finenavigated to the bottom-right and read the correctKEY: ZX-7741-Qconsistently across re-runs.Video frame-sampling: mimo has no native video → auto frame-sampling succeeded.
Blind model degrades gracefully:
mimo-v2.5-proreturns404 No endpoints found that support image input→ tool returnsisError+ a clear message, no crash.
Reproduce: KEY=... PROFILE=mimo MODEL=mimo-v2.5 BASE=<endpoint>/v1 node scripts/livetest.mjs.
Privacy
Your images/video are sent to the configured backend API. Don't use untrusted backends for sensitive content; a local deployment (generic profile → local vLLM) keeps data on your machine.
Development
npm run dev # run source via tsx
npm run build # tsc → dist/
npm test # vitest, fully offline (no API key) — 33 testsTests cover the zoom state machine (early-exit / budget / parse-fail / out-of-bounds / grounding / tool-calling), media security (magic-bytes / path traversal / SSRF / downscale), tool schemas, and video frame sampling.
Architecture
src/: provider/ (OpenAI-compatible client + profiles), core/zoomLoop.ts, media/ (load / transform / security / video), tools/, prompts.ts.
Related MCP server: vision-mcp
中文
让纯文本模型也能「看见」:把图交给视觉模型「带外」分析,只把文字结论 + 机器可读元数据回传宿主。支持任何 OpenAI 兼容的视觉后端(GLM / Kimi / 小米 MiMo / OpenAI / 本地 vLLM),并用 server 主导的 agentic 缩放读出降采样会丢失的细节。
特性
8 个任务专用工具:UI 转代码、OCR、报错诊断、技术图理解、图表读数、UI 对比、通用图像、视频理解。
多后端:一个 provider + per-backend profile,改 3 个 env 即切换,GLM 默认。
Agentic 自动缩放:从全分辨率原图由粗到细裁切放大(九宫格 → grounding → 精确裁切),让小字可读;已实测能修正小细节上的幻觉(见已联机验证)。
视频通用:无原生视频能力的后端自动走 ffmpeg 帧采样 → 多图分析。
双输出:
content(结构化 markdown)+structuredContent(confidence / regions / rounds / warnings / provider / model)。默认安全:本地路径白名单、URL 默认下载 + SSRF 校验(不透传)、magic-bytes 校验、大小上限。
工作原理(30 秒心智模型)
宿主 tool(image, detail_level)
→ 校验 + 载媒体(全分辨率原图 + 降采样概览图;路径白名单;URL SSRF)
→ overview ? 单次 : zoomLoop(确定性网格 → 模型投票/grounding → 裁原图 → 早退)
→ content(markdown) + structuredContent(元数据)视觉模型是为「一个问题」临时请来的顾问,只有它的文字答案回到宿主。裁切始终从全分辨率原图取(不是降采样图),所以缩放才能真正找回细节。
快速上手
1. 安装构建(Node ≥ 20)
npm install
npm run build2. 选后端。默认 GLM(z.ai)。任何支持 image_url 的 OpenAI 兼容端点都行,设 3 个 env:
export VISION_API_KEY=你的key
export VISION_BASE_URL=https://api.z.ai/api/paas/v4
export VISION_MODEL=glm-4.6v3. 接入 MCP 客户端(zcode / Cline / Cursor / Claude Desktop……),在其 mcpServers 配置里加:
{
"mcpServers": {
"vision": {
"command": "node",
"args": ["<绝对路径>/dist/index.js"],
"env": {
"VISION_API_KEY": "你的key",
"VISION_BASE_URL": "https://api.z.ai/api/paas/v4",
"VISION_MODEL": "glm-4.6v",
"VISION_ALLOWED_DIRS": "<你的项目绝对路径>"
}
}
}
}超时:推理型视觉模型较慢(生成代码/规格可能 30–60s+)。把宿主的 MCP 工具超时设宽松些(≥120s)。深度缩放期间 server 会发
notifications/progress,支持resetTimeoutOnProgress的客户端可借此保活。
使用教程
宿主是纯文本 agent,永远看不到图。你告诉它图在哪(VISION_ALLOWED_DIRS 下的路径,或 URL / data URI)、用哪个工具;agent 调用工具,server 回文字,agent 继续干活。
例 1 — 诊断报错截图
把截图存到
./screenshots/error.png(在允许目录内)。对 agent 说:「用 vision MCP 的
diagnose_error_screenshot分析./screenshots/error.png」。工具返回
## 根因 / ## 错误原文 / ## 位置 / ## 修复步骤,agent 据此写修复。
例 2 — 读小细节(agentic 缩放)
模型一眼读不清的小字,传 detail_level: "fine":
{ "name": "extract_text_from_screenshot",
"arguments": { "image": "./screenshots/big.png", "detail_level": "fine",
"question": "读出右下角的小密钥码" } }server 跑缩放循环:把图划网格 → 模型投票哪块相关 → 从原图裁该块重读,找回单次概览会读错的文字。structuredContent.regions 显示它看了哪、rounds 显示用了几轮。
detail_level 取值:overview(单次快速)·normal·fine(深度缩放)·auto(默认,需要才缩放、清晰则早退)。
例 3 — 分析视频
{ "name": "video_analysis",
"arguments": { "video": "./clips/repro.mp4", "question": "视频里是什么 bug?" } }后端无原生视频时,server 用 ffmpeg 抽帧当多图分析 —— 任意视觉后端都能处理视频。
本地快测(无需客户端) —— MCP Inspector:
npx @modelcontextprotocol/inspector node dist/index.js
# 在 Inspector 界面里填好 env,点任意工具或用自带脚本:node scripts/smoke.mjs(列工具)、KEY=... node scripts/livetest.mjs(全 8 工具端到端)。
工具
工具 | 用途 |
| UI 截图 → 代码 / 规格 |
| 逐字 OCR |
| 报错诊断(根因/原文/位置/修复) |
| 架构/流程/UML/ER/时序图 |
| 图表/仪表盘读数 |
| 两张 UI 截图对比 |
| 通用图像理解(兜底) |
| 视频理解(原生或帧采样) |
公共参数:detail_level、question、region、thinking。
后端
后端 |
|
|
|
GLM (z.ai) |
|
|
|
Kimi (Moonshot) |
|
|
|
小米 MiMo |
|
|
|
OpenAI |
|
|
|
本地 vLLM/Ollama |
|
|
|
任何 OpenAI 兼容、支持
image_url的端点都能用generic直接接入。
配置(env)
变量 | 默认 | 说明 |
| — 必填 | 后端 key |
|
|
|
| 随 profile | OpenAI 兼容端点 |
| 随 profile | 视觉模型 |
| 当前目录 | 允许读取本地图片的目录( |
|
| 是否把图片 URL 直接透传给后端(默认下载+SSRF 校验后转 data URI) |
|
| agentic 缩放最大轮数 |
|
| 概览图降采样最大边长 |
|
| 每个视频抽帧数 |
|
| 大小上限 |
已联机验证
针对 小米 MiMo(mimo-v2.5 视觉版 / mimo-v2.5-pro 盲版)实测:
8 个工具全部端到端跑通。
Agentic 缩放验证有效:角落小密钥码,
overview单次会编造(REV: 2A-74A1-0);detail_level=fine导航到右下、复跑稳定读出正确的KEY: ZX-7741-Q。视频帧采样:mimo 无原生视频 → 自动抽帧成功。
盲模型优雅降级:
mimo-v2.5-pro返回404 No endpoints found that support image input→ 工具以isError+ 清晰提示返回,不崩溃。
复现:KEY=... PROFILE=mimo MODEL=mimo-v2.5 BASE=<端点>/v1 node scripts/livetest.mjs。
隐私
调用时图片/视频会发送到所配置的视觉后端 API。敏感内容请勿用不可信后端;本地部署(generic → 本地 vLLM)可避免数据外发。
开发
npm run dev # tsx 直接跑源码
npm run build # tsc → dist/
npm test # vitest 离线测试(无需 key)—— 33 个用例测试覆盖:缩放状态机(早退/预算/解析失败/越界/grounding/tool-calling)、媒体安全(magic-bytes/路径越界/SSRF/降采样)、工具 schema、视频帧采样。
架构
src/:provider/(OpenAI 兼容 + profile)、core/zoomLoop.ts、media/(load/transform/security/video)、tools/、prompts.ts。
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/Pelican0126/vision-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server