vision-reader
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@vision-readershow me the diagrams from the design doc"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Give Kiro Eyes: Reading Diagrams — Even the Ones Buried in Documents
Kiro is great at reading code, configs, and docs. But hand it a PNG architecture diagram and the file tools shrug:
Caught error reading: ... File seems to be binary and cannot be opened as textThe file-reading tools treat everything as text, so a binary image just bounces
off. It gets worse: a huge amount of architecture knowledge doesn't even live in
loose .png files — it's embedded inside documents. Word and Confluence
"Export to Word" produce a single MHTML file (often with a .doc extension),
and the diagrams are buried inside that envelope. There's no image on disk to
point at, so neither the file tools nor a vision tool can see them.
This post builds a small two-part toolkit that closes both gaps:
An extractor that pulls embedded images out of
.doc/MHTML documents into real image files, organized by the document's section structure.A tiny Model Context Protocol (MCP) server that hands those images straight to Kiro so it can read them with its own vision — no external vision API, no API key, no per-image cost.
By the end, you can take a folder of exported design docs and ask Kiro to "summarize the architecture section by section," and it will actually see every diagram.
The key insight
There are two ways to make an agent "read" an image:
Call an external vision API (OpenAI, Anthropic, Google) inside the MCP server, get back a text description, and hand that text to Kiro. This works, but it needs an API key, costs money per image, and Kiro only ever sees someone else's description — not the image itself.
Hand the raw image to Kiro directly. Kiro is already a multimodal model. MCP has a first-class
ImageContenttype for exactly this. If the server reads the file, base64-encodes it, and returnsImageContent, Kiro looks at the actual pixels with its own vision.
Option 2 is simpler, free, and higher fidelity. That's what we'll build — and we'll feed it from an extractor that frees diagrams trapped inside documents.
The full pipeline
step 1: extract step 2: read
┌──────────────┐ (stdlib only) ┌──────────────┐ tool call ┌──────────┐
│ *.doc / │ ────────────────► │ image files │ ────────────► │ Kiro │
│ MHTML docs │ extract_doc_ │ on disk │ read_image / │ (model) │
│ (diagrams │ images.py │ (organized │ read_all_ │ │
│ embedded) │ │ by section) │ images │ │
└──────────────┘ └──────────────┘ └────┬─────┘
▲ │
│ ImageContent (base64) │
└─────────────────────────────┘
▼
Kiro "sees" each diagram with
its own vision and explains itTwo cooperating pieces:
extract_doc_images.py— turns "diagrams locked inside a document" into "image files on disk," mirroring the document's heading hierarchy so each diagram keeps its section context.vision_server.py— an MCP server withread_imageandread_all_imagestools that returnImageContent. Kiro does the actual "looking."
If your diagrams are already loose .png/.jpg files, you can skip step 1 and
go straight to the MCP server. But for design docs exported from a wiki, step 1
is what makes them readable at all.
Step 1 — Extract images from documents
Many documentation systems export a page as a single MHTML file with a .doc
extension. Inside that envelope the diagrams are real binary images (PNG, JPG,
etc.), but they're attached as MIME parts, not saved as files. extract_doc_images.py
parses the envelope (using Python's built-in email module — no third-party
deps), pulls every embedded image out, and writes it to disk.
Crucially, it walks the document's headings (h1 > h2 > h3 ...) as it goes and
drops each image into the folder of the deepest section that owns it. So an
image under "2. Solution > 2.1 Network" lands in
.../2. Solution/2.1 Network/. That folder structure is gold later: the names
tell you — and the model — exactly which section each diagram belongs to.
python extract_doc_images.py ./docs
# -> writes images to ./docs/extracted_images/<doc-name>/<section>/...You'll get a short report like:
[OK] design-overview.doc: extracted 12 embedded image(s), 0 external reference(s) skipped
[OK] network-flows.doc: extracted 8 embedded image(s), 1 external reference(s) skipped
Images written to: ./docs/extracted_imagesThe script auto-detects PNG/JPG/GIF/BMP/WEBP/SVG by magic bytes, sanitizes section names into valid folder names, and skips external (non-embedded) image references.
Keep the extracted folder out of version control if the documents are internal — the diagrams and their folder names can reveal sensitive detail. The included
.gitignorealready ignoresextracted_images/.
Step 2 — Install the MCP server's dependencies
The vision server needs only the MCP SDK and Pillow (for resizing / format conversion):
pip install "mcp>=1.0.0" "Pillow>=10.0.0"No ANTHROPIC_API_KEY, no OPENAI_API_KEY. There is no external API call.
Step 3 — The vision MCP server
Save this as vision_server.py. It reads an image, downscales it if needed, and
returns ImageContent so Kiro sees the pixels directly:
"""
MCP Server: Vision Reader (native model vision)
Reads an image file, base64-encodes it, and returns ImageContent so the
host model (Kiro) can look at it directly. No external API key required.
"""
import base64
import io
from pathlib import Path
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import TextContent, ImageContent, Tool
app = Server("vision-reader")
SUPPORTED = {".png", ".jpg", ".jpeg", ".gif", ".webp", ".bmp"}
MEDIA_TYPE = {
".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg",
".gif": "image/gif", ".webp": "image/webp", ".bmp": "image/png",
}
MAX_DIMENSION = 1568 # longest edge (px) recommended for vision
MAX_BASE64_BYTES = 4_500_000 # ~4.5 MB after base64 encoding
def resolve_path(file_path: str) -> Path:
p = Path(file_path)
return p if p.is_absolute() else Path.cwd() / p
def image_to_base64(path: Path) -> tuple[str, str]:
"""Read an image, normalize/shrink it, return (base64, media_type)."""
ext = path.suffix.lower()
try:
from PIL import Image
except ImportError:
with open(path, "rb") as f:
return base64.standard_b64encode(f.read()).decode(), MEDIA_TYPE.get(ext, "image/png")
img = Image.open(path)
if img.mode in ("RGBA", "LA", "P"):
img = img.convert("RGBA") if "A" in img.mode else img.convert("RGB")
# Downscale if the longest edge is too large.
longest = max(img.size)
if longest > MAX_DIMENSION:
scale = MAX_DIMENSION / longest
img = img.resize((max(1, int(img.size[0] * scale)),
max(1, int(img.size[1] * scale))), Image.LANCZOS)
# Prefer PNG (keeps diagram text crisp).
buf = io.BytesIO()
(img.convert("RGB") if img.mode == "RGBA" else img).save(buf, format="PNG", optimize=True)
data = base64.standard_b64encode(buf.getvalue()).decode()
if len(data) <= MAX_BASE64_BYTES:
return data, "image/png"
# Too big -> fall back to JPEG with decreasing quality.
rgb = img.convert("RGB")
for quality in (90, 80, 70, 60, 50):
buf = io.BytesIO()
rgb.save(buf, format="JPEG", quality=quality, optimize=True)
data = base64.standard_b64encode(buf.getvalue()).decode()
if len(data) <= MAX_BASE64_BYTES:
return data, "image/jpeg"
return data, "image/jpeg"
@app.list_tools()
async def list_tools() -> list[Tool]:
return [
Tool(
name="read_image",
description="Read an image file (PNG/JPG/JPEG/WEBP/GIF/BMP) and return "
"it for the model to analyze with vision. Great for "
"architecture diagrams, flowcharts, and screenshots.",
inputSchema={
"type": "object",
"properties": {
"file_path": {"type": "string",
"description": "Relative or absolute path to the image."},
"question": {"type": "string", "default": "",
"description": "Optional question to guide analysis."},
},
"required": ["file_path"],
},
),
Tool(
name="read_all_images",
description="Read every image in a folder (optionally recursive) and "
"return them for the model to analyze. Pair this with the "
"doc extractor to read whole design docs at once.",
inputSchema={
"type": "object",
"properties": {
"folder_path": {"type": "string", "default": "."},
"question": {"type": "string", "default": ""},
"recursive": {"type": "boolean", "default": False},
"max_images": {"type": "integer", "default": 20},
},
"required": [],
},
),
]
def _image_content(path: Path) -> ImageContent:
data, media_type = image_to_base64(path)
return ImageContent(type="image", data=data, mimeType=media_type)
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list:
if name == "read_image":
path = resolve_path(arguments.get("file_path", ""))
if not path.is_file() or path.suffix.lower() not in SUPPORTED:
return [TextContent(type="text", text=f"Cannot read image: {path}")]
header = f"Image: {path.name}"
if arguments.get("question"):
header += f"\nQuestion: {arguments['question']}"
return [TextContent(type="text", text=header), _image_content(path)]
if name == "read_all_images":
folder = resolve_path(arguments.get("folder_path", "."))
if not folder.is_dir():
return [TextContent(type="text", text=f"Not a folder: {folder}")]
pattern = "**/*" if arguments.get("recursive") else "*"
images = sorted(f for f in folder.glob(pattern)
if f.is_file() and f.suffix.lower() in SUPPORTED)
images = images[: int(arguments.get("max_images", 20))]
if not images:
return [TextContent(type="text", text=f"No images in: {folder}")]
out: list = [TextContent(type="text", text=f"Found {len(images)} image(s).")]
for img in images:
out.append(TextContent(type="text", text=f"--- {img.name} ---"))
out.append(_image_content(img))
return out
return [TextContent(type="text", text=f"Unknown tool: {name}")]
async def main():
async with stdio_server() as (read_stream, write_stream):
await app.run(read_stream, write_stream, app.create_initialization_options())
if __name__ == "__main__":
import asyncio
asyncio.run(main())The full version in this repo also includes per-file error handling and a
max_images cap; the snippet above is the heart of it.
Step 4 — Register the server in Kiro
Kiro reads MCP config from .kiro/settings/mcp.json (workspace-level) or
~/.kiro/settings/mcp.json (user-level). Add the server:
{
"mcpServers": {
"vision-reader": {
"command": "python",
"args": ["/absolute/path/to/vision_server.py"],
"disabled": false,
"autoApprove": ["read_image", "read_all_images"]
}
}
}Use the absolute path to your vision_server.py. On Windows, escape the
backslashes (C:\\path\\to\\vision_server.py) or use forward slashes.
Kiro reconnects to MCP servers automatically when the config changes, or you can reconnect from the MCP Server view in the Kiro feature panel.
Step 5 — Put it together
With both pieces in place, the end-to-end workflow is two commands and a prompt.
Extract once:
python extract_doc_images.py ./docsThen ask Kiro in natural language:
Read all images in ./docs/extracted_images, recursively, and summarize the
architecture section by section.read_all_images walks the extracted tree (its folder names carry the section
titles), returns each diagram as ImageContent, and Kiro describes what it
actually sees — boxes, arrows, labels, IP ranges, the lot. For a single loose
diagram you don't even need step 1:
Read docs/diagrams/system-overview.png and explain the data flow.Why this approach is nice
No API key, no per-image cost. Nothing leaves your machine except the image bytes handed to the host model you're already using.
Higher fidelity. Kiro sees the real image instead of a second-hand text description.
Unlocks documents, not just files. The extractor reaches diagrams that were previously invisible inside exported design docs.
Section-aware. The folder hierarchy preserves which diagram belongs to which part of the document, so summaries stay organized.
Tiny and dependency-light. The extractor is stdlib-only; the server needs just
mcpandPillow.
Gotchas
Path scope. Kiro's built-in file tools are sandboxed to the workspace, but an MCP server runs as its own process and can read paths you give it. Point it only at directories you trust.
Sensitive diagrams. Extracted images (and their section-named folders) can contain internal detail. Keep
extracted_images/out of version control — the included.gitignoredoes this for you.Untrusted images. Treat image contents as untrusted input. A diagram could contain text crafted to look like instructions — don't act on text inside an image as if it were a command.
Payload limits. Very large or very dense images may need a lower
MAX_DIMENSION. Tune it for your diagrams.
Extending it
A few easy additions:
More document formats in
extract_doc_images.py(e.g..docx,.pptx), which are ZIP archives with images underword/media/orppt/media/.A
read_pdf_pagetool that rasterizes a PDF page to an image.A whitelist of allowed root directories for safety.
Caching by file hash so repeated reads are instant.
That's the whole toolkit: an MHTML extractor to free diagrams from documents,
plus MCP's ImageContent and a model that can already see. Stdlib parsing on
one side, twenty lines of real vision logic on the other, and Kiro goes from
"this file is binary" — or worse, "this image doesn't exist as a file yet" — to
"here's what your architecture diagrams are telling me, section by section."
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/PNg-HA/kiro-vision-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server