understand_image
Analyze image content with quick summaries or structured 7-dimension analysis. Ask specific questions to extract text, objects, charts, and evidence from images.
Instructions
分析图片内容 — 借鉴 OpenHanako Vision Bridge 设计
两种模式:
quick: 简洁描述(~300词),适合快速了解图片内容
detailed: 结构化 7 维度分析(image_overview / visible_text / objects_and_layout / charts_or_data / answer_to_request / evidence / uncertainty),参考 OpenHanako vision-bridge.js
支持缓存:同一图片+相同prompt不重复调用API(LRU + 磁盘持久化)
Args: image_url: 图片 URL(HTTP/HTTPS)或本地文件路径,支持 JPEG/PNG/GIF/WebP (≤20MB) prompt: 对图片的具体问题,如 "这张图片里有什么错误提示?" mode: "quick" 或 "detailed",默认 detailed use_cache: 是否使用缓存,默认 true
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| image_url | Yes | ||
| prompt | No | ||
| mode | No | detailed | |
| use_cache | No |
Implementation Reference
- src/minimax_mcp/server.py:80-104 (registration)MCP tool registration via @mcp.tool() decorator — registers understand_image as a FastMCP tool with parameters image_url, prompt, mode, use_cache.
@mcp.tool() def understand_image( image_url: str, prompt: str = "", mode: str = "detailed", use_cache: bool = True, ) -> dict: """分析图片内容 — 借鉴 OpenHanako Vision Bridge 设计 两种模式: - quick: 简洁描述(~300词),适合快速了解图片内容 - detailed: 结构化 7 维度分析(image_overview / visible_text / objects_and_layout / charts_or_data / answer_to_request / evidence / uncertainty),参考 OpenHanako vision-bridge.js 支持缓存:同一图片+相同prompt不重复调用API(LRU + 磁盘持久化) Args: image_url: 图片 URL(HTTP/HTTPS)或本地文件路径,支持 JPEG/PNG/GIF/WebP (≤20MB) prompt: 对图片的具体问题,如 "这张图片里有什么错误提示?" mode: "quick" 或 "detailed",默认 detailed use_cache: 是否使用缓存,默认 true """ from minimax_mcp.tools.image_understand import understand_image as _run return _run(get_client(), image_url, prompt, mode, use_cache) - Core handler function — normalizes image URL (local file → base64 data URL), validates mode, delegates to analyze_image(), and attaches cache stats.
def understand_image( client: MiniMaxClient, image_url: str, prompt: str = "", mode: str = "", use_cache: bool = True, ) -> dict: """分析图片内容,支持本地文件路径 借鉴 OpenHanako 的设计,提供两种分析模式: - quick: 简洁描述(~300 词),对应 _analyzeImageAsNote() - detailed: 结构化 7 维度分析,对应 _analyzeImageWithPrimitives() Args: client: MiniMax API 客户端 image_url: 图片 URL(HTTP/HTTPS)或本地文件路径(自动转 base64) prompt: 用户对图片的具体问题(可选),如 "这张截图有什么错误?" mode: "quick" 或 "detailed",默认 detailed use_cache: 是否使用缓存,默认 true Returns: { success: bool, mode: str, analysis: str, cached: bool, image_url: str, cache_stats: {...}, } """ if not image_url: return {"success": False, "error": "image_url 不能为空"} if mode not in ("quick", "detailed"): mode = VISION_DEFAULT_MODE # 本地文件自动转 base64 data URL normalized_url = _normalize_image_url(image_url) print(f"[Vision] Source: {image_url[:80]}", file=sys.stderr) result = analyze_image( client=client, image_url=normalized_url, prompt=prompt.strip(), mode=mode, use_cache=use_cache, ) result["cache_stats"] = get_cache_stats() return result - Helper that converts local file paths to base64 data URLs for API consumption (supports jpg, png, gif, webp).
def _normalize_image_url(raw: str) -> str: """将本地文件路径转为 base64 data URL,HTTP/HTTPS URL 原样返回""" if raw.startswith("http://") or raw.startswith("https://") or raw.startswith("data:"): return raw path = Path(raw).expanduser() if path.is_file(): ext = path.suffix.lower() mime_map = {".jpg": "jpeg", ".jpeg": "jpeg", ".png": "png", ".gif": "gif", ".webp": "webp"} mime = mime_map.get(ext, "jpeg") b64 = base64.b64encode(path.read_bytes()).decode() return f"data:image/{mime};base64,{b64}" return raw - Core analysis orchestrator — checks VisionCache, builds prompts (quick/detailed), calls MiniMax API, caches results, and returns formatted response.
def analyze_image( client: MiniMaxClient, image_url: str, prompt: str = "", mode: str = "detailed", use_cache: bool = True, ) -> dict: """分析图片内容 借鉴 OpenHanako VisionBridge.prepare() + _analyzeImage() 流程: 1. 检查缓存 2. 构建 prompt(quick/detailed) 3. 调用 MiniMax understand_image API 4. 缓存结果 5. 格式化返回 Args: client: MiniMax API 客户端 image_url: 图片 URL 或本地路径 prompt: 用户对图片的具体问题(可选) mode: "quick" 或 "detailed" use_cache: 是否使用缓存 Returns: { success: bool, mode: str, analysis: str, cached: bool, image_url: str, } """ cache = _get_cache() # Step 1: 查缓存 if use_cache: cached = cache.get(image_url, prompt, mode) if cached: print(f"[Vision] Cache hit for {image_url[:60]}", file=sys.stderr) return { "success": True, "mode": mode, "analysis": cached, "cached": True, "image_url": image_url, } # Step 2: 构建 prompt if mode == "detailed": api_prompt = build_detailed_prompt(prompt) else: api_prompt = build_quick_prompt(prompt) print(f"[Vision] Analyzing image (mode={mode}): {image_url[:80]}", file=sys.stderr) # Step 3: 调用 API result = client.understand_image(prompt=api_prompt, image_url=image_url) if not result.get("success"): return { "success": False, "mode": mode, "analysis": "", "error": result.get("error", "Unknown error"), "detail": result.get("detail", ""), "cached": False, "image_url": image_url, } # Step 4: 提取分析文本 # 实际 API 响应: {"content": "...", "base_resp": {...}, "success": true} raw_analysis = ( result.get("content") or result.get("analysis") or result.get("data", {}).get("reply") or str(result) ) # Step 5: 缓存结果 if use_cache and raw_analysis: cache.put(image_url, prompt, mode, raw_analysis) return { "success": True, "mode": mode, "analysis": raw_analysis, "cached": False, "image_url": image_url, } - Detailed mode prompt template defining the 7-dimension structured analysis schema (image_overview, visible_text, objects_and_layout, charts_or_data, answer_to_request, evidence, uncertainty).
# ── Detailed Mode Prompt ─────────────────────────────────── # 借鉴 OpenHanako Vision Bridge 的分析维度: # image_overview / visible_text / objects_and_layout / # charts_or_data / user_request_answer / evidence / uncertainty DETAILED_PROMPT_TEMPLATE = """Analyze this image thoroughly and return a structured response in the following format. ## image_overview A concise description of what this image shows overall. Include the type of image (screenshot, photo, chart, document, UI, etc.), the context/setting, and the main subject. ## visible_text List all readable text visible in the image. Include labels, titles, buttons, menu items, error messages, code snippets, document text, etc. Be as complete as possible with exact wording when legible. ## objects_and_layout Describe the spatial layout: what objects/elements appear where. Note their relative positions (top-left, center, bottom-right, etc.), approximate sizes, and relationships between elements. For UI screenshots, describe the window structure, panels, toolbars, content areas. ## charts_or_data If the image contains charts, graphs, tables, or structured data, extract and describe the data. Include axis labels, data series, numerical values, trends, and table headers/rows where visible. ## answer_to_request {request_section} ## evidence Cite specific visual evidence from the image that supports your analysis. Reference exact positions, colors, text, or patterns that back up your conclusions. ## uncertainty Note anything that is unclear, ambiguous, partially hidden, cropped, or that you are uncertain about. Be honest about the limits of what you can see. {user_request}""" def build_detailed_prompt(user_request: str = "") -> str: """构建 detailed 模式 prompt 如果用户有具体问题,会生成 answer_to_request 段; 否则用通用描述段替代。 借鉴 OpenHanako: user_request 字段驱动定向分析 """ if user_request.strip(): request_section = f"Answer the following request using the image:\n{user_request}" user_request_line = f"User's request about this image: {user_request}" else: request_section = "Describe the main purpose or takeaway of this image." user_request_line = "" return DETAILED_PROMPT_TEMPLATE.format( request_section=request_section, user_request=user_request_line, ) def build_quick_prompt(user_request: str = "") -> str: """构建 quick 模式 prompt""" if user_request.strip(): return f"{QUICK_PROMPT}\n\nSpecifically, the user wants to know: {user_request}" return QUICK_PROMPT