Skip to main content
Glama
longhz

MiniMax MCP Server

by longhz

understand_image

Analyze image content with quick summaries or structured 7-dimension analysis. Ask specific questions to extract text, objects, charts, and evidence from images.

Instructions

分析图片内容 — 借鉴 OpenHanako Vision Bridge 设计

两种模式:

  • quick: 简洁描述(~300词),适合快速了解图片内容

  • detailed: 结构化 7 维度分析(image_overview / visible_text / objects_and_layout / charts_or_data / answer_to_request / evidence / uncertainty),参考 OpenHanako vision-bridge.js

支持缓存:同一图片+相同prompt不重复调用API(LRU + 磁盘持久化)

Args: image_url: 图片 URL(HTTP/HTTPS)或本地文件路径,支持 JPEG/PNG/GIF/WebP (≤20MB) prompt: 对图片的具体问题,如 "这张图片里有什么错误提示?" mode: "quick" 或 "detailed",默认 detailed use_cache: 是否使用缓存,默认 true

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
image_urlYes
promptNo
modeNodetailed
use_cacheNo

Implementation Reference

  • MCP tool registration via @mcp.tool() decorator — registers understand_image as a FastMCP tool with parameters image_url, prompt, mode, use_cache.
    @mcp.tool()
    def understand_image(
        image_url: str,
        prompt: str = "",
        mode: str = "detailed",
        use_cache: bool = True,
    ) -> dict:
        """分析图片内容 — 借鉴 OpenHanako Vision Bridge 设计
    
        两种模式:
        - quick:    简洁描述(~300词),适合快速了解图片内容
        - detailed: 结构化 7 维度分析(image_overview / visible_text /
                    objects_and_layout / charts_or_data / answer_to_request /
                    evidence / uncertainty),参考 OpenHanako vision-bridge.js
    
        支持缓存:同一图片+相同prompt不重复调用API(LRU + 磁盘持久化)
    
        Args:
            image_url: 图片 URL(HTTP/HTTPS)或本地文件路径,支持 JPEG/PNG/GIF/WebP (≤20MB)
            prompt:    对图片的具体问题,如 "这张图片里有什么错误提示?"
            mode:      "quick" 或 "detailed",默认 detailed
            use_cache: 是否使用缓存,默认 true
        """
        from minimax_mcp.tools.image_understand import understand_image as _run
        return _run(get_client(), image_url, prompt, mode, use_cache)
  • Core handler function — normalizes image URL (local file → base64 data URL), validates mode, delegates to analyze_image(), and attaches cache stats.
    def understand_image(
        client: MiniMaxClient,
        image_url: str,
        prompt: str = "",
        mode: str = "",
        use_cache: bool = True,
    ) -> dict:
        """分析图片内容,支持本地文件路径
    
        借鉴 OpenHanako 的设计,提供两种分析模式:
        - quick:  简洁描述(~300 词),对应 _analyzeImageAsNote()
        - detailed: 结构化 7 维度分析,对应 _analyzeImageWithPrimitives()
    
        Args:
            client: MiniMax API 客户端
            image_url: 图片 URL(HTTP/HTTPS)或本地文件路径(自动转 base64)
            prompt: 用户对图片的具体问题(可选),如 "这张截图有什么错误?"
            mode: "quick" 或 "detailed",默认 detailed
            use_cache: 是否使用缓存,默认 true
    
        Returns:
            {
                success: bool,
                mode: str,
                analysis: str,
                cached: bool,
                image_url: str,
                cache_stats: {...},
            }
        """
        if not image_url:
            return {"success": False, "error": "image_url 不能为空"}
    
        if mode not in ("quick", "detailed"):
            mode = VISION_DEFAULT_MODE
    
        # 本地文件自动转 base64 data URL
        normalized_url = _normalize_image_url(image_url)
        print(f"[Vision] Source: {image_url[:80]}", file=sys.stderr)
    
        result = analyze_image(
            client=client,
            image_url=normalized_url,
            prompt=prompt.strip(),
            mode=mode,
            use_cache=use_cache,
        )
    
        result["cache_stats"] = get_cache_stats()
        return result
  • Helper that converts local file paths to base64 data URLs for API consumption (supports jpg, png, gif, webp).
    def _normalize_image_url(raw: str) -> str:
        """将本地文件路径转为 base64 data URL,HTTP/HTTPS URL 原样返回"""
        if raw.startswith("http://") or raw.startswith("https://") or raw.startswith("data:"):
            return raw
        path = Path(raw).expanduser()
        if path.is_file():
            ext = path.suffix.lower()
            mime_map = {".jpg": "jpeg", ".jpeg": "jpeg", ".png": "png", ".gif": "gif", ".webp": "webp"}
            mime = mime_map.get(ext, "jpeg")
            b64 = base64.b64encode(path.read_bytes()).decode()
            return f"data:image/{mime};base64,{b64}"
        return raw
  • Core analysis orchestrator — checks VisionCache, builds prompts (quick/detailed), calls MiniMax API, caches results, and returns formatted response.
    def analyze_image(
        client: MiniMaxClient,
        image_url: str,
        prompt: str = "",
        mode: str = "detailed",
        use_cache: bool = True,
    ) -> dict:
        """分析图片内容
    
        借鉴 OpenHanako VisionBridge.prepare() + _analyzeImage() 流程:
        1. 检查缓存
        2. 构建 prompt(quick/detailed)
        3. 调用 MiniMax understand_image API
        4. 缓存结果
        5. 格式化返回
    
        Args:
            client: MiniMax API 客户端
            image_url: 图片 URL 或本地路径
            prompt: 用户对图片的具体问题(可选)
            mode: "quick" 或 "detailed"
            use_cache: 是否使用缓存
    
        Returns:
            {
                success: bool,
                mode: str,
                analysis: str,
                cached: bool,
                image_url: str,
            }
        """
        cache = _get_cache()
    
        # Step 1: 查缓存
        if use_cache:
            cached = cache.get(image_url, prompt, mode)
            if cached:
                print(f"[Vision] Cache hit for {image_url[:60]}", file=sys.stderr)
                return {
                    "success": True,
                    "mode": mode,
                    "analysis": cached,
                    "cached": True,
                    "image_url": image_url,
                }
    
        # Step 2: 构建 prompt
        if mode == "detailed":
            api_prompt = build_detailed_prompt(prompt)
        else:
            api_prompt = build_quick_prompt(prompt)
    
        print(f"[Vision] Analyzing image (mode={mode}): {image_url[:80]}", file=sys.stderr)
    
        # Step 3: 调用 API
        result = client.understand_image(prompt=api_prompt, image_url=image_url)
    
        if not result.get("success"):
            return {
                "success": False,
                "mode": mode,
                "analysis": "",
                "error": result.get("error", "Unknown error"),
                "detail": result.get("detail", ""),
                "cached": False,
                "image_url": image_url,
            }
    
        # Step 4: 提取分析文本
        # 实际 API 响应: {"content": "...", "base_resp": {...}, "success": true}
        raw_analysis = (
            result.get("content")
            or result.get("analysis")
            or result.get("data", {}).get("reply")
            or str(result)
        )
    
        # Step 5: 缓存结果
        if use_cache and raw_analysis:
            cache.put(image_url, prompt, mode, raw_analysis)
    
        return {
            "success": True,
            "mode": mode,
            "analysis": raw_analysis,
            "cached": False,
            "image_url": image_url,
        }
  • Detailed mode prompt template defining the 7-dimension structured analysis schema (image_overview, visible_text, objects_and_layout, charts_or_data, answer_to_request, evidence, uncertainty).
    # ── Detailed Mode Prompt ───────────────────────────────────
    # 借鉴 OpenHanako Vision Bridge 的分析维度:
    # image_overview / visible_text / objects_and_layout /
    # charts_or_data / user_request_answer / evidence / uncertainty
    
    DETAILED_PROMPT_TEMPLATE = """Analyze this image thoroughly and return a structured response in the following format.
    
    ## image_overview
    A concise description of what this image shows overall. Include the type of image (screenshot, photo, chart, document, UI, etc.), the context/setting, and the main subject.
    
    ## visible_text
    List all readable text visible in the image. Include labels, titles, buttons, menu items, error messages, code snippets, document text, etc. Be as complete as possible with exact wording when legible.
    
    ## objects_and_layout
    Describe the spatial layout: what objects/elements appear where. Note their relative positions (top-left, center, bottom-right, etc.), approximate sizes, and relationships between elements. For UI screenshots, describe the window structure, panels, toolbars, content areas.
    
    ## charts_or_data
    If the image contains charts, graphs, tables, or structured data, extract and describe the data. Include axis labels, data series, numerical values, trends, and table headers/rows where visible.
    
    ## answer_to_request
    {request_section}
    
    ## evidence
    Cite specific visual evidence from the image that supports your analysis. Reference exact positions, colors, text, or patterns that back up your conclusions.
    
    ## uncertainty
    Note anything that is unclear, ambiguous, partially hidden, cropped, or that you are uncertain about. Be honest about the limits of what you can see.
    
    {user_request}"""
    
    
    def build_detailed_prompt(user_request: str = "") -> str:
        """构建 detailed 模式 prompt
    
        如果用户有具体问题,会生成 answer_to_request 段;
        否则用通用描述段替代。
    
        借鉴 OpenHanako: user_request 字段驱动定向分析
        """
        if user_request.strip():
            request_section = f"Answer the following request using the image:\n{user_request}"
            user_request_line = f"User's request about this image: {user_request}"
        else:
            request_section = "Describe the main purpose or takeaway of this image."
            user_request_line = ""
    
        return DETAILED_PROMPT_TEMPLATE.format(
            request_section=request_section,
            user_request=user_request_line,
        )
    
    
    def build_quick_prompt(user_request: str = "") -> str:
        """构建 quick 模式 prompt"""
        if user_request.strip():
            return f"{QUICK_PROMPT}\n\nSpecifically, the user wants to know: {user_request}"
        return QUICK_PROMPT
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Discloses caching (LRU + disk persistence), mode-specific outputs (quick ~300 words; detailed 7 dimensions), image constraints (formats, size ≤20MB), and references OpenHanako design.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Well-structured with sections (modes, cache, args) but some redundancy (mode details in prose and bullet). Slightly verbose but still clear.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Comprehensive: covers input, behavior, caching, output (via mode descriptions). No output schema but mode details provide sufficient completeness.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

All 4 parameters explained in detail: image_url (URL/local path, formats, limit), prompt (example question), mode (values), use_cache (boolean). Schema has 0% coverage, description fully compensates.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description explicitly states '分析图片内容' (analyze image content) with two modes (quick/detailed). Clearly distinguishes from siblings like generate_image (creation) and web_search (search).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Describes when to use quick mode ('适合快速了解图片内容') and detailed mode ('结构化分析'). Does not explicitly exclude alternatives but context clarifies differentiation.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/longhz/minimax-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server