Skip to main content
Glama

chat_vision

Analyze images through two-turn conversations: first answer questions about image content, then respond to follow-up questions about visual details.

Instructions

两轮对话式图像问答

支持基于图像的两轮对话:

  • 第一轮:根据图像和本地AI的询问信息进行回复

  • 第二轮:如果本地AI对图像画面细节有进一步追问,则回答


使用场景

  • 深度图像分析

  • 迭代式问题探索

  • 复杂图像理解

参数说明

  • image: 图像输入(路径或Base64)

  • question: 问题

  • session_id: 会话ID(用于第二轮对话,首次调用可不提供)

  • is_new_conversation: 是否开始新对话(设为true会创建新会话)

两轮对话流程

  1. 第一轮:调用时不传session_id,AI分析图像并回复,返回会话ID

  2. 第二轮:传入session_id继续追问图像细节,AI回答后对话结束

  3. 超过两轮将无法继续,需开始新对话

示例

# 第一轮对话
result1 = chat_vision(
    image="C:/chart.png",
    question="这个图表显示什么数据?"
)
session_id = result1["session_id"]

# 第二轮对话(追问细节,对话结束后无法继续)
if result1["remaining_turns"] > 0:
    result2 = chat_vision(
        image="C:/chart.png",
        question="数据有什么趋势?",
        session_id=session_id
    )

返回内容

  • status: 执行状态

  • answer: 回答

  • session_id: 会话ID

  • conversation_turn: 当前对话轮次(1或2)

  • remaining_turns: 剩余对话轮次

  • can_continue: 是否可以继续对话

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
imageYes图像输入:本地文件路径或Base64编码
questionYes关于图像的问题
session_idNo会话ID(多轮对话用)
is_new_conversationNo是否开始新对话

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • Main chat_vision tool handler - Implements a two-turn conversational image Q&A tool that manages sessions, processes images, and calls the vision API with conversation history.
    @mcp.tool()
    async def chat_vision(
        image: str = Field(description="图像输入:本地文件路径或Base64编码"),
        question: str = Field(description="关于图像的问题"),
        session_id: str | None = Field(default=None, description="会话ID(多轮对话用)"),
        is_new_conversation: bool = Field(default=False, description="是否开始新对话"),
    ) -> dict[str, Any]:
        """
        两轮对话式图像问答
    
        支持基于图像的两轮对话:
        - 第一轮:根据图像和本地AI的询问信息进行回复
        - 第二轮:如果本地AI对图像画面细节有进一步追问,则回答
    
        ---
        **使用场景**:
        - 深度图像分析
        - 迭代式问题探索
        - 复杂图像理解
    
        **参数说明**:
        - `image`: 图像输入(路径或Base64)
        - `question`: 问题
        - `session_id`: 会话ID(用于第二轮对话,首次调用可不提供)
        - `is_new_conversation`: 是否开始新对话(设为true会创建新会话)
    
        **两轮对话流程**:
        1. 第一轮:调用时不传session_id,AI分析图像并回复,返回会话ID
        2. 第二轮:传入session_id继续追问图像细节,AI回答后对话结束
        3. 超过两轮将无法继续,需开始新对话
    
        **示例**:
        ```python
        # 第一轮对话
        result1 = chat_vision(
            image="C:/chart.png",
            question="这个图表显示什么数据?"
        )
        session_id = result1["session_id"]
    
        # 第二轮对话(追问细节,对话结束后无法继续)
        if result1["remaining_turns"] > 0:
            result2 = chat_vision(
                image="C:/chart.png",
                question="数据有什么趋势?",
                session_id=session_id
            )
        ```
    
        **返回内容**:
        - `status`: 执行状态
        - `answer`: 回答
        - `session_id`: 会话ID
        - `conversation_turn`: 当前对话轮次(1或2)
        - `remaining_turns`: 剩余对话轮次
        - `can_continue`: 是否可以继续对话
        """
        logger.info(f"收到chat_vision请求,问题: {question[:50]}...")
    
        try:
            # 获取管理器、处理器和客户端
            manager = get_chat_manager()
            processor = get_image_processor()
            client = get_vision_client()
    
            # 处理会话
            if is_new_conversation or session_id is None:
                session = manager.create_new_session()
                logger.info(f"创建新会话: {session.session_id[:8]}")
            else:
                session = manager.get_or_create_session(session_id)
    
                # 检查是否可以继续对话
                if not session.can_continue():
                    logger.info(f"会话 {session.session_id[:8]} 已达到最大轮次限制")
                    return {
                        "status": "completed",
                        "message": "该会话已完成两轮对话,已结束。如需继续分析图像,请上传新的图片并设置 is_new_conversation=true 开始新会话。",
                        "hint": "下次调用时需要提供新的 image 参数和 is_new_conversation=true",
                        "session_id": session.session_id,
                        "conversation_turn": session.current_turn,
                        "remaining_turns": 0,
                        "can_continue": False,
                    }
    
            # 处理图像输入
            image_info = processor.process_image_input(image)
    
            # 设置图像上下文
            session.set_image_context(image_info["url"], image_info)
    
            # 添加用户问题到历史
            session.add_message("user", question)
    
            # 获取对话历史
            history = session.get_openai_history()
    
            # 调用视觉API(多轮对话模式)
            answer = await client.chat_with_image(
                image_url=image_info["url"],
                question=question,
                conversation_history=history[:-1] if len(history) > 1 else None,  # 排除刚添加的问题
            )
    
            # 添加助手回答到历史(会自动增加轮次)
            session.add_message("assistant", answer)
    
            # 保存会话
            manager.save_session(session)
    
            logger.info(f"chat_vision完成,会话: {session.session_id[:8]},轮次: {session.current_turn}")
    
            return {
                "status": "success",
                "answer": answer,
                "session_id": session.session_id,
                "conversation_turn": session.current_turn,
                "remaining_turns": session.get_remaining_turns(),
                "can_continue": session.can_continue(),
                "image_info": {
                    "source_type": image_info["source_type"],
                    "mime_type": image_info["mime_type"],
                    "size": image_info["size"],
                }
            }
    
        except FileNotFoundError as e:
            logger.error(f"文件未找到: {e}")
            return {
                "status": "error",
                "error": f"文件未找到: {str(e)}",
                "error_type": "file_not_found",
            }
    
        except ValueError as e:
            logger.error(f"参数错误: {e}")
            return {
                "status": "error",
                "error": str(e),
                "error_type": "invalid_input",
            }
    
        except Exception as e:
            logger.error(f"对话失败: {e}")
            return {
                "status": "error",
                "error": f"对话失败: {str(e)}",
                "error_type": "chat_failed",
            }
  • Tool registration with @mcp.tool() decorator and schema definition using Pydantic Fields for parameter validation.
    @mcp.tool()
    async def chat_vision(
        image: str = Field(description="图像输入:本地文件路径或Base64编码"),
        question: str = Field(description="关于图像的问题"),
        session_id: str | None = Field(default=None, description="会话ID(多轮对话用)"),
        is_new_conversation: bool = Field(default=False, description="是否开始新对话"),
    ) -> dict[str, Any]:
  • Parameter schema definition using Pydantic Fields - defines input validation for image, question, session_id, and is_new_conversation parameters.
    @mcp.tool()
    async def chat_vision(
        image: str = Field(description="图像输入:本地文件路径或Base64编码"),
        question: str = Field(description="关于图像的问题"),
        session_id: str | None = Field(default=None, description="会话ID(多轮对话用)"),
        is_new_conversation: bool = Field(default=False, description="是否开始新对话"),
    ) -> dict[str, Any]:
  • ChatManager class - Manages conversation sessions with support for creating, retrieving, and saving sessions. Enforces 2-turn conversation limit.
    class ChatManager:
        """对话管理器 - 管理多个会话"""
    
        def __init__(self):
            """初始化对话管理器"""
            self._sessions: dict[str, ChatSession] = {}
            self._persistence_enabled = False
            self._history_file: Path | None = None
    
            # 初始化持久化
            self._init_persistence()
    
        def _init_persistence(self):
            """初始化持久化存储"""
            server_config = get_server_config()
    
            if server_config.enable_persistence:
                self._persistence_enabled = True
                self._history_file = Path(server_config.history_path).expanduser()
    
                # 确保目录存在
                self._history_file.parent.mkdir(parents=True, exist_ok=True)
    
                # 加载已有会话
                self._load_from_file()
    
                logger.info(f"持久化已启用,历史文件: {self._history_file}")
            else:
                logger.info("持久化未启用,使用内存模式")
    
        def _load_from_file(self):
            """从文件加载会话历史"""
            if not self._history_file or not self._history_file.exists():
                logger.info("历史文件不存在,将创建新文件")
                return
    
            try:
                content = self._history_file.read_text(encoding="utf-8")
                data = json.loads(content)
    
                if not isinstance(data, dict):
                    logger.warning("历史文件格式错误,忽略")
                    return
    
                # 加载所有会话
                for session_id, session_data in data.items():
                    if isinstance(session_data, dict):
                        self._sessions[session_id] = ChatSession.from_dict(session_data)
    
                logger.info(f"从文件加载了 {len(self._sessions)} 个会话")
    
            except json.JSONDecodeError as e:
                logger.error(f"历史文件JSON解析失败: {e}")
            except Exception as e:
                logger.error(f"加载历史文件失败: {e}")
    
        def _save_to_file(self):
            """保存会话历史到文件"""
            if not self._persistence_enabled or not self._history_file:
                return
    
            try:
                # 确保目录存在
                self._history_file.parent.mkdir(parents=True, exist_ok=True)
    
                # 转换所有会话为字典
                data = {
                    session_id: session.to_dict()
                    for session_id, session in self._sessions.items()
                }
    
                # 保存为格式化的JSON
                self._history_file.write_text(
                    json.dumps(data, ensure_ascii=False, indent=2),
                    encoding="utf-8"
                )
    
                logger.debug(f"已保存 {len(self._sessions)} 个会话到文件")
    
            except Exception as e:
                logger.warning(f"保存历史文件失败: {e}")
    
        def get_or_create_session(self, session_id: str | None = None) -> ChatSession:
            """
            获取或创建会话
    
            Args:
                session_id: 会话ID(可选)
    
            Returns:
                ChatSession: 会话实例
            """
            if session_id and session_id in self._sessions:
                logger.debug(f"使用现有会话: {session_id[:8]}")
                return self._sessions[session_id]
    
            # 创建新会话
            session = ChatSession(session_id)
            self._sessions[session.session_id] = session
    
            logger.info(f"创建新会话: {session.session_id[:8]}")
    
            # 保存到文件
            self._save_to_file()
    
            return session
    
        def create_new_session(self) -> ChatSession:
            """
            创建新会话
    
            Returns:
                ChatSession: 新会话实例
            """
            session = ChatSession()
            self._sessions[session.session_id] = session
    
            logger.info(f"创建新会话: {session.session_id[:8]}")
    
            # 保存到文件
            self._save_to_file()
    
            return session
  • chat_with_image method - Handles multi-turn conversational image Q&A by calling OpenAI-compatible vision API with conversation history.
    async def chat_with_image(
        self,
        image_url: str,
        question: str,
        conversation_history: list[dict[str, Any]] | None = None,
        system_prompt: str | None = None,
    ) -> str:
        """
        多轮对话式图像问答
    
        Args:
            image_url: 图像URL
            question: 问题
            conversation_history: 对话历史
            system_prompt: 自定义系统提示词
    
        Returns:
            str: 回答
        """
        system_prompt = system_prompt or self.SYSTEM_PROMPT
    
        # 构建消息列表
        messages = [{"role": "system", "content": system_prompt}]
    
        # 添加对话历史
        if conversation_history:
            for turn in conversation_history:
                role = turn.get("role", "user")
                content = turn.get("content", "")
                if role in ("user", "assistant"):
                    messages.append({"role": role, "content": content})
    
        # 添加当前问题(包含图像)
        messages.append({
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {
                    "type": "image_url",
                    "image_url": {"url": image_url}
                }
            ]
        })
    
        logger.info(f"发送多轮对话请求,历史轮数: {len(conversation_history or [])}")
    
        try:
            response = self._client.chat.completions.create(
                model=self.config.model,
                messages=messages,
                temperature=self.config.temperature,
                max_tokens=self.config.max_tokens,
            )
    
            content = response.choices[0].message.content
            logger.info(f"收到多轮对话响应,长度: {len(content)}")
            return content
    
        except Exception as e:
            logger.error(f"多轮对话请求失败: {e}")
            raise
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden and does so well by disclosing key behavioral traits: the two-round conversation limit ('超过两轮将无法继续'), session management requirements, and workflow constraints. It also describes the return structure and state tracking (e.g., 'remaining_turns', 'can_continue'), though it could mention potential error conditions or performance characteristics.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness3/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with clear sections (使用场景, 参数说明, 两轮对话流程, 示例, 返回内容), but it is verbose with repetitive information (e.g., the workflow is explained in multiple places). Some sentences could be condensed without losing clarity, making it less front-loaded than ideal.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (multi-round conversation with state management), no annotations, and the presence of an output schema, the description is highly complete. It covers purpose, usage, parameters, workflow, examples, and return values, providing sufficient context for an agent to use the tool correctly without relying on structured fields.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The schema description coverage is 100%, so the baseline is 3. The description adds some value by explaining the conversational flow implications of parameters (e.g., 'session_id' for second-round dialogue, 'is_new_conversation' to create new sessions) and providing usage examples, but doesn't significantly enhance semantic understanding beyond what the schema already documents.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool performs 'two-round conversational image Q&A' with specific verbs ('analyze', 'answer') and resources ('image', 'question'), distinguishing it from sibling tools like 'analyze_image' by emphasizing the conversational aspect. However, it doesn't explicitly contrast with 'get_status', leaving some ambiguity in sibling differentiation.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit usage scenarios ('深度图像分析', '迭代式问题探索', '复杂图像理解'), detailed two-round workflow instructions, and clear when-to-use guidance (e.g., '首次调用可不提供' for session_id, '超过两轮将无法继续'). It effectively guides the agent on proper invocation timing and limitations.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/LZMW/mcp-vision-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server