vision_query

Process images by extracting text via OCR, answering visual queries, or detecting elements through the GLM-4.5V MCP Server. Specify modes like describe, ocr, qa, or detect to analyze visuals effectively.

Instructions

调用 GLM-4.5V 对图片进行 OCR/问答/检测

Input Schema

TableJSON Schema

Name	Required	Description	Default
`mode`	No	查询模式	describe
`path`	Yes	图片路径或URL
`prompt`	Yes	查询提示词
`returnJson`	No	是否返回JSON格式结果

Implementation Reference

src/index.ts:81-110 (registration)
MCP tool registration for 'vision_query', including Zod input schema and thin wrapper handler that invokes the core visionQuery function and handles errors.
mcpServer.registerTool("vision_query", { description: "调用 GLM-4.5V 对图片进行 OCR/问答/检测", inputSchema: { path: z.string().describe("图片路径或URL"), prompt: z.string().describe("查询提示词"), mode: z.enum(["describe", "ocr", "qa", "detect"]).default("describe").describe("查询模式"), returnJson: z.boolean().default(false).describe("是否返回JSON格式结果"), }, }, async ({ path: imagePath, prompt, mode, returnJson }) => { try { const result = await visionQuery(imagePath, prompt, mode, returnJson); return { content: [{ type: "text" as const, text: JSON.stringify(result, null, 2) }] }; } catch (error) { return { content: [{ type: "text" as const, text: JSON.stringify({ ok: false, error: error instanceof Error ? error.message : "Unknown error" }, null, 2) }], isError: true }; } });
src/index.ts:229-304 (handler)
Primary handler function for vision_query tool. Loads image from path/URL/dataURL, compresses it, constructs payload for GLM-4V API based on mode, sends request, normalizes response, returns VisionResult.
async function visionQuery(imagePath: string, prompt: string, mode: string, returnJson: boolean): Promise<VisionResult> { try { let imageBase64: string; let buffer: Buffer; if (imagePath.startsWith("data:")) { // Data URL 格式 const base64Data = imagePath.split(',')[1]; if (!base64Data) { throw new Error("Invalid data URL format"); } buffer = Buffer.from(base64Data, 'base64'); } else if (imagePath.startsWith("http://") || imagePath.startsWith("https://")) { // HTTP/HTTPS URL图片 const response = await fetch(imagePath); if (!response.ok) { throw new Error(`Failed to fetch image: ${response.statusText}`); } buffer = Buffer.from(await response.arrayBuffer()); } else { // 本地文件 const resolvedPath = path.resolve(imagePath); buffer = await fs.readFile(resolvedPath); } // 压缩图片以减少token使用量 const compressedBuffer = await compressImage(buffer, 800, 75); // 更小尺寸和质量 imageBase64 = compressedBuffer.toString("base64"); const payload = buildGlmPayload({ prompt, imageBase64, mode, returnJson }); const glmBaseUrl = process.env.GLM_BASE_URL || "https://open.bigmodel.cn/api/paas/v4/chat/completions"; const glmApiKey = process.env.GLM_API_KEY; if (!glmApiKey) { throw new Error("GLM_API_KEY environment variable is required"); } const response = await fetch(glmBaseUrl, { method: "POST", headers: { "Content-Type": "application/json", "Authorization": `Bearer ${glmApiKey}` }, body: JSON.stringify(payload) }); if (!response.ok) { throw new Error(`GLM API request failed: ${response.statusText}`); } const data = await response.json(); const result = normalizeGlmResult(data, { mode, returnJson }); return { ok: true, result, metadata: { mode, returnJson, timestamp: Date.now() } }; } catch (error) { return { ok: false, error: error instanceof Error ? error.message : "Unknown error" }; } }
src/index.ts:83-88 (schema)
Zod input schema defining parameters for vision_query: image path/URL, prompt, mode (describe/ocr/qa/detect), returnJson flag.
inputSchema: { path: z.string().describe("图片路径或URL"), prompt: z.string().describe("查询提示词"), mode: z.enum(["describe", "ocr", "qa", "detect"]).default("describe").describe("查询模式"), returnJson: z.boolean().default(false).describe("是否返回JSON格式结果"), },
src/index.ts:25-34 (schema)
TypeScript interface for the return type of visionQuery function.
interface VisionResult { ok: boolean; result?: string | object; error?: string; metadata?: { mode: string; returnJson: boolean; timestamp: number; }; }
src/index.ts:374-430 (helper)
Helper function to build the payload for GLM-4V API call specific to vision_query modes (ocr, qa, detect, describe), including system prompt construction and image embedding.
function buildGlmPayload(opts: { prompt: string; imageBase64: string; mode: string; returnJson: boolean; }) { const { prompt, imageBase64, mode, returnJson } = opts; // 截断过长的 prompt const truncatedPrompt = truncatePrompt(prompt, 300); let systemPrompt = ""; switch (mode) { case "ocr": systemPrompt = "识别图片中的文字。"; break; case "qa": systemPrompt = "根据图片回答问题。"; break; case "detect": systemPrompt = "识别图片中的物体。"; break; default: systemPrompt = "描述图片内容。"; } if (returnJson) { systemPrompt += "用JSON格式回答。"; } return { model: "glm-4v-plus", messages: [ { role: "system", content: systemPrompt }, { role: "user", content: [ { type: "text", text: truncatedPrompt }, { type: "image_url", image_url: { url: `data:image/jpeg;base64,${imageBase64}` } } ] } ], temperature: 0.1, max_tokens: 1000 }; }

GLM-4.5V MCP Server

vision_query

Instructions

Input Schema

Implementation Reference

Other Tools

Related Tools

Latest Blog Posts

MCP directory API