detect
Identify objects in images and return precise bounding box coordinates using Gemini's vision AI. Specify detection targets like UI elements or buttons for targeted analysis.
Instructions
Detect objects in an image and return bounding boxes. Uses Gemini for native bounding box support. Coordinates are normalized 0-1000 as [ymin, xmin, ymax, xmax].
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| image | Yes | Path to the image file | |
| prompt | No | Optional: what to detect (e.g., 'find all buttons', 'detect UI elements') |
Implementation Reference
- src/tools/detect.ts:30-75 (handler)The main handler function for the 'detect' tool. It processes the input arguments, converts the image to base64, calls the geminiDetect helper, handles empty detections, and returns a structured JSON response with object count, list of objects with labels and bounding boxes.export async function handleDetect(args: Record<string, unknown>) { const image = args.image as string; const prompt = args.prompt as string | undefined; const { base64, mimeType } = await imageToBase64(image); const detections = await geminiDetect(base64, mimeType, prompt); if (detections.length === 0) { return { content: [ { type: "text", text: JSON.stringify( { count: 0, objects: [], note: "No objects detected. Try a more specific prompt.", }, null, 2 ), }, ], }; } return { content: [ { type: "text", text: JSON.stringify( { count: detections.length, objects: detections.map((d, i) => ({ id: i + 1, label: d.label, bbox: d.bbox, })), }, null, 2 ), }, ], }; }
- src/tools/detect.ts:9-28 (schema)Tool definition including name, description, and input schema for the 'detect' tool, specifying required 'image' parameter and optional 'prompt'.export const detectTool: Tool = { name: "detect", description: "Detect objects in an image and return bounding boxes. Uses Gemini for native bounding box support. Coordinates are normalized 0-1000 as [ymin, xmin, ymax, xmax].", inputSchema: { type: "object", properties: { image: { type: "string", description: "Path to the image file or URL (http/https)", }, prompt: { type: "string", description: "Optional: what to detect (e.g., 'find all buttons', 'detect UI elements')", }, }, required: ["image"], }, };
- src/index.ts:37-46 (registration)Registers the 'detect' tool schema by including detectTool in the MCP server's list of available tools returned for ListToolsRequest.server.setRequestHandler(ListToolsRequestSchema, async () => { return { tools: [ describeTool, detectTool, describeRegionTool, analyzeColorsTool, ], }; });
- src/index.ts:49-67 (registration)Registers the handler for 'detect' tool calls by dispatching to handleDetect in the switch statement for CallToolRequest.server.setRequestHandler(CallToolRequestSchema, async (request) => { const { name, arguments: args = {} } = request.params; try { switch (name) { case "describe": return await handleDescribe(args); case "detect": return await handleDetect(args); case "describe_region": return await handleDescribeRegion(args); case "analyze_colors": return await handleAnalyzeColors(args); default: return { content: [{ type: "text", text: `Unknown tool: ${name}` }], isError: true, }; }
- src/providers/gemini.ts:186-203 (helper)Helper function that performs the actual object detection using Gemini API: constructs specific detection prompt, invokes geminiRequest, and extracts parsed bounding boxes.export async function geminiDetect( imageBase64: string, mimeType: string, prompt?: string ): Promise<Array<{ label: string; bbox: [number, number, number, number] }>> { const detectionPrompt = prompt ? `Detect and locate: ${prompt}. For each object found, provide its label and bounding box coordinates in the format: label [ymin, xmin, ymax, xmax] where coordinates are normalized 0-1000.` : `Detect all notable objects in this image. For each object, provide its label and bounding box coordinates in the format: label [ymin, xmin, ymax, xmax] where coordinates are normalized 0-1000.`; const response = await geminiRequest(imageBase64, mimeType, detectionPrompt); if (!response.boundingBoxes || response.boundingBoxes.length === 0) { // Return empty array if no boxes detected return []; } return response.boundingBoxes; }