Skip to main content
Glama

detect

Identify objects in images and return precise bounding box coordinates using Gemini's vision AI. Specify detection targets like UI elements or buttons for targeted analysis.

Instructions

Detect objects in an image and return bounding boxes. Uses Gemini for native bounding box support. Coordinates are normalized 0-1000 as [ymin, xmin, ymax, xmax].

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
imageYesPath to the image file
promptNoOptional: what to detect (e.g., 'find all buttons', 'detect UI elements')

Implementation Reference

  • The main handler function for the 'detect' tool. It processes the input arguments, converts the image to base64, calls the geminiDetect helper, handles empty detections, and returns a structured JSON response with object count, list of objects with labels and bounding boxes.
    export async function handleDetect(args: Record<string, unknown>) {
      const image = args.image as string;
      const prompt = args.prompt as string | undefined;
    
      const { base64, mimeType } = await imageToBase64(image);
      const detections = await geminiDetect(base64, mimeType, prompt);
    
      if (detections.length === 0) {
        return {
          content: [
            {
              type: "text",
              text: JSON.stringify(
                {
                  count: 0,
                  objects: [],
                  note: "No objects detected. Try a more specific prompt.",
                },
                null,
                2
              ),
            },
          ],
        };
      }
    
      return {
        content: [
          {
            type: "text",
            text: JSON.stringify(
              {
                count: detections.length,
                objects: detections.map((d, i) => ({
                  id: i + 1,
                  label: d.label,
                  bbox: d.bbox,
                })),
              },
              null,
              2
            ),
          },
        ],
      };
    }
  • Tool definition including name, description, and input schema for the 'detect' tool, specifying required 'image' parameter and optional 'prompt'.
    export const detectTool: Tool = {
      name: "detect",
      description:
        "Detect objects in an image and return bounding boxes. Uses Gemini for native bounding box support. Coordinates are normalized 0-1000 as [ymin, xmin, ymax, xmax].",
      inputSchema: {
        type: "object",
        properties: {
          image: {
            type: "string",
            description: "Path to the image file or URL (http/https)",
          },
          prompt: {
            type: "string",
            description:
              "Optional: what to detect (e.g., 'find all buttons', 'detect UI elements')",
          },
        },
        required: ["image"],
      },
    };
  • src/index.ts:37-46 (registration)
    Registers the 'detect' tool schema by including detectTool in the MCP server's list of available tools returned for ListToolsRequest.
    server.setRequestHandler(ListToolsRequestSchema, async () => {
      return {
        tools: [
          describeTool,
          detectTool,
          describeRegionTool,
          analyzeColorsTool,
        ],
      };
    });
  • src/index.ts:49-67 (registration)
    Registers the handler for 'detect' tool calls by dispatching to handleDetect in the switch statement for CallToolRequest.
    server.setRequestHandler(CallToolRequestSchema, async (request) => {
      const { name, arguments: args = {} } = request.params;
    
      try {
        switch (name) {
          case "describe":
            return await handleDescribe(args);
          case "detect":
            return await handleDetect(args);
          case "describe_region":
            return await handleDescribeRegion(args);
          case "analyze_colors":
            return await handleAnalyzeColors(args);
          default:
            return {
              content: [{ type: "text", text: `Unknown tool: ${name}` }],
              isError: true,
            };
        }
  • Helper function that performs the actual object detection using Gemini API: constructs specific detection prompt, invokes geminiRequest, and extracts parsed bounding boxes.
    export async function geminiDetect(
      imageBase64: string,
      mimeType: string,
      prompt?: string
    ): Promise<Array<{ label: string; bbox: [number, number, number, number] }>> {
      const detectionPrompt = prompt
        ? `Detect and locate: ${prompt}. For each object found, provide its label and bounding box coordinates in the format: label [ymin, xmin, ymax, xmax] where coordinates are normalized 0-1000.`
        : `Detect all notable objects in this image. For each object, provide its label and bounding box coordinates in the format: label [ymin, xmin, ymax, xmax] where coordinates are normalized 0-1000.`;
    
      const response = await geminiRequest(imageBase64, mimeType, detectionPrompt);
    
      if (!response.boundingBoxes || response.boundingBoxes.length === 0) {
        // Return empty array if no boxes detected
        return [];
      }
    
      return response.boundingBoxes;
    }
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden. It discloses key behavioral traits: it returns bounding boxes, uses Gemini for support, and specifies the coordinate format (normalized 0-1000 as [ymin, xmin, ymax, xmax]). However, it does not cover aspects like rate limits, error handling, or performance characteristics.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is front-loaded with the core purpose and efficiently adds necessary details in two sentences. Every sentence earns its place by clarifying the tool's function, technology used, and output format without redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (object detection with two parameters) and no annotations or output schema, the description is fairly complete. It covers the purpose, technology, and output format, but could benefit from more details on behavioral aspects like error cases or limitations to be fully comprehensive.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents both parameters. The description adds some context by implying the 'prompt' parameter is for specifying what to detect, but it does not provide additional syntax or format details beyond what the schema provides, aligning with the baseline for high coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('detect objects in an image') and the resource ('image'), and it distinguishes from sibling tools like 'analyze_colors' or 'describe' by focusing on object detection with bounding boxes rather than color analysis or general description.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for object detection in images, but it does not explicitly state when to use this tool versus alternatives like 'describe' or 'describe_region'. It mentions 'Uses Gemini for native bounding box support', which hints at a specific context, but lacks clear exclusions or named alternatives.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/simen/mcp-see'

If you have feedback or need assistance with the MCP directory API, please join our Discord server