DINO-X Image Detection MCP Server

detect-objects-by-text

Identify and locate specific objects in images using text prompts. Analyze images to detect, count, and describe objects with 2D coordinates based on user-defined categories.

Instructions

Analyze an image based on a text prompt to identify and count specific objects, and return detailed descriptions of the objects and their 2D coordinates.

Input Schema

TableJSON Schema

Name	Required	Description
`imageFileUri`	Yes	URI of the input image. Preferred for remote or local files. Must start with 'https://' or 'file://'.
`textPrompt`	Yes	Nouns of target objects (English only, avoid adjectives). Use periods to separate multiple categories (e.g., 'person.car.traffic light').
`includeDescription`	Yes	Whether to return a description of the objects detected in the image, but will take longer to process.

Implementation Reference

src/constants/tool.ts:20-22 (registration)

Tool configuration defining the name and description for 'detect-objects-by-text' used in MCP server registration.

[Tool.DETECT_BY_TEXT]: {
  name: Tool.DETECT_BY_TEXT,
  description: "Analyze an image based on a text prompt to identify and count specific objects, and return detailed descriptions of the objects and their 2D coordinates.",

src/dino-x/client.ts:186-201 (handler)

Core implementation of object detection by text prompt in DinoXApiClient, which calls the DINO-X API and handles polling.

async detectObjectsByText(
  imageFileUri: string,
  textPrompt: string,
  includeDescription: boolean
): Promise<DetectionResult> {
  return this.performDetection(imageFileUri, includeDescription, {
    model: "DINO-X-1.0",
    prompt: {
      type: "text",
      text: textPrompt
    },
    targets: ["bbox"],
    bbox_threshold: 0.25,
    iou_threshold: 0.8
  });
}

src/servers/stdio-server.ts:44-117 (registration)

MCP tool registration in stdio server: defines Zod input schema, registers 'detect-objects-by-text' with description from constants, and provides handler that calls DinoXApiClient and formats results.

this.server.tool(
  name,
  description,
  {
    imageFileUri: z.string().describe("URI of the input image. Preferred for remote or local files. Must start with 'https://' or 'file://'."),
    textPrompt: z.string().describe("Nouns of target objects (English only, avoid adjectives). Use periods to separate multiple categories (e.g., 'person.car.traffic light')."),
    includeDescription: z.boolean().describe("Whether to return a description of the objects detected in the image, but will take longer to process."),
  },
  async (args) => {
    try {
      const { imageFileUri, textPrompt, includeDescription } = args;

      if (!imageFileUri || !textPrompt) {
        return {
          content: [
            {
              type: 'text',
              text: 'Image file URI and text prompt are required',
            },
          ],
        }
      }

      const { objects } = await this.api.detectObjectsByText(imageFileUri, textPrompt, includeDescription);

      const categories: ResultCategory = {};
      for (const object of objects) {
        if (!categories[object.category]) {
          categories[object.category] = [];
        }
        categories[object.category].push(object);
      }

      const objectsInfo = objects.map(obj => {
        const bbox = parseBbox(obj.bbox);
        return {
          name: obj.category,
          bbox,
          ...(includeDescription ? {
            description: obj.caption,
          } : {}),
        }
      });

      return {
        content: [
          {
            type: "text",
            text: `Objects detected in image: ${Object.keys(categories).map(cat =>
              `${cat} (${categories[cat].length})`
            )?.join(', ')}.`
          },
          {
            type: "text",
            text: `Detailed object detection results: ${JSON.stringify((objectsInfo), null, 2)}`
          },
          {
            type: "text",
            text: `Note: The bbox coordinates are in {xmin, ymin, xmax, ymax} format, where the origin (0,0) is at the top-left corner of the image. These coordinates help determine the exact position and spatial relationships of objects in the image.`
          },
        ]
      };
    } catch (error) {
      return {
        content: [
          {
            type: 'text',
            text: `Failed to detect objects from image: ${error instanceof Error ? error.message : String(error)}`,
          },
        ],
      };
    }
  }
)

src/servers/http-server.ts:77-150 (registration)

MCP tool registration in HTTP server: similar to stdio, with Zod schema and handler wrapping the API client call.

this.server.tool(
  name,
  description,
  {
    imageFileUri: z.string().describe("URI of the input image. Preferred for remote or local files. Must start with 'https://.'"),
    textPrompt: z.string().describe("Nouns of target objects (English only, avoid adjectives). Use periods to separate multiple categories (e.g., 'person.car.traffic light')."),
    includeDescription: z.boolean().describe("Whether to return a description of the objects detected in the image, but will take longer to process."),
  },
  async (args) => {
    try {
      const { imageFileUri, textPrompt, includeDescription } = args;

      if (!imageFileUri || !textPrompt) {
        return {
          content: [
            {
              type: 'text',
              text: 'Image file URI and text prompt are required',
            },
          ],
        }
      }

      const { objects } = await this.api.detectObjectsByText(imageFileUri, textPrompt, includeDescription);

      const categories: ResultCategory = {};
      for (const object of objects) {
        if (!categories[object.category]) {
          categories[object.category] = [];
        }
        categories[object.category].push(object);
      }

      const objectsInfo = objects.map(obj => {
        const bbox = parseBbox(obj.bbox);
        return {
          name: obj.category,
          bbox,
          ...(includeDescription ? {
            description: obj.caption,
          } : {}),
        }
      });

      return {
        content: [
          {
            type: "text",
            text: `Objects detected in image: ${Object.keys(categories).map(cat =>
              `${cat} (${categories[cat].length})`
            )?.join(', ')}.`
          },
          {
            type: "text",
            text: `Detailed object detection results: ${JSON.stringify((objectsInfo), null, 2)}`
          },
          {
            type: "text",
            text: `Note: The bbox coordinates are in {xmin, ymin, xmax, ymax} format, where the origin (0,0) is at the top-left corner of the image. These coordinates help determine the exact position and spatial relationships of objects in the image.`
          },
        ]
      };
    } catch (error) {
      return {
        content: [
          {
            type: 'text',
            text: `Failed to detect objects from image: ${error instanceof Error ? error.message : String(error)}`,
          },
        ],
      };
    }
  }
)

src/utils/index.ts:82-89 (helper)

Utility function to parse bounding box array from API response into structured object, used in all tool handlers.

export const parseBbox = (bbox: number[]) => {
  return {
    xmin: parseFloat(bbox[0].toFixed(1)),
    ymin: parseFloat(bbox[1].toFixed(1)),
    xmax: parseFloat(bbox[2].toFixed(1)),
    ymax: parseFloat(bbox[3].toFixed(1))
  };
};

Tool Definition Quality

A3.5/5.0

Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden. It mentions that including descriptions will 'take longer to process', which is useful behavioral context. However, it doesn't disclose other critical traits like potential rate limits, error conditions, authentication needs, or what happens with invalid inputs. For a tool with no annotation coverage, this leaves significant gaps in understanding its behavior.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, well-structured sentence that efficiently conveys the core functionality without unnecessary words. It's front-loaded with the main action and outcome, making it easy to understand at a glance. Every part of the sentence contributes directly to explaining what the tool does.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given that there's no output schema and no annotations, the description should ideally provide more context about return values and behavioral constraints. While it mentions what will be returned (descriptions and coordinates), it doesn't specify the format or structure of the output. For a tool with 3 required parameters and no structured output documentation, the description is adequate but leaves room for improvement in completeness.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all three parameters thoroughly. The description doesn't add any meaningful semantic information beyond what's in the schema descriptions (e.g., it doesn't explain the format of returned coordinates or how object counts are calculated). Baseline score of 3 is appropriate when the schema does the heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('analyze an image based on a text prompt'), the resource ('image'), and the outcome ('identify and count specific objects, return detailed descriptions and 2D coordinates'). It distinguishes from sibling tools like 'detect-all-objects' by specifying text-based filtering and from 'detect-human-pose-keypoints' by focusing on object detection rather than pose analysis.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage through its purpose statement but doesn't explicitly state when to use this tool versus alternatives like 'detect-all-objects' or 'visualize-detection-result'. It mentions the text prompt requirement, which suggests this tool is for targeted object detection, but lacks clear guidance on scenarios where other tools might be more appropriate.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/IDEA-Research/DINO-X-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server