detect-objects-by-text
Identify and locate specific objects in images using text prompts. Analyze images to detect, count, and describe objects with 2D coordinates based on user-defined categories.
Instructions
Analyze an image based on a text prompt to identify and count specific objects, and return detailed descriptions of the objects and their 2D coordinates.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| imageFileUri | Yes | URI of the input image. Preferred for remote or local files. Must start with 'https://' or 'file://'. | |
| textPrompt | Yes | Nouns of target objects (English only, avoid adjectives). Use periods to separate multiple categories (e.g., 'person.car.traffic light'). | |
| includeDescription | Yes | Whether to return a description of the objects detected in the image, but will take longer to process. |
Implementation Reference
- src/constants/tool.ts:20-22 (registration)Tool configuration defining the name and description for 'detect-objects-by-text' used in MCP server registration.[Tool.DETECT_BY_TEXT]: { name: Tool.DETECT_BY_TEXT, description: "Analyze an image based on a text prompt to identify and count specific objects, and return detailed descriptions of the objects and their 2D coordinates.",
- src/dino-x/client.ts:186-201 (handler)Core implementation of object detection by text prompt in DinoXApiClient, which calls the DINO-X API and handles polling.async detectObjectsByText( imageFileUri: string, textPrompt: string, includeDescription: boolean ): Promise<DetectionResult> { return this.performDetection(imageFileUri, includeDescription, { model: "DINO-X-1.0", prompt: { type: "text", text: textPrompt }, targets: ["bbox"], bbox_threshold: 0.25, iou_threshold: 0.8 }); }
- src/servers/stdio-server.ts:44-117 (registration)MCP tool registration in stdio server: defines Zod input schema, registers 'detect-objects-by-text' with description from constants, and provides handler that calls DinoXApiClient and formats results.this.server.tool( name, description, { imageFileUri: z.string().describe("URI of the input image. Preferred for remote or local files. Must start with 'https://' or 'file://'."), textPrompt: z.string().describe("Nouns of target objects (English only, avoid adjectives). Use periods to separate multiple categories (e.g., 'person.car.traffic light')."), includeDescription: z.boolean().describe("Whether to return a description of the objects detected in the image, but will take longer to process."), }, async (args) => { try { const { imageFileUri, textPrompt, includeDescription } = args; if (!imageFileUri || !textPrompt) { return { content: [ { type: 'text', text: 'Image file URI and text prompt are required', }, ], } } const { objects } = await this.api.detectObjectsByText(imageFileUri, textPrompt, includeDescription); const categories: ResultCategory = {}; for (const object of objects) { if (!categories[object.category]) { categories[object.category] = []; } categories[object.category].push(object); } const objectsInfo = objects.map(obj => { const bbox = parseBbox(obj.bbox); return { name: obj.category, bbox, ...(includeDescription ? { description: obj.caption, } : {}), } }); return { content: [ { type: "text", text: `Objects detected in image: ${Object.keys(categories).map(cat => `${cat} (${categories[cat].length})` )?.join(', ')}.` }, { type: "text", text: `Detailed object detection results: ${JSON.stringify((objectsInfo), null, 2)}` }, { type: "text", text: `Note: The bbox coordinates are in {xmin, ymin, xmax, ymax} format, where the origin (0,0) is at the top-left corner of the image. These coordinates help determine the exact position and spatial relationships of objects in the image.` }, ] }; } catch (error) { return { content: [ { type: 'text', text: `Failed to detect objects from image: ${error instanceof Error ? error.message : String(error)}`, }, ], }; } } )
- src/servers/http-server.ts:77-150 (registration)MCP tool registration in HTTP server: similar to stdio, with Zod schema and handler wrapping the API client call.this.server.tool( name, description, { imageFileUri: z.string().describe("URI of the input image. Preferred for remote or local files. Must start with 'https://.'"), textPrompt: z.string().describe("Nouns of target objects (English only, avoid adjectives). Use periods to separate multiple categories (e.g., 'person.car.traffic light')."), includeDescription: z.boolean().describe("Whether to return a description of the objects detected in the image, but will take longer to process."), }, async (args) => { try { const { imageFileUri, textPrompt, includeDescription } = args; if (!imageFileUri || !textPrompt) { return { content: [ { type: 'text', text: 'Image file URI and text prompt are required', }, ], } } const { objects } = await this.api.detectObjectsByText(imageFileUri, textPrompt, includeDescription); const categories: ResultCategory = {}; for (const object of objects) { if (!categories[object.category]) { categories[object.category] = []; } categories[object.category].push(object); } const objectsInfo = objects.map(obj => { const bbox = parseBbox(obj.bbox); return { name: obj.category, bbox, ...(includeDescription ? { description: obj.caption, } : {}), } }); return { content: [ { type: "text", text: `Objects detected in image: ${Object.keys(categories).map(cat => `${cat} (${categories[cat].length})` )?.join(', ')}.` }, { type: "text", text: `Detailed object detection results: ${JSON.stringify((objectsInfo), null, 2)}` }, { type: "text", text: `Note: The bbox coordinates are in {xmin, ymin, xmax, ymax} format, where the origin (0,0) is at the top-left corner of the image. These coordinates help determine the exact position and spatial relationships of objects in the image.` }, ] }; } catch (error) { return { content: [ { type: 'text', text: `Failed to detect objects from image: ${error instanceof Error ? error.message : String(error)}`, }, ], }; } } )
- src/utils/index.ts:82-89 (helper)Utility function to parse bounding box array from API response into structured object, used in all tool handlers.export const parseBbox = (bbox: number[]) => { return { xmin: parseFloat(bbox[0].toFixed(1)), ymin: parseFloat(bbox[1].toFixed(1)), xmax: parseFloat(bbox[2].toFixed(1)), ymax: parseFloat(bbox[3].toFixed(1)) }; };