The DINO-X MCP Server enables comprehensive image analysis and detection for multimodal applications.
- Full-Scene Object Detection: Identify all objects in an image with categories, counts, coordinates, and optional descriptions
- Text-Prompted Object Detection: Locate specific objects using English noun prompts (e.g.,
person.car
) with bounding boxes and descriptions - Human Pose Estimation: Detect 17 keypoints per person for body posture and movement analysis
- Detection Visualization: Generate annotated images with bounding boxes and labels, saved locally (STDIO mode only)
- Structured Outputs: Provides detailed object data suitable for VQA and multi-step reasoning tasks
- Flexible Input Sources: Supports
https://
URLs andfile://
URIs with common image formats (JPG, JPEG, WebP, PNG) - Multiple Runtime Modes: Available in local (STDIO) and remote (Streamable HTTP) modes for deployment flexibility
Provides visualization capabilities for object detection results, allowing bounding boxes, keypoints, and other visual markers to be overlaid on the original image for better presentation of analysis results.
Enables running the DINO-X MCP server, which provides tools for fine-grained object detection and image understanding in AI applications.
Used as the package manager for installing and building the DINO-X MCP server project.
DINO-X MCP
English | 中文
DINO-X Official MCP Server — powered by the DINO-X and Grounding DINO models — brings fine-grained object detection and image understanding to your multimodal applications.
Why DINO-X MCP?
With DINO-X MCP, you can:
- Fine-Grained Understanding: Full image detection, object detection, and region-level descriptions.
- Structured Outputs: Get object categories, counts, locations, and attributes for VQA and multi-step reasoning tasks.
- Composable: Works seamlessly with other MCP servers to build end-to-end visual agents or automation pipelines.
Transport Modes
DINO-X MCP supports two transport modes:
Feature | STDIO (default) | Streamable HTTP |
---|---|---|
Runtime | Local | Local or Cloud |
Transport | Standard I/O | HTTP (streaming responses) |
Input source | file:// and https:// | https:// only |
Visualization | Supported (saves annotated images locally) | Not supported (for now) |
Quick Start
1. Prepare an MCP client
Any MCP-compatible client works, e.g.:
2. Get your API key
Apply on the DINO-X platform: Request API Key (new users get free quota).
3. Configure MCP
Option A: Official Hosted Streamable HTTP (Recommended)
Add to your MCP client config and replace with your API key:
Option B: Use the NPM package locally (STDIO)
Install Node.js first
- Download the installer from nodejs.org
- Or use command:
Configure your MCP client:
Note: Replace your-api-key-here
with your real key.
Option C: Run from source locally
Make sure Node.js is installed (see Option B), then:
Configure your MCP client:
CLI Flags & Environment Variables
- Common flags
--http
: start in Streamable HTTP mode (otherwise STDIO by default)--stdio
: force STDIO mode--dinox-api-key=...
: set API key--enable-client-key
: allow API key via URL?key=
(Streamable HTTP only)--port=8080
: HTTP port (default 3020)
- Environment variables
DINOX_API_KEY
(required/conditionally required): DINO-X platform API keyIMAGE_STORAGE_DIRECTORY
(optional, STDIO): directory to save annotated imagesAUTH_TOKEN
(optional, HTTP): if set, client must sendAuthorization: Bearer <token>
Examples:
Client config when using ?key=
:
Using AUTH_TOKEN
with a gateway that injects Authorization: Bearer <token>
:
Client example with supergateway
:
Tools
Capability | Tool ID | Transport | Input | Output |
---|---|---|---|---|
Full-scene object detection | detect-all-objects | STDIO / HTTP | Image URL | Category + bbox + (optional) captions |
Text-prompted object detection | detect-objects-by-text | STDIO / HTTP | Image URL + English nouns (dot-separated for multiple, e.g., person.car ) | Target object bbox + (optional) captions |
Human pose estimation | detect-human-pose-keypoints | STDIO / HTTP | Image URL | 17 keypoints + bbox + (optional) captions |
Visualization | visualize-detection-result | STDIO only | Image URL + detection results array | Local path to annotated image |
🎬 Use Cases
🎯 Scenario | 📝 Input | ✨ Output |
---|---|---|
Detection & Localization | 💬 Prompt:Detect and visualize the fire areas in the forest 🖼️ Input Image: | |
Object Counting | 💬 Prompt:Please analyze this warehouse image, detect all the cardboard boxes, count the total number 🖼️ Input Image: | |
Feature Detection | 💬 Prompt:Find all red cars in the image 🖼️ Input Image: | |
Attribute Reasoning | 💬 Prompt:Find the tallest person in the image, describe their clothing 🖼️ Input Image: | |
Full Scene Detection | 💬 Prompt:Find the fruit with the highest vitamin C content in the image 🖼️ Input Image: | |
Pose Analysis | 💬 Prompt:Please analyze what yoga pose this is 🖼️ Input Image: |
FAQ
- Supported image sources?
- STDIO:
file://
andhttps://
- Streamable HTTP:
https://
only
- STDIO:
- Supported image formats?
- jpg, jpeg, webp, png
Development & Debugging
Use watch mode to auto-rebuild during development:
Use MCP Inspector for debugging:
License
Apache License 2.0
hybrid server
The server is able to function both locally and remotely, depending on the configuration or use case.
Empower LLMs with fine-grained visual understanding — detect, localize, and describe anything in images with natural language prompts.
Related MCP Servers
- -securityAlicense-qualityA powerful server that integrates the Moondream vision model to enable advanced image analysis, including captioning, object detection, and visual question answering, through the Model Context Protocol, compatible with AI assistants like Claude and Cline.Last updated -18JavaScriptApache 2.0
- AsecurityFlicenseAqualityEnables querying WolframAlpha's LLM API for natural language questions, providing structured and simplified answers optimized for LLM consumption.Last updated -336TypeScript
- AsecurityAlicenseAqualityEnhances LLM capabilities with location-based services and geospatial data, enabling users to geocode addresses, find nearby points of interest, get directions, optimize meeting points, and analyze neighborhoods.Last updated -1297PythonMIT License
- -securityFlicense-qualityIntelligently analyzes codebases to enhance LLM prompts with relevant context, featuring adaptive context management and task detection to produce higher quality AI responses.Last updated -TypeScript