DINO-X Image Detection MCP Server

detect-human-pose-keypoints

Detect 17 keypoints per person in images to analyze body posture and movement for applications like fitness tracking or motion analysis.

Instructions

Detects 17 keypoints for each person in an image, supporting body posture and movement analysis.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`imageFileUri`	Yes	URI of the input image. Preferred for remote or local files. Must start with 'https://' or 'file://'.
`includeDescription`	Yes	Whether to return a description of the objects detected in the image, but will take longer to process.

Implementation Reference

src/servers/http-server.ts:230-306 (handler)

MCP tool handler implementation for 'detect-human-pose-keypoints' in HTTP server. Includes inline input schema (Zod), API client call, bbox and pose keypoints parsing, categorization, and formatted text response.

private registerDetectHumanPoseKeypointsTool(): void {
  const { name, description } = ToolConfigs[Tool.DETECT_HUMAN_POSE_KEYPOINTS];
  this.server.tool(
    name,
    description,
    {
      imageFileUri: z.string().describe("URI of the input image. Preferred for remote or local files. Must start with 'https://'."),
      includeDescription: z.boolean().describe("Whether to return a description of the objects detected in the image, but will take longer to process."),
    },
    async (args) => {
      try {
        const { imageFileUri, includeDescription } = args;

        if (!imageFileUri) {
          return {
            content: [
              {
                type: 'text',
                text: 'Image file URI is required',
              },
            ],
          }
        }

        const { objects } = await this.api.detectHumanPoseKeypoints(imageFileUri, includeDescription);

        const categories: ResultCategory = {};
        for (const object of objects) {
          if (!categories[object.category]) {
            categories[object.category] = [];
          }
          categories[object.category].push(object);
        }

        const objectsInfo = objects.map(obj => {
          const bbox = parseBbox(obj.bbox);
          const pose = obj.pose ? parsePoseKeypoints(obj.pose) : undefined;

          return {
            name: obj.category,
            bbox,
            pose,
            ...(includeDescription ? {
              description: obj.caption,
            } : {}),
          }
        });

        return {
          content: [
            {
              type: "text",
              text: `${objectsInfo.length} human(s) detected in image.`
            },
            {
              type: "text",
              text: `Detailed human pose keypoints detection results: ${JSON.stringify((objectsInfo), null, 2)}`
            },
            {
              type: "text",
              text: `Note: The bbox coordinates are in {xmin, ymin, xmax, ymax} format, where the origin (0,0) is at the top-left corner of the image. The pose keypoints follow the same coordinate system, with visibility states (not visible, visible).`
            },
          ]
        };
      } catch (error) {
        return {
          content: [
            {
              type: 'text',
              text: `Failed to detect human pose keypoints from image: ${error instanceof Error ? error.message : String(error)}`,
            },
          ],
        };
      }
    }
  )
}

src/servers/stdio-server.ts:197-273 (handler)

MCP tool handler implementation for 'detect-human-pose-keypoints' in STDIO server. Includes inline input schema (Zod), API client call, bbox and pose keypoints parsing, categorization, and formatted text response.

private registerDetectHumanPoseKeypointsTool(): void {
  const { name, description } = ToolConfigs[Tool.DETECT_HUMAN_POSE_KEYPOINTS];
  this.server.tool(
    name,
    description,
    {
      imageFileUri: z.string().describe("URI of the input image. Preferred for remote or local files. Must start with 'https://' or 'file://'."),
      includeDescription: z.boolean().describe("Whether to return a description of the objects detected in the image, but will take longer to process."),
    },
    async (args) => {
      try {
        const { imageFileUri, includeDescription } = args;

        if (!imageFileUri) {
          return {
            content: [
              {
                type: 'text',
                text: 'Image file URI is required',
              },
            ],
          }
        }

        const { objects } = await this.api.detectHumanPoseKeypoints(imageFileUri, includeDescription);

        const categories: ResultCategory = {};
        for (const object of objects) {
          if (!categories[object.category]) {
            categories[object.category] = [];
          }
          categories[object.category].push(object);
        }

        const objectsInfo = objects.map(obj => {
          const bbox = parseBbox(obj.bbox);
          const pose = obj.pose ? parsePoseKeypoints(obj.pose) : undefined;

          return {
            name: obj.category,
            bbox,
            pose,
            ...(includeDescription ? {
              description: obj.caption,
            } : {}),
          }
        });

        return {
          content: [
            {
              type: "text",
              text: `${objectsInfo.length} human(s) detected in image.`
            },
            {
              type: "text",
              text: `Detailed human pose keypoints detection results: ${JSON.stringify((objectsInfo), null, 2)}`
            },
            {
              type: "text",
              text: `Note: The bbox coordinates are in {xmin, ymin, xmax, ymax} format, where the origin (0,0) is at the top-left corner of the image. The pose keypoints follow the same coordinate system, with visibility states (not visible, visible).`
            },
          ]
        };
      } catch (error) {
        return {
          content: [
            {
              type: 'text',
              text: `Failed to detect human pose keypoints from image: ${error instanceof Error ? error.message : String(error)}`,
            },
          ],
        };
      }
    }
  )
}

src/dino-x/client.ts:219-233 (helper)

Core DinoXApiClient method implementing the detection logic specific to human pose keypoints by setting 'person' text prompt and including 'pose_keypoints' in targets, delegating to performDetection which handles API task creation and polling.

async detectHumanPoseKeypoints(
  imageFileUri: string,
  includeDescription: boolean
): Promise<DetectionResult> {
  return this.performDetection(imageFileUri, includeDescription, {
    model: "DINO-X-1.0",
    prompt: {
      type: "text",
      text: "person"
    },
    targets: ["bbox", "pose_keypoints"],
    bbox_threshold: 0.25,
    iou_threshold: 0.8
  });
}

src/utils/index.ts:55-80 (helper)

Utility function to parse raw API pose keypoints array into structured object mapping 17 standard keypoints to {x, y, visible}.

export const parsePoseKeypoints = (pose: number[]) => {
  const keypointNames = [
    'nose', 'leftEye', 'rightEye', 'leftEar', 'rightEar',
    'leftShoulder', 'rightShoulder', 'leftElbow', 'rightElbow',
    'leftWrist', 'rightWrist', 'leftHip', 'rightHip',
    'leftKnee', 'rightKnee', 'leftAnkle', 'rightAnkle'
  ];

  const structuredPose: { [key: string]: { x: number; y: number; visible: string; } } = {};

  for (let i = 0; i < keypointNames.length; i++) {
    const baseIndex = i * 4;
    const visibilityMap = {
      0: 'not visible',
      2: 'visible'
    };

    structuredPose[keypointNames[i]] = {
      x: parseFloat(pose[baseIndex].toFixed(1)),
      y: parseFloat(pose[baseIndex + 1].toFixed(1)),
      visible: visibilityMap[pose[baseIndex + 2] as keyof typeof visibilityMap],
    };
  }

  return structuredPose;
};

src/constants/tool.ts:24-27 (schema)

Tool configuration defining the name and description used for MCP registration.

[Tool.DETECT_HUMAN_POSE_KEYPOINTS]: {
  name: Tool.DETECT_HUMAN_POSE_KEYPOINTS,
  description: "Detects 17 keypoints for each person in an image, supporting body posture and movement analysis.",
},

Tool Definition Quality

C2.9/5.0

Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden but offers minimal behavioral insight. It mentions the output (17 keypoints per person) and implies analysis use, but lacks details on performance (e.g., speed, accuracy), limitations (e.g., image quality requirements), or side effects. The description doesn't contradict annotations, but it's insufficient for a mutation-like detection tool.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, efficient sentence that front-loads the core purpose without unnecessary details. Every word contributes to understanding the tool's function, making it appropriately concise and well-structured.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (detection with 2 parameters) and lack of annotations and output schema, the description is incomplete. It doesn't explain return values (e.g., keypoint coordinates, confidence scores), error handling, or prerequisites (e.g., image format support), leaving significant gaps for agent usage.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema fully documents both parameters. The description adds no parameter-specific information beyond what's in the schema, such as explaining how 'includeDescription' relates to keypoints or image processing. This meets the baseline for high schema coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's function: detecting 17 keypoints per person in an image for posture/movement analysis. It specifies the verb ('detects'), resource ('keypoints'), and scope ('each person in an image'), but doesn't explicitly differentiate from sibling tools like 'detect-all-objects' or 'detect-objects-by-text', which likely serve different detection purposes.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It doesn't mention sibling tools or contexts where human pose detection is preferred over general object detection, leaving the agent to infer usage based on tool names alone.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/IDEA-Research/DINO-X-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server