Skip to main content
Glama
conorluddy

XC-MCP: XCode CLI wrapper

by conorluddy

screenshot

Capture simulator screenshots and return them as base64-encoded, optimized images with interactive elements and coordinate transforms for UI automation.

Instructions

simctl-screenshot-inline

Capture optimized screenshots with inline base64 encoding for direct MCP response transmission.

What it does

Captures simulator screenshots and returns them as base64-encoded images directly in the MCP response. Automatically optimizes images for token efficiency with tile-aligned resizing and WebP/JPEG compression. Includes interactive element detection and coordinate transforms.

Parameters

  • udid (string, optional): Simulator UDID (auto-detects booted device if omitted)

  • size (string, optional): Screenshot size - half, full, quarter, thumb (default: half)

  • appName (string, optional): App name for semantic context

  • screenName (string, optional): Screen/view name for semantic context

  • state (string, optional): UI state for semantic context

  • enableCoordinateCaching (boolean, optional): Enable view fingerprinting for coordinate caching

Screenshot Size Optimization

Automatically optimizes screenshots for token efficiency:

  • half (default): 256×512 pixels, 1 tile, ~170 tokens (50% savings)

  • full: Native resolution, 2 tiles, ~340 tokens

  • quarter: 128×256 pixels, 1 tile, ~170 tokens

  • thumb: 128×128 pixels, 1 tile, ~170 tokens

Automatic Optimization Process

  1. Capture: Screenshot taken at native resolution

  2. Resize: Automatically resized to tile-aligned dimensions (unless size='full')

  3. Compress: Converted to WebP format at 60% quality (falls back to JPEG if unavailable)

  4. Encode: Base64-encoded for inline MCP response transmission

  5. Extract: Interactive elements detected from accessibility tree

  6. Transform: Coordinate mapping provided for resized screenshots

Returns

MCP response with:

  • Base64-encoded optimized image (inline)

  • Screenshot optimization metadata (dimensions, tokens, savings)

  • Interactive elements with coordinates and properties

  • Coordinate transform for mapping screenshot to device coordinates

  • View fingerprint (if enableCoordinateCaching is true)

  • Semantic metadata (if provided)

Examples

Simple optimized screenshot (256×512)

await simctlScreenshotInlineTool({
  udid: 'device-123'
})

Full resolution screenshot

await simctlScreenshotInlineTool({
  udid: 'device-123',
  size: 'full'
})

Screenshot with semantic context

await simctlScreenshotInlineTool({
  udid: 'device-123',
  appName: 'MyApp',
  screenName: 'LoginScreen',
  state: 'Empty'
})

Screenshot with coordinate caching enabled

await simctlScreenshotInlineTool({
  udid: 'device-123',
  enableCoordinateCaching: true
})

Interactive Element Detection

Automatically extracts interactive elements from the accessibility tree:

  • Element type (Button, TextField, etc.)

  • Label and identifier

  • Bounds (x, y, width, height)

  • Tappability status

Limited to top 20 elements to avoid token overflow. Elements are filtered to only include those with bounds and hittable status.

Coordinate Transform

When screenshots are resized (size ≠ 'full'), provides automatic coordinate transformation:

Use the coordinateTransformHelper field in the response with idb-ui-tap:

  1. Identify element coordinates visually from the screenshot

  2. Call idb-ui-tap with applyScreenshotScale: true plus scale factors

  3. The tool automatically transforms screenshot coordinates to device coordinates

Example:

idb-ui-tap {
  x: 256,              // Screenshot coordinate
  y: 512,              // Screenshot coordinate
  applyScreenshotScale: true,
  screenshotScaleX: 1.67,
  screenshotScaleY: 1.66
}
// Tool automatically calculates: deviceX = 256 * 1.67, deviceY = 512 * 1.66

Manual Transformation (For Reference)

If not using automatic transformation:

  • scaleX: Multiply screenshot X coordinates by this to get device coordinates

  • scaleY: Multiply screenshot Y coordinates by this to get device coordinates

  • coordinateTransform.guidance: Human-readable instructions

Important: Most agents should use the automatic transformation via idb-ui-tap's applyScreenshotScale parameter. Manual calculation is provided for reference only.

View Fingerprinting (Opt-in)

When enableCoordinateCaching is true, computes a structural hash of the view:

  • elementStructureHash: SHA-256 hash of element hierarchy

  • cacheable: Whether view is stable enough to cache coordinates

  • elementCount: Number of elements in hierarchy

  • orientation: Device orientation

Excludes loading states, animations, and dynamic content from caching.

Common Use Cases

  1. Visual analysis: LLM-based screenshot analysis with token optimization

  2. UI automation: Detect interactive elements and get tap coordinates

  3. Bug reporting: Capture and transmit screenshots inline

  4. Test documentation: Screenshot with semantic context for test tracking

  5. Coordinate caching: Store element coordinates for repeated interactions

Token Efficiency

Screenshots are optimized for minimal token usage:

  • Default (half): ~170 tokens (50% savings vs full)

  • Full: ~340 tokens (native resolution)

  • Quarter: ~170 tokens (75% savings vs full)

  • Thumb: ~170 tokens (smallest, for thumbnails)

Token counts are estimates based on Claude's image processing (170 tokens per 512×512 tile).

Important Notes

  • Auto-detection: If udid is omitted, uses the currently booted device

  • Temp files: Uses temp directory for processing, auto-cleans up

  • WebP fallback: Attempts WebP compression, falls back to JPEG if unavailable

  • Element extraction: Requires app to be running with accessibility enabled

  • Coordinate accuracy: Transform provides pixel-perfect coordinate mapping

Error Handling

  • Simulator not found: Validates simulator exists in cache

  • Simulator not booted: Indicates simulator must be booted first

  • Capture failure: Reports if screenshot capture fails

  • Optimization failure: Falls back to original if optimization fails

  • Element extraction: Gracefully degrades if accessibility is unavailable

Next Steps After Screenshot

  1. Analyze visually: LLM processes inline image for visual analysis

  2. Interact with elements: Use coordinates from interactiveElements

  3. Tap elements: Apply coordinate transform if resized, then use simctl-tap

  4. Query specific elements: Use simctl-query-ui for targeted element discovery

  5. Cache coordinates: Store fingerprint for reuse on identical views

Comparison with simctl-io

Feature

screenshot-inline

simctl-io

Returns

Base64 inline

File path

Optimization

Automatic

Manual

Elements

Auto-detected

Not included

Transform

Included

Included

Use case

MCP responses

File storage

Token usage

Optimized

Depends on size

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
udidNo
sizeNo
appNameNo
screenNameNo
stateNo
enableCoordinateCachingNo
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description details the automatic optimization process (capture, resize, compress, encode, extract, transform), error handling, token efficiency, coordinate transformation, view fingerprinting, and temp file cleanup. With no annotations provided, the description fully compensates, making behavior highly transparent.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extensive but well-structured with clear headers, bullet points, tables, and examples. It is front-loaded with the core purpose. Minor repetition (e.g., token efficiency mentioned twice) but overall concise for the tool's complexity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 6 parameters, no output schema, and no annotations, the description covers all aspects: input, output (base64 image, metadata, elements, transform), error handling, use cases, comparison with siblings, and next steps. It is thorough and leaves no significant gaps.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, but the description explains all 6 parameters, including defaults (e.g., size defaults to 'half'), auto-detection for udid, and semantic context for appName, screenName, state. It also explains the size enum values in detail with pixel dimensions and token savings, adding significant meaning.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description explicitly states 'Capture optimized screenshots with inline base64 encoding for direct MCP response transmission,' providing a specific verb and resource. It distinguishes itself from sibling tools like `simctl-io` through a comparison table, making the purpose clear and unique.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description includes a comparison with `simctl-io` and a 'Common Use Cases' section, offering context for when to use this tool. However, it does not explicitly state when not to use it, though the alternatives are clear. This is strong guidance but not fully explicit on exclusions.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/conorluddy/xc-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server