screenshot
Capture simulator screenshots and return them as base64-encoded, optimized images with interactive elements and coordinate transforms for UI automation.
Instructions
simctl-screenshot-inline
Capture optimized screenshots with inline base64 encoding for direct MCP response transmission.
What it does
Captures simulator screenshots and returns them as base64-encoded images directly in the MCP response. Automatically optimizes images for token efficiency with tile-aligned resizing and WebP/JPEG compression. Includes interactive element detection and coordinate transforms.
Parameters
udid (string, optional): Simulator UDID (auto-detects booted device if omitted)
size (string, optional): Screenshot size - half, full, quarter, thumb (default: half)
appName (string, optional): App name for semantic context
screenName (string, optional): Screen/view name for semantic context
state (string, optional): UI state for semantic context
enableCoordinateCaching (boolean, optional): Enable view fingerprinting for coordinate caching
Screenshot Size Optimization
Automatically optimizes screenshots for token efficiency:
half (default): 256×512 pixels, 1 tile, ~170 tokens (50% savings)
full: Native resolution, 2 tiles, ~340 tokens
quarter: 128×256 pixels, 1 tile, ~170 tokens
thumb: 128×128 pixels, 1 tile, ~170 tokens
Automatic Optimization Process
Capture: Screenshot taken at native resolution
Resize: Automatically resized to tile-aligned dimensions (unless size='full')
Compress: Converted to WebP format at 60% quality (falls back to JPEG if unavailable)
Encode: Base64-encoded for inline MCP response transmission
Extract: Interactive elements detected from accessibility tree
Transform: Coordinate mapping provided for resized screenshots
Returns
MCP response with:
Base64-encoded optimized image (inline)
Screenshot optimization metadata (dimensions, tokens, savings)
Interactive elements with coordinates and properties
Coordinate transform for mapping screenshot to device coordinates
View fingerprint (if enableCoordinateCaching is true)
Semantic metadata (if provided)
Examples
Simple optimized screenshot (256×512)
await simctlScreenshotInlineTool({
udid: 'device-123'
})Full resolution screenshot
await simctlScreenshotInlineTool({
udid: 'device-123',
size: 'full'
})Screenshot with semantic context
await simctlScreenshotInlineTool({
udid: 'device-123',
appName: 'MyApp',
screenName: 'LoginScreen',
state: 'Empty'
})Screenshot with coordinate caching enabled
await simctlScreenshotInlineTool({
udid: 'device-123',
enableCoordinateCaching: true
})Interactive Element Detection
Automatically extracts interactive elements from the accessibility tree:
Element type (Button, TextField, etc.)
Label and identifier
Bounds (x, y, width, height)
Tappability status
Limited to top 20 elements to avoid token overflow. Elements are filtered to only include those with bounds and hittable status.
Coordinate Transform
When screenshots are resized (size ≠ 'full'), provides automatic coordinate transformation:
Automatic Transformation (Recommended for Agents)
Use the coordinateTransformHelper field in the response with idb-ui-tap:
Identify element coordinates visually from the screenshot
Call idb-ui-tap with applyScreenshotScale: true plus scale factors
The tool automatically transforms screenshot coordinates to device coordinates
Example:
idb-ui-tap {
x: 256, // Screenshot coordinate
y: 512, // Screenshot coordinate
applyScreenshotScale: true,
screenshotScaleX: 1.67,
screenshotScaleY: 1.66
}
// Tool automatically calculates: deviceX = 256 * 1.67, deviceY = 512 * 1.66Manual Transformation (For Reference)
If not using automatic transformation:
scaleX: Multiply screenshot X coordinates by this to get device coordinates
scaleY: Multiply screenshot Y coordinates by this to get device coordinates
coordinateTransform.guidance: Human-readable instructions
Important: Most agents should use the automatic transformation via idb-ui-tap's applyScreenshotScale parameter. Manual calculation is provided for reference only.
View Fingerprinting (Opt-in)
When enableCoordinateCaching is true, computes a structural hash of the view:
elementStructureHash: SHA-256 hash of element hierarchy
cacheable: Whether view is stable enough to cache coordinates
elementCount: Number of elements in hierarchy
orientation: Device orientation
Excludes loading states, animations, and dynamic content from caching.
Common Use Cases
Visual analysis: LLM-based screenshot analysis with token optimization
UI automation: Detect interactive elements and get tap coordinates
Bug reporting: Capture and transmit screenshots inline
Test documentation: Screenshot with semantic context for test tracking
Coordinate caching: Store element coordinates for repeated interactions
Token Efficiency
Screenshots are optimized for minimal token usage:
Default (half): ~170 tokens (50% savings vs full)
Full: ~340 tokens (native resolution)
Quarter: ~170 tokens (75% savings vs full)
Thumb: ~170 tokens (smallest, for thumbnails)
Token counts are estimates based on Claude's image processing (170 tokens per 512×512 tile).
Important Notes
Auto-detection: If udid is omitted, uses the currently booted device
Temp files: Uses temp directory for processing, auto-cleans up
WebP fallback: Attempts WebP compression, falls back to JPEG if unavailable
Element extraction: Requires app to be running with accessibility enabled
Coordinate accuracy: Transform provides pixel-perfect coordinate mapping
Error Handling
Simulator not found: Validates simulator exists in cache
Simulator not booted: Indicates simulator must be booted first
Capture failure: Reports if screenshot capture fails
Optimization failure: Falls back to original if optimization fails
Element extraction: Gracefully degrades if accessibility is unavailable
Next Steps After Screenshot
Analyze visually: LLM processes inline image for visual analysis
Interact with elements: Use coordinates from interactiveElements
Tap elements: Apply coordinate transform if resized, then use simctl-tap
Query specific elements: Use simctl-query-ui for targeted element discovery
Cache coordinates: Store fingerprint for reuse on identical views
Comparison with simctl-io
Feature | screenshot-inline | simctl-io |
Returns | Base64 inline | File path |
Optimization | Automatic | Manual |
Elements | Auto-detected | Not included |
Transform | Included | Included |
Use case | MCP responses | File storage |
Token usage | Optimized | Depends on size |
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| udid | No | ||
| size | No | ||
| appName | No | ||
| screenName | No | ||
| state | No | ||
| enableCoordinateCaching | No |