screenshot
Capture desktop, window, or region screenshots with four detail modes: metadata for orientation, text for actionable elements, pixel-accurate images, or diff mode for changed windows.
Instructions
Purpose: Capture desktop, window, or region state across four output modes — from cheap orientation metadata to pixel-accurate images. Details: detail='meta' (default) returns window titles+positions only (~20 tok/window, no image). detail='text' returns UIA actionable elements with clickAt coords, no image (~100-300 tok). detail='image' is server-blocked unless confirmImage=true is also passed. dotByDot=true returns 1:1 pixel WebP; compute screen coords: screen_x = origin_x + image_x (or screen_x = origin_x + image_x / scale when dotByDotMaxDimension is set — scale printed in response). diffMode=true returns only changed windows after the first call (~160 tok). Data reduction: grayscale=true (−50%), dotByDotMaxDimension=1280 (caps longest edge), windowTitle+region (sub-crop to exclude browser chrome — e.g. region={x:0, y:120, width:1920, height:900}). Prefer: Use meta to orient, text before clicking, dotByDot only when precise pixel coords are needed. Prefer browser_* tools for Chrome. Use diffMode after actions to confirm state changed. Only use image+confirmImage when text returned 0 actionable elements and visual inspection is genuinely required. Caveats: Default mode scales to maxDimension=768 — image pixels ≠ screen pixels; apply the scale formula before passing to mouse_click. detail='image' is always blocked without confirmImage=true. diffMode requires a prior full-capture baseline (non-diff call or workspace_snapshot) — calling diffMode cold returns a full frame, not a diff. Examples: screenshot() → meta orientation of all windows screenshot({detail:'text', windowTitle:'Notepad'}) → clickable elements with coords screenshot({dotByDot:true, dotByDotMaxDimension:1280, grayscale:true, windowTitle:'Chrome', region:{x:0,y:120,width:1920,height:900}}) → pixel-accurate Chrome content
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| windowTitle | No | Capture only the window whose title contains this string. Prefer over full-screen when target window is known. | |
| displayId | No | Capture a specific monitor (0 = primary). Use get_screen_info to list displays. | |
| region | No | Capture only this sub-region. Without windowTitle: virtual screen coordinates. With windowTitle: window-local coordinates — useful to exclude browser chrome (tabs/address bar). Example: windowTitle='Chrome', region={x:0, y:120, width:1920, height:900} skips the 120px browser chrome. | |
| maxDimension | No | Max width or height in pixels (default 768). Use 1280 to read small text, code, or fine UI details. Ignored when dotByDot=true. | |
| dotByDot | No | 1:1 pixel mode — no scaling, WebP compression. Window captures include 'origin: (x,y)' so you can compute screen position: screen_x = origin_x + image_x. When dotByDotMaxDimension is also set, scale factor is included: screen_x = origin_x + image_x / scale. | |
| dotByDotMaxDimension | No | Cap the longest edge (pixels) when dotByDot=true. Reduces payload while preserving coordinate math. Example: 1280 on a 1920×1080 screen → scale≈0.667. Response includes scale factor: screen_x = origin_x + image_x / scale. Recommended for Chrome: dotByDot=true, dotByDotMaxDimension=1280, grayscale=true. | |
| grayscale | No | Convert to grayscale before encoding. Reduces file size ~50% for text-heavy content (e.g. AWS console, code editors). Avoid when color is meaningful (charts, status indicators). | |
| webpQuality | No | WebP quality when dotByDot=true or diffMode=true. 40=layout only, 60=general (default), 80=fine text. | |
| diffMode | No | Layer diff mode — compares each window against the buffered previous frame. First call = full I-frame (all windows). Subsequent calls = only changed windows (P-frame). Implicitly enables dotByDot. Best used with windowTitle=undefined to snapshot all windows. | |
| detail | No | Response detail level (omit to let the server pick a smart default): omitted — auto: 'image' when dotByDot/region/displayId is specified, else 'meta' 'meta' — window title + screen region only (~20 tok/window, cheapest) 'text' — UIA element tree as JSON with text values (~100-300 tok/window, no image) 'image' — actual screenshot pixels. BLOCKED unless confirmImage=true is also passed. | |
| confirmImage | No | Must be true to receive image pixels when detail='image'. Without this flag, detail='image' is blocked and a guidance message is returned instead. Prefer detail='text' / diffMode=true / dotByDot=true first — only set confirmImage=true when visual inspection is genuinely required. | |
| ocrFallback | No | OCR fallback behaviour when detail='text'. 'auto' (default): fire Windows OCR if UIA returns 0 actionable elements OR hints.uiaSparse=true (UIA returned <5 elements, typical for Chrome). 'always': always augment actionable[] with OCR words. 'never': disable OCR entirely. | auto |
| ocrLanguage | No | BCP-47 language tag for the OCR engine (e.g. 'ja', 'en-US'). Only used when detail='text'. | ja |