Skip to main content
Glama

screenshot

Capture desktop windows or regions with configurable detail: window titles, UIA elements, OCR text, or pixel-accurate screenshots in normal or background mode.

Instructions

Purpose: Capture desktop, window, or region across detail levels (meta / text / image / som / ocr) and capture modes (normal / background). Details: detail='meta' (default) returns window titles+positions only (~20 tok/window, no image). detail='text' returns UIA actionable elements with clickAt coords, no image (~100-300 tok). detail='som' returns a Set-of-Marks annotated image plus OCR-detected elements with IDs (bypasses UIA entirely). detail='ocr' returns Windows OCR words with screen-pixel clickAt coords (Phase 4: absorbs former screenshot_ocr — use when UIA is sparse and you want to force OCR unconditionally). detail='image' and detail='som' are server-blocked unless confirmImage=true is also passed. mode='background' captures hidden/minimised/occluded windows via PrintWindow (Phase 4: absorbs former screenshot_background) — pair with windowTitle/hwnd. dotByDot=true returns 1:1 pixel WebP; compute screen coords: screen_x = origin_x + image_x (or screen_x = origin_x + image_x / scale when dotByDotMaxDimension is set — scale printed in response). diffMode=true returns only changed windows after the first call (~160 tok). region={x,y,width,height} captures a sub-rectangle (Phase 4: absorbs former scope_element when paired with windowTitle/hwnd — discover element bounds via desktop_discover, then pass region here). Data reduction: grayscale=true (−50%), dotByDotMaxDimension=1280 (caps longest edge), windowTitle+region (sub-crop to exclude browser chrome — e.g. region={x:0, y:120, width:1920, height:900}). Prefer: Use meta to orient, text before clicking, dotByDot only when precise pixel coords are needed. Use detail='som' for native apps or games that do not expose UIA elements (UIA-Blind). Use detail='ocr' for OCR-only (skip UIA entirely). Use mode='background' when the target window must stay hidden or cannot be brought to foreground. Prefer browser_* tools for Chrome. Use diffMode after actions to confirm state changed. Only use image+confirmImage when text returned 0 actionable elements and visual inspection is genuinely required. Caveats: Default mode scales to maxDimension=768 — image pixels ≠ screen pixels; apply the scale formula before passing to mouse_click. Foreground detail='image' is always blocked without confirmImage=true. diffMode requires a prior full-capture baseline (non-diff call or workspace_snapshot) — calling diffMode cold returns a full frame, not a diff. mode='background' requires windowTitle or hwnd, and only composes with detail in {'image','meta'} — detail='text'/'som'/'ocr' run only against foreground capture (the dispatcher rejects the conflicting combination). Passing mode='background' is itself the acknowledgement that image pixels are wanted, so confirmImage is NOT required for it (matches the former screenshot_background contract). fullContent=false enables legacy mode (faster but GPU windows may be black). detail='ocr' requires windowTitle or hwnd; first call may take ~1s (WinRT cold-start) and the matching OCR language pack must be installed. Examples: screenshot() → meta orientation of all windows screenshot({detail:'text', windowTitle:'Notepad'}) → clickable elements with coords screenshot({detail:'ocr', windowTitle:'PDF', ocrLanguage:'ja'}) → OCR words with screen-pixel coords screenshot({mode:'background', windowTitle:'Chrome', dotByDot:true, dotByDotMaxDimension:1280, grayscale:true}) → background-capture pixel-accurate Chrome screenshot({windowTitle:'Notepad', region:{x:0,y:120,width:600,height:400}}) → cropped sub-region (zoom into element after desktop_discover)

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
windowTitleNoCapture only the window whose title contains this string. Use '@active' for the current foreground window. Prefer over full-screen when target window is known.
hwndNoDirect window handle ID (takes precedence over windowTitle). Obtain from desktop_discover (windows[].hwnd). String type to avoid 64-bit precision issues.
displayIdNoCapture a specific monitor (0 = primary). Use desktop_state({includeScreen:true}) to list displays.
regionNoCapture only this sub-region. Without windowTitle: virtual screen coordinates. With windowTitle: window-local coordinates — useful to exclude browser chrome (tabs/address bar). Example: windowTitle='Chrome', region={x:0, y:120, width:1920, height:900} skips the 120px browser chrome.
maxDimensionNoMax width or height in pixels (default 768). Use 1280 to read small text, code, or fine UI details. Ignored when dotByDot=true.
dotByDotNo1:1 pixel mode — no scaling, WebP compression. Window captures include 'origin: (x,y)' so you can compute screen position: screen_x = origin_x + image_x. When dotByDotMaxDimension is also set, scale factor is included: screen_x = origin_x + image_x / scale.
dotByDotMaxDimensionNoCap the longest edge (pixels) when dotByDot=true. Reduces payload while preserving coordinate math. Example: 1280 on a 1920×1080 screen → scale≈0.667. Response includes scale factor: screen_x = origin_x + image_x / scale. Recommended for Chrome: dotByDot=true, dotByDotMaxDimension=1280, grayscale=true.
grayscaleNoConvert to grayscale before encoding. Reduces file size ~50% for text-heavy content (e.g. AWS console, code editors). Avoid when color is meaningful (charts, status indicators).
webpQualityNoWebP quality when dotByDot=true or diffMode=true. 40=layout only, 60=general (default), 80=fine text.
diffModeNoLayer diff mode — compares each window against the buffered previous frame. First call = full I-frame (all windows). Subsequent calls = only changed windows (P-frame). Implicitly enables dotByDot. Best used with windowTitle=undefined to snapshot all windows.
detailNoResponse detail level (omit to let the server pick a smart default): omitted — auto: 'image' when dotByDot/region/displayId is specified, else 'meta' 'meta' — window title + screen region only (~20 tok/window, cheapest) 'text' — UIA element tree as JSON with text values (~100-300 tok/window, no image) 'image' — actual screenshot pixels. BLOCKED unless confirmImage=true is also passed. 'som' — Set-of-Marks image + OCR elements (bypasses UIA entirely). BLOCKED unless confirmImage=true is also passed. 'ocr' — Windows OCR words with screen-pixel clickAt coords (Phase 4: absorbs former screenshot_ocr). Use when UIA returns no actionable elements (WinUI3 custom-drawn UIs, game overlays, PDF viewers). Note: detail='text' auto-falls back to OCR via ocrFallback='auto'; choose detail='ocr' only when forcing OCR unconditionally.
modeNoCapture mode. 'normal' — default. Window-targeted captures (windowTitle / hwnd) use Win32 PrintWindow with automatic BitBlt fallback when PrintWindow returns no data or an all-black frame; the route used is reported in hints.captureSource. Fullscreen / displayId captures use BitBlt. 'background' — explicit Win32 PrintWindow capture, retained for back-compat and explicit selection. Requires windowTitle (or hwnd). Pair with fullContent for GPU-rendered apps.normal
fullContentNoWhen mode='background', use PW_RENDERFULLCONTENT to capture GPU-rendered windows (Chrome, Electron, WinUI3). Default true. Set false for legacy mode (faster but GPU windows may appear black). Ignored unless mode='background'.
confirmImageNoMust be true to receive image pixels when detail='image'. Without this flag, detail='image' is blocked and a guidance message is returned instead. Prefer detail='text' / diffMode=true / dotByDot=true first — only set confirmImage=true when visual inspection is genuinely required.
ocrFallbackNoOCR fallback behaviour when detail='text'. 'auto' (default): fire Windows OCR if UIA returns 0 actionable elements OR hints.uiaSparse=true (UIA returned <5 elements, typical for Chrome). 'always': always augment actionable[] with OCR words. 'never': disable OCR entirely.auto
ocrLanguageNoBCP-47 language tag for the OCR engine (e.g. 'ja', 'en-US'). Used when detail='text' (OCR fallback) or detail='ocr' (direct OCR).ja
preprocessPolicyNoOCR preprocessing scale policy for detail='som' and OCR fallback paths. 'auto' (default): clamp scale to 1 on OOM (>8MP) or high-DPI (≥150%). 'aggressive': relaxes DPI clamp to 175%, preserving upscale on 150%-DPI monitors (e.g. Outlook PWA). Also auto-enables adaptive binarization. 'minimal': always scale=1 regardless of DPI/resolution.auto
preprocessAdaptiveNoWhen true, apply Sauvola adaptive binarization after contrast stretch. Improves recognition of thin text on low-contrast or gradient backgrounds. Automatically enabled when preprocessPolicy='aggressive'. Requires Rust native engine; silently skipped otherwise.
includeNoOptional response-shape opt-in. `['envelope']` returns the self-documenting envelope (`_version` / `data` / `as_of` / `confidence`). `['raw']` forces raw shape (overrides DESKTOP_TOUCH_ENVELOPE=1 server default). Default behaviour is raw shape (compat with existing clients).
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description fully covers behavioral traits: diffMode cold start, background mode limitations, scaling formulas, blocked details without confirmImage, and mode interactions. It also explains the coordinate transformation for dotByDot.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Though long, the description is well-structured with titled sections (Purpose, Details, Prefer, Caveats, Examples) and front-loaded with the main purpose. Every sentence is informative, and no unnecessary repetition exists.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given 19 parameters and no output schema, the description is remarkably complete. It explains return values implicitly via detail levels and examples, covers edge cases, and provides coordinate math and mode restrictions. It leaves no significant gaps.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, but the description adds significant meaning by explaining parameter interactions (e.g., region with windowTitle, dotByDot coordinate math, diffMode baseline requirement). It compensates fully for the lack of an output schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool captures desktop, window, or region across multiple detail levels and capture modes. It distinguishes itself from sibling tools like browser_* and mouse_click by specifying its scope (desktop) and alternatives (prefer browser tools for Chrome).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description includes explicit 'Prefer' and 'Caveats' sections that detail when to use each detail level and mode, and when to use alternatives (e.g., browser_* tools for Chrome). Examples further clarify common use cases.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Harusame64/desktop-touch-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server