Skip to main content
Glama

screenshot

Capture desktop, window, or region screenshots with four detail modes: metadata for orientation, text for actionable elements, pixel-accurate images, or diff mode for changed windows.

Instructions

Purpose: Capture desktop, window, or region state across four output modes — from cheap orientation metadata to pixel-accurate images. Details: detail='meta' (default) returns window titles+positions only (~20 tok/window, no image). detail='text' returns UIA actionable elements with clickAt coords, no image (~100-300 tok). detail='image' is server-blocked unless confirmImage=true is also passed. dotByDot=true returns 1:1 pixel WebP; compute screen coords: screen_x = origin_x + image_x (or screen_x = origin_x + image_x / scale when dotByDotMaxDimension is set — scale printed in response). diffMode=true returns only changed windows after the first call (~160 tok). Data reduction: grayscale=true (−50%), dotByDotMaxDimension=1280 (caps longest edge), windowTitle+region (sub-crop to exclude browser chrome — e.g. region={x:0, y:120, width:1920, height:900}). Prefer: Use meta to orient, text before clicking, dotByDot only when precise pixel coords are needed. Prefer browser_* tools for Chrome. Use diffMode after actions to confirm state changed. Only use image+confirmImage when text returned 0 actionable elements and visual inspection is genuinely required. Caveats: Default mode scales to maxDimension=768 — image pixels ≠ screen pixels; apply the scale formula before passing to mouse_click. detail='image' is always blocked without confirmImage=true. diffMode requires a prior full-capture baseline (non-diff call or workspace_snapshot) — calling diffMode cold returns a full frame, not a diff. Examples: screenshot() → meta orientation of all windows screenshot({detail:'text', windowTitle:'Notepad'}) → clickable elements with coords screenshot({dotByDot:true, dotByDotMaxDimension:1280, grayscale:true, windowTitle:'Chrome', region:{x:0,y:120,width:1920,height:900}}) → pixel-accurate Chrome content

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
windowTitleNoCapture only the window whose title contains this string. Prefer over full-screen when target window is known.
displayIdNoCapture a specific monitor (0 = primary). Use get_screen_info to list displays.
regionNoCapture only this sub-region. Without windowTitle: virtual screen coordinates. With windowTitle: window-local coordinates — useful to exclude browser chrome (tabs/address bar). Example: windowTitle='Chrome', region={x:0, y:120, width:1920, height:900} skips the 120px browser chrome.
maxDimensionNoMax width or height in pixels (default 768). Use 1280 to read small text, code, or fine UI details. Ignored when dotByDot=true.
dotByDotNo1:1 pixel mode — no scaling, WebP compression. Window captures include 'origin: (x,y)' so you can compute screen position: screen_x = origin_x + image_x. When dotByDotMaxDimension is also set, scale factor is included: screen_x = origin_x + image_x / scale.
dotByDotMaxDimensionNoCap the longest edge (pixels) when dotByDot=true. Reduces payload while preserving coordinate math. Example: 1280 on a 1920×1080 screen → scale≈0.667. Response includes scale factor: screen_x = origin_x + image_x / scale. Recommended for Chrome: dotByDot=true, dotByDotMaxDimension=1280, grayscale=true.
grayscaleNoConvert to grayscale before encoding. Reduces file size ~50% for text-heavy content (e.g. AWS console, code editors). Avoid when color is meaningful (charts, status indicators).
webpQualityNoWebP quality when dotByDot=true or diffMode=true. 40=layout only, 60=general (default), 80=fine text.
diffModeNoLayer diff mode — compares each window against the buffered previous frame. First call = full I-frame (all windows). Subsequent calls = only changed windows (P-frame). Implicitly enables dotByDot. Best used with windowTitle=undefined to snapshot all windows.
detailNoResponse detail level (omit to let the server pick a smart default): omitted — auto: 'image' when dotByDot/region/displayId is specified, else 'meta' 'meta' — window title + screen region only (~20 tok/window, cheapest) 'text' — UIA element tree as JSON with text values (~100-300 tok/window, no image) 'image' — actual screenshot pixels. BLOCKED unless confirmImage=true is also passed.
confirmImageNoMust be true to receive image pixels when detail='image'. Without this flag, detail='image' is blocked and a guidance message is returned instead. Prefer detail='text' / diffMode=true / dotByDot=true first — only set confirmImage=true when visual inspection is genuinely required.
ocrFallbackNoOCR fallback behaviour when detail='text'. 'auto' (default): fire Windows OCR if UIA returns 0 actionable elements OR hints.uiaSparse=true (UIA returned <5 elements, typical for Chrome). 'always': always augment actionable[] with OCR words. 'never': disable OCR entirely.auto
ocrLanguageNoBCP-47 language tag for the OCR engine (e.g. 'ja', 'en-US'). Only used when detail='text'.ja
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure and does so comprehensively. It explains blocking behavior ('detail="image" is server-blocked unless confirmImage=true'), data reduction techniques, diffMode requirements, coordinate calculation formulas, token costs for different modes, and practical constraints like the need for a baseline before diffMode works. This goes well beyond basic functionality.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with clear sections (Purpose, Details, Prefer, Caveats, Examples) and every sentence adds value. While comprehensive, it's efficiently organized with no redundant information. The front-loaded purpose statement immediately communicates the tool's function, though the detailed explanations that follow are necessary given the tool's complexity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a complex 13-parameter tool with no annotations and no output schema, the description provides exceptional completeness. It covers purpose, usage guidelines, behavioral characteristics, parameter interactions, practical examples, and caveats. The examples demonstrate real-world usage patterns, and the caveats section addresses potential pitfalls. This is exactly what an agent needs to use this sophisticated tool effectively.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Despite 100% schema description coverage, the description adds significant value by explaining parameter interactions and practical usage patterns. It clarifies how parameters work together (e.g., 'dotByDot=true returns 1:1 pixel WebP; compute screen coords'), provides recommended combinations ('Recommended for Chrome: dotByDot=true, dotByDotMaxDimension:1280, grayscale=true'), and explains default behaviors ('Default mode scales to maxDimension=768'). However, it doesn't fully document all 13 parameters individually.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description starts with 'Purpose: Capture desktop, window, or region state across four output modes' - a specific verb ('capture') and resource ('desktop, window, or region state') that clearly distinguishes from sibling tools like browser_* tools or get_screen_info. It immediately establishes the tool's core function.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use this tool vs alternatives: 'Prefer browser_* tools for Chrome', 'Use meta to orient, text before clicking, dotByDot only when precise pixel coords are needed', 'Use diffMode after actions to confirm state changed', and 'Only use image+confirmImage when text returned 0 actionable elements and visual inspection is genuinely required'. This covers both when-to-use and when-not-to-use scenarios.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Harusame64/desktop-touch-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server