Skip to main content
Glama

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
DESKTOP_TOUCH_DOCK_PINNoAlways-on-top toggle: 'true' or 'false'true
DESKTOP_TOUCH_ALLOWLISTNoPath to an allowlist file for workspace_launch; overrides default locations
DESKTOP_TOUCH_DOCK_TITLENoTitle of the window to dock on startup; use '@parent' to auto-dock the terminal hosting Claude via process-tree walker
DESKTOP_TOUCH_DOCK_WIDTHNoWidth in pixels (e.g., '480') or as a percentage of work area (e.g., '25%')480
DESKTOP_TOUCH_DOCK_CORNERNoCorner to dock the window: 'top-left', 'top-right', 'bottom-left', or 'bottom-right'bottom-right
DESKTOP_TOUCH_DOCK_HEIGHTNoHeight in pixels (e.g., '360') or as a percentage of work area (e.g., '25%')360
DESKTOP_TOUCH_DOCK_MARGINNoScreen-edge padding in pixels8
DESKTOP_TOUCH_MOUSE_SPEEDNoDefault mouse movement speed in pixels per second; set to '0' for instant teleport1500
DESKTOP_TOUCH_DOCK_MONITORNoMonitor ID from get_screen_info; default is primary monitorprimary
DESKTOP_TOUCH_DOCK_SCALE_DPINoIf 'true', multiply pixel values by dpi/96 for per-monitor scalingfalse
DESKTOP_TOUCH_DOCK_TIMEOUT_MSNoMaximum wait time in milliseconds for the target window to appear5000

Capabilities

Features and capabilities supported by this server

CapabilityDetails
tools
{
  "listChanged": false
}

Tools

Functions exposed to the LLM to take actions

NameDescription
browser_click_elementA

Find a DOM element by CSS selector and click it (combines browser_find_element + mouse_click in one step). Prefer over mouse_click for Chrome — selector-based clicking is stable across repaints. Pass lensId (from perception_register) to verify tab identity and readyState before clicking and receive post.perception state feedback without a screenshot. Caveats: Fails if the element is outside the visible viewport — scroll it into view with browser_eval("document.querySelector('sel').scrollIntoView()") first.

browser_connectA

Connect to Chrome/Edge running with --remote-debugging-port and return open tab IDs — required before all other browser_* tools. Launch with browser_launch() or manually: chrome.exe --remote-debugging-port=9222 --user-data-dir=C:\tmp\cdp. Returns tabs[] with id, url, title — pass tabId to browser_* tools to target a specific tab. Caveats: CDP connection is per-process; if Chrome restarts, call browser_connect again to get fresh tab IDs.

browser_disconnectA

Close cached CDP WebSocket sessions for a port. Call when browser interaction is complete to release connections.

browser_evalA

Evaluate a JavaScript expression in a browser tab and return the result. Use for reading page state, scrolling, or triggering simple DOM mutations. Pass lensId (from perception_register) to verify tab identity and readyState before evaluating and receive post.perception state feedback without a screenshot. Caveats: Returns JSON-serializable values only — DOM nodes cannot be returned directly. React / Vue / Svelte controlled inputs cannot be updated via element.value = ... or native-setter + dispatchEvent — the framework's internal state is not refreshed; use keyboard_type (click the field first, pass windowTitle) for such form fields.

browser_fill_inputA

Fill a form input with a value via CDP — works on React/Vue/Svelte controlled inputs that reject browser_eval value assignment. Use browser_get_interactive or browser_find_element first to obtain a stable selector. Use this over browser_eval when setting a controlled input's value via JS does not update the framework state. Caveats: Requires browser_connect (CDP active). Does not work on contenteditable rich-text editors — use keyboard_type for those. actual in response shows what the element's value property reads after fill; verify it matches the intended value.

browser_find_elementA

Find a DOM element by CSS selector and return its physical screen coordinates — compatible directly with mouse_click. Prefer browser_click_element to find+click in one step. Prefer browser_get_interactive to discover selectors. Caveats: Coordinates are captured at call time; if the page reflows before mouse_click, coords may be stale.

browser_get_app_stateA

Extract embedded SPA framework state (Next.js, Nuxt, Remix, GitHub, Apollo, Redux SSR) in one CDP call. Returns parsed payloads with framework labels. Use BEFORE browser_eval or browser_get_dom on SPAs where rendered HTML is sparse. Pass selectors to target specific window globals (e.g. 'window:MY_STATE'). Caveats: Only extracts SSR-injected state — client-only runtime state requires browser_eval.

browser_get_domA

Return the HTML of a DOM element (or document.body when no selector is given), truncated to maxLength characters. Use when browser_get_interactive is insufficient for inspecting page structure.

browser_get_interactiveA

List all interactive elements (links, buttons, inputs, ARIA controls) on the current page with CSS selectors, visible text or value for inputs, and viewport status — use before browser_click_element to discover stable selectors, and prefer this over screenshot when verifying button/toggle state after submission (no image tokens, structured output). scope limits to a CSS subsection (e.g. '.sidebar'). Returns state (checked/pressed/selected/expanded) for ARIA custom controls. Caveats: Selectors are CDP-generated snapshots — re-call after page navigates or re-renders. Input text reflects the empty-field hint text when defined (takes priority over typed value) — use browser_eval('document.querySelector(sel).value') to read actual typed content.

browser_launchA

Launch Chrome/Edge/Brave in CDP debug mode and wait until the DevTools endpoint is ready. Idempotent — if a CDP endpoint is already live on the target port, returns immediately without spawning. Default: tries chrome → edge → brave (first installed wins), port 9222, userDataDir C:\tmp\cdp. Pass url to open a specific page on launch; follow with browser_connect to get tab IDs. Caveats: A Chrome session started without --remote-debugging-port cannot be taken over — close it first or use a separate profile.

browser_navigateA

Navigate a browser tab to a URL via CDP Page.navigate — more reliable than clicking the address bar (no need to find UI elements). Verify readiness with browser_eval("document.readyState") after calling. Pass lensId (from perception_register) to verify tab identity before navigating and receive post.perception state feedback without a screenshot. Caveats: Does not block until page load completes — follow with wait_until(element_matches) or repeated browser_eval polling for slow pages.

browser_searchA

Grep-like element search across the current page. by: 'text' (literal substring), 'regex', 'role', 'ariaLabel', 'selector' (CSS). Returns results[] sorted by confidence descending — pass results[0].selector to browser_click_element. Pagination via offset/maxResults. Caveats: Use browser_get_interactive for broad discovery; use browser_search when you know specific text or role to target.

click_elementA

Invoke a UI element by name or automationId via UIA InvokePattern — no screen coordinates needed. Prefer over mouse_click for buttons, menu items, and links in native Windows apps. Use get_ui_elements first to discover automationIds. Pass lensId (from perception_register) to run safety guards (identity stable, foreground, modal) before invoking and receive post.perception state feedback without a screenshot. Caveats: Requires the element to expose InvokePattern — some read-only or custom controls do not; fall back to mouse_click in that case.

clipboard_readA

Return the current text content of the Windows clipboard. Use after the user copies something to inspect it, or to retrieve text written by clipboard_write. Caveats: Non-text clipboard payloads (images, files) return an empty string — not an error.

clipboard_writeA

Place text on the Windows clipboard. Useful for seeding clipboard content before pasting, or for sharing data between tools without typing. Caveats: Overwrites existing clipboard content; non-text clipboard data (images, files) is not supported.

dock_windowA

Purpose: Snap a window to a screen corner at a fixed small size and pin it always-on-top — primarily to keep Claude CLI visible while operating other apps full-screen. Details: Accepts corner ('bottom-right' default), width/height (480×360 default, clamped to monitor work area), pin (true default = always-on-top), margin (8px default gap from screen edges, avoids taskbar overlap), and monitorId (see get_screen_info for IDs). Minimized windows are automatically restored before docking. Prefer: Use pin_window alone when you only need always-on-top without moving or resizing. Use dock_window when you need corner placement + resize + pin in one step. Caveats: Overrides any existing Win+Arrow snap arrangement. Call unpin_window explicitly to release always-on-top when the docked window is no longer needed in front.

events_listB

Return all active subscription IDs.

events_pollA

Drain buffered events for a subscription. Pass sinceMs to filter to events newer than that timestamp (incremental polling).

events_subscribeA

Purpose: Subscribe to window-state change events (appear/disappear/focus) for continuous monitoring without repeated polling. Details: Returns subscriptionId. Events are buffered internally at 500ms intervals via EnumWindows; buffer holds up to 50 events (oldest dropped on overflow). Call events_poll(subscriptionId, sinceMs: lastEventTs) to drain incrementally; call events_unsubscribe when monitoring is complete. Each buffered event: {type, hwnd, title, timestamp}. Prefer: Use instead of wait_until(window_appears) when you need to monitor multiple events simultaneously or over an extended period. Use wait_until for one-shot, single-condition waiting. Caveats: Events that occurred before subscribe() was called will not appear — buffer starts empty. Poll frequently (every few seconds) during high-frequency window activity to avoid the 50-event overflow. Examples: id = events_subscribe() → poll: events_poll({subscriptionId:id}) → on next poll: events_poll({subscriptionId:id, sinceMs: lastEventTs}) → events_unsubscribe({subscriptionId:id})

events_unsubscribeC

Stop a subscription and free its event buffer.

focus_windowA

Bring a window to the foreground by partial title match (case-insensitive). Use when a tool does not accept a windowTitle param, or when you need to switch focus before a sequence of actions. Use chromeTabUrlContains to activate a specific Chrome/Edge tab by URL substring before focusing — only the active tab's title appears in get_windows. If CDP is unavailable, chromeTabUrlContains is silently skipped — check response.hints.warnings. Returns WindowNotFound if no match exists; call get_windows to see available titles. Caveats: On some apps focus may be immediately stolen back (modal dialogs, UAC prompts) — verify with get_context after focusing.

get_active_windowA

Return the title, hwnd, and bounds of the currently focused window.

get_contextA

Purpose: Query focused window, focused element, cursor position, and page state in one call — far cheaper than any screenshot for confirming current state after an action. Details: Returns focusedWindow (title, hwnd), focusedElement (name, type, value), cursorPos {x,y}, cursorOverElement (name, type), hasModal (boolean), pageState ('ready'|'loading'|'dialog'|'error'). Does NOT enumerate descendants — use screenshot(detail='text') or get_ui_elements for the full clickable element list. Chromium: cursorOverElement is null (UIA sparse); focusedElement may fall back to CDP document.activeElement; hints.focusedElementSource reports which was used ('uia' or 'cdp'). Prefer: Use after keyboard_type or set_element_value to confirm the value landed in the expected field — cheaper than a verification screenshot. Use instead of screenshot(detail='meta') when the question is only "which window/control has focus." Caveats: Cannot detect non-UIA elements (custom-drawn UIs, game overlays). hasModal only detects modal dialogs exposed via UIA — browser alert/confirm dialogs may not appear here.

get_cursor_positionA

Return the current mouse cursor position in virtual screen coordinates.

get_document_stateA

Return current Chrome page state via CDP: url, title, readyState, selection, and scroll position. Far cheaper than browser_get_dom for page orientation.

get_historyA

Return recent action history (ring buffer, last 20 entries) with tool name, argsDigest, post-state, and timestamp. Use to reconstruct context after model interruption or verify a step occurred.

get_screen_infoA

Return all connected display info: resolution, position, DPI scaling, and current cursor position. Use monitorId from this response to target a specific display in dock_window.

get_ui_elementsA

Inspect the raw UIA element tree of a window — returns names, control types, automationIds, bounding rects, and interaction patterns. Each element includes viewportPosition ('in-view'|'above'|'below'|'left'|'right') relative to the window client region — use it to decide whether scroll_to_element is needed before clicking. Prefer screenshot(detail='text') for interactive automation (returns pre-filtered actionable[] with clickAt coords). Use get_ui_elements when you need the unfiltered tree or specific automationIds for click_element. Caveats: Large windows may return hundreds of elements — scope with windowTitle. Results are capped at maxElements (default 80, max 200) — increase if the target element is missing.

get_windowsA

List all visible windows with titles, screen positions, Z-order, active state, and virtual desktop membership. zOrder=0 is frontmost; isActive=true is the keyboard-focused window; isOnCurrentDesktop=false means the window is on another virtual desktop and cannot be interacted with without switching. Use before screenshot to determine whether a specific window needs capturing. Caveats: Returns only top-level visible windows — child windows and system tray items are excluded.

keyboard_pressA

Press a key or key combination (e.g. 'ctrl+c', 'alt+tab', 'ctrl+shift+s', 'f5', 'escape', 'f1'–'f12'). Pass windowTitle to auto-focus before pressing — eliminates a separate focus_window call. Pass lensId (from perception_register, window lens only) to run safety guards (identity stable, foreground, modal) before pressing and receive post.perception state feedback without a screenshot. Caveats: Omitting windowTitle sends keystrokes to the currently active window — if focus may have shifted since your last observation, pass windowTitle explicitly. win+r, win+x, win+s, win+l are blocked for security. narrate:'rich' adds UIA state feedback for state-transitioning keys (Enter, Tab, Esc, F-keys) only.

keyboard_typeA

Type a string into the focused window. Pass windowTitle to auto-focus the target before typing and enable focus-loss detection (focusLost in response) — eliminates a separate focus_window call. Pass replaceAll:true to Ctrl+A before typing (replace existing content in one call). Prefer set_element_value for form fields. Pass lensId (from perception_register, window lens only) to run safety guards (identity stable, foreground, modal) before typing and receive post.perception state feedback without a screenshot. Caveats: Omitting windowTitle types into whatever window is currently active — if focus may have shifted since your last get_context, pass windowTitle explicitly. Does not handle IME composition for CJK — use use_clipboard=true or set_element_value instead. Text containing em-dash (—), en-dash (–), smart quotes, or other non-ASCII punctuation is automatically rerouted via clipboard (method:'clipboard-auto') to prevent Chrome/Edge from intercepting keystrokes as keyboard accelerators (e.g. address bar hijack). Pass forceKeystrokes:true to disable this auto-upgrade.

mouse_clickA

Click at screen-absolute coordinates (virtual screen pixels), or pass origin+scale from a dotByDot=true screenshot response to let the server convert image-local coords automatically: screen = origin + (x,y) / (scale ?? 1). doubleClick:true for double-click; tripleClick:true for triple-click (selects a full line of text) — if both are set, tripleClick wins. windowTitle optionally focuses the window first (for pinned-dock setups). Prefer click_element (UIA) for stable text-addressed clicking in native apps. Prefer browser_click_element for Chrome. Use mouse_click only when pixel coords are the only available option. Pass lensId (from perception_register) to run safety guards (identity stable, foreground, coordinates in rect) before clicking and receive post.perception state feedback without a screenshot. Caveats: origin+scale are meaningful ONLY with dotByDot=true screenshot responses — applying them to scaled detail='text'/'meta' output lands clicks in the wrong positions.

mouse_dragA

Click and drag from (startX, startY) to (endX, endY) holding the left mouse button — for sliders, drag-and-drop, canvas drawing, and window resizing. windowTitle optionally focuses before drag. Pass lensId (from perception_register) to run safety guards (identity stable, foreground, coordinates in rect) before dragging and receive post.perception state feedback without a screenshot. Caveats: Left button only; does not support right-drag or middle-drag.

mouse_moveA

Move the cursor to coordinates without clicking — for hover-only effects such as revealing tooltips or triggering hover states. Use mouse_click for click targets (it moves and clicks in one call).

notification_showA

Show a Windows system tray balloon notification to alert the user. Use at the end of a long-running task so the user knows it finished without watching the screen. Caveats: Focus Assist (Do Not Disturb) mode suppresses balloon tips; the tool still returns ok:true in that case. Uses System.Windows.Forms — no external modules needed.

perception_forgetA

Deregister a perception lens and release its tracking resources. Use when a workflow is complete, when attention is identity_changed, or before re-registering a target that was closed, restarted, or replaced. Removes the lens from guard evaluation, resource listings, and sensor subscriptions. Returns whether a lens was removed.

perception_listA

List all active perception lenses. Use when you need to find an existing lensId, verify which windows or browser tabs are being tracked, or clean up stale lenses before starting a new workflow. Returns lensId, name, target kind, guardPolicy, salience, attention, and registration metadata.

perception_readA

Force-refresh a registered perception lens and return a full perception envelope. Use when post.perception.attention is dirty, stale, settling, guard_failed, or identity_changed, or when you need fresh structured state before the next action. Returns attention, guard results, latest target/browser state, changed fields, and suggested recovery actions. Prefer this over screenshot/get_context when a lens already exists.

perception_registerA

Purpose: Register a perception lens, a lightweight live state tracker for one window or browser tab. Use it before repeated actions so later tool calls can verify target identity, focus, readiness, modal obstruction, and click safety without taking another screenshot. Details: Returns a lensId that can be passed to action tools such as keyboard_type, keyboard_press, mouse_click, browser_click_element, and browser_navigate. When a tool receives lensId, desktop-touch refreshes the tracked state, evaluates safety guards, and attaches a compact post.perception envelope to the response. The envelope reports attention, guard status, recent changes, and the latest known target state, reducing get_context/screenshot round trips. Prefer: Use for multi-step workflows on the same app window or browser tab, especially before typing, clicking coordinates, navigating browser tabs, or acting after focus may have changed. It is most useful when mistakes would be costly, such as typing into the wrong window or clicking stale coordinates. Caveats: A lens is not a visual recognition model. It tracks structured state from Win32, CDP, and optional UIA sensors. safe.clickCoordinates checks window bounds, not pixel-level occlusion. browserTab lenses require Chrome/Edge with --remote-debugging-port=9222. If attention is dirty, stale, settling, guard_failed, or identity_changed, follow the suggested action before continuing. Maximum 16 active lenses are kept; old lenses may be evicted. Examples: perception_register({name:'editor', target:{kind:'window', match:{titleIncludes:'Visual Studio Code'}}}) → {lensId:'perc-1'} keyboard_type({windowTitle:'Visual Studio Code', text:'hello', lensId:'perc-1'}) → response includes post.perception perception_read({lensId:'perc-1'}) → force a fresh envelope when attention is dirty/stale perception_forget({lensId:'perc-1'}) → release tracking when the workflow is done

pin_windowA

Make a window always-on-top until unpin_window is called (or duration_ms elapses). Useful in run_macro sequences: pin_window → interact → unpin_window. Caveats: Pin state survives window minimize/restore; call unpin_window explicitly to release.

run_macroA

Purpose: Execute multiple tools sequentially in one MCP call — eliminates round-trip latency for predictable multi-step workflows. Details: steps[] is an array of {tool, params} objects. Accepts all desktop-touch tools plus a special sleep pseudo-step: {tool:"sleep", params:{ms:N}} (max 10000ms per step). stop_on_error=true (default) halts on first failure. Max 50 steps. The LLM cannot inspect intermediate results during execution — all steps run to completion (or first error) before any output is returned. Prefer: Use for predictable fixed sequences (focus → sleep → type → screenshot). Do not use for conditional logic — return to the LLM between branches so it can inspect intermediate state. Caveats: If any step may fail conditionally (e.g. a dialog that may or may not appear), split the macro at that point. Each screenshot step within a macro incurs the same token cost as a standalone call. Examples: [{tool:'focus_window',params:{windowTitle:'Notepad'}},{tool:'sleep',params:{ms:300}},{tool:'keyboard_type',params:{text:'Hello'}},{tool:'screenshot',params:{detail:'text',windowTitle:'Notepad'}}] [{tool:'browser_navigate',params:{url:'https://example.com'}},{tool:'wait_until',params:{condition:'element_matches',target:{by:'text',pattern:'Example Domain'}}}]

scope_elementA

Return a high-resolution screenshot of a specific element's region plus its child element tree. Requires UIA — works with native apps, Chrome/Edge, VS Code. Use get_ui_elements first to discover element names or automationIds. At least one of name, automationId, or controlType must be provided.

screenshotA

Purpose: Capture desktop, window, or region state across four output modes — from cheap orientation metadata to pixel-accurate images. Details: detail='meta' (default) returns window titles+positions only (~20 tok/window, no image). detail='text' returns UIA actionable elements with clickAt coords, no image (~100-300 tok). detail='image' is server-blocked unless confirmImage=true is also passed. dotByDot=true returns 1:1 pixel WebP; compute screen coords: screen_x = origin_x + image_x (or screen_x = origin_x + image_x / scale when dotByDotMaxDimension is set — scale printed in response). diffMode=true returns only changed windows after the first call (~160 tok). Data reduction: grayscale=true (−50%), dotByDotMaxDimension=1280 (caps longest edge), windowTitle+region (sub-crop to exclude browser chrome — e.g. region={x:0, y:120, width:1920, height:900}). Prefer: Use meta to orient, text before clicking, dotByDot only when precise pixel coords are needed. Prefer browser_* tools for Chrome. Use diffMode after actions to confirm state changed. Only use image+confirmImage when text returned 0 actionable elements and visual inspection is genuinely required. Caveats: Default mode scales to maxDimension=768 — image pixels ≠ screen pixels; apply the scale formula before passing to mouse_click. detail='image' is always blocked without confirmImage=true. diffMode requires a prior full-capture baseline (non-diff call or workspace_snapshot) — calling diffMode cold returns a full frame, not a diff. Examples: screenshot() → meta orientation of all windows screenshot({detail:'text', windowTitle:'Notepad'}) → clickable elements with coords screenshot({dotByDot:true, dotByDotMaxDimension:1280, grayscale:true, windowTitle:'Chrome', region:{x:0,y:120,width:1920,height:900}}) → pixel-accurate Chrome content

screenshot_backgroundA

Purpose: Capture a window that is hidden, minimized, or behind other windows using Win32 PrintWindow API. Details: Uses PW_RENDERFULLCONTENT (fullContent=true, default) for GPU-rendered content in Chrome, Electron, and WinUI3 apps. Supports same detail and dotByDot modes as screenshot. Default mode scales to maxDimension=768; dotByDot=true gives 1:1 WebP with origin in response — compute screen coords: screen_x = origin_x + image_x. grayscale=true reduces size ~50%. dotByDotMaxDimension caps resolution; response includes scale (screen_x = origin_x + image_x / scale). Prefer: Prefer screenshot(windowTitle=X) for visible windows (faster, no API overhead). Use screenshot_background when the window must stay hidden or cannot be brought to foreground. Caveats: Default (scaled) mode: image pixels ≠ screen pixels — always use dotByDot=true + origin for mouse_click coords. Set fullContent=false for legacy or game windows where GPU rendering causes 1-3s delay or black capture. Some DX12 games may not capture correctly even with fullContent=true.

screenshot_ocrA

Run Windows OCR on a window and return word-level text with screen-pixel clickAt coordinates — use when UIA returns no actionable elements (WinUI3 custom-drawn UIs, game overlays, PDF viewers). Note: screenshot(detail='text') auto-falls back to OCR when UIA is sparse (ocrFallback='auto' default) — call screenshot_ocr directly only when forcing OCR unconditionally. language: BCP-47 tag (default 'ja'). Caveats: First call may take ~1s (WinRT cold-start). Requires the matching Windows OCR language pack installed.

scrollC

Scroll at specified coordinates (or current cursor position). direction: 'up'|'down'|'left'|'right'. amount: scroll clicks (default 3).

scroll_captureA

Purpose: Scroll a window top-to-bottom (or left-to-right) and stitch all frames into one image — for full-length webpages or documents that exceed a single screenshot. Details: Output is capped at ~700KB raw (MCP base64 encoding inflates to ~933KB, approaching the 1MB message limit); when sizeReduced=true appears in the response, iterative WebP downscale was applied (up to 3 passes at 0.75× each) — reduce maxScrolls or add grayscale=true to avoid truncation. Focuses the target window, scrolls to Ctrl+Home, then captures frames via Page Down until identical consecutive frames are detected or maxScrolls is reached. Pixel-overlap detection eliminates seam duplication; check response overlapMode — 'mixed-with-failures' means some seams may have duplicate rows. Prefer: Use only when the goal is whole-page overview of content too long for one screenshot. For partial verification or locating a specific section, prefer scroll + screenshot(detail='text') — you get actionable[] with coords and pay only per-viewport token cost. scroll_capture returns a stitched image (not clickable elements) that stays expensive in tokens regardless of the 1MB guard. Caveats: When sizeReduced=true, stitched image pixels do NOT match screen coords — use for reading only, not for mouse_click. When overlapMode='mixed-with-failures', expect occasional duplicate content rows near frame boundaries. Increase scrollDelayMs for pages with animations or lazy-loaded images.

scroll_to_elementA

Purpose: Scroll a named element into the visible viewport without manually computing scroll amounts. Details: Two paths: (1) Chrome/Edge (CDP): provide selector — calls el.scrollIntoView({block, behavior:'instant'}) via CDP. Uses instant (not smooth) scroll so coords stabilize immediately. block controls vertical alignment (start/center/end/nearest, default: center). (2) Native apps (UIA): provide name + windowTitle — calls ScrollItemPattern.ScrollIntoView(). Returns scrolled:true on success, scrolled:false if the element doesn't expose ScrollItemPattern (fall back to scroll + screenshot). Prefer: Use over scroll + screenshot loops when you know the element name or selector. Pairs well with screenshot(detail='text') to confirm the element is now in-view (viewportPosition:'in-view'). For Chrome, browser_get_interactive shows viewportPosition for all elements — use that to decide whether scrolling is needed before calling this tool. Caveats: Chrome path requires browser_connect (CDP active). Native path requires the element to implement UIA ScrollItemPattern — some custom/third-party controls do not; in that case scrolled:false is returned. SPA virtual-scroll lists (React Virtualized, TanStack) may not respond to scrollIntoView. Examples: scroll_to_element({selector: '#create-release-btn'}) — Chrome, scroll to button scroll_to_element({name: 'Create Release', windowTitle: 'Glama'}) — native UIA scroll_to_element({selector: '.submit', block: 'start'}) — align to top of viewport

set_element_valueA

Set the value of a text field or combo box via UIA ValuePattern — more reliable than keyboard_type for programmatic form input. Use narrate:'rich' to confirm the value was applied without a verification screenshot. Pass lensId (from perception_register) to run safety guards (identity stable, foreground, modal) before setting and receive post.perception state feedback without a screenshot. Caveats: Only works for elements that expose ValuePattern; does not work on contenteditable HTML or custom rich-text editors — use keyboard_type for those.

smart_scrollA

Purpose: Scroll any element into the viewport — handles nested scroll layers, virtualised lists, sticky-header occlusion, and image-only fallbacks in a single call. Details: Three paths selected by strategy:'auto' (default): (1) CDP (Chrome/Edge): walks scroll ancestor chain, handles overflow:hidden (expandHidden), virtualised lists (TanStack/data-index bisect), detects sticky headers and compensates. (2) UIA (native Windows apps): uses ScrollPattern.SetScrollPercent on ancestor containers then ScrollItemPattern for final snap. (3) Image: binary-search via Win32 GetScrollInfo (exact ratio) or scrollbar-strip pixel sampling (overlay scrollbars), with dHash verification — detects no-op scrolls (Hamming < 5). All paths emit a unified response: ok, path, attempts, pageRatio (0..1), scrolled (bool), ancestors[], viewportPosition. pageRatio is the normalised vertical position of the element on the full page (0=top, 1=bottom). Set verifyWithHash:true to explicitly check that pixels changed (auto-enabled on image path). Nested scroll: ancestors[] is ordered outer→inner; the tool scrolls outer containers first. Virtual lists: set virtualIndex + virtualTotal for O(log n) bisect (≤6 iterations). Prefer: Use instead of scroll_to_element when: content is virtualised (React Virtualized, TanStack Virtual), multiple scroll containers nest, or scroll_to_element returns scrolled:true but the viewport did not actually move. For a simple single-container non-virtual scroll, scroll_to_element is lighter. Caveats: CDP path requires browser_connect. Cross-origin iframes are not traversed (warning returned). expandHidden mutates live CSS (overflow:auto); previous value is stored in data-dt-prev-overflow and restored on the next smart_scroll call (or after 30 s). Image path cannot determine whether the target element is in-view — viewportPosition is null. Call screenshot(detail='text') afterwards to verify. UIA ScrollPattern may not be available in all native apps — falls through to image path. Examples: smart_scroll({target: '#create-release-btn'}) — CDP, nested container, no virtual list smart_scroll({target: '[data-index]', virtualIndex: 500, virtualTotal: 10000}) — TanStack virtual list smart_scroll({target: 'Create Release', windowTitle: 'File Explorer', strategy: 'uia'}) — native UIA smart_scroll({target: 'readme section', windowTitle: 'MyApp', strategy: 'image', hint: 'below'}) — image binary-search

terminal_readA

Read current text from a terminal window via UIA TextPattern (falls back to OCR). Strips ANSI escape sequences. sinceMarker: pass the marker from a previous response to get only new output (diff mode — cheaper than full read). Caveats: When the underlying process restarts, the marker is invalidated and full text is returned.

terminal_sendA

Send a command to a terminal window (Windows Terminal, conhost, PowerShell, cmd, WSL). Wraps focus_window + keyboard type + Enter. preferClipboard=true (default) uses clipboard paste — IME-safe for CJK text, but overwrites the user's clipboard. restoreFocus=true (default) returns focus to the previously active window after sending. Caveats: If the terminal is busy (previous command still running), text will be injected mid-stream — check terminal_read first or use wait_until(terminal_output_contains) to confirm completion before sending.

unpin_windowA

Remove always-on-top from a window. Reverses pin_window.

wait_untilA

Purpose: Server-side poll for an observable condition — eliminates screenshot-polling loops when waiting for state changes. Details: condition selects what to watch: window_appears/window_disappears (target.windowTitle required), focus_changes (optional target.fromHwnd), element_appears/value_changes (target.windowTitle + target.elementName required, UIA; min 500ms interval), ready_state (target.windowTitle; visible + not minimized), terminal_output_contains (target.windowTitle + target.pattern required [+target.regex:true], needs terminal tools loaded), element_matches (target.by + target.pattern required, needs browser tools loaded). Returns {ok:true, elapsedMs, observed} on success, or WaitTimeout error with suggest hints. timeoutMs default 5000 (max 60000). Prefer: Use instead of run_macro({sleep:N}) + screenshot loops. Use terminal_output_contains to detect CLI command completion. Use element_matches for browser DOM readiness after navigation. Caveats: terminal_output_contains and element_matches require the respective tool modules to be loaded. element_appears/value_changes spawn a UIA process per poll — interval clamped to 500ms minimum. On WaitTimeout, read the suggest[] array in the error for recovery steps. Examples: wait_until({condition:'window_appears', target:{windowTitle:'Save As'}, timeoutMs:10000}) wait_until({condition:'terminal_output_contains', target:{windowTitle:'Terminal', pattern:'$ '}, timeoutMs:30000}) wait_until({condition:'element_matches', target:{by:'text', pattern:'Submit', scope:'#checkout-form'}})

workspace_launchA

Purpose: Launch an application and wait for its new window to appear, returning title, HWND, and PID. Details: Runs the command via ShellExecute, snapshots the window list before launch, then polls until a new HWND appears (compared by HWND, not title). Returns {windowTitle, hwnd, pid, elapsedMs}. Works for localized window titles (e.g. '電卓' for calc.exe) because detection is HWND-based, not title-based. timeoutMs default 10000. detach=true fires without waiting and returns no window info. Prefer: Use instead of run_macro({exec, sleep, get_windows}) combos. Follow with focus_window(windowTitle) to interact with the launched app. Caveats: Single-instance apps that reuse an existing window will not register as a new HWND — call get_windows first to check if the window is already open. detach=true returns immediately with no window title or hwnd. Examples: workspace_launch({command:'notepad.exe'}) → {windowTitle:'', hwnd:'...', pid:...} workspace_launch({command:'calc.exe', timeoutMs:15000})

workspace_snapshotA

Purpose: Orient fully in one call — returns display layouts, all window thumbnails (WebP), and per-window actionable element lists with clickAt coords. Details: uiSummary.actionable[] per window includes: action ('click'|'type'|'expand'|'select'), clickAt {x,y} (pass directly to mouse_click), value (current text for editable fields). Runs parallel internally; latency ≈ max(single screenshot), not N×screenshots. Also resets the diffMode buffer so subsequent screenshot(diffMode=true) returns only changes (P-frame). Prefer: Use at session start or after major workspace changes. Use screenshot(detail='meta') for cheap re-orientation within a session. Use screenshot(detail='text', windowTitle=X) for a single-window update. Caveats: Thumbnails are scaled, not 1:1 — use screenshot(dotByDot=true, windowTitle=X) for pixel-accurate coords on a specific window after snapshot.

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription

No resources

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Harusame64/desktop-touch-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server