desktop-touch-mcp
desktop-touch-mcp
Stop pasting screenshots. Let Claude see and control your desktop directly.
An MCP server that gives Claude eyes and hands on Windows — 56 tools covering screenshots, mouse, keyboard, Windows UI Automation, Chrome DevTools Protocol, clipboard, desktop notifications, SmartScroll, and a Reactive Perception Graph for safe multi-step automation, designed from the ground up for LLM efficiency.
Applies MPEG P-frame diffing to window capture: only changed windows are sent after the first frame, cutting token usage by ~60–80% in typical automation loops.
Features
LLM-native design — Built around how LLMs think, not how humans click.
run_macrobatches multiple operations into a single API call;diffModesends only the windows that changed since the last frame. Minimal tokens, minimal round-trips.Reactive Perception Graph — Register a
lensIdfor a window or browser tab, pass it to action tools, and get guard-checkedpost.perceptionfeedback after each action. It reduces repeatedscreenshot/get_contextcalls and prevents wrong-window typing or stale-coordinate clicks.Full CJK support — Uses Win32
GetWindowTextWfor window titles, avoiding nut-js garbling. IME bypass input supported for Japanese/Chinese/Korean environments.3-tier token reduction —
detail="image"(~443 tok) /detail="text"(~100–300 tok) /diffMode=true(~160 tok). Send pixels only when you actually need to see them.1:1 coordinate mode —
dotByDot=truecaptures at native resolution (WebP). Image pixel = screen coordinate — no scale math needed. Withorigin+scalepassed tomouse_click, the server converts coords for you — eliminating off-by-one / scale bugs.Browser capture data reduction —
grayscale=true(~50% size),dotByDotMaxDimension=1280(auto-scaled with coord preservation), andwindowTitle + regionsub-crops help exclude browser chrome and other irrelevant pixels. Typical reduction for heavy captures: 50–70%.Chromium smart fallback —
detail="text"on Chrome/Edge/Brave auto-skips UIA (prohibitively slow there) and runs Windows OCR.hints.chromiumGuard+hints.ocrFallbackFiredflag the path taken.UIA element extraction —
detail="text"returns button names andclickAtcoords as JSON. Claude can click the right element without ever looking at a screenshot.Auto-dock CLI —
dock_windowsnaps any window to a screen corner with always-on-top. SetDESKTOP_TOUCH_DOCK_TITLE='@parent'to auto-dock the terminal hosting Claude on MCP startup — the process-tree walker finds the right window regardless of title.Emergency stop (Failsafe) — Move the mouse to the top-left corner (within 10px of 0,0) to immediately terminate the MCP server.
Requirements
OS | Windows 10 / 11 (64-bit) |
Node.js | v20+ recommended (tested on v22+) |
PowerShell | 5.1+ (bundled with Windows) |
Claude CLI |
|
Note: nut-js native bindings require the Visual C++ Redistributable. Download from Microsoft if not already installed.
Installation
npx -y @harusame64/desktop-touch-mcpThe npm launcher downloads the latest desktop-touch-mcp-windows.zip from GitHub Releases on first run and caches it under %USERPROFILE%\.desktop-touch-mcp. Later runs reuse the cached release unless a newer GitHub Release is available.
Register with Claude CLI
Add to ~/.claude.json under mcpServers:
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@harusame64/desktop-touch-mcp"]
}
}
}No system prompt needed. The command reference is automatically injected into Claude via the MCP initialize response's instructions field.
Development install
git clone https://github.com/Harusame64/desktop-touch-mcp.git
cd desktop-touch-mcp
npm installnpm install runs the prepare script, which compiles TypeScript to dist/. No separate build step is required.
For a local checkout, register the built server directly:
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "node",
"args": ["D:/path/to/desktop-touch-mcp/dist/index.js"]
}
}
}Note: Replace
D:/path/to/desktop-touch-mcpwith the actual path where you cloned this repository.
Tools (56 total)
📖 Full command reference:
docs/system-overview.md— every tool's parameters, response shape, coordinate math, layer-buffer strategy, and engineering notes in one place.
Screenshot (5)
Tool | Description |
| Main capture. Supports |
| Capture a background window without focusing it (PrintWindow API) |
| Windows.Media.Ocr on a window; returns word-level text + screen clickAt coords |
| Monitor layout, DPI, cursor position |
| Full-page stitch by scrolling (MAE overlap detection + 10% fallback) |
Window management (4)
Tool | Description |
| List all windows in Z-order |
| Info about the focused window |
| Bring a window to foreground by partial title match |
| Snap a window to a screen corner at a small size + always-on-top (for keeping CLI visible) |
Mouse (5)
Tool | Description |
| Move, click, drag. |
| Scroll in any direction. Accepts |
| Current cursor coordinates |
Keyboard (2)
Tool | Description |
| Type text. |
| Key combos ( |
UI Automation (4)
Tool | Description |
| Full UIA element tree for a window |
| Click a button by name or automationId — no coordinates needed |
| Write directly to a text field |
| High-res zoom crop of an element + its child tree |
Browser CDP (12)
Tool | Description |
| Launch Chrome/Edge/Brave with |
| Connect to Chrome/Edge via CDP; lists open tabs with |
| CSS selector → exact physical screen coords |
| Find DOM element + click in one step |
| Evaluate JS expression in the browser tab |
| Fill React/Vue/Svelte controlled inputs via CDP — works where |
| Get outerHTML of element or |
| Enumerate links / buttons / inputs + ARIA toggles with |
| SPA state extractor — one CDP call that scans |
| Grep DOM by text / regex / role / ariaLabel / selector with confidence ranking |
| Navigate via CDP |
| Close cached CDP WebSocket sessions |
All browser_* tools that touch the DOM accept includeContext:false to omit the trailing activeTab: / readyState: lines (saves ~150 tok/call on chained invocations). Within a 500 ms window, consecutive calls reuse one tab-context fetch automatically.
Workspace (2)
Tool | Description |
| All windows: thumbnails + UI summaries in one call |
| Launch an app and auto-detect the new window |
Pin / Macro (3)
Tool | Description |
| Always-on-top toggle |
| Execute up to 50 steps sequentially in one MCP call |
Clipboard (2)
Tool | Description |
| Read the current Windows clipboard text (non-text payloads return empty string) |
| Write text to the Windows clipboard; full Unicode / emoji / CJK support |
Notification (1)
Tool | Description |
| Show a Windows system tray balloon notification — useful to alert the user when a long-running task finishes |
Scroll (1)
Tool | Description |
| Scroll a named element into the viewport without computing scroll amounts. Chrome path: |
| SmartScroll — unified scroll dispatcher: CDP → UIA → image binary-search fallback. Handles nested containers, virtualised lists (TanStack/React Virtualized), sticky-header occlusion, and image-only environments. Returns |
Browser CDP automation
For web automation, connect Chrome or Edge with the remote debugging port enabled — no Selenium or Playwright needed.
# Launch Chrome in CDP mode
chrome.exe --remote-debugging-port=9222 --user-data-dir=C:\tmp\cdpbrowser_launch() → launch Chrome/Edge/Brave in debug mode (idempotent)
browser_connect() → list open tabs + get tabIds
browser_find_element("#submit") → CSS selector → physical screen coords
browser_click_element("#submit") → find + click in one step (auto-focuses browser)
browser_eval("document.title") → evaluate JS, returns result
browser_fill_input("#email", "user@example.com") → fill React/Vue/Svelte controlled input (state-safe)
browser_get_dom("#main", maxLength=5000)→ outerHTML, truncated to maxLength chars
browser_get_interactive() → links/buttons/inputs + ARIA toggles + viewportPosition per element
browser_get_app_state() → one-shot SPA state (Next/Nuxt/Remix/Apollo/GitHub react-app/Redux SSR)
browser_search(by="text", pattern="...")→ grep DOM with confidence ranking
browser_navigate("https://example.com") → navigate via CDP (no address bar interaction)
browser_disconnect() → clean up WebSocket sessionsFor chained calls in the same tab, pass includeContext:false to omit the activeTab/readyState annotation (~150 tok/call saved). Boolean / object params accept the LLM-friendly string spellings ("true", "{}").
Coordinates returned by browser_find_element account for the browser chrome (tab strip + address bar height) and devicePixelRatio, so they can be passed directly to mouse_click without any scaling.
Recommended web workflow:
browser_connect() → browser_get_dom() → browser_find_element(selector) → browser_click_element(selector)Auto-dock CLI on startup
Keep Claude CLI visible while operating other apps full-screen. Set env vars in your MCP config and the docked window auto-snaps into place every MCP startup.
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@harusame64/desktop-touch-mcp"],
"env": {
"DESKTOP_TOUCH_DOCK_TITLE": "@parent",
"DESKTOP_TOUCH_DOCK_CORNER": "bottom-right",
"DESKTOP_TOUCH_DOCK_WIDTH": "480",
"DESKTOP_TOUCH_DOCK_HEIGHT": "360",
"DESKTOP_TOUCH_DOCK_PIN": "true"
}
}
}
}Env var | Default | Notes |
| (unset = off) |
|
|
|
|
|
| px ( |
|
| Always-on-top toggle |
| primary | Monitor id from |
|
| If true, multiply px values by |
|
| Screen-edge padding (px) |
|
| Max wait for the target window to appear |
Input routing gotcha: when a pinned window is active (e.g. Claude CLI),
keyboard_type/keyboard_presssend keys to it, not the app you wanted to type into. Always callfocus_window(title=...)before keyboard operations, then verifyisActive=trueviascreenshot(detail='meta').
Reactive Perception Graph (4)
Tool | Description |
| Register a live perception lens on a window or browser tab. Returns a |
| Force-refresh the lens and return a full perception envelope when attention is dirty/stale/blocked |
| Release a lens when the workflow ends or the target was replaced |
| List active lenses so Claude can reuse or clean up existing tracking |
Reactive Perception Graph is desktop-touch's low-cost situational awareness layer. It keeps the target identity, focus, rect, readiness, and guard state alive across actions so Claude does not need to re-check everything with a screenshot after every small move.
# Register a lens on the target window or browser tab
perception_register({name:"editor", target:{kind:"window", match:{titleIncludes:"Notepad"}}})
→ {lensId:"perc-1", ...}
# Pass lensId to action tools. Guards run before the action;
# compact feedback arrives in post.perception after the action.
keyboard_type({text:"hello", windowTitle:"Notepad", lensId:"perc-1"})
→ post.perception: {attention:"ok", guards:{...}, latest:{target:{title, rect, foreground}}}
# If the app restarts or focus moves away, guards fail closed before unsafe input:
keyboard_type({text:"x", lensId:"perc-1"})
→ {ok:false, code:"GuardFailed", suggest:["Re-register lens for the new process instance"]}lensId is opt-in on all action tools (keyboard_type, keyboard_press, mouse_click, mouse_drag, click_element, set_element_value, browser_click_element, browser_navigate, browser_eval). Omitting lensId preserves existing behavior exactly.
Mouse homing correction
When Claude calls screenshot(detail='text') to read coordinates and then mouse_click seconds later, the target window may have moved. The homing system corrects this automatically.
Tier | How to enable | Latency | What it does |
1 | Always-on (if cache exists) | <1ms | Applies (dx, dy) offset when window moved |
2 | Pass | ~100ms | Auto-focuses window if it went behind another |
3 | Pass | 1–3s | UIA re-query for fresh coords on resize |
# Tier 1 only (automatic)
mouse_click(x=500, y=300)
# Tier 1 + 2: also bring window to front if hidden
mouse_click(x=500, y=300, windowTitle="Notepad")
# Tier 1 + 2 + 3: also re-query UIA if window resized
mouse_click(x=500, y=300, windowTitle="Notepad", elementName="Save")
# Traction control OFF — no correction
mouse_click(x=500, y=300, homing=false)The homing parameter is available on mouse_click, mouse_move, mouse_drag, and scroll. The cache is updated automatically on every screenshot(), get_windows(), focus_window(), and workspace_snapshot() call.
mouse_click image-local coords (origin + scale)
When you take a dotByDot screenshot with dotByDotMaxDimension, the response prints the origin and scale values. Instead of computing screen coords manually, copy them into mouse_click:
# Screenshot response:
# origin: (0, 120) | scale: 0.6667
# To click image pixel (ix, iy): mouse_click(x=ix, y=iy, origin={x:0, y:120}, scale=0.6667)
mouse_click(x=640, y=300, origin={x:0, y:120}, scale=0.6667, windowTitle="Chrome")
# Server converts: screen = (0 + 640/0.6667, 120 + 300/0.6667) = (960, 570)This eliminates a whole class of off-by-one and scale bugs. Without origin/scale, x/y remain absolute screen pixels (unchanged behavior).
screenshot key parameters
detail="image" — PNG/WebP pixels (default)
detail="text" — UIA element JSON + clickAt coords (no image, ~100–300 tok)
detail="meta" — Title + region only (cheapest, ~20 tok/window)
dotByDot=true — 1:1 WebP; image_px + origin = screen_px
dotByDotMaxDimension=N — cap longest edge (response includes scale for coord math)
grayscale=true — ~50% smaller for text-heavy captures (code/AWS console)
region={x,y,w,h} — with windowTitle: window-local coords (exclude browser chrome)
without: virtual screen coords
diffMode=true — I-frame first call, P-frame (changed windows only) after (~160 tok)
ocrFallback="auto" — detail='text' auto-fires Windows OCR on uiaSparse or emptyRecommended Chrome combo (50–70% data reduction):
screenshot(windowTitle="Chrome",
dotByDot=true, dotByDotMaxDimension=1280, grayscale=true,
region={x:0, y:120, width:1920, height:900}) # skip browser chromeRecommended workflow:
workspace_snapshot() → full orientation (resets diff buffer)
screenshot(detail="text", windowTitle=X) → get actionable[].clickAt coords
mouse_click(x, y) → click directly, no math needed
screenshot(diffMode=true) → check only what changed (~160 tok)Security
Emergency stop (Failsafe)
Move the mouse to the top-left corner of the screen (within 10px of 0,0) to immediately terminate the MCP server.
Per-tool check:
checkFailsafe()runs before every tool handlerBackground monitor: 500ms polling as a backup for long-running operations
Trigger radius: 10px
Blocked operations
workspace_launch blocklist:
cmd.exe, powershell.exe, pwsh.exe, wscript.exe, cscript.exe, mshta.exe, regsvr32.exe, rundll32.exe, msiexec.exe, bash.exe, wsl.exe are blocked.
Script extensions (.bat, .ps1, .vbs, etc.) are rejected. Arguments containing ;, &, |, `, $(, ${ are also rejected.
keyboard_press blocklist:
Win+R (Run dialog), Win+X (admin menu), Win+S (search), Win+L (lock screen) are blocked.
PowerShell injection protection
All -like patterns in the UIA bridge are sanitized with escapeLike(), which escapes wildcard characters (*, ?, [, ]) before they reach PowerShell.
Allowlist for workspace_launch
Shell interpreters are blocked by default. To allow specific executables, create an allowlist file:
File locations (searched in order):
Path in
DESKTOP_TOUCH_ALLOWLISTenvironment variable~/.claude/desktop-touch-allowlist.jsondesktop-touch-allowlist.jsonin the server's working directory
Format:
{
"allowedExecutables": [
"pwsh.exe",
"C:\\Tools\\myapp.exe"
]
}Changes take effect immediately — no restart needed.
Mouse movement speed
All mouse tools (mouse_move, mouse_click, mouse_drag, scroll) accept an optional speed parameter:
Value | Behavior |
Omitted | Uses the configured default (see below) |
| Instant teleport — |
| Animated movement at N px/sec |
Default speed is 1500 px/sec. Change it permanently via the DESKTOP_TOUCH_MOUSE_SPEED environment variable:
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@harusame64/desktop-touch-mcp"],
"env": {
"DESKTOP_TOUCH_MOUSE_SPEED": "3000"
}
}
}
}Common values: 0 = teleport, 1500 = default gentle, 3000 = fast, 5000 = very fast.
Force-Focus (AttachThreadInput)
Windows foreground-stealing protection can prevent SetForegroundWindow from succeeding when another window (such as a pinned Claude CLI) is in the foreground. This causes subsequent keystrokes or clicks to land in the wrong window — a silent failure.
mouse_click, keyboard_type, keyboard_press, and terminal_send all accept a forceFocus parameter that bypasses this protection using AttachThreadInput:
{
"name": "mouse_click",
"arguments": {
"x": 500,
"y": 300,
"windowTitle": "Google Chrome",
"forceFocus": true
}
}If the force attempt is refused despite AttachThreadInput, the response includes hints.warnings: ["ForceFocusRefused"].
Global default via environment variable:
{
"mcpServers": {
"desktop-touch": {
"env": {
"DESKTOP_TOUCH_FORCE_FOCUS": "1"
}
}
}
}Setting DESKTOP_TOUCH_FORCE_FOCUS=1 makes forceFocus: true the default for all four tools without changing each call.
Known tradeoffs:
During the ~10ms
AttachThreadInputwindow, key state and mouse capture are shared between the two threads. In rapid macro sequences this can cause a race condition (rare in practice).Disable
forceFocus(or unset the env var) when the user is manually operating another app to avoid unexpected focus shifts.
Known limitations
Limitation | Detail | Workaround |
Games / video players may return black or hang in background capture | DirectX fullscreen apps may not work even with | Retry with |
UIA call overhead | ~300ms per call via PowerShell; | Batch with |
Chrome / WinUI3 UIA elements are empty | Chromium exposes only limited UIA |
|
Chromium title-regex misses when sites rewrite | Guard relies on the | Title is treated as plain Chrome (UIA runs). OCR path is still reachable via |
| If Chrome is already running on the default profile without the flag, | Close Chrome first, then |
Layer buffer TTL | Buffer auto-clears after 90s of inactivity → next | After long waits, call |
| When | Call |
| Non-ASCII punctuation (em-dash | Always use |
| Setting | Use |
Token cost reference
Mode | Tokens | Use case |
| ~443 tok | General visual check |
| ~800 tok | Precise clicking (no coordinate math) |
| ~160 tok | Post-action diff |
| ~100–300 tok | UI interaction (no image) |
| ~2000 tok | Full session orientation |
License
MIT
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/Harusame64/desktop-touch-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server