desktop-touch-mcp
desktop-touch-mcp is a Windows-only MCP server that gives Claude eyes and hands on a Windows desktop via 46 specialized tools, optimized for LLM efficiency (minimal tokens and round-trips).
Note: All tools return
UnsupportedPlatformwhen running on Linux. Capabilities below apply when hosted on Windows 10/11 (64-bit).
Screen Capture & Vision
Screenshots with 3 detail levels:
image(~443 tokens),text(~100–300 tokens),meta(~20 tokens)Diff mode — only sends changed windows after first frame (60–80% token reduction)
Background window capture without stealing focus
OCR text extraction with click coordinates (Windows.Media.Ocr)
Full-page scroll capture with overlap detection
dotByDot=truefor pixel-perfect 1:1 coordinate clicking
Mouse & Keyboard Control
Move, click, drag with speed control and homing correction
Scroll in any direction; get cursor position
Image-local coordinate translation via
origin+scaleType text (clipboard bypass for non-ASCII/IME/CJK) and send key combos
forceFocusto bypass Windows foreground-stealing protection
Window Management
List all windows in Z-order; get active window info
Focus by partial title match; pin/unpin (always-on-top); dock to screen corners
UI Automation (UIA)
Extract full UI element trees as JSON with click coordinates
Click elements by name or automationId (no coordinates needed)
Set text field values directly; zoom/crop to specific elements
Automatic fallback from UIA to OCR for Chromium apps
Browser Automation via CDP (Chrome/Edge/Brave)
Launch, connect, navigate browsers via Chrome DevTools Protocol
Find/click elements by CSS selector; evaluate JavaScript; get DOM/outerHTML
Enumerate interactive elements with ARIA states
Extract SPA state (Next.js, Nuxt, Remix, Apollo, Redux, GitHub react-app)
Search DOM by text, regex, role, or ariaLabel with confidence ranking
Workspace & Macro Operations
Full workspace snapshot: all windows with thumbnails and UI summaries
Launch applications with auto-detection of new windows
run_macro: batch up to 50 sequential operations in a single API call
Event System & Terminal
Subscribe to, poll, and wait on desktop events (
wait_until)Read from and send input to terminals
Safety & Security
Emergency stop: move mouse to top-left corner (within 10px of 0,0)
Blocklists for dangerous executables and keyboard shortcuts
PowerShell injection protection; configurable allowlists for app launching
Provides browser automation capabilities through CDP (Chrome DevTools Protocol), enabling connection to Brave browser, finding elements via CSS selectors, clicking elements, evaluating JavaScript, and navigating pages without requiring Selenium or Playwright.
Supports CSS selector-based element location in browser automation, allowing precise identification of DOM elements in Chrome/Edge/Brave browsers for clicking and interaction through CDP integration.
Provides comprehensive browser automation through CDP (Chrome DevTools Protocol), enabling connection to Chrome, finding elements via CSS selectors, clicking elements, evaluating JavaScript, getting DOM content, and navigating pages without requiring Selenium or Playwright.
desktop-touch-mcp
Stop pasting screenshots. Let Claude see and control your desktop directly.
An MCP server that gives Claude eyes and hands on Windows — 56 tools covering screenshots, mouse, keyboard, Windows UI Automation, Chrome DevTools Protocol, clipboard, desktop notifications, SmartScroll, and a Reactive Perception Graph for safe multi-step automation, designed from the ground up for LLM efficiency.
Applies MPEG P-frame diffing to window capture: only changed windows are sent after the first frame, cutting token usage by ~60–80% in typical automation loops.
Features
LLM-native design — Built around how LLMs think, not how humans click.
run_macrobatches multiple operations into a single API call;diffModesends only the windows that changed since the last frame. Minimal tokens, minimal round-trips.Reactive Perception Graph — Register a
lensIdfor a window or browser tab, pass it to action tools, and get guard-checkedpost.perceptionfeedback after each action. It reduces repeatedscreenshot/get_contextcalls and prevents wrong-window typing or stale-coordinate clicks.Full CJK support — Uses Win32
GetWindowTextWfor window titles, avoiding nut-js garbling. IME bypass input supported for Japanese/Chinese/Korean environments.3-tier token reduction —
detail="image"(~443 tok) /detail="text"(~100–300 tok) /diffMode=true(~160 tok). Send pixels only when you actually need to see them.1:1 coordinate mode —
dotByDot=truecaptures at native resolution (WebP). Image pixel = screen coordinate — no scale math needed. Withorigin+scalepassed tomouse_click, the server converts coords for you — eliminating off-by-one / scale bugs.Browser capture data reduction —
grayscale=true(~50% size),dotByDotMaxDimension=1280(auto-scaled with coord preservation), andwindowTitle + regionsub-crops help exclude browser chrome and other irrelevant pixels. Typical reduction for heavy captures: 50–70%.Chromium smart fallback —
detail="text"on Chrome/Edge/Brave auto-skips UIA (prohibitively slow there) and runs Windows OCR.hints.chromiumGuard+hints.ocrFallbackFiredflag the path taken.UIA element extraction —
detail="text"returns button names andclickAtcoords as JSON. Claude can click the right element without ever looking at a screenshot.Auto-dock CLI —
dock_windowsnaps any window to a screen corner with always-on-top. SetDESKTOP_TOUCH_DOCK_TITLE='@parent'to auto-dock the terminal hosting Claude on MCP startup — the process-tree walker finds the right window regardless of title.Emergency stop (Failsafe) — Move the mouse to the top-left corner (within 10px of 0,0) to immediately terminate the MCP server.
Requirements
OS | Windows 10 / 11 (64-bit) |
Node.js | v20+ recommended (tested on v22+) |
PowerShell | 5.1+ (bundled with Windows) |
Claude CLI |
|
Note: nut-js native bindings require the Visual C++ Redistributable. Download from Microsoft if not already installed.
Installation
npx -y @harusame64/desktop-touch-mcpThe npm launcher downloads the latest desktop-touch-mcp-windows.zip from GitHub Releases on first run and caches it under %USERPROFILE%\.desktop-touch-mcp. Later runs reuse the cached release unless a newer GitHub Release is available.
Register with Claude CLI
Add to ~/.claude.json under mcpServers:
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@harusame64/desktop-touch-mcp"]
}
}
}No system prompt needed. The command reference is automatically injected into Claude via the MCP initialize response's instructions field.
Development install
git clone https://github.com/Harusame64/desktop-touch-mcp.git
cd desktop-touch-mcp
npm installnpm install runs the prepare script, which compiles TypeScript to dist/. No separate build step is required.
For a local checkout, register the built server directly:
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "node",
"args": ["D:/path/to/desktop-touch-mcp/dist/index.js"]
}
}
}Note: Replace
D:/path/to/desktop-touch-mcpwith the actual path where you cloned this repository.
Tools (56 total)
📖 Full command reference:
docs/system-overview.md— every tool's parameters, response shape, coordinate math, layer-buffer strategy, and engineering notes in one place.
Screenshot (5)
Tool | Description |
| Main capture. Supports |
| Capture a background window without focusing it (PrintWindow API) |
| Windows.Media.Ocr on a window; returns word-level text + screen clickAt coords |
| Monitor layout, DPI, cursor position |
| Full-page stitch by scrolling (MAE overlap detection + 10% fallback) |
Window management (4)
Tool | Description |
| List all windows in Z-order |
| Info about the focused window |
| Bring a window to foreground by partial title match |
| Snap a window to a screen corner at a small size + always-on-top (for keeping CLI visible) |
Mouse (5)
Tool | Description |
| Move, click, drag. |
| Scroll in any direction. Accepts |
| Current cursor coordinates |
Keyboard (2)
Tool | Description |
| Type text. |
| Key combos ( |
UI Automation (4)
Tool | Description |
| Full UIA element tree for a window |
| Click a button by name or automationId — no coordinates needed |
| Write directly to a text field |
| High-res zoom crop of an element + its child tree |
Browser CDP (12)
Tool | Description |
| Launch Chrome/Edge/Brave with |
| Connect to Chrome/Edge via CDP; lists open tabs with |
| CSS selector → exact physical screen coords |
| Find DOM element + click in one step |
| Evaluate JS expression in the browser tab |
| Fill React/Vue/Svelte controlled inputs via CDP — works where |
| Get outerHTML of element or |
| Enumerate links / buttons / inputs + ARIA toggles with |
| SPA state extractor — one CDP call that scans |
| Grep DOM by text / regex / role / ariaLabel / selector with confidence ranking |
| Navigate via CDP |
| Close cached CDP WebSocket sessions |
All browser_* tools that touch the DOM accept includeContext:false to omit the trailing activeTab: / readyState: lines (saves ~150 tok/call on chained invocations). Within a 500 ms window, consecutive calls reuse one tab-context fetch automatically.
Workspace (2)
Tool | Description |
| All windows: thumbnails + UI summaries in one call |
| Launch an app and auto-detect the new window |
Pin / Macro (3)
Tool | Description |
| Always-on-top toggle |
| Execute up to 50 steps sequentially in one MCP call |
Clipboard (2)
Tool | Description |
| Read the current Windows clipboard text (non-text payloads return empty string) |
| Write text to the Windows clipboard; full Unicode / emoji / CJK support |
Notification (1)
Tool | Description |
| Show a Windows system tray balloon notification — useful to alert the user when a long-running task finishes |
Scroll (1)
Tool | Description |
| Scroll a named element into the viewport without computing scroll amounts. Chrome path: |
| SmartScroll — unified scroll dispatcher: CDP → UIA → image binary-search fallback. Handles nested containers, virtualised lists (TanStack/React Virtualized), sticky-header occlusion, and image-only environments. Returns |
Browser CDP automation
For web automation, connect Chrome or Edge with the remote debugging port enabled — no Selenium or Playwright needed.
# Launch Chrome in CDP mode
chrome.exe --remote-debugging-port=9222 --user-data-dir=C:\tmp\cdpbrowser_launch() → launch Chrome/Edge/Brave in debug mode (idempotent)
browser_connect() → list open tabs + get tabIds
browser_find_element("#submit") → CSS selector → physical screen coords
browser_click_element("#submit") → find + click in one step (auto-focuses browser)
browser_eval("document.title") → evaluate JS, returns result
browser_fill_input("#email", "user@example.com") → fill React/Vue/Svelte controlled input (state-safe)
browser_get_dom("#main", maxLength=5000)→ outerHTML, truncated to maxLength chars
browser_get_interactive() → links/buttons/inputs + ARIA toggles + viewportPosition per element
browser_get_app_state() → one-shot SPA state (Next/Nuxt/Remix/Apollo/GitHub react-app/Redux SSR)
browser_search(by="text", pattern="...")→ grep DOM with confidence ranking
browser_navigate("https://example.com") → navigate via CDP (no address bar interaction)
browser_disconnect() → clean up WebSocket sessionsFor chained calls in the same tab, pass includeContext:false to omit the activeTab/readyState annotation (~150 tok/call saved). Boolean / object params accept the LLM-friendly string spellings ("true", "{}").
Coordinates returned by browser_find_element account for the browser chrome (tab strip + address bar height) and devicePixelRatio, so they can be passed directly to mouse_click without any scaling.
Recommended web workflow:
browser_connect() → browser_get_dom() → browser_find_element(selector) → browser_click_element(selector)Auto-dock CLI on startup
Keep Claude CLI visible while operating other apps full-screen. Set env vars in your MCP config and the docked window auto-snaps into place every MCP startup.
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@harusame64/desktop-touch-mcp"],
"env": {
"DESKTOP_TOUCH_DOCK_TITLE": "@parent",
"DESKTOP_TOUCH_DOCK_CORNER": "bottom-right",
"DESKTOP_TOUCH_DOCK_WIDTH": "480",
"DESKTOP_TOUCH_DOCK_HEIGHT": "360",
"DESKTOP_TOUCH_DOCK_PIN": "true"
}
}
}
}Env var | Default | Notes |
| (unset = off) |
|
|
|
|
|
| px ( |
|
| Always-on-top toggle |
| primary | Monitor id from |
|
| If true, multiply px values by |
|
| Screen-edge padding (px) |
|
| Max wait for the target window to appear |
Input routing gotcha: when a pinned window is active (e.g. Claude CLI),
keyboard_type/keyboard_presssend keys to it, not the app you wanted to type into. Always callfocus_window(title=...)before keyboard operations, then verifyisActive=trueviascreenshot(detail='meta').
Reactive Perception Graph (4)
Tool | Description |
| Register a live perception lens on a window or browser tab. Returns a |
| Force-refresh the lens and return a full perception envelope when attention is dirty/stale/blocked |
| Release a lens when the workflow ends or the target was replaced |
| List active lenses so Claude can reuse or clean up existing tracking |
Reactive Perception Graph is desktop-touch's low-cost situational awareness layer. It keeps the target identity, focus, rect, readiness, and guard state alive across actions so Claude does not need to re-check everything with a screenshot after every small move.
# Register a lens on the target window or browser tab
perception_register({name:"editor", target:{kind:"window", match:{titleIncludes:"Notepad"}}})
→ {lensId:"perc-1", ...}
# Pass lensId to action tools. Guards run before the action;
# compact feedback arrives in post.perception after the action.
keyboard_type({text:"hello", windowTitle:"Notepad", lensId:"perc-1"})
→ post.perception: {attention:"ok", guards:{...}, latest:{target:{title, rect, foreground}}}
# If the app restarts or focus moves away, guards fail closed before unsafe input:
keyboard_type({text:"x", lensId:"perc-1"})
→ {ok:false, code:"GuardFailed", suggest:["Re-register lens for the new process instance"]}lensId is opt-in on all action tools (keyboard_type, keyboard_press, mouse_click, mouse_drag, click_element, set_element_value, browser_click_element, browser_navigate, browser_eval). Omitting lensId preserves existing behavior exactly.
Mouse homing correction
When Claude calls screenshot(detail='text') to read coordinates and then mouse_click seconds later, the target window may have moved. The homing system corrects this automatically.
Tier | How to enable | Latency | What it does |
1 | Always-on (if cache exists) | <1ms | Applies (dx, dy) offset when window moved |
2 | Pass | ~100ms | Auto-focuses window if it went behind another |
3 | Pass | 1–3s | UIA re-query for fresh coords on resize |
# Tier 1 only (automatic)
mouse_click(x=500, y=300)
# Tier 1 + 2: also bring window to front if hidden
mouse_click(x=500, y=300, windowTitle="Notepad")
# Tier 1 + 2 + 3: also re-query UIA if window resized
mouse_click(x=500, y=300, windowTitle="Notepad", elementName="Save")
# Traction control OFF — no correction
mouse_click(x=500, y=300, homing=false)The homing parameter is available on mouse_click, mouse_move, mouse_drag, and scroll. The cache is updated automatically on every screenshot(), get_windows(), focus_window(), and workspace_snapshot() call.
mouse_click image-local coords (origin + scale)
When you take a dotByDot screenshot with dotByDotMaxDimension, the response prints the origin and scale values. Instead of computing screen coords manually, copy them into mouse_click:
# Screenshot response:
# origin: (0, 120) | scale: 0.6667
# To click image pixel (ix, iy): mouse_click(x=ix, y=iy, origin={x:0, y:120}, scale=0.6667)
mouse_click(x=640, y=300, origin={x:0, y:120}, scale=0.6667, windowTitle="Chrome")
# Server converts: screen = (0 + 640/0.6667, 120 + 300/0.6667) = (960, 570)This eliminates a whole class of off-by-one and scale bugs. Without origin/scale, x/y remain absolute screen pixels (unchanged behavior).
screenshot key parameters
detail="image" — PNG/WebP pixels (default)
detail="text" — UIA element JSON + clickAt coords (no image, ~100–300 tok)
detail="meta" — Title + region only (cheapest, ~20 tok/window)
dotByDot=true — 1:1 WebP; image_px + origin = screen_px
dotByDotMaxDimension=N — cap longest edge (response includes scale for coord math)
grayscale=true — ~50% smaller for text-heavy captures (code/AWS console)
region={x,y,w,h} — with windowTitle: window-local coords (exclude browser chrome)
without: virtual screen coords
diffMode=true — I-frame first call, P-frame (changed windows only) after (~160 tok)
ocrFallback="auto" — detail='text' auto-fires Windows OCR on uiaSparse or emptyRecommended Chrome combo (50–70% data reduction):
screenshot(windowTitle="Chrome",
dotByDot=true, dotByDotMaxDimension=1280, grayscale=true,
region={x:0, y:120, width:1920, height:900}) # skip browser chromeRecommended workflow:
workspace_snapshot() → full orientation (resets diff buffer)
screenshot(detail="text", windowTitle=X) → get actionable[].clickAt coords
mouse_click(x, y) → click directly, no math needed
screenshot(diffMode=true) → check only what changed (~160 tok)Security
Emergency stop (Failsafe)
Move the mouse to the top-left corner of the screen (within 10px of 0,0) to immediately terminate the MCP server.
Per-tool check:
checkFailsafe()runs before every tool handlerBackground monitor: 500ms polling as a backup for long-running operations
Trigger radius: 10px
Blocked operations
workspace_launch blocklist:
cmd.exe, powershell.exe, pwsh.exe, wscript.exe, cscript.exe, mshta.exe, regsvr32.exe, rundll32.exe, msiexec.exe, bash.exe, wsl.exe are blocked.
Script extensions (.bat, .ps1, .vbs, etc.) are rejected. Arguments containing ;, &, |, `, $(, ${ are also rejected.
keyboard_press blocklist:
Win+R (Run dialog), Win+X (admin menu), Win+S (search), Win+L (lock screen) are blocked.
PowerShell injection protection
All -like patterns in the UIA bridge are sanitized with escapeLike(), which escapes wildcard characters (*, ?, [, ]) before they reach PowerShell.
Allowlist for workspace_launch
Shell interpreters are blocked by default. To allow specific executables, create an allowlist file:
File locations (searched in order):
Path in
DESKTOP_TOUCH_ALLOWLISTenvironment variable~/.claude/desktop-touch-allowlist.jsondesktop-touch-allowlist.jsonin the server's working directory
Format:
{
"allowedExecutables": [
"pwsh.exe",
"C:\\Tools\\myapp.exe"
]
}Changes take effect immediately — no restart needed.
Mouse movement speed
All mouse tools (mouse_move, mouse_click, mouse_drag, scroll) accept an optional speed parameter:
Value | Behavior |
Omitted | Uses the configured default (see below) |
| Instant teleport — |
| Animated movement at N px/sec |
Default speed is 1500 px/sec. Change it permanently via the DESKTOP_TOUCH_MOUSE_SPEED environment variable:
{
"mcpServers": {
"desktop-touch": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@harusame64/desktop-touch-mcp"],
"env": {
"DESKTOP_TOUCH_MOUSE_SPEED": "3000"
}
}
}
}Common values: 0 = teleport, 1500 = default gentle, 3000 = fast, 5000 = very fast.
Force-Focus (AttachThreadInput)
Windows foreground-stealing protection can prevent SetForegroundWindow from succeeding when another window (such as a pinned Claude CLI) is in the foreground. This causes subsequent keystrokes or clicks to land in the wrong window — a silent failure.
mouse_click, keyboard_type, keyboard_press, and terminal_send all accept a forceFocus parameter that bypasses this protection using AttachThreadInput:
{
"name": "mouse_click",
"arguments": {
"x": 500,
"y": 300,
"windowTitle": "Google Chrome",
"forceFocus": true
}
}If the force attempt is refused despite AttachThreadInput, the response includes hints.warnings: ["ForceFocusRefused"].
Global default via environment variable:
{
"mcpServers": {
"desktop-touch": {
"env": {
"DESKTOP_TOUCH_FORCE_FOCUS": "1"
}
}
}
}Setting DESKTOP_TOUCH_FORCE_FOCUS=1 makes forceFocus: true the default for all four tools without changing each call.
Known tradeoffs:
During the ~10ms
AttachThreadInputwindow, key state and mouse capture are shared between the two threads. In rapid macro sequences this can cause a race condition (rare in practice).Disable
forceFocus(or unset the env var) when the user is manually operating another app to avoid unexpected focus shifts.
Known limitations
Limitation | Detail | Workaround |
Games / video players may return black or hang in background capture | DirectX fullscreen apps may not work even with | Retry with |
UIA call overhead | ~300ms per call via PowerShell; | Batch with |
Chrome / WinUI3 UIA elements are empty | Chromium exposes only limited UIA |
|
Chromium title-regex misses when sites rewrite | Guard relies on the | Title is treated as plain Chrome (UIA runs). OCR path is still reachable via |
| If Chrome is already running on the default profile without the flag, | Close Chrome first, then |
Layer buffer TTL | Buffer auto-clears after 90s of inactivity → next | After long waits, call |
| When | Call |
| Non-ASCII punctuation (em-dash | Always use |
| Setting | Use |
Token cost reference
Mode | Tokens | Use case |
| ~443 tok | General visual check |
| ~800 tok | Precise clicking (no coordinate math) |
| ~160 tok | Post-action diff |
| ~100–300 tok | UI interaction (no image) |
| ~2000 tok | Full session orientation |
License
MIT
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/Harusame64/desktop-touch-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server