Orbination AI Desktop Vision & Control
Give AI assistants eyes and hands. A native Windows MCP server that lets AI see the screen, read UI elements, click buttons, type text, and control any application — with built-in OCR, dark theme support, window occlusion detection, and batch action sequencing.
Built for Claude Code by Leia Enterprise Solutions for the Orbination project.
AI coding assistants are blind. They generate code but can never see the result. They can't compare a design mockup to a running app. They can't click through a UI to test it. This server fixes that.
What It Does
This MCP server bridges the gap between AI and your desktop. Instead of working blind with just text, the AI can:
See — Take screenshots, run OCR on any window (auto-enhances dark themes), detect window occlusion
Read — Detect every UI element (buttons, inputs, text, tabs, checkboxes) with exact positions via Windows UIAutomation
Interact — Click elements by text (UIAutomation + OCR fallback), navigate menus, fill forms, type and paste text
Navigate — Open apps, switch windows, focus tabs, navigate browser URLs
Understand — Scan the entire desktop: window visibility %, occlusion detection, uncovered desktop regions
Batch — Execute multi-step UI workflows in a single call with
run_sequence
What's New in v2.0
Window Occlusion Detection — Grid-based analysis showing which windows are truly visible (visibility %) and which are hidden behind others
Desktop Region Detection — Flood-fill algorithm to find uncovered screen areas
Shared OcrService — Centralized OCR with automatic dark theme enhancement (invert + contrast boost) — single-pass, not two
PrintWindow API — Capture window content even when obscured by other windows
click_element— UIAutomation first, then OCR for dark themes, web apps, iframesrun_sequence— Batch multiple UI actions (click, type, paste, hotkey, wait, focus, OCR click) in a single MCP callclick_menu_item— Navigate parent > child menus with smooth mouse movement to keep submenus openDPI Awareness — Per-monitor DPI for correct coordinates on multi-monitor setups with mixed scaling
Embedded AI Instructions — Server sends tool usage guidelines on MCP connection, teaching AI to prefer OCR over screenshots
Architecture
AI Client (Claude Code / Claude Desktop)
│
│ MCP / stdio
▼
┌─────────────────────────────┐
│ MCP Server │
│ (ServerInstructions) │
└─────────┬───────────────────┘
│
┌─────────┼──────────────────────────────────────┐
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ │
Mouse Keyboard Screen Vision Composite │
Tools Tools Tools Tools Tools │
│ │ │ │
┌────────┼──────────┼──────────┘ │
▼ ▼ ▼ │
Win32 UIAuto- OcrService │
Native mation (dark theme) │
│ │ │
▼ ▼ │
DesktopScanner NativeInput │
(occlusion, (SendInput, │
regions) clipboard) │
│ │ │
└───────┬───────┘ │
▼ │
Windows OS │
(Desktop, Windows, Apps) │
└────────────────────────────────────────────────┘Single native .NET 8 executable. No Python. No Node.js. No browser drivers. Direct Windows API access.
Requirements
Windows 10/11
Build
cd DesktopControlMcp
dotnet build -c ReleaseOr publish as a single file:
dotnet publish -c Release -r win-x64 --self-contained falseSetup with Claude Code
Add the MCP server to your Claude Code configuration:
claude mcp add desktop-control -- "C:\path\to\DesktopControlMcp.exe"Or add it manually to your MCP config file:
{
"mcpServers": {
"desktop-control": {
"command": "C:\\path\\to\\DesktopControlMcp\\bin\\Release\\net8.0-windows\\DesktopControlMcp.exe",
"args": []
}
}
}Tools (45+)
Vision & Element Detection
Tool | Description |
| Full desktop scan — screens, windows with visibility %, UI elements, desktop regions, taskbar |
| List all visible windows with titles, process names, visibility %, occlusion status |
| Get all UI elements in a window (filter by kind: button, input, text, etc.) |
| Search for a UI element by text across all windows |
| Extract all visible text from a window |
| Re-scan a single window's elements (faster than full scan) |
Interaction
Tool | Description |
| Find element by text and click — UIAutomation first, OCR fallback for dark themes/web apps |
| Find an input field and type text (ValuePattern, clipboard paste, or click+type fallback) |
| Smart interaction — auto-detects element type and performs the right action |
| Fill multiple form fields in one call with JSON field:value pairs |
| Select a browser or application tab by text |
| Navigate menus: click parent, smooth-move to child, click — single call |
Batch & Composite Actions
Tool | Description |
| Execute multiple UI actions in ONE call: click, type, paste, hotkey, wait, focus, OCR click, screenshot |
| Click at position then type text |
| Click to focus (e.g. iframe) then send keyboard shortcut atomically |
Mouse & Keyboard
Tool | Description |
| Click at screen coordinates |
| Move cursor to position |
| Move mouse smoothly (keeps menus/submenus open) |
| Drag from one position to another |
| Scroll the mouse wheel |
| Get current cursor position |
| Type text (supports Unicode) |
| Press a single key |
| Press key combinations (Ctrl+C, Alt+Tab, etc.) |
| Hold and release keys |
Window & App Management
Tool | Description |
| Bring a window to the foreground |
| Maximize a window |
| Minimize a window |
| Restore a minimized/maximized window |
| Open an app by name (focuses existing, clicks taskbar, or searches Start) |
| Navigate a browser to a URL |
Screenshots & OCR
Tool | Description |
| Full screenshot across all monitors |
| Screenshot a specific screen region |
| Capture a window via PrintWindow API (works even when obscured) |
| Get monitor layout (positions, sizes, primary) |
| Capture a region and run OCR — auto-enhances dark themes |
| Run OCR on an entire window — reads all text with click coordinates |
| Search for specific text on screen using OCR — returns click coordinates |
Utilities
Tool | Description |
| Set clipboard text without pasting |
| Paste large text via clipboard (XML, code, multi-line) |
| Scroll with pauses between batches |
| Pause between actions |
| Poll for UI element to appear with timeout |
Embedded AI Instructions
The server sends tool usage guidelines automatically on every MCP connection via ServerInstructions. This teaches AI clients the optimal workflow without requiring any configuration files:
Observation Priority: ocr_window > get_window_details > list_windows > scan_desktop > screenshot_to_file
Action Priority: click_element > click_menu_item > run_sequence > paste_text > mouse_click
The key insight: OCR and UIAutomation return exact text and coordinates — the AI knows exactly what to click. Screenshots require vision processing and guessing. OCR-first workflows are faster, cheaper, and more reliable.
Window Occlusion Detection
The server uses a grid-based occlusion analysis (24px cells) to determine which windows are truly visible:
Chrome (chrome) [71] @ -2060,-1461 3456x1403 ← 100% visible
VS Code (Code) [45] @ -1500,-800 1200x900 ← 65% visible
Explorer (explorer) [20] @ -1400,-700 800x600 ← 0% visible [OCCLUDED]The AI knows which windows it can interact with and which are hidden. Combined with desktop region detection (flood-fill to find uncovered screen areas), the AI has a complete spatial understanding of the desktop.
Dark Theme OCR Enhancement
Many modern apps use dark themes where standard OCR fails. The server automatically detects dark backgrounds and enhances images before OCR:
Sample pixel luminance across the image
If average luminance < 100 → dark theme detected
Invert colors + boost contrast (1.4x) — single pass
Run OCR on enhanced image
This works automatically on ocr_window, ocr_screen_region, ocr_find_text, and click_element's OCR fallback.
Multi-Monitor Support
Full multi-monitor support out of the box with per-monitor DPI awareness:
Auto-detects all monitors — positions, sizes, primary screen via
get_screen_infoVirtual desktop mapping — coordinates span the full virtual desktop, including negative coordinates for left/top monitors
DPI-aware — correct coordinates on mixed-scaling setups (e.g. 100% on one monitor, 150% on another)
Cross-monitor screenshots —
screenshot_to_filecaptures all screens,screenshot_regiontargets any regionWindow-aware — windows on any monitor are detected with correct positions
Taskbar scanning — reads both
Shell_TrayWnd(primary) andShell_SecondaryTrayWnd(secondary monitors)
How UIAutomation Works
Unlike screenshot-based tools that guess what's on screen, this server reads the actual UI element tree exposed by Windows. Every button, input field, text label, tab, and checkbox is detected with:
Exact position and size (bounding rectangle)
Text/label (what the element says)
Control type (button, input, text, checkbox, etc.)
Automation ID (developer-assigned identifier)
Supported patterns (can it be clicked? typed into? toggled?)
UIAutomation + OCR Fallback
click_element combines both strategies. UIAutomation first (fast, structured), OCR fallback (universal):
click_element "Save"
→ UIAutomation: found "Save" button → click via Invoke pattern ✓
click_element "OK" (dark web dialog)
→ UIAutomation: not found
→ OCR: capture window → enhance dark theme → find "OK" text → click center ✓Limitation: Custom-Rendered Apps
Applications that render their own UI canvas (Flutter, Electron with custom rendering, game engines) may expose fewer elements to UIAutomation. The OCR fallback handles these cases automatically.
Token-Efficient by Design
Every MCP tool call costs tokens. This server is engineered to minimize token usage:
Structured Data Instead of Screenshots
Most desktop automation tools send full screenshots for every action — each one costs thousands of tokens. This server returns compact structured text:
[button] "Save" @ 450,320
[input] "Search..." @ 200,60
[tab-item] "Settings" @ 120,35Batch Operations
run_sequenceexecutes multiple actions in one call (click, type, paste, hotkey, wait, focus)fill_formfills multiple form fields in a single callscan_desktopreturns screens + windows + elements + taskbar in one responseclick_menu_itemnavigates parent > child menus in one call
Smart Caching
Scan results are cached for 30 seconds. Individual windows can be refreshed with refresh_window instead of a full scan_desktop. The scanner uses UIAutomation's CacheRequest to batch-fetch all properties in a single cross-process call.
Project Structure
DesktopControlMcp/
├── Program.cs # MCP server entry + DPI awareness + ServerInstructions
├── NativeInput.cs # Low-level mouse/keyboard via SendInput
├── Native/
│ └── Win32.cs # P/Invoke: EnumWindows, PrintWindow, window management
├── Models/
│ └── SceneData.cs # Data models: windows (with occlusion), elements, regions
├── Services/
│ ├── DesktopScanner.cs # Desktop scanning + occlusion analysis + region detection
│ ├── OcrService.cs # Shared OCR engine with dark theme auto-enhancement
│ └── UiAutomationHelper.cs # Element interaction patterns
└── Tools/
├── VisionTools.cs # scan, find, click (with OCR fallback), list windows
├── CompositeTools.cs # run_sequence, click_menu_item, navigate, open app
├── MouseTools.cs # Mouse control
├── KeyboardTools.cs # Keyboard control
└── ScreenTools.cs # Screenshots, OCR tools, PrintWindow captureExamples
See the examples/ folder for real-world workflows:
Visual UI Comparison — AI opens an HTML design and a Flutter app side by side, clicks through both, and identifies every visual difference
Automated UI Testing — AI tests login flows, form validation, and navigation by clicking through any app — no test scripts needed
Multi-App Workflows — AI orchestrates across browser, code editor, database tool, and desktop apps in a single workflow
Quick Install
Option A: Download pre-built binary
Download from Releases
Extract the zip
Add to Claude Code:
claude mcp add desktop-control -- "C:\path\to\DesktopControlMcp.exe"Option B: Build from source
git clone https://github.com/amichail-1/Orbination-AI-Desktop-Vision-Control.git
cd Orbination-AI-Desktop-Vision-Control/DesktopControlMcp
dotnet build -c Release
claude mcp add desktop-control -- "bin\Release\net8.0-windows\DesktopControlMcp.exe"Contributing
Contributions welcome. Open an issue or submit a PR.
License
MIT
This server cannot be installed
Resources
Looking for Admin?
Admins can modify the Dockerfile, update the server description, and track usage metrics. If you are the server author, to access the admin panel.