Screen Agent
Allows AI agents to view and interact with the Figma desktop application through screen capture and UI automation, including mouse clicks, keyboard input, and scrolling, when the app is added to the allowlist.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Screen Agenthelp me debug this error I'm seeing on screen"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Screen Agent
AI-native test agent that sees your app like a real user — 15x faster than Claude Code, without touching your screen.
An MCP server for autonomous visual testing. The AI plans test steps in natural language, the server executes them all without LLM round-trips. Works in the background via CDP (Chrome) or Accessibility API (native apps).
Quick Demo
# The AI plans. The server executes. No LLM round-trips. Background. 3 seconds.
run_test(name="Login Flow", steps=[
{"find": "Email", "action": "click_and_type", "text": "user@test.com"},
{"find": "Password", "action": "click_and_type", "text": "secret123"},
{"find": "Log in", "action": "click"},
{"verify": "Dashboard"},
])
# → ✅ 4/4 passed in 800ms. Screenshot evidence attached.Why?
Every testing tool makes you choose: fast but fragile (Playwright) or smart but slow (Claude Code computer use). Screen Agent is both:
Autonomous Execution —
run_test()executes ALL steps server-side. No LLM round-trips. 150ms/step vs Claude Code's 1-3s/step. 15x faster.Vision-First — the LLM SEES the screen and decides where to click. Not DOM selectors. UI changes don't break tests because the LLM re-interprets the screen.
act+eval_js—actreturns a screenshot for the LLM to analyze visually, then executes at LLM-provided coordinates.eval_jsruns JavaScript via CDP for assertions. 5 tests in 0.6s.Background Testing —
window_scope+ CDP lets you test Chrome apps on any macOS Space without touching the user's screen. For native apps, tests behind other windows on the same Space.Multi-Backend Input Chain — three input methods (Accessibility API → CGEvent → pyautogui) with automatic fallback. Works with native apps, Electron apps, and game engines.
Input Guardian — real-time safety system that pauses all agent actions when you touch your mouse or keyboard. No other tool provides this.
Cross-App Workflows — test flows spanning multiple apps (email → browser → Slack). No other tool can do this because they're all single-app.
Architecture
┌──────────────────────────────────┐
│ MCP Layer │ 22 tools via Model Context Protocol
├──────────────────────────────────┤
│ Engine Layer │ InputChain (fallback) + Guardian (safety)
│ │ + WindowSession (background testing)
├──────────────────────────────────┤
│ Platform Layer │ Protocol-based backends
│ AX → CGEvent → pyautogui │ macOS / Windows / Linux
└──────────────────────────────────┘Input Backend Chain
The core design challenge: pyautogui works for ~80% of apps but fails for game engines and many Electron apps. Screen Agent solves this with a Chain of Responsibility pattern:
Priority | Backend | Method | Best For |
1 | AX |
| Native macOS apps — semantic, no coordinates needed |
2 | CGEvent |
| Games, Electron — native OS event injection |
3 | pyautogui | Python wrapper | Cross-platform fallback |
Each backend implements the same InputBackend protocol. If one fails, the chain automatically tries the next. All attempts are logged with telemetry for observability.
Install
pip install screen-agent
# Recommended: install macOS native backends
pip install screen-agent[macos]Quick Start
With Claude Code
claude mcp add screen -- screen-agent serveWith Cursor / other MCP clients
Add to your MCP config:
{
"mcpServers": {
"screen": {
"command": "screen-agent",
"args": ["serve"]
}
}
}Check system capabilities
screen-agent checkTools
Perception
Tool | Description |
| Screenshot (full or region), returns image for vision analysis |
| List all visible windows with positions |
| Currently focused window |
| Current mouse position |
Input (all support verify: true for post-action screenshots)
Tool | Description |
| Click at coordinates (left/right/middle, multi-click) |
| Type text at cursor (Unicode via clipboard on macOS) |
| Key press with modifiers (e.g., Cmd+C) |
| Scroll wheel at optional position |
| Move cursor without clicking |
| Click-drag between two points |
| Bring window to front by partial title match |
OCR (auto-detects Chinese, Japanese, Korean, English)
Tool | Description |
| Extract all text with bounding boxes |
| Find text and return location |
| Find text and click its center |
Autonomous Testing (the differentiator)
Tool | Description |
| Execute a full test plan autonomously — no LLM round-trips. 15x faster. |
| Vision-first: returns screenshot → LLM looks → executes at coordinates |
| Execute JavaScript via CDP. DOM assertions, element clicks, state checks |
| OCR-based: find element by text + click/type in one call |
Background Testing
Tool | Description |
| Lock to a window. Chrome: auto-CDP (any Space). Native: CGWindowList (same Space). |
| Release window scope, return to full-screen mode |
Visual E2E Testing
Tool | Description |
| Start a test session with automatic screenshot collection |
| Begin a test step (auto-captures "before" screenshot) |
| Verify step via OCR text check or screenshot diff |
| End session, generate markdown report with evidence |
| Current session status |
Safety (Input Guardian)
Tool | Description |
| Add app to allowlist — agent can ONLY interact with listed apps |
| Remove from allowlist |
| Restrict to pixel region |
| Remove all restrictions |
| Guardian state, backend stats, scope info |
Background Testing
Screen Agent can test applications without occupying your screen. Three modes, auto-selected:
Mode 1: CDP (Chrome/Electron — any Space, fully invisible)
# Start Chrome with debugging port
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--remote-debugging-port=9222 --user-data-dir=/tmp/chrome-test# Connect — works even if Chrome is on a different desktop
window_scope(app="Chrome", url="localhost:3000")
# All operations go through Chrome's internal pipeline
interact(target="Submit", action="click")
interact(target="Email", action="click_and_type", text="test@example.com")
window_release()CDP bypasses the macOS window server entirely. Screenshots come from Chrome's renderer, clicks go through Chrome's input system. Your screen is never touched.
Mode 2: Window Capture (any macOS app — same Space)
# Works with Figma, Xcode, Terminal, games — any app
window_scope(app="Figma", title="Design v2")
interact(target="Export", action="click")
window_release()Uses CGWindowListCreateImage to capture the window even when behind other apps. Requires same macOS Space.
Mode 3: Full Screen (original)
Without window_scope, operates on the full screen as before.
Fallback Priority
window_scope called → try CDP (Chrome) → try CGWindowList (same Space) → error
no scope → full screen modeInput Guardian
Screen Agent's unique safety system with two guarantees:
User Priority — any keyboard/mouse activity instantly pauses the agent. It resumes only after you've been idle for 1.5s (configurable).
Scope Lock — restrict the agent to specific apps and/or screen regions.
# Agent can only interact with Chrome and Figma
add_app("Chrome")
add_app("Figma")
# Or restrict to a region
set_region(x=0, y=0, width=800, height=600)Configuration
All parameters are configurable via environment variables:
Variable | Default | Description |
| 1.5 | Guardian cooldown seconds |
| 0 | Set to "1" to disable |
| ax,cgevent,pyautogui | Backend priority order |
| 2560 | Max screenshot dimension |
| INFO | Logging level |
Platform Support
Feature | macOS | Windows | Linux |
Screenshot | mss | mss | mss |
AX Input | Quartz AX | - | - |
CGEvent Input | Quartz | - | - |
pyautogui Input | fallback | fallback | fallback |
Window Management | AppleScript | - | wmctrl |
OCR | Vision Framework | - | - |
Retina Scaling | auto-detect | - | - |
Window Capture | CGWindowListCreateImage | PrintWindow | xdotool+ImageMagick |
Development
git clone https://github.com/chriswu727/screen-agent
cd screen-agent
pip install -e ".[dev,macos]"
pytest tests/unit/ -v
ruff check src/ tests/See DEVPATH.md for development history and architectural decisions.
License
MIT
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/chriswu727/screen-agent'
If you have feedback or need assistance with the MCP directory API, please join our Discord server