Skip to main content
Glama

Screen Agent

AI-native test agent that sees your app like a real user — 15x faster than Claude Code, without touching your screen.

An MCP server for autonomous visual testing. The AI plans test steps in natural language, the server executes them all without LLM round-trips. Works in the background via CDP (Chrome) or Accessibility API (native apps).

Quick Demo

# The AI plans. The server executes. No LLM round-trips. Background. 3 seconds.
run_test(name="Login Flow", steps=[
    {"find": "Email",    "action": "click_and_type", "text": "user@test.com"},
    {"find": "Password", "action": "click_and_type", "text": "secret123"},
    {"find": "Log in",   "action": "click"},
    {"verify": "Dashboard"},
])
# → ✅ 4/4 passed in 800ms. Screenshot evidence attached.

Why?

Every testing tool makes you choose: fast but fragile (Playwright) or smart but slow (Claude Code computer use). Screen Agent is both:

  • Autonomous Executionrun_test() executes ALL steps server-side. No LLM round-trips. 150ms/step vs Claude Code's 1-3s/step. 15x faster.

  • Vision-First — the LLM SEES the screen and decides where to click. Not DOM selectors. UI changes don't break tests because the LLM re-interprets the screen.

  • act + eval_jsact returns a screenshot for the LLM to analyze visually, then executes at LLM-provided coordinates. eval_js runs JavaScript via CDP for assertions. 5 tests in 0.6s.

  • Background Testingwindow_scope + CDP lets you test Chrome apps on any macOS Space without touching the user's screen. For native apps, tests behind other windows on the same Space.

  • Multi-Backend Input Chain — three input methods (Accessibility API → CGEvent → pyautogui) with automatic fallback. Works with native apps, Electron apps, and game engines.

  • Input Guardian — real-time safety system that pauses all agent actions when you touch your mouse or keyboard. No other tool provides this.

  • Cross-App Workflows — test flows spanning multiple apps (email → browser → Slack). No other tool can do this because they're all single-app.

Architecture

┌──────────────────────────────────┐
│          MCP Layer               │  22 tools via Model Context Protocol
├──────────────────────────────────┤
│          Engine Layer            │  InputChain (fallback) + Guardian (safety)
│                                  │  + WindowSession (background testing)
├──────────────────────────────────┤
│        Platform Layer            │  Protocol-based backends
│  AX → CGEvent → pyautogui       │  macOS / Windows / Linux
└──────────────────────────────────┘

Input Backend Chain

The core design challenge: pyautogui works for ~80% of apps but fails for game engines and many Electron apps. Screen Agent solves this with a Chain of Responsibility pattern:

Priority

Backend

Method

Best For

1

AX

AXPerformAction

Native macOS apps — semantic, no coordinates needed

2

CGEvent

CGEventPost

Games, Electron — native OS event injection

3

pyautogui

Python wrapper

Cross-platform fallback

Each backend implements the same InputBackend protocol. If one fails, the chain automatically tries the next. All attempts are logged with telemetry for observability.

Install

pip install screen-agent

# Recommended: install macOS native backends
pip install screen-agent[macos]

Quick Start

With Claude Code

claude mcp add screen -- screen-agent serve

With Cursor / other MCP clients

Add to your MCP config:

{
  "mcpServers": {
    "screen": {
      "command": "screen-agent",
      "args": ["serve"]
    }
  }
}

Check system capabilities

screen-agent check

Tools

Perception

Tool

Description

capture_screen

Screenshot (full or region), returns image for vision analysis

list_windows

List all visible windows with positions

get_active_window

Currently focused window

get_cursor_position

Current mouse position

Input (all support verify: true for post-action screenshots)

Tool

Description

click

Click at coordinates (left/right/middle, multi-click)

type_text

Type text at cursor (Unicode via clipboard on macOS)

press_key

Key press with modifiers (e.g., Cmd+C)

scroll

Scroll wheel at optional position

move_mouse

Move cursor without clicking

drag

Click-drag between two points

focus_window

Bring window to front by partial title match

OCR (auto-detects Chinese, Japanese, Korean, English)

Tool

Description

ocr

Extract all text with bounding boxes

find_text

Find text and return location

click_text

Find text and click its center

Autonomous Testing (the differentiator)

Tool

Description

run_test

Execute a full test plan autonomously — no LLM round-trips. 15x faster.

act

Vision-first: returns screenshot → LLM looks → executes at coordinates

eval_js

Execute JavaScript via CDP. DOM assertions, element clicks, state checks

interact

OCR-based: find element by text + click/type in one call

Background Testing

Tool

Description

window_scope

Lock to a window. Chrome: auto-CDP (any Space). Native: CGWindowList (same Space).

window_release

Release window scope, return to full-screen mode

Visual E2E Testing

Tool

Description

test_start

Start a test session with automatic screenshot collection

test_step

Begin a test step (auto-captures "before" screenshot)

test_verify

Verify step via OCR text check or screenshot diff

test_end

End session, generate markdown report with evidence

test_status

Current session status

Safety (Input Guardian)

Tool

Description

add_app

Add app to allowlist — agent can ONLY interact with listed apps

remove_app

Remove from allowlist

set_region

Restrict to pixel region

clear_scope

Remove all restrictions

get_agent_status

Guardian state, backend stats, scope info

Background Testing

Screen Agent can test applications without occupying your screen. Three modes, auto-selected:

Mode 1: CDP (Chrome/Electron — any Space, fully invisible)

# Start Chrome with debugging port
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
  --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-test
# Connect — works even if Chrome is on a different desktop
window_scope(app="Chrome", url="localhost:3000")

# All operations go through Chrome's internal pipeline
interact(target="Submit", action="click")
interact(target="Email", action="click_and_type", text="test@example.com")

window_release()

CDP bypasses the macOS window server entirely. Screenshots come from Chrome's renderer, clicks go through Chrome's input system. Your screen is never touched.

Mode 2: Window Capture (any macOS app — same Space)

# Works with Figma, Xcode, Terminal, games — any app
window_scope(app="Figma", title="Design v2")
interact(target="Export", action="click")
window_release()

Uses CGWindowListCreateImage to capture the window even when behind other apps. Requires same macOS Space.

Mode 3: Full Screen (original)

Without window_scope, operates on the full screen as before.

Fallback Priority

window_scope called → try CDP (Chrome) → try CGWindowList (same Space) → error
no scope → full screen mode

Input Guardian

Screen Agent's unique safety system with two guarantees:

  1. User Priority — any keyboard/mouse activity instantly pauses the agent. It resumes only after you've been idle for 1.5s (configurable).

  2. Scope Lock — restrict the agent to specific apps and/or screen regions.

# Agent can only interact with Chrome and Figma
add_app("Chrome")
add_app("Figma")

# Or restrict to a region
set_region(x=0, y=0, width=800, height=600)

Configuration

All parameters are configurable via environment variables:

Variable

Default

Description

SCREEN_AGENT_COOLDOWN

1.5

Guardian cooldown seconds

SCREEN_AGENT_GUARDIAN_DISABLED

0

Set to "1" to disable

SCREEN_AGENT_INPUT_BACKENDS

ax,cgevent,pyautogui

Backend priority order

SCREEN_AGENT_MAX_DIMENSION

2560

Max screenshot dimension

SCREEN_AGENT_LOG_LEVEL

INFO

Logging level

Platform Support

Feature

macOS

Windows

Linux

Screenshot

mss

mss

mss

AX Input

Quartz AX

-

-

CGEvent Input

Quartz

-

-

pyautogui Input

fallback

fallback

fallback

Window Management

AppleScript

-

wmctrl

OCR

Vision Framework

-

-

Retina Scaling

auto-detect

-

-

Window Capture

CGWindowListCreateImage

PrintWindow

xdotool+ImageMagick

Development

git clone https://github.com/chriswu727/screen-agent
cd screen-agent
pip install -e ".[dev,macos]"
pytest tests/unit/ -v
ruff check src/ tests/

See DEVPATH.md for development history and architectural decisions.

License

MIT

A
license - permissive license
-
quality - not tested
C
maintenance

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/chriswu727/screen-agent'

If you have feedback or need assistance with the MCP directory API, please join our Discord server