capus
Enables testing of macOS applications through GUI interaction, using vision-based perception (OmniParser) and synthesized mouse/keyboard input.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@capusrun a test with a low-tech persona on my web app's login flow"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Capus
Persona-driven LLM agent testing for macOS apps and web apps, served as
a local MCP daemon. LLM agents role-play sampled human personas
("silicon sampling") and exercise your app through the GUI — screenshots in,
real synthesized input out. The daemon is deliberately dumb (no LLM calls,
ever); the intelligence is an MCP client: Claude Code/Codex driving it
interactively, or the built-in runner started from the dashboard or
capus agents (Anthropic API key or a Claude subscription via the
claude CLI).
Two target drivers behind one identical tool surface:
macOS (
app_path= .app/executable/.py): window screenshots parsed by OmniParser v2, CGEvent mouse/keyboard. Needs the machine free during runs (one shared screen + pointer).Browser (
app_path= http(s):// or file:// URL): one isolated headless Chromium context per persona session — truly parallel, zero desktop contention, no macOS permissions. The persona still sees a screenshot (numbered Set-of-Mark boxes), but element geometry comes from the DOM (visible, in-viewport, interactive elements with accessible names; shadow DOM and same-origin iframes traversed), input goes through CDP at real coordinates (move, then click — the same trusted path as a human mouse), and oracles get browser-grade signals: uncaught JS exceptions, console errors, failed/4xx+ requests, native dialogs, renderer crashes. OmniParser remains the automatic fallback for canvas-rendered UIs.
Output is two deliverables from one findings database:
report.html/report.md— for the developer: findings with screenshots and repro steps, business-rule coverage matrix (personas × rules), persona journey filmstrips.feedback.json— for coding agents: stable finding IDs, expected vs observed, deterministic repro traces, a status queue (open/fixed/wontfix) plus MCP tools (findings_query,trace_replay,finding_update) to fix and verify autonomously.
Architecture
the MIND (an MCP client — choose one per run)
Claude Code / Codex capus runner (capusd/runner.py)
orchestrator + tester spawned by the dashboard ▶ button or
subagents (parallel) `capus agents`; Anthropic API or claude CLI
│ MCP over streamable HTTP (127.0.0.1:7777/mcp)
▼
capusd (the BODY: dumb on purpose — zero LLM calls)
session manager + work queue │ persona sampler (correlated human sampling)
humanize (motor simulation: typing cadence+typos, Fitts-law mouse paths,
click scatter, think pauses — seeded per session) │ Driver protocol
├─ macOS driver (Quartz windows, screenshots, CGEvent input)
└─ browser driver (Playwright Chromium, DOM perception, CDP input)
vision (OmniParser v2 + Apple-Vision OCR) │ oracles (crash/hang/log/no-op
+ JS/console/network/dialog for browser) │ SQLite store + artifacts │
report generator │ dashboard (live streaming + run/persona control)One daemon serves many parallel clients. The split is deliberate: the client decides WHAT a persona does (cognition), the daemon decides HOW it physically unfolds (motor execution) — so even a fast model produces sessions that look and pace like a human when watched live.
Related MCP server: macinput
Human realism
Personas are sampled, not invented: demographics from an overridable population spec, behavioral traits drawn from a correlated-but-noisy latent model calibrated against real data — OECD PIAAC tech-skill bands (~60% of adults are level 1 or below: the "average user" fails multi-step tasks), CDC disability prevalence, CHI'18 typing statistics (52±25 WPM, ~6% of keystrokes are corrections). Counter-stereotypical draws are guaranteed by design (the 79-year-old engineer happens; so does the low-tech 19-year-old).
Each persona compiles into:
a behavior contract (
task_claimreturns it) — first-person system prompt: identity, reading style, patience, blame attribution, quirks, and hard anti-"superuser" rules (satisfice, knowledge limits, giving up is valid data) — plus a 3-linepersona_reminderto re-inject every few turns against persona drift;a motor profile the daemon enforces mechanically: typing WPM with corrected typo bursts, Fitts's-law mouse movement (curved Bezier paths, bell velocity, terminal overshoot), click scatter inside targets, hesitation pauses. Seeded per session — reproducible.
pacing: "fast"switches it all off for CI throughput.
macOS targets: perception AND targeting are pure screenshot vision (OmniParser); actuation is real synthesized mouse/keyboard events (move-then-click, like a human) at the vision-derived coordinates — serialized through a global input mutex, so on one attended machine sessions interleave. Emergency aborts: touch /tmp/capus.stop, hold cmd+option+ctrl+ escape, or slam the pointer into a screen corner. For unattended, truly parallel runs, a VM/container isolation tier is on the roadmap.
Browser targets: each session gets its own incognito Chromium context
(own storage, own pointer) — no input mutex, no contention with you or with
other sessions; the isolation problem the macOS tier needs VMs for simply
doesn't exist. Headless by default; pass headless: false to run_create to
watch.
Install
One line:
curl -fsSL https://raw.githubusercontent.com/DanielBirk04/capus/main/scripts/install.sh | shThat installs uv if needed, installs capus as
an isolated tool (with web-target support out of the box), and runs the
guided capus setup wizard: environment checks, headless Chromium,
optional native-app permissions/models, one-click Claude Code wiring (MCP +
plugin), then starts the daemon and opens the dashboard. Already have uv?
uv tool install 'capus[browser]' && capus setupThe dashboard's Setup page (http://127.0.0.1:7777/#/setup) mirrors the
same checklist with live re-checks and action buttons — first run lands
there automatically.
Web (URL) targets need no macOS permissions at all. Native macOS apps
are the advanced path: Screen Recording + Accessibility for the app hosting
the daemon (your terminal — restart it after granting) and the OmniParser
vision extra (pip install 'capus[vision]' + capus models download,
~1.5 GB). Without it the daemon still works for browser targets (DOM
perception) and in OCR-only degraded mode for macOS targets.
Development install
uv venv --python 3.12 .venv
uv pip install --python .venv/bin/python -e '.[browser]' # web targets
uv pip install --python .venv/bin/python -e '.[vision]' # + OmniParser deps (heavy)
.venv/bin/playwright install chromium # browser binary
.venv/bin/capus models download # OmniParser v2 weights (~1.5 GB)
.venv/bin/capus doctor # permissions checkRun
capus serve --open # daemon + dashboard (--open pops the browser)
# register in Claude Code (capus setup / the Setup page do this for you):
claude mcp add --transport http capus http://127.0.0.1:7777/mcpControl dashboard — open http://127.0.0.1:7777/ in a browser while the
daemon runs. It streams the live screen each agent is driving, lists every
run with its findings, and lets you drill into a run (persona cards,
findings with screenshots and the exact pages where each problem appears) or
into a session's full reasoning trace — every step's screenshot, the action,
and the agent's intent ("why, in persona voice"), plus the persona's exit
verdict and per-session model/cost. Every run and session view has
Copy Markdown / Download .md so the whole thing is documented and
portable.
The dashboard is also the control room: ▶ New run configures and
launches a run (target, goals, persona count/seed or hand-picked library
personas, pacing, model, parallel workers) and can start the built-in agent
runner with one click; Personas manages the persona library — sample
new ones, edit names/backstories (first-person interview style conditions
best), and preview each persona's compiled behavior contract. Runs created
from chat (Claude Code) appear the same way and can be started/stopped from
either side. POST endpoints honor CAPUS_DASHBOARD_TOKEN (Bearer) when set.
Credentials manages a local vault of accounts the personas may need
(staging logins etc.): attach credential sets to a run in the New-Run form
and each persona receives them with its assignment and signs in naturally.
Secret values (keys containing password/secret/token/pin/key/otp) are
masked out of every recorded trace, repro step and report — agents type
them, records show {{secret:…}} placeholders, and trace_replay resolves
them back at replay time. Use test accounts, not production ones.
Sessions have no fixed step limit — a persona keeps going as long as it makes progress (the daemon stops a session only after 25 consecutive zero-change actions, or at the 10000-step runaway cap). Patience is still personal: low-patience personas give up early because they choose to. Parallel workers are automatically capped at the number of queued sessions.
⇪ Share with coding agent (run view) hands findings to the agent that
will fix them: pick the findings, point at your project folder, optionally
add instructions, then choose where it lands —
VS Code (recommended: the agent spawns in the project folder and starts
working immediately, connected to the capus MCP so it can trace_replay
fixes and mark findings fixed; VS Code opens there so you can watch and
resume the chat from the Claude Code panel — status also streams into the
run page's Handoffs panel), Claude desktop app (a new chat via
claude:// deep link, pre-filled with the brief; plain chat, not
folder-bound), or ⧉ Copy brief (paste into any coding agent).
Model/effort for agents and handoffs work exactly like a normal chat: the default inherits your own Claude settings (model AND effort), or pick any current model (Fable 5, Opus, Sonnet, Haiku, 1M-context variants) and an effort level (low → max) per run/handoff.
A note on cost: the runner's auto backend prefers the claude CLI, which
runs on your Claude subscription — no API key is billed. The $ figures
shown for such sessions (marked sub/≈) are the nominal API-equivalent
the CLI reports, not a charge. The pay-per-token Anthropic API backend is
strictly opt-in.
While watching a live session you can moderate it like a usability test: the 🎙 Steer box delivers your instruction with the agent's next look at the screen ("now try to export a PDF", "you may give up now") — the persona acknowledges it in voice and follows it, and the note is recorded in the trace (🎙 operator). ■ Stop ends one session (overview cards and the live view) without touching the rest of the run; the run-level Stop halts the runner and all its sessions.
Headless agents without Claude Code — capus agents --run-id <id> [--model haiku|sonnet|opus] [--workers N] [--backend auto|api|claude-cli]
plays all queued sessions of a run. The runner is an ordinary MCP client:
the daemon stays dumb even when the dashboard's Start button spawns it.
Install the skills/agents as a plugin — capus setup (or the dashboard's
Setup page) does this with one click; manually from a checkout:
claude plugin marketplace add ./client/claude
claude plugin install capus@capus-marketplaceClaude Code workflow (skills in client/claude/capus) — Capus is fully
drivable from chat; /capus:help is the front desk (the command map +
troubleshooting, and it routes you to the right command). The core loop:
/capus:setup— extracts business rules from your PRD intocapus/rules.yaml, samples personas, writes their narrative cards. (Prefer/capus:specfor realistic-workflow testing;/capus:recongrounds a spec against the live app;/capus:personasauthors a custom panel.)/capus:run— spawns parallel tester subagents; each claims a persona-session and plays it against the app./capus:report— judge pass + generatesreport.html,report.md,feedback.json./capus:fix— run inside the app's repo: works open findings, verifies fixes withtrace_replay, marks them fixed.
The control-room commands cover everything else the dashboard does, from chat:
/capus:status (list/inspect runs, sessions and findings; start/stop the
built-in runner; steer or stop a live session), /capus:doctor (permissions,
vision models, browser, daemon health), and /capus:credentials (the
login/secret vault). Chat and the dashboard are two windows onto one daemon
store — anything you do from chat is persisted immediately and visible in the
dashboard (runs, traces, screenshots, verdicts, findings) days or weeks later;
nothing is chat-only.
Codex: see client/codex/AGENTS.md.
Try it on the demo apps
examples/invoice_mini/app.py is a tiny Cocoa app with planted bugs (a dead
Export button, a missing volume discount, a wrong confirmation message,
silent input validation). Its PRD is examples/invoice_mini/PRD.md, expected
extracted rules in examples/rules.example.yaml. A full verification pass:
setup → run (3 personas) → report should find at least the dead control and
the discount rule violation.
examples/invoice_web/index.html is the same app (and the same 4 planted
bugs) as a self-contained web page — point run_create at
file:///…/examples/invoice_web/index.html to exercise the browser driver
end to end, headless, with no permissions.
Security
Capus runs entirely on your machine. The daemon and dashboard bind to
127.0.0.1 and assume a localhost trust boundary: the read-only
dashboard API (runs, sessions, live screenshots) is open to local processes.
Mutating routes (start/stop runs, edit personas/credentials) and the
credentials vault require CAPUS_DASHBOARD_TOKEN (a Bearer token) when
it is set — set it before exposing the dashboard beyond localhost (e.g.
behind an authenticating reverse proxy or tunnel).
Test credentials live in a local SQLite vault. Secret-ish field values (keys
containing password/secret/token/pin/key/otp) are masked out of every
recorded trace, repro step and report — agents type them, records only show
{{secret:…}} placeholders. Use dedicated test accounts, never production
credentials.
Notes
macOS only (Quartz, Apple Vision OCR, Screen Capture). Apple Silicon recommended for OmniParser on MPS.
OmniParser v2 icon-detector weights are AGPL-3.0 (caption model MIT) — fine locally; re-check before commercial distribution.
Data dir:
~/.capus(override withCAPUS_DATA_DIRor--data-dir).
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/DanielBirk04/capus'
If you have feedback or need assistance with the MCP directory API, please join our Discord server