Skip to main content
Glama

Capus

Persona-driven LLM agent testing for macOS apps and web apps, served as a local MCP daemon. LLM agents role-play sampled human personas ("silicon sampling") and exercise your app through the GUI — screenshots in, real synthesized input out. The daemon is deliberately dumb (no LLM calls, ever); the intelligence is an MCP client: Claude Code/Codex driving it interactively, or the built-in runner started from the dashboard or capus agents (Anthropic API key or a Claude subscription via the claude CLI).

Two target drivers behind one identical tool surface:

  • macOS (app_path = .app/executable/.py): window screenshots parsed by OmniParser v2, CGEvent mouse/keyboard. Needs the machine free during runs (one shared screen + pointer).

  • Browser (app_path = http(s):// or file:// URL): one isolated headless Chromium context per persona session — truly parallel, zero desktop contention, no macOS permissions. The persona still sees a screenshot (numbered Set-of-Mark boxes), but element geometry comes from the DOM (visible, in-viewport, interactive elements with accessible names; shadow DOM and same-origin iframes traversed), input goes through CDP at real coordinates (move, then click — the same trusted path as a human mouse), and oracles get browser-grade signals: uncaught JS exceptions, console errors, failed/4xx+ requests, native dialogs, renderer crashes. OmniParser remains the automatic fallback for canvas-rendered UIs.

Output is two deliverables from one findings database:

  • report.html / report.md — for the developer: findings with screenshots and repro steps, business-rule coverage matrix (personas × rules), persona journey filmstrips.

  • feedback.json — for coding agents: stable finding IDs, expected vs observed, deterministic repro traces, a status queue (open/fixed/ wontfix) plus MCP tools (findings_query, trace_replay, finding_update) to fix and verify autonomously.

Architecture

the MIND (an MCP client — choose one per run)
  Claude Code / Codex            capus runner (capusd/runner.py)
  orchestrator + tester          spawned by the dashboard ▶ button or
  subagents (parallel)           `capus agents`; Anthropic API or claude CLI
        │ MCP over streamable HTTP (127.0.0.1:7777/mcp)
        ▼
capusd  (the BODY: dumb on purpose — zero LLM calls)
  session manager + work queue │ persona sampler (correlated human sampling)
  humanize (motor simulation: typing cadence+typos, Fitts-law mouse paths,
  click scatter, think pauses — seeded per session) │ Driver protocol
    ├─ macOS driver (Quartz windows, screenshots, CGEvent input)
    └─ browser driver (Playwright Chromium, DOM perception, CDP input)
  vision (OmniParser v2 + Apple-Vision OCR) │ oracles (crash/hang/log/no-op
  + JS/console/network/dialog for browser) │ SQLite store + artifacts │
  report generator │ dashboard (live streaming + run/persona control)

One daemon serves many parallel clients. The split is deliberate: the client decides WHAT a persona does (cognition), the daemon decides HOW it physically unfolds (motor execution) — so even a fast model produces sessions that look and pace like a human when watched live.

Related MCP server: macinput

Human realism

Personas are sampled, not invented: demographics from an overridable population spec, behavioral traits drawn from a correlated-but-noisy latent model calibrated against real data — OECD PIAAC tech-skill bands (~60% of adults are level 1 or below: the "average user" fails multi-step tasks), CDC disability prevalence, CHI'18 typing statistics (52±25 WPM, ~6% of keystrokes are corrections). Counter-stereotypical draws are guaranteed by design (the 79-year-old engineer happens; so does the low-tech 19-year-old).

Each persona compiles into:

  • a behavior contract (task_claim returns it) — first-person system prompt: identity, reading style, patience, blame attribution, quirks, and hard anti-"superuser" rules (satisfice, knowledge limits, giving up is valid data) — plus a 3-line persona_reminder to re-inject every few turns against persona drift;

  • a motor profile the daemon enforces mechanically: typing WPM with corrected typo bursts, Fitts's-law mouse movement (curved Bezier paths, bell velocity, terminal overshoot), click scatter inside targets, hesitation pauses. Seeded per session — reproducible. pacing: "fast" switches it all off for CI throughput.

macOS targets: perception AND targeting are pure screenshot vision (OmniParser); actuation is real synthesized mouse/keyboard events (move-then-click, like a human) at the vision-derived coordinates — serialized through a global input mutex, so on one attended machine sessions interleave. Emergency aborts: touch /tmp/capus.stop, hold cmd+option+ctrl+ escape, or slam the pointer into a screen corner. For unattended, truly parallel runs, a VM/container isolation tier is on the roadmap.

Browser targets: each session gets its own incognito Chromium context (own storage, own pointer) — no input mutex, no contention with you or with other sessions; the isolation problem the macOS tier needs VMs for simply doesn't exist. Headless by default; pass headless: false to run_create to watch.

Install

One line:

curl -fsSL https://raw.githubusercontent.com/DanielBirk04/capus/main/scripts/install.sh | sh

That installs uv if needed, installs capus as an isolated tool (with web-target support out of the box), and runs the guided capus setup wizard: environment checks, headless Chromium, optional native-app permissions/models, one-click Claude Code wiring (MCP + plugin), then starts the daemon and opens the dashboard. Already have uv?

uv tool install 'capus[browser]' && capus setup

The dashboard's Setup page (http://127.0.0.1:7777/#/setup) mirrors the same checklist with live re-checks and action buttons — first run lands there automatically.

Web (URL) targets need no macOS permissions at all. Native macOS apps are the advanced path: Screen Recording + Accessibility for the app hosting the daemon (your terminal — restart it after granting) and the OmniParser vision extra (pip install 'capus[vision]' + capus models download, ~1.5 GB). Without it the daemon still works for browser targets (DOM perception) and in OCR-only degraded mode for macOS targets.

Development install

uv venv --python 3.12 .venv
uv pip install --python .venv/bin/python -e '.[browser]'  # web targets
uv pip install --python .venv/bin/python -e '.[vision]'   # + OmniParser deps (heavy)
.venv/bin/playwright install chromium                      # browser binary
.venv/bin/capus models download                            # OmniParser v2 weights (~1.5 GB)
.venv/bin/capus doctor                                      # permissions check

Run

capus serve --open        # daemon + dashboard (--open pops the browser)
# register in Claude Code (capus setup / the Setup page do this for you):
claude mcp add --transport http capus http://127.0.0.1:7777/mcp

Control dashboard — open http://127.0.0.1:7777/ in a browser while the daemon runs. It streams the live screen each agent is driving, lists every run with its findings, and lets you drill into a run (persona cards, findings with screenshots and the exact pages where each problem appears) or into a session's full reasoning trace — every step's screenshot, the action, and the agent's intent ("why, in persona voice"), plus the persona's exit verdict and per-session model/cost. Every run and session view has Copy Markdown / Download .md so the whole thing is documented and portable.

The dashboard is also the control room: ▶ New run configures and launches a run (target, goals, persona count/seed or hand-picked library personas, pacing, model, parallel workers) and can start the built-in agent runner with one click; Personas manages the persona library — sample new ones, edit names/backstories (first-person interview style conditions best), and preview each persona's compiled behavior contract. Runs created from chat (Claude Code) appear the same way and can be started/stopped from either side. POST endpoints honor CAPUS_DASHBOARD_TOKEN (Bearer) when set.

Credentials manages a local vault of accounts the personas may need (staging logins etc.): attach credential sets to a run in the New-Run form and each persona receives them with its assignment and signs in naturally. Secret values (keys containing password/secret/token/pin/key/otp) are masked out of every recorded trace, repro step and report — agents type them, records show {{secret:…}} placeholders, and trace_replay resolves them back at replay time. Use test accounts, not production ones.

Sessions have no fixed step limit — a persona keeps going as long as it makes progress (the daemon stops a session only after 25 consecutive zero-change actions, or at the 10000-step runaway cap). Patience is still personal: low-patience personas give up early because they choose to. Parallel workers are automatically capped at the number of queued sessions.

⇪ Share with coding agent (run view) hands findings to the agent that will fix them: pick the findings, point at your project folder, optionally add instructions, then choose where it lands — VS Code (recommended: the agent spawns in the project folder and starts working immediately, connected to the capus MCP so it can trace_replay fixes and mark findings fixed; VS Code opens there so you can watch and resume the chat from the Claude Code panel — status also streams into the run page's Handoffs panel), Claude desktop app (a new chat via claude:// deep link, pre-filled with the brief; plain chat, not folder-bound), or ⧉ Copy brief (paste into any coding agent).

Model/effort for agents and handoffs work exactly like a normal chat: the default inherits your own Claude settings (model AND effort), or pick any current model (Fable 5, Opus, Sonnet, Haiku, 1M-context variants) and an effort level (low → max) per run/handoff.

A note on cost: the runner's auto backend prefers the claude CLI, which runs on your Claude subscription — no API key is billed. The $ figures shown for such sessions (marked sub/) are the nominal API-equivalent the CLI reports, not a charge. The pay-per-token Anthropic API backend is strictly opt-in.

While watching a live session you can moderate it like a usability test: the 🎙 Steer box delivers your instruction with the agent's next look at the screen ("now try to export a PDF", "you may give up now") — the persona acknowledges it in voice and follows it, and the note is recorded in the trace (🎙 operator). ■ Stop ends one session (overview cards and the live view) without touching the rest of the run; the run-level Stop halts the runner and all its sessions.

Headless agents without Claude Codecapus agents --run-id <id> [--model haiku|sonnet|opus] [--workers N] [--backend auto|api|claude-cli] plays all queued sessions of a run. The runner is an ordinary MCP client: the daemon stays dumb even when the dashboard's Start button spawns it.

Install the skills/agents as a plugin — capus setup (or the dashboard's Setup page) does this with one click; manually from a checkout:

claude plugin marketplace add ./client/claude
claude plugin install capus@capus-marketplace

Claude Code workflow (skills in client/claude/capus) — Capus is fully drivable from chat; /capus:help is the front desk (the command map + troubleshooting, and it routes you to the right command). The core loop:

  1. /capus:setup — extracts business rules from your PRD into capus/rules.yaml, samples personas, writes their narrative cards. (Prefer /capus:spec for realistic-workflow testing; /capus:recon grounds a spec against the live app; /capus:personas authors a custom panel.)

  2. /capus:run — spawns parallel tester subagents; each claims a persona-session and plays it against the app.

  3. /capus:report — judge pass + generates report.html, report.md, feedback.json.

  4. /capus:fix — run inside the app's repo: works open findings, verifies fixes with trace_replay, marks them fixed.

The control-room commands cover everything else the dashboard does, from chat: /capus:status (list/inspect runs, sessions and findings; start/stop the built-in runner; steer or stop a live session), /capus:doctor (permissions, vision models, browser, daemon health), and /capus:credentials (the login/secret vault). Chat and the dashboard are two windows onto one daemon store — anything you do from chat is persisted immediately and visible in the dashboard (runs, traces, screenshots, verdicts, findings) days or weeks later; nothing is chat-only.

Codex: see client/codex/AGENTS.md.

Try it on the demo apps

examples/invoice_mini/app.py is a tiny Cocoa app with planted bugs (a dead Export button, a missing volume discount, a wrong confirmation message, silent input validation). Its PRD is examples/invoice_mini/PRD.md, expected extracted rules in examples/rules.example.yaml. A full verification pass: setup → run (3 personas) → report should find at least the dead control and the discount rule violation.

examples/invoice_web/index.html is the same app (and the same 4 planted bugs) as a self-contained web page — point run_create at file:///…/examples/invoice_web/index.html to exercise the browser driver end to end, headless, with no permissions.

Security

Capus runs entirely on your machine. The daemon and dashboard bind to 127.0.0.1 and assume a localhost trust boundary: the read-only dashboard API (runs, sessions, live screenshots) is open to local processes. Mutating routes (start/stop runs, edit personas/credentials) and the credentials vault require CAPUS_DASHBOARD_TOKEN (a Bearer token) when it is set — set it before exposing the dashboard beyond localhost (e.g. behind an authenticating reverse proxy or tunnel).

Test credentials live in a local SQLite vault. Secret-ish field values (keys containing password/secret/token/pin/key/otp) are masked out of every recorded trace, repro step and report — agents type them, records only show {{secret:…}} placeholders. Use dedicated test accounts, never production credentials.

Notes

  • macOS only (Quartz, Apple Vision OCR, Screen Capture). Apple Silicon recommended for OmniParser on MPS.

  • OmniParser v2 icon-detector weights are AGPL-3.0 (caption model MIT) — fine locally; re-check before commercial distribution.

  • Data dir: ~/.capus (override with CAPUS_DATA_DIR or --data-dir).

A
license - permissive license
-
quality - not tested
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/DanielBirk04/capus'

If you have feedback or need assistance with the MCP directory API, please join our Discord server