Argus
Argus is an MCP server that enables automated web application testing through browser control, interaction, verification, and reporting.
Session Management
Start a browser session (
start_session) — launch Playwright at a given URL with configurable headless mode and viewportEnd a session (
end_session) — close the browser and generate an HTML report of all findings
Page Interaction
get_page_state— retrieve current URL, title, and a numbered list of interactive elementsclick— click any element by indextype_text— type into input fields by indexselect_option— choose values from dropdowns by indexnavigate/go_back/scroll_down— control page navigation and scrolling
Observation & Capture
screenshot— capture the current viewportget_errors— collect JavaScript console errors and HTTP 4xx/5xx network failures
Verification & Testing
verify_action— confirm a change (delete, edit, toggle) actually persisted by re-fetching page contenttest_action— click an element and auto-capture before/after state diff with screenshotstest_form— fill, submit, and verify form outcomes (success or validation errors)test_crud— run a full Create → Verify → Edit → Verify → Delete → Verify cycle automatically
Site-Wide Probing
check_links— crawl internal links and report dead links (404/5xx)check_performance— measure page load time, resource count, and large resourcescrawl_site— visit all internal pages, run all detectors, and check links/performance across the entire site
Tested on Angular.dev SPA, detects accessibility bugs and other issues through automated exploratory testing of Angular applications.
Alternative to Cypress that doesn't require writing test scripts, instead using AI to discover bugs through exploratory testing.
Tested on Next.js applications including React.dev and Tailwind CSS sites, detects accessibility, performance, and other bugs in Next.js SPAs.
Tested on React.dev SPA, detects accessibility bugs and other issues through automated exploratory testing of React applications.
Tested on TodoMVC Svelte SPA, detects SEO and accessibility bugs through automated exploratory testing of Svelte applications.
Tested on Tailwind CSS Next.js site, detects accessibility, performance, and resource-related bugs through automated exploratory testing.
Tested on Vue.js Vitepress SPA, detects accessibility bugs through automated exploratory testing of Vitepress documentation sites.
Tested on Vue.js Vitepress SPA, detects accessibility bugs through automated exploratory testing of Vue.js applications.
Argus
An opinionated MCP server that turns your loaded LLM into a senior human QA tester — for web apps and any macOS app on your screen.
$ python -m argus.bench --target all
buggytasks 22 / 22 = 100 % in 20.5s
darkshop 12 / 12 = 100 % in 8.4s
──────────────────────────────────────────
total 34 / 34 = 100 % in 28.9sThe 34 / 34 is reproducible from git clone in two commands. The
point of Argus is the prompt + the tool surface: when an Opus-class
agent (Claude Code, Cursor, etc.) loads this MCP, it stops being "an
assistant with browser tools" and starts behaving like a QA tester —
hypothesising, observing, verifying persistence, recording reproducible
bugs, and refusing to wander off into "let me just complete your flow
for you".
The other thing that makes Argus different from the existing browser-MCP crowd: screen mode. Same description-keyed tools, but the target is whatever app is foreground on macOS — Notes, Cursor, Safari, your in-progress feature. No headless Chrome, no scripted Playwright. Argus sees what you see.
Skip to Quick start · Bench · Tool surface · Philosophy
What it is, in one paragraph
Argus is an MCP server that exposes two things:
A role-binding instructions block that tells whichever agent loaded it: "while I'm here, you are a senior QA tester. Stay in role until end_session."
A mode-agnostic tool surface —
observe,click_what,type_into,verify_persistence,record_bug, plus the same again for screen-mode (screen_observe,screen_click_what, …) — that the agent uses to drive whatever you point it at.
That's the whole product. There's no detector library, no AI brain wrapped around static rules, no scoring. The agent is the smart layer. Argus is an opinionated, well-instrumented seat to put that agent in.
What it isn't
Not an assertion library. There's no
expect(x).toBe(y). The agent reads page state and decides what's a bug.Not an axe / Lighthouse replacement. We deliberately don't run static a11y / SEO / performance scans — those tools already exist and are excellent. Argus only flags what requires human judgement to see.
Not a task-completion agent. If you want it to actually buy the thing or send the email, use Browser-Use or Stagehand. Argus's instructions block specifically refuses task completion in favour of testing the flow.
Quick start
# Web mode (works everywhere)
pip install argus-testing
playwright install chromium
# Screen mode (macOS only, optional)
pip install 'argus-testing[mac]'
brew install cliclick # for keystroke / coordinate fallback
# Wire it into Claude Code
claude mcp add argus -- argus-mcp
# Confirm the version your MCP host will load
argus-mcp --versionAfter upgrading Argus, restart your MCP host (Claude Code, Cursor, etc.). MCP hosts cache the tool table at startup, so a fresh
pip install -U argus-testingwon't expose new tools until the host reconnects to the server.argus-mcp --versionis the easy way to verify which version your host is actually running.
Then, in your Claude Code / Cursor / any-MCP session:
"Test my app at http://localhost:3000 — find five real bugs."For screen mode, say "test whatever is on my screen" or specify the app:
"Test the Notes app in screen mode."Permission check (screen mode)
Screen mode needs Screen Recording + Accessibility grants. Run:
argus-mcp --doctorIt probes both, reports status, and gives you the
x-apple.systempreferences: deep-link for any missing grant.
Reproduce the bench
# Start the seeded fixtures
python test-site/app.py # BuggyTasks :5555
python human-eye-fixture/app.py # DarkShop :5556
# Run all scenarios
python -m argus.bench --target all \
--json bench-results/matrix.json \
--md bench-results/matrix.mdSee bench-results/matrix.md for the
checked-in artifact.
Bench method
Argus's headline number — 34 / 34 — measures Argus's capability
ceiling. Each scenario is a deterministic Python sequence that
exercises the same MCP tools an LLM agent would call. We're answering
"what's findable through this surface?" — separate from
"how often does any specific LLM remember to call the right tool?"
BuggyTasks (mechanical bugs)
22 seeded bugs in a small task-management app: console errors, dead links, fake delete (UI says "deleted!" but data persists on refresh), auth bypass, NaN dates, count-off-by-one, race conditions, etc. These are the "scripted E2E could find them" bugs.
DarkShop (human-eye bugs)
12 seeded bugs in a polished-looking e-commerce fixture: hardcoded
"Only 3 left!" scarcity, fake -50% sale badges where original price
equals sale price, "free shipping over $50" banner contradicted by a
flat $5 in checkout, visual hierarchy inverted ("Add to Cart" demoted
while "Subscribe to Newsletter" gets the prominent green button), cross
-page state drift (rename succeeds on /account, navbar greeting still
shows the old name), and so on. Static analysis catches roughly none
of these. They require an agent that observes the page and reasons
about what's wrong.
What an agent has to do per scenario
Take BUG #10 in DarkShop: the navbar greeting goes stale after an account rename. The scenario does:
reset(mode="renamed") # fixture pre-stages a renamed account
observe() # read the rendered /account page
# — page shows "Alex-Renamed" in the form
# — navbar still says "Hi, Alex"
record_bug(
title="Account name change does not update nav greeting",
severity="medium",
evidence={"bug_type": "ux_issue", ...},
)The judgement ("the navbar saying Alex while the form says Alex-Renamed is wrong") lives in the agent. The bench measures whether Argus's surface gives the agent enough information to make that call.
Screen mode
Screen mode is not in the recall matrix — that needs a seeded
macOS app with intentional bugs, which is out of scope for v1. Screen
mode is validated separately via python -m argus.screen.validate,
which walks the AX tree of any running app and reports the elements
round-trip identity probes. The checked-in artifact at
bench-results/screen_validation.jsonwalks Notes (8 menu-bar items, all localised OS strings — 5 / 5 unique probes).
To exercise screen mode against your own apps:
python -m argus.screen.validate Finder Notes "Google Chrome" \
--json /tmp/screen.jsonThe script is read-only — it does not click, type, or move the mouse. Output element counts vary by app: simple system apps expose a few items at the menu-bar level; richer apps (browsers, IDEs) typically expose tens to hundreds.
Tool surface
Web mode
Tool | Purpose |
| Launch a Playwright session at |
| URL + title + interactive elements (description-keyed, no integer indices) + counts + visible feedback + ARIA tree + viewport state. |
| Click the element best matching |
| Resolve a text input by description, then type. |
| Resolve a |
| Force a fresh GET on |
| Computed styles + ARIA + outerHTML + truncation detection for one element. |
| Full viewport, full page, or a tight crop of one element. |
| Pillow-based pixel diff with red-tint overlay. |
| Arbitrary JS in the page context. Off by default; enable with |
| The agent calls this after it confirms a real bug. Required: severity in |
| Drain captured console + network events (the only channels not visible in |
| Probe-style helpers — return raw data, no auto-bug. |
| Close session, write the HTML report. |
Screen mode (macOS)
Tool | Purpose |
| Bind to the foreground app or to a named running app. Refuses cleanly with deep-link permission instructions if grants are missing. |
| Foreground app + window title + AX tree (capped at 200 elements / 6-deep) + screen-coords for every element + screenshot. |
| Resolve via AX tree; click via |
| Resolve via AX tree; set |
|
|
| Elapsed time vs cap, action count, abort-file state, last 30 trail entries. |
Safety
Screen mode runs against the user's actual machine, so:
Per-call timeout — every action wraps in a 15-second budget (
ARGUS_SCREEN_PER_CALL_TIMEOUT_Sto override). A hung AX query doesn't lock up the agent.Session cap — 30-minute default (
ARGUS_SCREEN_SESSION_MAX_SECONDS). After expiry, all screen tools refuse with a clear "start a new session" message.Abort file —
touch ~/.argus/abortblocks every subsequent screen action in the current session. Robust panic button that works from any second terminal.Action trail — every screen action records a paired before/after screenshot, automatically.
Philosophy
This section exists because the design choices are opinionated.
Trust the agent, don't simulate intelligence
The agent loaded into Argus is assumed to be Opus-class or stronger.
Static rules that pretend to be the smart layer are subtractive: they
add maintenance, produce false positives, and pull attention away from
what the agent actually saw. So Argus's detector.py is 130 lines —
it only captures the two channels the agent literally cannot see (the
console event stream and the HTTP layer).
Everything else — "is the page text broken", "is there a count
mismatch", "is this a misleading success toast", "is the visual
hierarchy wrong" — the agent reads from observe() and decides.
Lock the role; don't bake a checklist
The MCP's instructions block does not tell the agent to fire every XSS payload from a textbook. Smart agents don't need that and benefit from being kept in role rather than handed instructions. The block defines a senior-tester worldview (Map → Hypothesize → Act → Observe → Verify → Record → Cover), bug bar (reproducible, user-affecting, persistent), severity calibration, and a hunting list of "things humans notice that machines miss" — and gets out of the way.
Description-keyed tools
click_what("Login button"), not click(7). Element indices are how
dumb LLMs were prompted in 2023; they're a leaky abstraction even
within a single observe. A smart agent describes what it wants to
interact with by what it is, and Argus's resolver maps that to the
right element — refusing to misclick on ambiguity rather than guessing.
Test anything on screen
The web is one target. Real software is hundreds of native macOS apps,
Electron things, IDEs, design tools, mobile simulators. Argus's screen
backend uses the macOS Accessibility tree as its structured surface and
screencapture for pixels — same description-keyed tools, no
framework lock-in. v1 is macOS-only; Win/Linux is v2.
Project layout
argus/
├── argus/
│ ├── mcp_server.py # tool surface + role instructions
│ ├── browser.py # Playwright web backend
│ ├── detector.py # console + network event capture (only)
│ ├── differ.py # state diff for compute_changes
│ ├── resolver.py # description → element resolver (web + screen)
│ ├── reporter.py # HTML session report
│ ├── models.py # Bug / PageState / etc.
│ ├── bench/
│ │ ├── runner.py # fixture-agnostic harness
│ │ ├── scenarios_buggytasks.py
│ │ └── scenarios_darkshop.py
│ └── screen/
│ ├── permissions.py # Screen Recording / Accessibility probes
│ ├── backend.py # AX tree + cliclick + screencapture
│ ├── safety.py # timeouts, abort file, action trail
│ └── validate.py # read-only walker for real apps
│
├── test-site/ # BuggyTasks fixture (22 mechanical bugs)
├── human-eye-fixture/ # DarkShop fixture (12 human-eye bugs)
├── tests/ # 45 unit tests (resolver, detector, safety, …)
└── bench-results/ # checked-in artifacts (json + md)Fixture convention
Argus benchmarks against fixtures that expose two HTTP endpoints:
GET /api/test/state # full in-memory state JSON
POST /api/test/reset?mode=... # restore to a known starting statemode is fixture-defined. BuggyTasks supports
seeded / empty / all_done / one_pending. DarkShop supports
seeded / with_items / renamed. See
docs/FIXTURE_CONVENTION.md for the
full spec.
Roadmap
Concrete next-up:
Real-world OSS PR — file a real bug report on a real OSS web app, with Argus's run as the evidence trail.
Live LLM bench mode —
python -m argus.bench --agent <model>swaps the scripted driver for a real LLM, so we measure variance on top of capability ceiling.Screen-mode seeded fixture — a deterministic macOS app with intentional bugs, so the matrix becomes 2 × 2 and screen-mode recall is measurable.
VLM resolver fallback — for apps with empty AX trees (some Electron things), use vision to resolve descriptions to coordinates.
License
MIT. See LICENSE.
Author
Built by Yichen Wu. Issues and PRs welcome.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/chriswu727/argus'
If you have feedback or need assistance with the MCP directory API, please join our Discord server