Which integrations are available for this server?

Generates pytest test files from Python modules, with automatic introspection of functions and generation of typed test scenarios.

How do I use AI Test Pilot?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@AI Test Pilot generate tests for utils/database.py" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

AI Test Pilot

by Drzymek92

Overview Schema Related Servers Score Discussions

Python

Local

AI Test Pilot

An LLM-driven test generator with a shared core and pluggable adapters. Point it at a Python module or a web page; it introspects the target, has an LLM propose test scenarios as schema-validated JSON, renders them into runnable tests, runs them, and triages the failures.

Python License

AI Test Pilot — generate, run, and lock a golden suite in one command

Why it's different: most "AI writes your tests" tools let the model emit test code directly — which hallucinates imports, fabricates inputs, and asserts wrong things. AI Test Pilot takes the opposite stance: the LLM only ever returns structured, schema-validated JSON; every line of runnable code is rendered deterministically from that JSON. The LLM is used for the two genuinely fuzzy steps only — proposing scenarios and judging ambiguous failures — and nothing else.

Does it actually catch bugs?

Coverage is a weak proxy — a suite can execute every line and assert nothing useful. The real test of a test generator is kill rate: generate a suite from correct code, then run it against a buggy version and count how many bugs a test that passed on the correct code now catches (standard mutation-testing semantics). The full reproducible eval ships in benchmark/.

Corpus	Kill rate (95% Wilson CI)	Notes
QuixBugs (external, default config)	0.80 [0.584–0.919] (n=20)	human-verified correct↔buggy pairs
In-repo AST mutation (full ablation)	0.818 [0.523–0.949] (n=11)	controlled mutants; the "is it overengineered?" ablation substrate
HumanEval held-out — standard tool (cosmic-ray)	0.923 [0.906–0.937] (n=1166)	re-measured with a standard mutation tool on code never seen during development

vs a search-based peer: on the same QuixBugs targets and the same kill mechanic, ours 0.80 vs Pynguin's 0.30 (Pynguin at a modest 30s/SIMPLE budget — a floor; even the literature's stronger SBST range, ~0.59–0.70, sits below 0.80). The edge is the coverage-feedback loop — after the first pass the suite's uncovered lines are fed back to the LLM to reach them — the feedback mechanism plain LLM test-generation lacks.

Honest framing: that held-out number was first measured with this repo's own lightweight mutation operators (0.98) — a generalization check, not a PIT-style mutation score — so it's reframed and re-measured with a standard tool (0.923) for a comparable figure; cosmic-ray doesn't exclude equivalent mutants, so 0.923 is a conservative lower bound, and the harder, most representative number remains QuixBugs 0.80. Methodology, comparables vs the literature, and re-run procedure: benchmark/DETECTION.md.

python scripts/main.py detect --subset 20      # QuixBugs + in-repo mutation kill rate + feature ablation

Related MCP server: Generate Manual Test Cases

Demo — sample run

$ python scripts/main.py --target path/to/rules/commission.py --selector compute_commission --golden

introspected 1 unit(s); resolved types: OrderView, LineItemView, RulesConfig, CommissionRules
generated 5 scenario(s)
golden mode: locked 5 characterization assertion(s)
run complete: 5/5 passed

✓ 5 passed · 0 failed · 0 error / 5 generated
  tests:  scripts/outputs/tests/test_commission_<ts>.py
  report: scripts/outputs/reports/report_<ts>.md

A generated test constructs the real typed inputs and locks the computed result:

def test_standard_commission():
    """Commission for a multi-item order."""
    result = compute_commission(
        order=OrderView(currency="PLN", status="DELIVERED", line_items=[
            LineItemView(category="electronics", unit_amount=Decimal("100.00"), quantity=2)]),
        config=RulesConfig())
    assert repr(result) == ("CommissionBreakdown(currency='PLN', "
        "items_commission=Decimal('4.00'), transaction_fee=Decimal('1.00'), "
        "total_commission=Decimal('5.00'), rule_version='v1')")

For the web_playwright adapter, the same pipeline produces self-contained Playwright tests and an idiomatic .spec.ts export. Three sample targets are included: demo/signup.html (simple form), demo/login_app/ (served — auth/storage_state + API interception), and demo/ws_app/ (served — WebSocket push/echo). The served demos are run with --serve:

# deep web: emits base_url + auth_state fixtures, page.route interception, async variant
python scripts/main.py --adapter web_playwright --target demo/login_app/index.html --serve

# websocket: emits page.route_web_socket mock (server push + echo) + expect_ws_message
python scripts/main.py --adapter web_playwright --target demo/ws_app/index.html --serve

Features

Structured-output pipeline — the LLM returns JSON validated against a Pydantic schema; every line of test code is rendered deterministically from the validated objects, never written by the model.
Typed-input construction — resolves parameter types from source (dataclass/Pydantic/attrs/ NamedTuple, nested) via ast without importing the target, so it tests real domain/OO code, not just functions taking primitives.
Reproducible & fail-safe — temperature=0 + a scenario cache replay identical tests for an unchanged target; the LLM call has timeout/retry and a documented exit-code contract, so it never emits a half-generated suite.
Proven bug detection — a detect command measures mutation kill rate (not just coverage), with a feature ablation and a coverage-feedback loop plain LLM test-gen lacks.
Advanced Playwright (served mode) — base_url/auth_state fixtures, storage_state reuse, page.route network interception, and in-process WebSocket mocking — all from JSON the model emits.
MCP server + quality/cost gates — callable from any MCP client; a quality-regression gate (false-positive rate, test-smell density…) and opt-in budget caps that abort before overspending.

Structured-output pipeline — the LLM returns JSON validated against a Pydantic schema (with a one-shot repair retry); code is generated from the validated objects, never written by the model. The value grammar's $type/$call/$enum symbols (and constructor argument names) are allow-listed against the types resolved from the target's own source, so a crafted docstring/source can't smuggle code tokens into a generated test that then gets executed.
Typed-input construction — recursively resolves a function's parameter types from source (dataclass + Pydantic + attrs + NamedTuple, nested, Decimal/datetime/Enum, defaults) via ast — without importing the target — and builds real constructor calls. It even surfaces Pydantic field constraints (gt/le/min_length, Annotated[...]) so the model picks valid values instead of triggering a construction-time ValidationError. Lets it test domain/OO code, not just functions taking primitives.
Reproducible generation — generation runs at temperature=0 and caches each scenario set keyed by the prompt + resolved model version + temperature, so re-running an unchanged target replays the identical tests at zero token cost; a model change invalidates the cache. --no-cache / --refresh-cache override it.
Assertion strength from source — the unit's own code is fed into generation (a bounded slice), so assertions target specific computed behaviour, not just type/shape. --golden then locks the real result; together they turn a draft into a regression guard.
Characterization (golden) mode — runs each call and locks the assertion to the real result, turning a generated test into a regression guard. Guarded against time-bombs: it double-runs and keeps only reproducible results, skipping any clock/RNG-reading unit whose time isn't pinned.
File & fixture inputs — creates real temp files for file-processing functions, and can optionally seed inputs from a companion synthetic data factory.
Failure triage — a deterministic signal table classifies most failures for free (bad_scenario / env_issue / a broken golden lock → real_bug); the LLM is called only for the genuinely ambiguous ones.
Advanced Playwright (served web mode) — fixtures (base_url, auth_state), authenticated sessions via saved storage_state, network interception (page.route) to stub APIs deterministically, in-process WebSocket mocking (page.route_web_socket, server-push + echo), and an async_playwright variant. Each is just structured JSON the LLM emits — no Playwright code from the model.
Self-tracking ledger + self-improving tuning — every run is recorded to DuckDB; accept backfills how many tests you kept. The tool then proposes the best prompt version and (in auto mode) injects your previously-accepted scenarios for the same target as few-shot exemplars — closing the loop with zero extra LLM calls.
Draft → suite workflow — discover scans a project and prints ready-to-run commands per module; promote strips a draft's boilerplate, rewrites golden locks into value assertions, and appends only the non-duplicate tests into an existing suite. Both deterministic, zero-token.
MCP server — exposes the engine as tools (introspect, generate_tests, triage_failures, run_metrics, accept_run) so it's callable from any MCP client.
Fail-safe by contract — the LLM call has a timeout + exponential-backoff retry and raises rather than ever emit a half-generated suite; a per-test run cap bounds a hanging test; and the CLI has a documented exit-code contract (0 ok · 1 internal · 2 usage · 3 uninspectable target · 4 LLM failure · 5 quality regression · 6 budget exceeded · 7 detection regression) for scripting.
Quality regression gate — quality runs a curated known-good target set and reports a metric panel (coverage, pass-rate, false-positive rate, error rate, test-smell density, acceptance), gated against a stored baseline so a generation/prompt change that regresses quality fails loudly.
Cost guardrails — every run measures and records real token spend (a cache replay is free); opt-in [budget] caps abort before overspending, and sweep "tests the diff" by generating only for git-changed modules under a per-sweep cap.
Proven bug detection (not just coverage) — a detect command measures mutation kill rate (generate from correct code, re-run against buggy/mutant versions, count what's caught) over an external corpus (QuixBugs) + in-repo mutants, with a feature ablation and a standard-tool (cosmic-ray) cross-check on held-out code. A coverage-feedback loop then feeds the suite's uncovered lines back to the LLM to target them — the feedback mechanism single-shot LLM test-gen lacks. See Does it actually catch bugs? and benchmark/DETECTION.md.

The same engine drives two target types through one adapter seam, so adding a new kind of target is a single new file with zero changes to the core:

python_pytest — points at a Python module, generates pytest tests.
web_playwright — points at a web page, generates Playwright end-to-end tests (and exports idiomatic TypeScript alongside the runnable Python).

How it works

flowchart LR
    T([target: module or URL]) --> I[1 · introspect<br/>ast / DOM — deterministic]
    I --> G[2 · generate<br/>LLM → schema-validated JSON]
    G --> M[3 · materialize<br/>render code — deterministic]
    M --> R[4 · run<br/>pytest / Playwright]
    R --> TR[5 · triage<br/>signals + LLM for ambiguous]
    TR --> L[6 · record<br/>DuckDB ledger]
    L -. 7 · propose tuning .-> G

Stages 1, 3, 4, 6 cost zero tokens. Stage 2 is one batched LLM call; stage 5 calls the LLM only for failures the deterministic signal table can't classify. The core never imports an adapter directly — only through a name registry — which is what keeps the two target types fully decoupled:

flowchart TD
    C[shared core<br/>introspect · generate · materialize · run · triage · record] --> RG[registry]
    RG --> A1[python_pytest adapter]
    RG --> A2[web_playwright adapter]
    A1 --> P[(pytest)]
    A2 --> PW[(Playwright + TS export)]

Tech Stack

Language: Python 3.10+
Core: pydantic (the schema spine), jinja2 (pytest emission), duckdb (the run ledger)
LLM: langchain-openai against any OpenAI-compatible gateway (LLM_BASE_URL/LLM_MODEL/LLM_API_KEY)
Adapters: pytest (python target runner), playwright (web — bundles its own driver, no Node needed)
Integration: mcp (FastMCP) — exposes the engine over the Model Context Protocol

Getting Started

Prerequisites

Python 3.10+ on PATH

Installation

git clone https://github.com/Drzymek92/ai-test-pilot.git
cd ai-test-pilot
python -m venv .venv
.venv\Scripts\activate          # Windows  (source .venv/bin/activate on macOS/Linux)
pip install -e .                # installs deps + the `ai-test-pilot` command
#   (or `pip install -r requirements.txt` for deps only, without the console entry point)
cp config/.env.example config/.env     # then fill in your LLM gateway values
# for the web adapter only:
python -m playwright install chromium

Installed this way you can call it as ai-test-pilot ... anywhere; the examples below use python scripts/main.py ..., which is equivalent.

Usage

# generate pytest tests for selected functions
python scripts/main.py --target path/to/module.py --selector func_a,func_b

# lock assertions to real results (characterization / regression mode)
python scripts/main.py --target path/to/module.py --golden

# also emit deterministic validator-rejection tests (a validator-gated pydantic type refuses bad input)
python scripts/main.py --target path/to/module.py --reject-tests

# generate Playwright tests for a web page
python scripts/main.py --adapter web_playwright --target path/to/page.html

# record how many proposed tests you kept (feeds tuning)
python scripts/main.py accept <run_id> --kept 4

# scan a project for testable targets (deterministic, no LLM)
python scripts/main.py discover path/to/project

# incremental: only the modules changed in git (regenerate tests for what moved, not the whole tree)
python scripts/main.py discover path/to/project --changed          # working tree vs HEAD
python scripts/main.py discover path/to/project --since main        # vs a ref / tag / commit

# clean a draft for the suite: strip boilerplate, rewrite golden locks, append non-duplicates
python scripts/main.py promote <run_id> --into tests/test_module.py

# force a fresh (uncached) generation, or regenerate + overwrite the cache
python scripts/main.py --target path/to/module.py --no-cache       # bypass the scenario cache
python scripts/main.py --target path/to/module.py --refresh-cache  # regenerate and re-store

# quality gate: run the curated set, print the metric panel, fail (exit 5) on regression
python scripts/main.py quality                       # gate vs baseline
python scripts/main.py quality --update-baseline     # (re)set the baseline

# "test the diff": generate only for git-changed modules in a project, under a budget cap
python scripts/main.py sweep path/to/project --since main

Run as an MCP server (callable from any MCP client) — register it with:

{ "command": "python", "args": ["/path/to/ai-test-pilot/scripts/mcp_server.py"] }

Project Structure

scripts/
  main.py               # CLI + run_pipeline() (the one pipeline every interface reuses)
  mcp_server.py         # MCP server (FastMCP) exposing the engine as tools
  cli.py                # argument parsing + subcommand dispatch + the exit-code contract
  pipeline.py           # run_pipeline() — the one pipeline every interface reuses (typed RunRequest in)
  core/                 # adapter-agnostic engine: models, generate, materialize, runner, triage,
                        #   ledger, tuning, context, fixtures, registry, discover, promote, report,
                        #   feedback, cache, errors (exit codes), quality (gate), budget, detection
  adapters/             # python_pytest · web_playwright  (one file per target type)
  prompts/              # scenario-generation prompts + the pytest Jinja template
config/                 # ai_test_pilot.toml (defaults) + .env.example
benchmark/              # reproducible efficacy eval: mutation kill rate, ablation, cosmic-ray +
                        #   Pynguin baselines, Wilson CIs, corpora loaders, DETECTION.md + artifacts
demo/                   # signup.html · login_app/ (auth) · ws_app/ (websocket) — web adapter targets
tests/                  # 204 unit tests

Design notes

Determinism first. Introspection, code emission, running, the triage signal table, and the ledger are all plain code. The LLM is a tool for the two irreducibly fuzzy steps only.
Never imports the target. Introspection is ast-only, so a target's heavy/optional dependencies are never triggered to generate tests for it.
Human-in-the-loop. Generated tests are proposed into scripts/outputs/ — never written into a target repository. Promoting them is a separate, explicit step.

Validated end-to-end across a typed business-rules engine, pure data-transformation helpers, a web form, an authenticated app with API interception, and a WebSocket feed — producing runnable, correctly-typed, regression-grade tests in each case.

CI note: the published CI runs the full unit suite, which is browser-free by design — the web_playwright tests assert on the generated test source, not a live browser. The --serve demos are run locally (after playwright install chromium); CI doesn't download a browser.

Limitations

The python_pytest adapter is a tested, quality-gated, cost-bounded tool I rely on for my own pipelines — reproducible (temperature 0 + scenario cache), fail-safe (typed exit-code contract), and guarded by a regression gate. Its honest scope is trusted, single-machine code: it executes the tests it generates and sends introspected source to the configured LLM gateway, so it is not a sandbox for untrusted targets. The web_playwright adapter remains a demo of advanced Playwright technique. The input shapes the tool deliberately won't guess at, the non-determinism of the LLM step, and the scale/safety/packaging boundaries are all written up in LIMITATIONS.md — knowing them is part of using it well.

License

Licensed under the MIT License — see LICENSE.

This server cannot be installed

license - permissive license

quality - not tested

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

–Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Drzymek92/ai-test-pilot'

If you have feedback or need assistance with the MCP directory API, please join our Discord server