EvalView
EvalView's MCP server enables regression testing and behavior validation of AI agents directly from a coding assistant context.
Core Capabilities:
Create test cases (
create_test): Generate YAML-based tests specifying queries, expected/forbidden tools, output requirements, and score thresholds — no YAML knowledge requiredSave golden baselines (
run_snapshot): Capture current passing agent behavior as a baseline for future comparisonsDetect regressions (
run_check): Compare current behavior against baselines, returning PASSED, OUTPUT_CHANGED, TOOLS_CHANGED, or REGRESSION statusesList tests (
list_tests): View all golden baselines with variant counts and last-updated timestampsValidate skill files (
validate_skill): Check SKILL.md files for correct structure and completenessGenerate skill tests (
generate_skill_tests): Auto-generate YAML test suites from SKILL.md files covering explicit, implicit, contextual, and negative categoriesRun skill tests (
run_skill_test): Execute tests with deterministic checks (tool calls, file ops, output content) and LLM-as-judge rubric scoringGenerate visual reports (
generate_visual_report): Produce self-contained HTML reports with traces, diffs, scores, and timelines
Additional Features:
Auto-detects project structure and agent types (Claude Code, LangGraph, CrewAI, OpenAI Assistants, custom agents)
Supports multiple behavior variants to handle agent non-determinism
Integrates with CI/CD pipelines and offers a Python API and Pytest plugin
Supports execution path tracing and behavioral regression detection for multi-agent systems built with CrewAI.
Integrates with CI/CD workflows to block regressions and provide automated pass/fail signals for AI agent tests.
Provides evaluation and monitoring tools for AI agents developed within the LangChain ecosystem.
Facilitates execution trace visualization and behavioral diffing for stateful LangGraph agentic workflows.
Supports offline, provider-agnostic regression testing for local AI agents running on Ollama.
Performs semantic similarity scoring and LLM-as-judge evaluation for OpenAI and compatible agent systems.
Provides a testing interface to run AI agent behavioral validation and regression checks through the pytest framework.
Delivers continuous monitoring notifications and automated alerts for AI agent regressions via Slack.
Your agent can still return 200 and be wrong. A model or provider update can change tool choice, skip a clarification, or degrade output quality without changing your code or breaking a health check. EvalView catches those silent regressions before users do.
You don't need frontier-lab resources to run a serious agent regression loop. EvalView gives solo devs, startups, and small AI teams the same core discipline: snapshot behavior, detect drift, classify changes, and review or heal them safely.
Traditional tests tell you if your agent is up. EvalView tells you if it still behaves correctly. It tracks drift across outputs, tools, model IDs, and runtime fingerprints, so you can tell "the provider changed" from "my system regressed."

30-second live demo.
Most eval tools stop at detect and compare. EvalView helps you classify changes, inspect drift, and auto-heal the safe cases.
Catch silent regressions that normal tests miss
Separate provider/model drift from real system regressions
Auto-heal flaky failures with retries, review gates, and audit logs
Built for frontier-lab rigor, startup-team practicality:
targeted behavior runs instead of giant always-on eval suites
deterministic diffs first, LLM judgment where it adds signal
faster loops from change -> eval -> review -> ship
How we run EvalView with this operating model →
✓ login-flow PASSED
⚠ refund-request TOOLS_CHANGED
- lookup_order → check_policy → process_refund
+ lookup_order → check_policy → process_refund → escalate_to_human
✗ billing-dispute REGRESSION -30 pts
Score: 85 → 55 Output similarity: 35%Quick Start
pip install evalviewevalview init # Detect agent, auto-configure profile + starter suite
evalview snapshot # Save current behavior as baseline
evalview check # Catch regressions after every changeThat's it. Three commands to regression-test any AI agent. init auto-detects your agent type (chat, tool-use, multi-step, RAG, coding) and configures the right evaluators, thresholds, and assertions.
curl -fsSL https://raw.githubusercontent.com/hidai25/eval-view/main/install.sh | bashevalview demo # See regression detection live (~30 seconds, no API key)Or clone a real working agent with built-in tests:
git clone https://github.com/hidai25/evalview-support-automation-template
cd evalview-support-automation-template
make runevalview generate --agent http://localhost:8000 # Generate tests from a live agent
evalview capture --agent http://localhost:8000/invoke # Capture real user flows (runs assertion wizard after)
evalview capture --agent http://localhost:8000/invoke --multi-turn # Multi-turn conversation as one test
evalview generate --from-log traffic.jsonl # Generate from existing logs
evalview init --profile rag # Override auto-detected agent profileWhy EvalView?
Use LangSmith for observability. Use Braintrust for scoring. Use EvalView for regression gating.
LangSmith | Braintrust | Promptfoo | EvalView | |
Primary focus | Observability | Scoring | Prompt comparison | Regression detection |
Tool call + parameter diffing | — | — | — | Yes |
Golden baseline regression | — | Manual | — | Automatic |
Silent model change detection | — | — | — | Yes |
Auto-heal (retry + variant proposal) | — | — | — | Yes |
PR comments with alerts | — | — | — | Cost, latency, model change |
Works without API keys | No | No | Partial | Yes |
Production monitoring | Tracing | — | — | Check loop + Slack |
What It Catches
Status | Meaning | Action |
✅ PASSED | Behavior matches baseline | Ship with confidence |
⚠️ TOOLS_CHANGED | Different tools called | Review the diff |
⚠️ OUTPUT_CHANGED | Same tools, output shifted | Review the diff |
❌ REGRESSION | Score dropped significantly | Fix before shipping |
Model / Runtime Change Detection
EvalView does more than compare model_id.
Declared model change: adapter-reported model changed from baseline
Runtime fingerprint change: observed model labels in the trace changed, even when the top-level model name is missing
Coordinated drift: multiple tests shift together in the same check run, which often points to a silent provider rollout or runtime change
When detected, evalview check surfaces a run-level signal with a classification (declared or suspected), confidence level, and evidence from fingerprints, retries, and affected tests.
If the new behavior is correct, rerun evalview snapshot to accept the updated baseline.
Four scoring layers — the first two are free and offline:
Layer | What it checks | Cost |
Tool calls + sequence | Exact tool names, order, parameters | Free |
Code-based checks | Regex, JSON schema, contains/not_contains | Free |
Semantic similarity | Output meaning via embeddings | ~$0.00004/test |
LLM-as-judge | Output quality scored by LLM (GPT, Claude, Gemini, DeepSeek, Ollama) | ~$0.01/test |
Score Breakdown
Tools 100% ×30% Output 42/100 ×50% Sequence ✓ ×20% = 54/100
↑ tools were fine ↑ this is the problemCI/CD Integration
Block broken agents in every PR. One step — PR comments, artifacts, and job summary are automatic.
# .github/workflows/evalview.yml — copy this, add your secret, done
name: EvalView Agent Check
on: [pull_request, push]
jobs:
agent-check:
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- uses: actions/checkout@v4
- name: Check for agent regressions
uses: hidai25/eval-view@main
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}## ✅ EvalView: PASSED
| Metric | Value |
|--------|-------|
| Tests | 5/5 unchanged (100%) |
---
*Generated by EvalView*When something breaks:
## ❌ EvalView: REGRESSION
> **Alerts**
> - 💸 Cost spike: $0.02 → $0.08 (+300%)
> - 🤖 Model changed: gpt-5.4 → gpt-5.4-mini
| Metric | Value |
|--------|-------|
| Tests | 3/5 unchanged (60%) |
| Regressions | 1 |
| Tools Changed | 1 |
### Changes from Baseline
- ❌ **search-flow**: score -15.0, 1 tool change(s)
- ⚠️ **create-flow**: 1 tool change(s)Common options: strict: 'true' | fail-on: 'REGRESSION,TOOLS_CHANGED' | mode: 'run' | filter: 'my-test'
Also works with pre-push hooks (evalview install-hooks) and status badges (evalview badge).
Watch Mode
Leave it running while you code. Every file save triggers a regression check.
evalview watch # Watch current dir, check on change
evalview watch --quick # No LLM judge — $0, sub-second
evalview watch --test "refund-flow" # Only check one test╭─────────────────────────── EvalView Watch ────────────────────────────╮
│ Watching . │
│ Tests all in tests/ │
│ Mode quick (no judge, $0) │
╰───────────────────────────────────────────────────────────────────────╯
14:32:07 Change detected: src/agent.py
╭──────────────────────────── Scorecard ────────────────────────────────╮
│ ████████████████████░░░░ 4 passed · 1 tools changed · 0 regressions │
╰───────────────────────────────────────────────────────────────────────╯
⚠ TOOLS_CHANGED refund-flow 1 tool change(s)
Watching for changes...Multi-Turn Testing
Most eval tools handle single-turn well. EvalView is built for multi-turn — clarification paths, follow-up handling, and tool use across conversations.
name: refund-needs-order-number
turns:
- query: "I want a refund"
expected:
output:
contains: ["order number"]
- query: "Order 4812"
expected:
tools: ["lookup_order", "check_policy"]
forbidden_tools: ["delete_order"]
output:
contains: ["refund", "processed"]
not_contains: ["error"]
thresholds:
min_score: 70Each turn scored independently with conversation context. Per-turn judge scoring, not just final response.
Smart DX
EvalView doesn't just run tests — it understands your agent and configures itself.
Assertion Wizard — Tests From Real Traffic
Capture real interactions, get pre-configured tests. No YAML writing.
evalview capture --agent http://localhost:8000/invoke
# Use your agent normally, then Ctrl+CAssertion Wizard — analyzing 8 captured interactions
Agent type detected: multi-step
Tools seen search, extract, summarize
Consistent sequence search -> extract -> summarize
Suggested assertions:
1. Lock tool sequence: search -> extract -> summarize (recommended)
2. Require tools: search, extract, summarize (recommended)
3. Max latency: 5000ms (recommended)
4. Minimum quality score: 70 (recommended)
Accept all recommended? [Y/n]: y
Applied 4 assertions to 8 test filesAuto-Variant Discovery — Solve Non-Determinism
Non-deterministic agents take different valid paths. Let EvalView discover and save them:
evalview check --statistical 10 --auto-variant search-flow mean: 82.3, std: 8.1, flakiness: low_variance
1. search -> extract -> summarize (7/10 runs, avg score: 85.2)
2. search -> summarize (3/10 runs, avg score: 78.1)
Save as golden variant? [Y/n]: y
Saved variant 'auto-v1': search -> summarizeRun N times. Cluster the paths. Save the valid ones. Tests stop being flaky — automatically.
Auto-Heal — Fix Flakes Without Leaving CI
Model got silently updated? Output drifted? --heal retries safe failures, proposes variants for borderline cases, and hard-escalates everything else. It also records when those retries were triggered by a likely model/runtime update.
evalview check --heal ⚠ Model update detected: gpt-5-2025-08-07 → gpt-5.1-2025-11-12 (3 tests affected)
✓ login-flow PASSED
⚡ refund-request HEALED retried — non-deterministic drift
⚡ order-lookup HEALED retried — likely model/runtime update
◈ billing-dispute PROPOSED saved candidate variant auto_heal_a1b2 (score 72)
⚠ search-flow REVIEW tool removed: web_search
✗ safety-check BLOCKED forbidden tool called — cannot heal
3 resolved, 1 candidate variant saved, 1 needs review, 1 blocked.
Model update: 2 of 3 affected tests healed via retry. Run `evalview snapshot` to rebase.
Audit log: .evalview/healing/2026-03-25T14-30-00.jsonDecision policy: Retry when tools match but output drifted (non-determinism or likely model/runtime update). Propose a variant when retry fails but score is acceptable. Never auto-resolve structural changes, forbidden tool violations, cost spikes, or score improvements. Full audit trail in .evalview/healing/.
Exit code: 0 only when every failure was resolved via retry. Proposed variants, reviews, and blocks always exit 1 — CI stays honest.
Budget circuit breaker — enforced mid-execution, not post-hoc:
evalview check --budget 0.50 $0.12 (24%) — search-flow
$0.09 (18%) — refund-flow
$0.31 (62%) — billing-dispute
Budget circuit breaker tripped: $0.52 spent of $0.50 limit
2 test(s) skipped to stay within budgetSmart eval profiles — evalview init detects your agent type and pre-configures evaluators:
Five profiles — chat, tool-use, multi-step, rag, coding — each with tailored thresholds, recommended checks, and actionable tips. Override with --profile rag.
Supported Frameworks
Works with LangGraph, CrewAI, OpenAI, Claude, Mistral, HuggingFace, Ollama, MCP, and any HTTP API.
Agent | E2E Testing | Trace Capture |
LangGraph | ✅ | ✅ |
CrewAI | ✅ | ✅ |
OpenAI Assistants | ✅ | ✅ |
Claude Code | ✅ | ✅ |
OpenClaw | ✅ | ✅ |
Ollama | ✅ | ✅ |
Any HTTP API | ✅ | ✅ |
Framework details → | Flagship starter → | Starter examples →
How It Works
┌────────────┐ ┌──────────┐ ┌──────────────┐
│ Test Cases │ ──→ │ EvalView │ ──→ │ Your Agent │
│ (YAML) │ │ │ ←── │ local / cloud │
└────────────┘ └──────────┘ └──────────────┘evalview init— detects your running agent, creates a starter test suiteevalview snapshot— runs tests, saves traces as baselinesevalview check— replays tests, diffs against baselines, opens HTML reportevalview watch— re-runs checks on every file saveevalview monitor— continuous checks in production with Slack alerts
evalview snapshot list # See all saved baselines
evalview snapshot show "my-test" # Inspect a baseline
evalview snapshot delete "my-test" # Remove a baseline
evalview snapshot --preview # See what would change without saving
evalview snapshot --reset # Clear all and start fresh
evalview replay # List tests, or: evalview replay "my-test"Your data stays local by default. Nothing leaves your machine unless you opt in to cloud sync via evalview login.
Production Monitoring
evalview monitor # Check every 5 min
evalview monitor --dashboard # Live terminal dashboard
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl # JSONL for dashboardsNew regressions trigger Slack alerts. Recoveries send all-clear. No spam on persistent failures.
Key Features
Feature | Description | Docs |
Assertion wizard | Analyze captured traffic, suggest smart assertions automatically | |
Auto-variant discovery | Run N times, cluster paths, save valid variants | |
Auto-heal | Retry flakes, propose variants, escalate structural changes | |
Budget circuit breaker | Mid-execution budget enforcement with per-test cost breakdown | |
Smart eval profiles | Auto-detect agent type, pre-configure evaluators | |
Baseline diffing | Tool call + parameter + output regression detection | |
Multi-turn testing | Per-turn tool, forbidden_tools, and output checks | |
Multi-reference baselines | Up to 5 variants for non-deterministic agents | |
| Safety contracts — hard-fail on any violation | |
Watch mode |
| |
Python API |
| |
PR comments + alerts | Cost/latency spikes, model changes, collapsible diffs | |
Terminal dashboard | Scorecard, sparkline trends, confidence scoring | — |
Feature | Description | Docs |
Multi-turn capture |
| |
Semantic similarity | Embedding-based output comparison | |
Production monitoring |
| |
A/B comparison |
| |
Test generation |
| |
Per-turn judge scoring | Multi-turn output quality scored per turn with conversation context | |
Silent model detection | Alerts when LLM provider updates the model version | |
Gradual drift detection | Trend analysis across check history | |
Statistical mode (pass@k) | Run N times, require a pass rate, auto-discover variants | |
HTML trace replay | Auto-opens after check with full trace details | |
Verified cost tracking | Per-test cost breakdown with model pricing rates | |
Judge model picker | Choose GPT, Claude, Gemini, DeepSeek, or Ollama (free) | |
Pytest plugin |
| |
GitHub Actions job summary | Results visible in Actions UI, not just PR comments | |
Git hooks | Pre-push regression blocking, zero CI config | |
LLM judge caching | ~80% cost reduction in statistical mode | |
Quick mode |
| |
OpenClaw integration | Regression gate skill + | |
Snapshot preview |
| — |
Skills testing | E2E testing for Claude Code, Codex, OpenClaw |
Python API
Use EvalView as a library — no CLI, no subprocess, no output parsing.
from evalview import gate, DiffStatus
result = gate(test_dir="tests/")
result.passed # bool — True if no regressions
result.status # DiffStatus.PASSED / REGRESSION / TOOLS_CHANGED
result.summary # .total, .unchanged, .regressions, .tools_changed
result.diffs # List[TestDiff] — per-test scores and tool diffsQuick mode — skip the LLM judge for free, sub-second checks:
result = gate(test_dir="tests/", quick=True) # deterministic only, $0Async — for agent frameworks already in an event loop:
result = await gate_async(test_dir="tests/")Autonomous loops — gate + auto-revert on regression:
from evalview.openclaw import gate_or_revert
make_code_change()
if not gate_or_revert("tests/", quick=True):
# Change was reverted — try a different approach
try_alternative()OpenClaw Integration
Use EvalView as a regression gate in autonomous agent loops.
evalview openclaw install # Install gate skill into workspace
evalview openclaw check --path tests/ # Check and auto-revert on regressionfrom evalview.openclaw import gate_or_revert
make_code_change()
if not gate_or_revert("tests/", quick=True):
try_alternative() # Change was revertedPytest Plugin
def test_weather_regression(evalview_check):
diff = evalview_check("weather-lookup")
assert diff.overall_severity.value != "regression", diff.summary()pip install evalview # Plugin registers automatically
pytest # Runs alongside your existing testsClaude Code (MCP)
claude mcp add --transport stdio evalview -- evalview mcp serve8 tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, generate_visual_report
# 1. Install
pip install evalview
# 2. Connect to Claude Code
claude mcp add --transport stdio evalview -- evalview mcp serve
# 3. Make Claude Code proactive
cp CLAUDE.md.example CLAUDE.mdThen just ask Claude: "did my refactor break anything?" and it runs run_check inline.
Agent-Friendly Docs
Works with your coding agent out of the box. Ask Cursor, Claude Code, or Copilot to add regression tests, build a new adapter, or debug a failing check — EvalView ships the architecture maps and task recipes they need to get it right on the first try.
AGENT_INSTRUCTIONS.md — architecture map, contracts, invariants, verification commands
Agent Recipes — task-specific playbooks for common extensions
Documentation
Getting Started | Core Features | Integrations |
Contributing
Bug or feature request? Run
evalview feedbackor open an issueQuestions? GitHub Discussions
Setup help? Email hidai@evalview.com
Contributing? See CONTRIBUTING.md
License: Apache 2.0
Star History
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/hidai25/eval-view'
If you have feedback or need assistance with the MCP directory API, please join our Discord server