Skip to main content
Glama

Your agent can still return 200 and be wrong. A model or provider update can change tool choice, skip a clarification, or degrade output quality without changing your code or breaking a health check. EvalView catches those silent regressions before users do.

You don't need frontier-lab resources to run a serious agent regression loop. EvalView gives solo devs, startups, and small AI teams the same core discipline: snapshot behavior, detect drift, classify changes, and review or heal them safely.

Traditional tests tell you if your agent is up. EvalView tells you if it still behaves correctly. It tracks drift across outputs, tools, model IDs, and runtime fingerprints, so you can tell "the provider changed" from "my system regressed."

demo.gif

30-second live demo.

Most eval tools stop at detect and compare. EvalView helps you classify changes, inspect drift, and auto-heal the safe cases.

  • Catch silent regressions that normal tests miss

  • Separate provider/model drift from real system regressions

  • Auto-heal flaky failures with retries, review gates, and audit logs

Built for frontier-lab rigor, startup-team practicality:

  • targeted behavior runs instead of giant always-on eval suites

  • deterministic diffs first, LLM judgment where it adds signal

  • faster loops from change -> eval -> review -> ship

How we run EvalView with this operating model →

  ✓ login-flow           PASSED
  ⚠ refund-request       TOOLS_CHANGED
      - lookup_order → check_policy → process_refund
      + lookup_order → check_policy → process_refund → escalate_to_human
  ✗ billing-dispute      REGRESSION  -30 pts
      Score: 85 → 55  Output similarity: 35%

Quick Start

pip install evalview
evalview init        # Detect agent, auto-configure profile + starter suite
evalview snapshot    # Save current behavior as baseline
evalview check       # Catch regressions after every change

That's it. Three commands to regression-test any AI agent. init auto-detects your agent type (chat, tool-use, multi-step, RAG, coding) and configures the right evaluators, thresholds, and assertions.

curl -fsSL https://raw.githubusercontent.com/hidai25/eval-view/main/install.sh | bash
evalview demo        # See regression detection live (~30 seconds, no API key)

Or clone a real working agent with built-in tests:

git clone https://github.com/hidai25/evalview-support-automation-template
cd evalview-support-automation-template
make run
evalview generate --agent http://localhost:8000           # Generate tests from a live agent
evalview capture --agent http://localhost:8000/invoke      # Capture real user flows (runs assertion wizard after)
evalview capture --agent http://localhost:8000/invoke --multi-turn  # Multi-turn conversation as one test
evalview generate --from-log traffic.jsonl                # Generate from existing logs
evalview init --profile rag                               # Override auto-detected agent profile

Why EvalView?

Use LangSmith for observability. Use Braintrust for scoring. Use EvalView for regression gating.

LangSmith

Braintrust

Promptfoo

EvalView

Primary focus

Observability

Scoring

Prompt comparison

Regression detection

Tool call + parameter diffing

Yes

Golden baseline regression

Manual

Automatic

Silent model change detection

Yes

Auto-heal (retry + variant proposal)

Yes

PR comments with alerts

Cost, latency, model change

Works without API keys

No

No

Partial

Yes

Production monitoring

Tracing

Check loop + Slack

Detailed comparisons →

What It Catches

Status

Meaning

Action

PASSED

Behavior matches baseline

Ship with confidence

⚠️ TOOLS_CHANGED

Different tools called

Review the diff

⚠️ OUTPUT_CHANGED

Same tools, output shifted

Review the diff

REGRESSION

Score dropped significantly

Fix before shipping

Model / Runtime Change Detection

EvalView does more than compare model_id.

  • Declared model change: adapter-reported model changed from baseline

  • Runtime fingerprint change: observed model labels in the trace changed, even when the top-level model name is missing

  • Coordinated drift: multiple tests shift together in the same check run, which often points to a silent provider rollout or runtime change

When detected, evalview check surfaces a run-level signal with a classification (declared or suspected), confidence level, and evidence from fingerprints, retries, and affected tests.

If the new behavior is correct, rerun evalview snapshot to accept the updated baseline.

Four scoring layers — the first two are free and offline:

Layer

What it checks

Cost

Tool calls + sequence

Exact tool names, order, parameters

Free

Code-based checks

Regex, JSON schema, contains/not_contains

Free

Semantic similarity

Output meaning via embeddings

~$0.00004/test

LLM-as-judge

Output quality scored by LLM (GPT, Claude, Gemini, DeepSeek, Ollama)

~$0.01/test

Score Breakdown
  Tools 100% ×30%    Output 42/100 ×50%    Sequence ✓ ×20%    = 54/100
  ↑ tools were fine   ↑ this is the problem

CI/CD Integration

Block broken agents in every PR. One step — PR comments, artifacts, and job summary are automatic.

# .github/workflows/evalview.yml — copy this, add your secret, done
name: EvalView Agent Check
on: [pull_request, push]

jobs:
  agent-check:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4

      - name: Check for agent regressions
        uses: hidai25/eval-view@main
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
## ✅ EvalView: PASSED

| Metric | Value |
|--------|-------|
| Tests | 5/5 unchanged (100%) |

---
*Generated by EvalView*

When something breaks:

## ❌ EvalView: REGRESSION

> **Alerts**
> - 💸 Cost spike: $0.02 → $0.08 (+300%)
> - 🤖 Model changed: gpt-5.4 → gpt-5.4-mini

| Metric | Value |
|--------|-------|
| Tests | 3/5 unchanged (60%) |
| Regressions | 1 |
| Tools Changed | 1 |

### Changes from Baseline
- ❌ **search-flow**: score -15.0, 1 tool change(s)
- ⚠️ **create-flow**: 1 tool change(s)

Common options: strict: 'true' | fail-on: 'REGRESSION,TOOLS_CHANGED' | mode: 'run' | filter: 'my-test'

Also works with pre-push hooks (evalview install-hooks) and status badges (evalview badge).

Full CI/CD guide →

Watch Mode

Leave it running while you code. Every file save triggers a regression check.

evalview watch                          # Watch current dir, check on change
evalview watch --quick                  # No LLM judge — $0, sub-second
evalview watch --test "refund-flow"     # Only check one test
╭─────────────────────────── EvalView Watch ────────────────────────────╮
│   Watching   .                                                        │
│   Tests      all in tests/                                            │
│   Mode       quick (no judge, $0)                                     │
╰───────────────────────────────────────────────────────────────────────╯

14:32:07  Change detected: src/agent.py

╭──────────────────────────── Scorecard ────────────────────────────────╮
│ ████████████████████░░░░  4 passed · 1 tools changed · 0 regressions │
╰───────────────────────────────────────────────────────────────────────╯
  ⚠ TOOLS_CHANGED  refund-flow  1 tool change(s)

Watching for changes...

Multi-Turn Testing

Most eval tools handle single-turn well. EvalView is built for multi-turn — clarification paths, follow-up handling, and tool use across conversations.

name: refund-needs-order-number
turns:
  - query: "I want a refund"
    expected:
      output:
        contains: ["order number"]
  - query: "Order 4812"
    expected:
      tools: ["lookup_order", "check_policy"]
      forbidden_tools: ["delete_order"]
      output:
        contains: ["refund", "processed"]
        not_contains: ["error"]
thresholds:
  min_score: 70

Each turn scored independently with conversation context. Per-turn judge scoring, not just final response.

Smart DX

EvalView doesn't just run tests — it understands your agent and configures itself.

Assertion Wizard — Tests From Real Traffic

Capture real interactions, get pre-configured tests. No YAML writing.

evalview capture --agent http://localhost:8000/invoke
# Use your agent normally, then Ctrl+C
Assertion Wizard — analyzing 8 captured interactions

  Agent type detected: multi-step
  Tools seen          search, extract, summarize
  Consistent sequence search -> extract -> summarize

  Suggested assertions:
    1. Lock tool sequence: search -> extract -> summarize  (recommended)
    2. Require tools: search, extract, summarize           (recommended)
    3. Max latency: 5000ms                                 (recommended)
    4. Minimum quality score: 70                           (recommended)

  Accept all recommended? [Y/n]: y
  Applied 4 assertions to 8 test files

Auto-Variant Discovery — Solve Non-Determinism

Non-deterministic agents take different valid paths. Let EvalView discover and save them:

evalview check --statistical 10 --auto-variant
  search-flow  mean: 82.3, std: 8.1, flakiness: low_variance
    1. search -> extract -> summarize  (7/10 runs, avg score: 85.2)
    2. search -> summarize             (3/10 runs, avg score: 78.1)

    Save as golden variant? [Y/n]: y
    Saved variant 'auto-v1': search -> summarize

Run N times. Cluster the paths. Save the valid ones. Tests stop being flaky — automatically.

Auto-Heal — Fix Flakes Without Leaving CI

Model got silently updated? Output drifted? --heal retries safe failures, proposes variants for borderline cases, and hard-escalates everything else. It also records when those retries were triggered by a likely model/runtime update.

evalview check --heal
  ⚠ Model update detected: gpt-5-2025-08-07 → gpt-5.1-2025-11-12 (3 tests affected)

  ✓ login-flow           PASSED
  ⚡ refund-request       HEALED   retried — non-deterministic drift
  ⚡ order-lookup         HEALED   retried — likely model/runtime update
  ◈ billing-dispute      PROPOSED saved candidate variant auto_heal_a1b2 (score 72)
  ⚠ search-flow          REVIEW   tool removed: web_search
  ✗ safety-check         BLOCKED  forbidden tool called — cannot heal

  3 resolved, 1 candidate variant saved, 1 needs review, 1 blocked.
  Model update: 2 of 3 affected tests healed via retry. Run `evalview snapshot` to rebase.
  Audit log: .evalview/healing/2026-03-25T14-30-00.json

Decision policy: Retry when tools match but output drifted (non-determinism or likely model/runtime update). Propose a variant when retry fails but score is acceptable. Never auto-resolve structural changes, forbidden tool violations, cost spikes, or score improvements. Full audit trail in .evalview/healing/.

Exit code: 0 only when every failure was resolved via retry. Proposed variants, reviews, and blocks always exit 1 — CI stays honest.

Budget circuit breaker — enforced mid-execution, not post-hoc:

evalview check --budget 0.50
  $0.12 (24%) — search-flow
  $0.09 (18%) — refund-flow
  $0.31 (62%) — billing-dispute

  Budget circuit breaker tripped: $0.52 spent of $0.50 limit
  2 test(s) skipped to stay within budget

Smart eval profilesevalview init detects your agent type and pre-configures evaluators:

Five profiles — chat, tool-use, multi-step, rag, coding — each with tailored thresholds, recommended checks, and actionable tips. Override with --profile rag.

Supported Frameworks

Works with LangGraph, CrewAI, OpenAI, Claude, Mistral, HuggingFace, Ollama, MCP, and any HTTP API.

Agent

E2E Testing

Trace Capture

LangGraph

CrewAI

OpenAI Assistants

Claude Code

OpenClaw

Ollama

Any HTTP API

Framework details → | Flagship starter → | Starter examples →

How It Works

┌────────────┐      ┌──────────┐      ┌──────────────┐
│ Test Cases  │ ──→  │ EvalView │ ──→  │  Your Agent   │
│   (YAML)   │      │          │ ←──  │ local / cloud │
└────────────┘      └──────────┘      └──────────────┘
  1. evalview init — detects your running agent, creates a starter test suite

  2. evalview snapshot — runs tests, saves traces as baselines

  3. evalview check — replays tests, diffs against baselines, opens HTML report

  4. evalview watch — re-runs checks on every file save

  5. evalview monitor — continuous checks in production with Slack alerts

evalview snapshot list              # See all saved baselines
evalview snapshot show "my-test"    # Inspect a baseline
evalview snapshot delete "my-test"  # Remove a baseline
evalview snapshot --preview         # See what would change without saving
evalview snapshot --reset           # Clear all and start fresh
evalview replay                     # List tests, or: evalview replay "my-test"

Your data stays local by default. Nothing leaves your machine unless you opt in to cloud sync via evalview login.

Production Monitoring

evalview monitor                                         # Check every 5 min
evalview monitor --dashboard                             # Live terminal dashboard
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl                 # JSONL for dashboards

New regressions trigger Slack alerts. Recoveries send all-clear. No spam on persistent failures.

Monitor config options →

Key Features

Feature

Description

Docs

Assertion wizard

Analyze captured traffic, suggest smart assertions automatically

Above

Auto-variant discovery

Run N times, cluster paths, save valid variants

Above

Auto-heal

Retry flakes, propose variants, escalate structural changes

Above

Budget circuit breaker

Mid-execution budget enforcement with per-test cost breakdown

Above

Smart eval profiles

Auto-detect agent type, pre-configure evaluators

Above

Baseline diffing

Tool call + parameter + output regression detection

Docs

Multi-turn testing

Per-turn tool, forbidden_tools, and output checks

Docs

Multi-reference baselines

Up to 5 variants for non-deterministic agents

Docs

forbidden_tools

Safety contracts — hard-fail on any violation

Docs

Watch mode

evalview watch — re-run checks on file save, with dashboard

Docs

Python API

gate() / gate_async() — programmatic regression checks

Docs

PR comments + alerts

Cost/latency spikes, model changes, collapsible diffs

Docs

Terminal dashboard

Scorecard, sparkline trends, confidence scoring

Feature

Description

Docs

Multi-turn capture

capture --multi-turn records conversations as tests

Docs

Semantic similarity

Embedding-based output comparison

Docs

Production monitoring

evalview monitor --dashboard with Slack alerts and JSONL history

Docs

A/B comparison

evalview compare --v1 <url> --v2 <url>

Docs

Test generation

evalview generate — discovers your agent's domain, generates relevant tests

Docs

Per-turn judge scoring

Multi-turn output quality scored per turn with conversation context

Docs

Silent model detection

Alerts when LLM provider updates the model version

Docs

Gradual drift detection

Trend analysis across check history

Docs

Statistical mode (pass@k)

Run N times, require a pass rate, auto-discover variants

Docs

HTML trace replay

Auto-opens after check with full trace details

Docs

Verified cost tracking

Per-test cost breakdown with model pricing rates

Docs

Judge model picker

Choose GPT, Claude, Gemini, DeepSeek, or Ollama (free)

Docs

Pytest plugin

evalview_check fixture for standard pytest

Docs

GitHub Actions job summary

Results visible in Actions UI, not just PR comments

Docs

Git hooks

Pre-push regression blocking, zero CI config

Docs

LLM judge caching

~80% cost reduction in statistical mode

Docs

Quick mode

gate(quick=True) — no judge, $0, sub-second

Docs

OpenClaw integration

Regression gate skill + gate_or_revert() helpers

Docs

Snapshot preview

evalview snapshot --preview — dry-run before saving

Skills testing

E2E testing for Claude Code, Codex, OpenClaw

Docs

Python API

Use EvalView as a library — no CLI, no subprocess, no output parsing.

from evalview import gate, DiffStatus

result = gate(test_dir="tests/")

result.passed          # bool — True if no regressions
result.status          # DiffStatus.PASSED / REGRESSION / TOOLS_CHANGED
result.summary         # .total, .unchanged, .regressions, .tools_changed
result.diffs           # List[TestDiff] — per-test scores and tool diffs

Quick mode — skip the LLM judge for free, sub-second checks:

result = gate(test_dir="tests/", quick=True)  # deterministic only, $0

Async — for agent frameworks already in an event loop:

result = await gate_async(test_dir="tests/")

Autonomous loops — gate + auto-revert on regression:

from evalview.openclaw import gate_or_revert

make_code_change()
if not gate_or_revert("tests/", quick=True):
    # Change was reverted — try a different approach
    try_alternative()

OpenClaw Integration

Use EvalView as a regression gate in autonomous agent loops.

evalview openclaw install                    # Install gate skill into workspace
evalview openclaw check --path tests/        # Check and auto-revert on regression
from evalview.openclaw import gate_or_revert

make_code_change()
if not gate_or_revert("tests/", quick=True):
    try_alternative()  # Change was reverted

Pytest Plugin

def test_weather_regression(evalview_check):
    diff = evalview_check("weather-lookup")
    assert diff.overall_severity.value != "regression", diff.summary()
pip install evalview    # Plugin registers automatically
pytest                  # Runs alongside your existing tests

Claude Code (MCP)

claude mcp add --transport stdio evalview -- evalview mcp serve

8 tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, generate_visual_report

# 1. Install
pip install evalview

# 2. Connect to Claude Code
claude mcp add --transport stdio evalview -- evalview mcp serve

# 3. Make Claude Code proactive
cp CLAUDE.md.example CLAUDE.md

Then just ask Claude: "did my refactor break anything?" and it runs run_check inline.

Agent-Friendly Docs

Works with your coding agent out of the box. Ask Cursor, Claude Code, or Copilot to add regression tests, build a new adapter, or debug a failing check — EvalView ships the architecture maps and task recipes they need to get it right on the first try.

Documentation

Contributing

License: Apache 2.0


Star History

Star History Chart

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/hidai25/eval-view'

If you have feedback or need assistance with the MCP directory API, please join our Discord server