Skip to main content
Glama

Your agent returns 200 and looks fine. But a model update, a provider change, or a one-line prompt edit just made it skip a clarification, call the wrong tool, or quietly drop output quality. Your tests still pass. Your users notice before you do.

EvalView snapshots your agent's behavior — the tools it calls, in what order, with what output — and tells you the moment that behavior changes. Like Jest snapshots, but for tool-calling, multi-turn agents.

demo.gif

↑ 30-second live demo — no API key needed

Quick Start

pip install evalview
evalview snapshot    # Record your agent's current behavior as the baseline
evalview check       # After any change, diff against the baseline

That's the whole loop. check returns one of:

  ✓ login-flow        PASSED          behavior matches baseline
  ⚠ refund-request    TOOLS_CHANGED   called a different tool, or in a different order
  ✗ billing-dispute   REGRESSION      score dropped — output quality fell

It diffs the whole trajectory — tool names, parameters, and order — not just the final string. The deterministic tool + sequence diff runs offline, with no API key. Add an LLM judge only when you want output-quality scoring.

No agent yet? See it work in 30 seconds:

evalview demo

Related MCP server: openclaw-output-vetter-mcp

Why snapshot testing (and not assertions)?

Most eval tools ask you to write down what "good" looks like — assertions, metrics, rubrics. That's a lot of upfront work, and you can only catch the failures you thought to assert.

EvalView inverts it: it records what your agent actually does now, and flags any drift from that. You catch regressions you never anticipated, with zero assertions written. When the new behavior is correct, evalview snapshot accepts it as the new baseline — same as updating a snapshot in Jest.

EvalView

Assertion-based eval tools

Setup

Record current behavior

Write assertions/metrics first

Catches

Any drift from baseline

Only what you asserted

Non-determinism

Multi-variant baselines (up to 5 valid paths)

You handle it

Unit of comparison

Full tool-call trajectory

Usually final output

This makes EvalView a merge-time regression gate, which is a different job from observability (Langfuse, LangSmith) or metric scoring (promptfoo, DeepEval, Braintrust). Many teams run one of those for visibility and EvalView as the gate. Honest comparisons →

EvalView tests itself in public, every day

The badge at the top is live. Every day at 09:00 UTC, a GitHub Action runs EvalView against EvalView — including a regression check where the tool snapshots a live agent and diffs it with the same snapshot / check loop this README asks you to trust. It also runs the full test suite, type checks, evalview demo, the end-to-end flows, an evalview monitor smoke test, and chat-mode self-tests.

When something breaks, the run opens a single rolling 🐕 dogfood issue and keeps updating it until the tool is green again — so failures are public, not quietly patched.

Live dogfood runs → · How it works →

CI: block regressions in every PR

# .github/workflows/evalview.yml
name: EvalView
on: [pull_request]
jobs:
  agent-check:
    runs-on: ubuntu-latest
    permissions: { pull-requests: write }
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/eval-view@v0.8.0
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

You get a PR comment with the diff, cost/latency deltas, and a pass/fail gate. CI/CD guide →

Works with your stack

LangGraph · CrewAI · OpenAI · Claude · Mistral · Ollama · MCP · any HTTP API.

evalview check --agent http://localhost:8000/invoke

Framework details →

Use it as a library

from evalview import gate

result = gate(test_dir="tests/")
result.passed   # bool
result.diffs    # per-test scores and tool diffs

Python API →

More

EvalView also does multi-turn testing, statistical/pass@k runs, record/replay cassettes, model-drift canaries, production monitoring with Slack alerts, and auto-generated regression tests from incidents. These are power-user features — start with snapshot and check, reach for the rest when you need them.

Full feature reference · Getting Started · FAQ

Contributing

This is a young project built mostly by one developer. Issues, PRs, and "I tried it and X was confusing" feedback are all genuinely valuable.

License: Apache 2.0


Star History Chart

Install Server
A
license - permissive license
A
quality
A
maintenance

Maintenance

Maintainers
7dResponse time
5dRelease cycle
32Releases (12mo)
Commit activity
Issues opened vs closed

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/hidai25/eval-view'

If you have feedback or need assistance with the MCP directory API, please join our Discord server