multivon-mcp
OfficialThe multivon-mcp server provides AI agents with 22 tools for comprehensive evaluation of LLM products, RAG pipelines, agent behavior, and document AI, directly via the Model Context Protocol.
Discovery & Document AI
eval_discover: Retrieve a machine-readable catalog of all evaluators, PDF traps, and test suites.pdfhell_make: Generate a single adversarial PDF and its answer key for a specific trap type.pdfhell_run: Run the adversarial-PDF benchmark against a vision model, returning pass rates, confidence intervals, and per-trap-family breakdowns.eval_audit_pack: Bundle a pdfhell run into a hash-chained, procurement-ready audit ZIP (SHA-256 manifest, JUnit XML, case PDFs).
RAG Generation & Retrieval Evaluation
eval_faithfulness: Check if a RAG output is grounded in the retrieved context (QAG-graded).eval_hallucination: Detect fabricated information not present in the provided context.eval_relevance: Assess whether an LLM response addresses the user's question.eval_answer_accuracy: Evaluate semantic equivalence against a ground-truth reference.eval_context_precision: Check if retrieved chunks are on-topic.eval_context_recall: Assess if the context contains enough information to answer.
Safety, Compliance & Fairness
eval_toxicity: Detect harmful content in LLM outputs.eval_bias: Identify bias across axes such as gender, race, and politics.eval_pii_detection: Local regex scan for PII — no API egress.eval_schema_compliance: Validate LLM output against a JSON Schema.
Agent & Multimodal Evaluation
eval_tool_call_accuracy: Deterministically verify correct tool calls and arguments (no LLM judge required).eval_vqa_faithfulness: Evaluate image-grounded visual-QA faithfulness.eval_document_grounding: Assess multi-page document-grounded faithfulness for document-AI agents.
Flexible Scoring
eval_g_eval: Score output against a plain-English criterion (0.0–1.0).eval_custom_rubric: Score output against a custom list of yes/no quality checks.
Agent Workflows
eval_compare_runs: Diff two eval report JSONs for regression analysis and pass-rate deltas.eval_generate_cases: Generate eval cases (input, expected output, context) from source text.eval_ingest_trace: Convert agent traces (e.g., LangGraph, OpenAI Agents) into an EvalCase payload for scoring.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@multivon-mcpevaluate my RAG output for hallucinations"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
multivon-mcp
Docs · Website · PyPI · multivon-eval (engine)
These 22 tools are what an autonomous eval agent needs to do its job: discover its own capabilities (eval.discover), normalize traces from any source (ingest_trace), and run calibrated evaluators against them. The framework lives behind an MCP boundary because that's the future shape of eval — a swarm of specialized eval agents coordinating through the protocol, not a SaaS dashboard.
MCP server that gives AI coding agents direct access to evaluation tools. Drop into Claude Desktop, Claude Code, Cursor, Cline, or any Model Context Protocol–compatible agent.
When the agent is helping you build an LLM product, it can:
Score a RAG output for hallucination without you writing the scaffolding
Generate an adversarial PDF on demand to test your document AI
Run the full pdfhell mini-suite against a model and analyse the results
Produce a hash-chained audit pack for procurement diligence
Discover the full evaluation capability catalog as JSON
No copy-paste, no python -c "...", no asking the agent to figure out the SDK calls.
Install
pip install multivon-mcpBare install pulls multivon-eval, pdfhell, and the MCP SDK. The provider SDKs (anthropic, openai, google-genai) come along too — bring your own API key in env.
Configure your agent
Claude Desktop / Claude Code
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"multivon": {
"command": "multivon-mcp",
"env": {
"ANTHROPIC_API_KEY": "sk-ant-...",
"OPENAI_API_KEY": "sk-proj-...",
"GOOGLE_API_KEY": "AIza..."
}
}
}
}Restart Claude. The 22 tools become available; ask Claude "use multivon to evaluate this RAG output" and it figures out which tool to call.
Cursor
cursor.json or via Settings → MCP:
{ "mcpServers": { "multivon": { "command": "multivon-mcp" } } }Cline / OpenCode / any MCP-compatible agent
Same shape — point at the multivon-mcp console script.
Local dev / debugging
mcp dev multivon_mcp.serverOpens the MCP Inspector UI in your browser. You can call any tool by name, see the JSON schemas, and watch the requests/responses.
The 22 tools
Discovery & document AI
Tool | What it does | API key |
| Full machine-readable capability catalog (evaluators, traps, suites, calibration data, versions). Call first. | No |
| Generate one adversarial PDF + its answer key. | No |
| Run the pdfhell adversarial-PDF benchmark against a vision model. Returns pass rate, per-trap CIs, suite hash. | Yes (vision) |
| Build a hash-chained, procurement-ready ZIP from a pdfhell run. | No |
RAG generation & retrieval
Tool | What it does | API key |
| QAG-graded faithfulness — is a RAG output grounded in the retrieved context? | Yes |
| QAG-graded hallucination — does the output contain content NOT in context? | Yes |
| QAG-graded answer-vs-question relevance. | Yes |
| QAG-graded semantic equivalence vs ground truth. | Yes |
| RAG retrieval quality — are the retrieved chunks on-topic? | Yes |
| RAG retrieval completeness — does context contain enough info to answer? | Yes |
Safety, compliance, fairness
Tool | What it does | API key |
| QAG-graded toxicity / harmful-content detection. | Yes |
| QAG-graded bias across gender, race, politics, age, socioeconomic axes. | Yes |
| Local-only regex scan for PII (GDPR / CCPA / PIPEDA / HIPAA packs). | No |
| Validate an LLM output against a JSON Schema. | No |
Agent & multimodal
Tool | What it does | API key |
| Deterministic agent tool-call correctness. No LLM. | No |
| Image-grounded visual-QA faithfulness. | Yes (vision) |
| Multi-page document-grounded faithfulness for document-AI agents. | Yes (vision) |
Agent traces.
eval_tool_call_accuracyand the other agent-trace evaluators inmultivon-eval(ToolArgumentAccuracy,ToolCallNecessity,TrajectoryEfficiency,AgentMemoryEval,PlanQuality,TaskCompletion,StepFaithfulness) take anagent_trace=[AgentStep(...)]plusexpected_tool_calls=[...]on the case. Three-shape semantics matter:expected_tool_calls=Noneskips,[]asserts "no tools called", and[...]checks the trace contains the named calls in order. The MCP tool wraps this — pass the trace JSON viaeval_ingest_tracefirst to normalize it from LangGraph / OpenAI Agents SDK / manual shapes. See themultivon-evalagent integrations for the source-of-truth tracer code.
Flexible scoring
Tool | What it does | API key |
| G-Eval holistic 0.0-1.0 scoring against a plain-English criterion. | Yes |
| Score against your own list of yes/no quality checks. | Yes |
Agent workflows (new in 0.3.0)
Tool | What it does | API key |
| Diff two eval report JSONs — pass-rate delta, per-case regressions/improvements, McNemar p-value. Use after every fix to confirm it actually helped. | No |
| Generate N eval cases (input / expected_output / context) from a chunk of source text. Eliminates the cold-start when building a new suite. | Yes (judge) |
| Convert a JSON agent trace (LangGraph / OpenAI Agents / manual) into an EvalCase payload. Use to score trajectories your agent just executed. | No |
Example session
User: I just shipped a RAG endpoint. Can you check it for hallucinations?
Claude: I'll use multivon to evaluate it.
[calls eval_discover to see what's available]
[calls eval_faithfulness with your input/context/output]
→ score: 0.667 (passed: False), threshold: 0.9
reason: 2/3 claims grounded
✓ "annual renewal" — supported by context
✓ "30-day notice" — supported by context
✗ "automatic upgrade" — NOT in context
Claude: Your RAG hallucinated the "automatic upgrade" detail. The context
doesn't mention upgrades. I'd add a Hallucination evaluator to your CI
gate, threshold ≥0.85, and re-prompt with explicit "only use facts
from context" instructions.Why these 22 tools (not all 44)
eval_discover returns the full 44-evaluator catalog, so the agent can always introspect everything. The 22 tools we expose directly are the ones agents actually call mid-edit:
RAG generation checks (faithfulness, hallucination, relevance, answer_accuracy)
RAG retrieval checks (context_precision, context_recall)
Safety / fairness guardrails (toxicity, bias)
Compliance (pii_detection, schema_compliance) — local-only, no API egress
Flexible scoring (g_eval, custom_rubric) for user-defined rubrics
Multimodal (vqa_faithfulness, document_grounding) for vision agents
Agent traces (tool_call_accuracy)
Document AI (pdfhell.run, pdfhell.make) — for any RAG-on-PDFs flow
Audit pack — when procurement is involved
Discover — meta-capability for planning
Agent workflows (compare_runs, generate_cases, ingest_trace) — the loop that turns one-shot scoring into iterative improvement
The three new 0.3.0 tools matter because evals are most useful as a loop, not a single call: generate a starting suite from your own docs (eval_generate_cases), run your agent over it, score the trace (eval_ingest_trace → eval_*), make a fix, then verify the fix improved things vs. the baseline (eval_compare_runs). Agents need that whole loop callable from within a conversation — otherwise they fall back to ad-hoc judgment.
Exposing all 44 evaluators as MCP tools would bloat the agent's context window and overwhelm tool-selection. If you need an evaluator that's not directly exposed, the agent can still use multivon-eval as a library — eval_discover returns the import paths.
Dependencies
Hard pins (from pyproject.toml):
mcp[cli] >= 1.0— official MCP Python SDK + themcp devinspectormultivon-eval >= 0.9.4— the evaluator surface this wrapspdfhell >= 0.1.0— the adversarial-PDF benchmark this wraps
Recommended (effective floor for full feature parity):
multivon-eval >= 0.9.8— pulls in the corrected calibrated-threshold logic from the 0.9.7 hotfix (which affects whateval_discoverreports and any tool that surfaces benchmark numbers in its docstring), plus the bundled Claude Code skills +multivon-eval install-skillsCLI from 0.9.8.pdfhell >= 0.5.4— pulls in themini-v417-trap suite and thepdfhell.researchautoresearch loop. Thepdfhell_run --suite mini-v4tool path assumes these are present.
The pyproject pins are kept loose so existing deployments don't break; pin the recommended floors yourself if you care about the corrected benchmark numbers or the new suites.
All Apache 2.0.
MCP server vs Claude Code skills vs eval-action — which one do I use?
multivon-eval ships three agent-facing surfaces. They overlap on what
they call (the same evaluator catalog) but differ on where the agent
lives.
Surface | Where the agent runs | Best for |
multivon-mcp (this repo) | Any MCP-compatible client — Claude Desktop, Cursor, Cline, OpenCode, Claude Code | Mid-edit scoring inside an IDE or chat app. Agent calls |
Claude Code skills — | Claude Code only | Workflow-shaped tasks: scaffold an eval suite from a project description, pre-PR regression checks against a baseline, explaining why a particular evaluator was picked. The skills know how to call |
GitHub CI | Gate every PR on eval regressions automatically. Posts the Wilson-CI + McNemar verdict as a PR comment. |
If you're building an LLM product and want the agent in your editor to score a RAG output without copy-pasting Python, use multivon-mcp. If you live in Claude Code and want the bootstrap → audit → explain loop wired up as native commands, use the bundled skills. If you want PR-time gating, use the GitHub Action. The three are complementary — most projects end up using all three.
The Multivon ecosystem
Five public + one early-access package, all built on a shared evaluation engine:
Repo | What it is |
Python SDK — 44 evaluators + | |
Adversarial PDFs that break AI document readers — exposed here as | |
multivon-mcp (you are here) | MCP server — 22 tools from multivon-eval + pdfhell |
GitHub Action — runs the same evals on every PR | |
Reproducible head-to-head benchmark vs DeepEval + RAGAS | |
multivon-guard (early access) | Local proxy that catches LLM coding agents leaking secrets / PII |
License
Apache 2.0.
Citing
@software{multivon_mcp,
title = {multivon-mcp: MCP server exposing multivon-eval + pdfhell as agent-callable tools},
author = {Multivon},
year = {2026},
url = {https://github.com/multivon-ai/multivon-mcp},
}Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/multivon-ai/multivon-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server