multivon-mcp
OfficialServer Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
| GOOGLE_API_KEY | No | Google API key for using Gemini models in evaluations | |
| OPENAI_API_KEY | No | OpenAI API key for using OpenAI models in evaluations | |
| ANTHROPIC_API_KEY | No | Anthropic API key for using Claude models in evaluations |
Capabilities
Features and capabilities supported by this server
| Capability | Details |
|---|---|
| tools | {
"listChanged": false
} |
| prompts | {
"listChanged": false
} |
| resources | {
"subscribe": false,
"listChanged": false
} |
| experimental | {} |
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| pdfhell_runA | Run the pdfhell adversarial-PDF benchmark against a vision model. Args:
model: Provider:model spec, e.g. Returns:
A dict with overall Provider API keys come from environment variables
( |
| pdfhell_makeA | Generate one adversarial PDF + its answer key. Useful for an agent to inspect what a specific trap looks like before deciding to evaluate against it. Args:
trap: Trap family name. The full list of 17+ families is
discoverable via Returns:
A dict with the case JSON (id, trap_family, question,
expected_answer, forbidden_answers, metadata) and optionally
the base64-encoded PDF bytes under |
| eval_faithfulnessA | Evaluate whether an LLM output is grounded in the retrieved context. Uses multivon-eval's QAG-graded Faithfulness evaluator. Extracts factual claims from the output and verifies each one against the context. Score is the fraction of claims supported. Use this when a RAG pipeline returned an answer and you want to check the LLM didn't invent facts not present in retrieved documents. Args:
input: The user's question.
context: The retrieved context the LLM was given.
output: The LLM's answer being evaluated.
judge_model: Provider:model for the QAG judge.
Default Returns:
|
| eval_hallucinationB | Detect fabricated information not present in the context. Score 1.0 = no hallucination. Score 0.0 = significant hallucination. Args: output: The LLM output to check. context: The ground-truth context the output should be grounded in. judge_model: Provider:model for the QAG judge. Returns:
|
| eval_relevanceA | Check whether an LLM output actually addresses the user's question. QAG-graded — generates yes/no questions about whether the output answers the input, stays on topic, contains relevant content. Args: input: The user's question. output: The LLM's response. judge_model: Provider:model for the QAG judge. Returns:
|
| eval_tool_call_accuracyA | Evaluate whether an agent called the right tool with the right arguments. Pure deterministic — no LLM judge needed. Compares the actual tool name + arguments against expected. Args: expected_tool: Tool name the agent should have called. actual_tool: Tool name the agent actually called. expected_arguments: Dict of expected argument values (optional). actual_arguments: Dict of argument values the agent passed (optional). Returns:
|
| eval_answer_accuracyA | Evaluate whether an answer is semantically equivalent to the ground truth. QAG-graded — generates yes/no questions about whether the actual answer matches the meaning of the expected answer. Useful when string match is too strict (e.g. paraphrased correct answers). Args: expected_answer: Ground-truth answer. actual_answer: The LLM's answer. judge_model: Provider:model for the QAG judge. Returns:
|
| eval_audit_packA | Build a hash-chained audit ZIP from a pdfhell run. Combines the run JSON, the case PDFs + answer keys, JUnit XML, and a SHA-256 manifest into one downloadable ZIP. Suitable for attaching to a procurement diligence appendix. Args:
run_json_path: Path to a pdfhell run JSON (from Returns:
|
| eval_discoverA | Return the full machine-readable capability catalog. Useful as a first call at session start — an agent can plan its evaluation strategy against the actual available evaluators rather than guessing or hallucinating tool names. Returns: A dict with three top-level keys: |
| eval_pii_detectionA | Detect personally-identifiable information (PII) in an LLM output. Local-first: zero API calls. Uses a regex pattern library covering emails, phone numbers, SSNs, credit cards, IBANs, IPs, addresses, and jurisdiction-specific identifiers (HIPAA MRNs, EU VAT, California bank accounts, etc). Score 1.0 = no PII detected. Score 0.0 = PII found (the reason field lists which types matched and example substrings). Args:
output: The LLM-generated text to scan.
jurisdiction: Which extra pattern set to include. One of
Returns:
|
| eval_schema_complianceA | Validate that an LLM output conforms to a JSON Schema. Wraps multivon-eval's For Pydantic-model validation or more advanced setups (custom validators, recursive schemas), use the multivon-eval SDK directly. Args:
output: The LLM-generated text expected to contain JSON.
schema: A JSON Schema dict (Draft 7). Example:
Returns:
|
| eval_toxicityA | Detect harmful, offensive, or inappropriate content in an LLM output. QAG-graded — the judge answers four yes/no questions about whether the output is free of hate speech, threats, harmful instructions, and disrespectful tone. Score is the fraction of questions that pass; 1.0 = not toxic, 0.0 = toxic. Use this for guardrails on generated content, chatbot turns, or any user-facing LLM output where harmful content is a risk. Args:
output: The LLM-generated text to evaluate.
judge_model: Provider:model for the QAG judge, e.g.
Returns:
|
| eval_biasA | Detect gender, racial, political, age, or socioeconomic bias. QAG-graded — the judge answers five yes/no questions about whether the response treats demographic groups equitably and stays politically neutral. Score 1.0 = no detectable bias, 0.0 = significant bias. Use this for fairness audits on generated content, summarisation of user-submitted text, or hiring/lending-adjacent LLM pipelines. Args: input: The prompt the LLM was responding to. Bias evaluators use this to contextualise whether disparate treatment is justified (e.g. a medical question may legitimately discuss group-specific risk factors). output: The LLM-generated response. judge_model: Provider:model for the QAG judge. Returns:
|
| eval_context_precisionA | Measure whether retrieved RAG context chunks are relevant to the question. High precision = the retriever returned mostly on-topic chunks; low noise. The judge asks "is this chunk relevant?" for each chunk (up to 8) and scores precision = fraction marked relevant. Use this to diagnose retriever quality: if precision is low, your embedding model, chunk size, or reranker is returning noise. Args: input: The user's question. context: Either a list of retrieved chunks, or a single string with the full retrieved context (will be evaluated as one chunk). judge_model: Provider:model for the QAG judge. Returns:
|
| eval_context_recallA | Measure whether retrieved context contains enough information to answer. High recall = the retriever found the information needed to derive the expected answer. The judge asks whether the expected answer could plausibly be reconstructed from the retrieved context alone. Use this when you have a labelled QA dataset and want to diagnose whether failures are retriever misses vs. generator errors. Args: input: The user's question. context: The retrieved context chunks (list or single string). expected_answer: The ground-truth answer the context should support. judge_model: Provider:model for the QAG judge. Returns:
|
| eval_g_evalA | G-Eval style holistic scoring against a plain-English criterion. The judge reads the criterion and the output, then returns a numeric score from 0.0 to 1.0 plus a short reason. To reduce single-sample variance the prompt is run twice by default and the scores averaged (position/framing bias mitigation per the original G-Eval paper). Best for fuzzy or holistic qualities: creativity, tone, style,
helpfulness, conciseness. For criteria with multiple discrete
aspects, prefer Args:
input: The prompt the LLM was responding to.
output: The LLM-generated response to score.
criteria: A plain-English description of what to score on,
e.g. Returns:
|
| eval_custom_rubricA | Score an output against your own list of yes/no quality checks. Each criterion is a Args:
input: The prompt the LLM was responding to.
output: The LLM-generated response.
criteria: A list of Returns:
|
| eval_vqa_faithfulnessA | Check whether an LLM answer about an image is grounded in what's visible. Image-grounded faithfulness. The vision judge extracts up to 3 factual claims from the answer, then verifies each one against the image. Score = fraction of claims that are accurate. Use this for visual QA, image captioning, chart/diagram reading, and any LLM output that purports to describe an image. Image input — exactly one of:
Args:
input: The question or prompt the LLM was answering.
output: The LLM-generated answer to verify against the image.
image: Path / URL / data URI for the image.
image_base64: Alternative — raw base64 image bytes.
mime_type: Mime type when using Returns:
|
| eval_document_groundingA | Check whether an answer about a multi-page document is grounded. Document-page-grounded faithfulness for multi-page document agents (contracts, invoices, scientific PDFs, medical records). The vision judge answers three yes/no questions per document: is every claim supported, no inventions, exceptions handled. Provide one image per page. Use exactly one of:
Args:
input: The question or prompt the LLM was answering about
the document.
output: The LLM-generated answer to verify against the pages.
images: List of page image sources (paths/URLs/data URIs).
images_base64: Alternative — list of raw base64 strings.
mime_type: Mime type when using Returns:
|
| eval_compare_runsA | Compare two multivon-eval report JSONs and return a structured diff. Loads both reports from disk (the JSON produced by
Use this when you've made a prompt / retrieval / model change and want to know if the new run actually improved over the baseline — not just on aggregate, but case-by-case. Args:
baseline_json_path: Filesystem path to the baseline report
JSON (e.g. Returns:
A dict with:
- |
| eval_generate_casesA | Generate synthetic eval cases from a source text. Calls multivon-eval's synthetic generator to produce Requires a provider API key in env so the underlying judge can propose question/answer pairs. Args:
from_text: Source text to generate cases from (e.g. FAQ,
docs chunk, knowledge base article).
n: Number of cases to generate. Default 10.
task: One of Returns:
A list of dicts |
| eval_ingest_traceA | Convert a JSON agent trace into a JSON-friendly EvalCase payload. Parses a serialised agent trajectory and returns the
:class: Supports three frameworks:
Args:
trace_json: The trace as a JSON-friendly dict. Must include
Returns:
A dict with |
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
No prompts | |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
No resources | |
Latest Blog Posts
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/multivon-ai/multivon-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server