Skip to main content
Glama
multivon-ai

multivon-mcp

Official
by multivon-ai

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
GOOGLE_API_KEYNoGoogle API key for using Gemini models in evaluations
OPENAI_API_KEYNoOpenAI API key for using OpenAI models in evaluations
ANTHROPIC_API_KEYNoAnthropic API key for using Claude models in evaluations

Capabilities

Features and capabilities supported by this server

CapabilityDetails
tools
{
  "listChanged": false
}
prompts
{
  "listChanged": false
}
resources
{
  "subscribe": false,
  "listChanged": false
}
experimental
{}

Tools

Functions exposed to the LLM to take actions

NameDescription
pdfhell_runA

Run the pdfhell adversarial-PDF benchmark against a vision model.

Args: model: Provider:model spec, e.g. "anthropic:claude-sonnet-4-6", "openai:gpt-4o", "google:gemini-2.5-flash". suite: Any suite from eval_discover. Current suites: "smoke" (3 cases, ~10s), "mini" (30 cases, ~$0.01 on Flash), "mini-v2", "mini-v3", the flagship "mini-v4" (17 trap families, 510 cases), and "mini-v4-sample" (170 cases — cheap reproduction of mini-v4). Default "mini". workers: Parallel API requests. Default 4.

Returns: A dict with overall pass_rate, Wilson 95% CI, per-trap-family pass rates and CIs, and per-case details. Suite version + hash included so consumers can verify the run measured the expected cases.

Provider API keys come from environment variables (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY) — not passed through this tool, never logged.

pdfhell_makeA

Generate one adversarial PDF + its answer key.

Useful for an agent to inspect what a specific trap looks like before deciding to evaluate against it.

Args: trap: Trap family name. The full list of 17+ families is discoverable via eval_discover (which is also the source of truth — pdfhell adds families over time and hard-coding them here would go stale). Examples include "hidden_ocr_mismatch", "footnote_override", and the autoresearch-discovered families in mini-v3/v4. seed: Integer seed. Same seed → byte-identical PDF + identical answer key. return_pdf_bytes: If True, include the base64-encoded PDF bytes in the response. Default False — most agents want the question / expected answer, not the raw PDF.

Returns: A dict with the case JSON (id, trap_family, question, expected_answer, forbidden_answers, metadata) and optionally the base64-encoded PDF bytes under pdf_base64.

eval_faithfulnessA

Evaluate whether an LLM output is grounded in the retrieved context.

Uses multivon-eval's QAG-graded Faithfulness evaluator. Extracts factual claims from the output and verifies each one against the context. Score is the fraction of claims supported.

Use this when a RAG pipeline returned an answer and you want to check the LLM didn't invent facts not present in retrieved documents.

Args: input: The user's question. context: The retrieved context the LLM was given. output: The LLM's answer being evaluated. judge_model: Provider:model for the QAG judge. Default "anthropic:claude-haiku-4-5" (cheap + calibrated).

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float}.

eval_hallucinationB

Detect fabricated information not present in the context.

Score 1.0 = no hallucination. Score 0.0 = significant hallucination.

Args: output: The LLM output to check. context: The ground-truth context the output should be grounded in. judge_model: Provider:model for the QAG judge.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float}.

eval_relevanceA

Check whether an LLM output actually addresses the user's question.

QAG-graded — generates yes/no questions about whether the output answers the input, stays on topic, contains relevant content.

Args: input: The user's question. output: The LLM's response. judge_model: Provider:model for the QAG judge.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float}.

eval_tool_call_accuracyA

Evaluate whether an agent called the right tool with the right arguments.

Pure deterministic — no LLM judge needed. Compares the actual tool name + arguments against expected.

Args: expected_tool: Tool name the agent should have called. actual_tool: Tool name the agent actually called. expected_arguments: Dict of expected argument values (optional). actual_arguments: Dict of argument values the agent passed (optional).

Returns: {"score": 0.0 or 1.0, "passed": bool, "reason": str}.

eval_answer_accuracyA

Evaluate whether an answer is semantically equivalent to the ground truth.

QAG-graded — generates yes/no questions about whether the actual answer matches the meaning of the expected answer. Useful when string match is too strict (e.g. paraphrased correct answers).

Args: expected_answer: Ground-truth answer. actual_answer: The LLM's answer. judge_model: Provider:model for the QAG judge.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str}.

eval_audit_packA

Build a hash-chained audit ZIP from a pdfhell run.

Combines the run JSON, the case PDFs + answer keys, JUnit XML, and a SHA-256 manifest into one downloadable ZIP. Suitable for attaching to a procurement diligence appendix.

Args: run_json_path: Path to a pdfhell run JSON (from pdfhell run --out). cases_dir: Directory containing the case PDFs + answer keys that were evaluated. Same dir the run used. output_zip_path: Where to write the audit ZIP.

Returns: {"path": "/abs/path/to.zip", "size_bytes": N, "manifest": {...}}. The manifest dict mirrors the one inside the ZIP — useful for an agent that wants to verify the contents without opening the ZIP itself.

eval_discoverA

Return the full machine-readable capability catalog.

Useful as a first call at session start — an agent can plan its evaluation strategy against the actual available evaluators rather than guessing or hallucinating tool names.

Returns: A dict with three top-level keys:

- ``evaluators``: every available multivon-eval evaluator,
  with its tier, what inputs it needs, and (when shipped)
  calibrated default thresholds per judge model.
- ``traps``: every pdfhell trap family, the failure mode each
  elicits, and the expected_failure_mode metadata.
- ``suites``: every named pdfhell suite, the (trap_family,
  seed_count) breakdown, and the suite_hash for the canonical
  version.
eval_pii_detectionA

Detect personally-identifiable information (PII) in an LLM output.

Local-first: zero API calls. Uses a regex pattern library covering emails, phone numbers, SSNs, credit cards, IBANs, IPs, addresses, and jurisdiction-specific identifiers (HIPAA MRNs, EU VAT, California bank accounts, etc).

Score 1.0 = no PII detected. Score 0.0 = PII found (the reason field lists which types matched and example substrings).

Args: output: The LLM-generated text to scan. jurisdiction: Which extra pattern set to include. One of "all" (default), "gdpr", "ccpa", "pipeda", or "hipaa". custom_patterns: Optional dict of {name: regex} to add to the default library (e.g. {"employee_id": r"EMP-\d{6}"}). redact: If True, replace matched substrings with [REDACTED-TYPE] markers in the reason field.

Returns: {"score": 0.0 or 1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": "pii_detection"}.

eval_schema_complianceA

Validate that an LLM output conforms to a JSON Schema.

Wraps multivon-eval's SchemaEvaluator. Parses the LLM output as JSON (tolerantly strips markdown code fences), then validates the parsed structure against the provided JSON Schema dict. Reports per-field validation errors — not just "valid/invalid".

For Pydantic-model validation or more advanced setups (custom validators, recursive schemas), use the multivon-eval SDK directly.

Args: output: The LLM-generated text expected to contain JSON. schema: A JSON Schema dict (Draft 7). Example: {"type": "object", "required": ["title", "score"], "properties": {"title": {"type": "string"}, "score": {"type": "number"}}}. strict: If True, additional fields not in the schema are treated as failures.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": "schema_compliance"}.

eval_toxicityA

Detect harmful, offensive, or inappropriate content in an LLM output.

QAG-graded — the judge answers four yes/no questions about whether the output is free of hate speech, threats, harmful instructions, and disrespectful tone. Score is the fraction of questions that pass; 1.0 = not toxic, 0.0 = toxic.

Use this for guardrails on generated content, chatbot turns, or any user-facing LLM output where harmful content is a risk.

Args: output: The LLM-generated text to evaluate. judge_model: Provider:model for the QAG judge, e.g. "anthropic:claude-haiku-4-5" (default), "openai:gpt-4o-mini", or "google:gemini-2.5-flash".

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": "toxicity"}.

eval_biasA

Detect gender, racial, political, age, or socioeconomic bias.

QAG-graded — the judge answers five yes/no questions about whether the response treats demographic groups equitably and stays politically neutral. Score 1.0 = no detectable bias, 0.0 = significant bias.

Use this for fairness audits on generated content, summarisation of user-submitted text, or hiring/lending-adjacent LLM pipelines.

Args: input: The prompt the LLM was responding to. Bias evaluators use this to contextualise whether disparate treatment is justified (e.g. a medical question may legitimately discuss group-specific risk factors). output: The LLM-generated response. judge_model: Provider:model for the QAG judge.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": "bias"}.

eval_context_precisionA

Measure whether retrieved RAG context chunks are relevant to the question.

High precision = the retriever returned mostly on-topic chunks; low noise. The judge asks "is this chunk relevant?" for each chunk (up to 8) and scores precision = fraction marked relevant.

Use this to diagnose retriever quality: if precision is low, your embedding model, chunk size, or reranker is returning noise.

Args: input: The user's question. context: Either a list of retrieved chunks, or a single string with the full retrieved context (will be evaluated as one chunk). judge_model: Provider:model for the QAG judge.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": "context_precision"}.

eval_context_recallA

Measure whether retrieved context contains enough information to answer.

High recall = the retriever found the information needed to derive the expected answer. The judge asks whether the expected answer could plausibly be reconstructed from the retrieved context alone.

Use this when you have a labelled QA dataset and want to diagnose whether failures are retriever misses vs. generator errors.

Args: input: The user's question. context: The retrieved context chunks (list or single string). expected_answer: The ground-truth answer the context should support. judge_model: Provider:model for the QAG judge.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": "context_recall"}.

eval_g_evalA

G-Eval style holistic scoring against a plain-English criterion.

The judge reads the criterion and the output, then returns a numeric score from 0.0 to 1.0 plus a short reason. To reduce single-sample variance the prompt is run twice by default and the scores averaged (position/framing bias mitigation per the original G-Eval paper).

Best for fuzzy or holistic qualities: creativity, tone, style, helpfulness, conciseness. For criteria with multiple discrete aspects, prefer eval_custom_rubric.

Args: input: The prompt the LLM was responding to. output: The LLM-generated response to score. criteria: A plain-English description of what to score on, e.g. "Is the response concise, polite, and free of jargon?". name: Optional label for the evaluator instance (appears in the result dict's evaluator field). runs: How many independent judgements to average. Default 2. judge_model: Provider:model for the scoring judge.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": <name>}.

eval_custom_rubricA

Score an output against your own list of yes/no quality checks.

Each criterion is a [question, expect_yes] pair. The judge answers each question with yes/no; the score is the fraction answered as expected. Best for compliance-style rubrics where each aspect should be auditable separately.

Args: input: The prompt the LLM was responding to. output: The LLM-generated response. criteria: A list of [question_str, expect_yes_bool] pairs. Example: [["Does it cite a source?", true], ["Does it speculate beyond the source?", false]]. name: Optional label for the rubric (appears in the result dict's evaluator field). context: Optional context string for the judge to consider (e.g. retrieved RAG context, source document). judge_model: Provider:model for the QAG judge.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": <name>}.

eval_vqa_faithfulnessA

Check whether an LLM answer about an image is grounded in what's visible.

Image-grounded faithfulness. The vision judge extracts up to 3 factual claims from the answer, then verifies each one against the image. Score = fraction of claims that are accurate.

Use this for visual QA, image captioning, chart/diagram reading, and any LLM output that purports to describe an image.

Image input — exactly one of:

  • image: a local path, http(s) URL, or full data URI.

  • image_base64: raw base64 (no data: prefix); pair with mime_type (default "image/png").

Args: input: The question or prompt the LLM was answering. output: The LLM-generated answer to verify against the image. image: Path / URL / data URI for the image. image_base64: Alternative — raw base64 image bytes. mime_type: Mime type when using image_base64. Default "image/png". Other common values: "image/jpeg", "image/webp". judge_model: Provider:model for the vision judge. Must be vision-capable. Default "google:gemini-2.5-flash" (cheap). Other vision-capable options: "openai:gpt-4o-mini" or "anthropic:claude-sonnet-4-6" (not haiku — Haiku 4-5 is not vision-capable).

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": "vqa_faithfulness"}.

eval_document_groundingA

Check whether an answer about a multi-page document is grounded.

Document-page-grounded faithfulness for multi-page document agents (contracts, invoices, scientific PDFs, medical records). The vision judge answers three yes/no questions per document: is every claim supported, no inventions, exceptions handled.

Provide one image per page. Use exactly one of:

  • images: list of paths, http(s) URLs, or data URIs.

  • images_base64: list of raw base64 strings; pair with mime_type.

Args: input: The question or prompt the LLM was answering about the document. output: The LLM-generated answer to verify against the pages. images: List of page image sources (paths/URLs/data URIs). images_base64: Alternative — list of raw base64 strings. mime_type: Mime type when using images_base64. Default "image/png". judge_model: Provider:model for the vision judge. Must be vision-capable. Default "google:gemini-2.5-flash".

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float, "evaluator": "document_grounding"}.

eval_compare_runsA

Compare two multivon-eval report JSONs and return a structured diff.

Loads both reports from disk (the JSON produced by EvalReport.to_json()), pairs cases by case_input, and returns pass-rate / average-score deltas plus the per-case regressions and improvements lists. Includes a McNemar p-value so the agent can tell a real shift from small-sample noise.

Use this when you've made a prompt / retrieval / model change and want to know if the new run actually improved over the baseline — not just on aggregate, but case-by-case.

Args: baseline_json_path: Filesystem path to the baseline report JSON (e.g. "runs/baseline.json"). new_json_path: Filesystem path to the new / proposal report JSON to compare against the baseline.

Returns: A dict with: - pass_rate_delta: float, new - baseline pass rate - avg_score_delta: float, new - baseline average score - regressions: list of dicts with input, baseline_status, proposal_status, baseline_score, proposal_score - improvements: same shape as regressions - mcnemar_p_value: float or null — paired-test p-value - baseline / proposal: summary blocks with name, pass_rate, avg_score, errors, flaky - paired_count / added_count / removed_count: pairing stats so the caller can see how many cases lined up vs. drifted between runs

eval_generate_casesA

Generate synthetic eval cases from a source text.

Calls multivon-eval's synthetic generator to produce n eval cases from raw text (docs, FAQ, knowledge base). Each case has an input (question), expected_output (ground-truth answer), and context (the source excerpt the answer was grounded in). Eliminates the cold-start problem when building a new eval suite from scratch.

Requires a provider API key in env so the underlying judge can propose question/answer pairs.

Args: from_text: Source text to generate cases from (e.g. FAQ, docs chunk, knowledge base article). n: Number of cases to generate. Default 10. task: One of "qa" (question/answer pairs — default), "summarization" (text + expected summary), or "hallucination" (faithful answer + expected_output = "faithful" for hallucination benchmarks). judge_model: Provider:model string used to generate the cases. The generator calls this judge under the hood; it does NOT need to match the judge you eventually use to evaluate the cases. Default "anthropic:claude-haiku-4-5".

Returns: A list of dicts {"input", "expected_output", "context", "metadata"} ready to feed into EvalCase(**d) or to persist as a JSONL eval dataset.

eval_ingest_traceA

Convert a JSON agent trace into a JSON-friendly EvalCase payload.

Parses a serialised agent trajectory and returns the :class:EvalCase shape the rest of the eval pipeline (and the other eval_* MCP tools) expect. Use this when your agent has just finished a trajectory at runtime and you want to score that trajectory immediately — no need to re-run anything.

Supports three frameworks:

  • "langgraph" (default): canonical universal step list

  • "openai_agents": canonical OR {"new_items": [...]} from a RunResult you serialised

  • "manual": canonical step list

Args: trace_json: The trace as a JSON-friendly dict. Must include input; steps (or new_items for openai_agents) is strongly recommended. framework: One of "langgraph", "openai_agents", "manual". Defaults to "langgraph".

Returns: A dict with input, expected_output, context, expected_tool_calls, agent_trace (list of step dicts), and metadata — ready to feed back into other eval_* MCP tools or to persist as part of an eval dataset.

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription

No resources

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/multivon-ai/multivon-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server