Skip to main content
Glama
multivon-ai

multivon-mcp

Official
by multivon-ai

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
GOOGLE_API_KEYNoGoogle API key for using Gemini models in evaluations
OPENAI_API_KEYNoOpenAI API key for using OpenAI models in evaluations
ANTHROPIC_API_KEYNoAnthropic API key for using Claude models in evaluations

Capabilities

Features and capabilities supported by this server

CapabilityDetails
tools
{
  "listChanged": false
}
prompts
{
  "listChanged": false
}
resources
{
  "subscribe": false,
  "listChanged": false
}
experimental
{}

Tools

Functions exposed to the LLM to take actions

NameDescription
pdfhell_runA

Run the pdfhell adversarial-PDF benchmark against a vision model.

Args: model: Provider:model spec, e.g. "anthropic:claude-sonnet-4-6", "openai:gpt-4o", "google:gemini-2.5-flash". suite: "smoke" (3 cases, ~10s) or "mini" (30 cases, ~$0.01 on Flash). Default "mini". workers: Parallel API requests. Default 4.

Returns: A dict with overall pass_rate, Wilson 95% CI, per-trap-family pass rates and CIs, and per-case details. Suite version + hash included so consumers can verify the run measured the expected cases.

Provider API keys come from environment variables (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY) — not passed through this tool, never logged.

pdfhell_makeA

Generate one adversarial PDF + its answer key.

Useful for an agent to inspect what a specific trap looks like before deciding to evaluate against it.

Args: trap: Trap family. One of: "hidden_ocr_mismatch", "footnote_override", "split_table_across_pages". seed: Integer seed. Same seed → byte-identical PDF + identical answer key. return_pdf_bytes: If True, include the base64-encoded PDF bytes in the response. Default False — most agents want the question / expected answer, not the raw PDF.

Returns: A dict with the case JSON (id, trap_family, question, expected_answer, forbidden_answers, metadata) and optionally the base64-encoded PDF bytes under pdf_base64.

eval_faithfulnessA

Evaluate whether an LLM output is grounded in the retrieved context.

Uses multivon-eval's QAG-graded Faithfulness evaluator. Extracts factual claims from the output and verifies each one against the context. Score is the fraction of claims supported.

Use this when a RAG pipeline returned an answer and you want to check the LLM didn't invent facts not present in retrieved documents.

Args: input: The user's question. context: The retrieved context the LLM was given. output: The LLM's answer being evaluated. judge_model: Provider:model for the QAG judge. Default "anthropic:claude-haiku-4-5" (cheap + calibrated).

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float}.

eval_hallucinationB

Detect fabricated information not present in the context.

Score 1.0 = no hallucination. Score 0.0 = significant hallucination.

Args: output: The LLM output to check. context: The ground-truth context the output should be grounded in. judge_model: Provider:model for the QAG judge.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float}.

eval_relevanceA

Check whether an LLM output actually addresses the user's question.

QAG-graded — generates yes/no questions about whether the output answers the input, stays on topic, contains relevant content.

Args: input: The user's question. output: The LLM's response. judge_model: Provider:model for the QAG judge.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float}.

eval_tool_call_accuracyA

Evaluate whether an agent called the right tool with the right arguments.

Pure deterministic — no LLM judge needed. Compares the actual tool name + arguments against expected.

Args: expected_tool: Tool name the agent should have called. actual_tool: Tool name the agent actually called. expected_arguments: Dict of expected argument values (optional). actual_arguments: Dict of argument values the agent passed (optional).

Returns: {"score": 0.0 or 1.0, "passed": bool, "reason": str}.

eval_answer_accuracyA

Evaluate whether an answer is semantically equivalent to the ground truth.

QAG-graded — generates yes/no questions about whether the actual answer matches the meaning of the expected answer. Useful when string match is too strict (e.g. paraphrased correct answers).

Args: expected_answer: Ground-truth answer. actual_answer: The LLM's answer. judge_model: Provider:model for the QAG judge.

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str}.

eval_audit_packA

Build a hash-chained audit ZIP from a pdfhell run.

Combines the run JSON, the case PDFs + answer keys, JUnit XML, and a SHA-256 manifest into one downloadable ZIP. Suitable for attaching to a procurement diligence appendix.

Args: run_json_path: Path to a pdfhell run JSON (from pdfhell run --out). cases_dir: Directory containing the case PDFs + answer keys that were evaluated. Same dir the run used. output_zip_path: Where to write the audit ZIP.

Returns: {"path": "/abs/path/to.zip", "size_bytes": N, "manifest": {...}}. The manifest dict mirrors the one inside the ZIP — useful for an agent that wants to verify the contents without opening the ZIP itself.

eval_discoverA

Return the full machine-readable capability catalog.

Useful as a first call at session start — an agent can plan its evaluation strategy against the actual available evaluators rather than guessing or hallucinating tool names.

Returns: A dict with three top-level keys:

- ``evaluators``: every available multivon-eval evaluator,
  with its tier, what inputs it needs, and (when shipped)
  calibrated default thresholds per judge model.
- ``traps``: every pdfhell trap family, the failure mode each
  elicits, and the expected_failure_mode metadata.
- ``suites``: every named pdfhell suite, the (trap_family,
  seed_count) breakdown, and the suite_hash for the canonical
  version.

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription

No resources

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/multivon-ai/multivon-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server