Schema | multivon-mcp

multivon-mcp

Official

by multivon-ai

Overview Schema Related Servers Score Discussions

Python

Remote

Server Configuration

Describes the environment variables required to run the server.

Name	Required	Description
`GOOGLE_API_KEY`	No	Google API key for using Gemini models in evaluations
`OPENAI_API_KEY`	No	OpenAI API key for using OpenAI models in evaluations
`ANTHROPIC_API_KEY`	No	Anthropic API key for using Claude models in evaluations

Capabilities

Features and capabilities supported by this server

Capability	Details
`tools`	{ "listChanged": false }
`prompts`	{ "listChanged": false }
`resources`	{ "subscribe": false, "listChanged": false }
`experimental`	{}

Tools

Functions exposed to the LLM to take actions

Name	Description
pdfhell_runA	Run the pdfhell adversarial-PDF benchmark against a vision model. Args: model: Provider:model spec, e.g. `"anthropic:claude-sonnet-4-6"`, `"openai:gpt-4o"`, `"google:gemini-2.5-flash"`. suite: `"smoke"` (3 cases, ~10s) or `"mini"` (30 cases, ~$0.01 on Flash). Default `"mini"`. workers: Parallel API requests. Default 4. Returns: A dict with overall `pass_rate`, Wilson 95% CI, per-trap-family pass rates and CIs, and per-case details. Suite version + hash included so consumers can verify the run measured the expected cases. Provider API keys come from environment variables (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`) — not passed through this tool, never logged.
pdfhell_makeA	Generate one adversarial PDF + its answer key. Useful for an agent to inspect what a specific trap looks like before deciding to evaluate against it. Args: trap: Trap family. One of: `"hidden_ocr_mismatch"`, `"footnote_override"`, `"split_table_across_pages"`. seed: Integer seed. Same seed → byte-identical PDF + identical answer key. return_pdf_bytes: If True, include the base64-encoded PDF bytes in the response. Default False — most agents want the question / expected answer, not the raw PDF. Returns: A dict with the case JSON (id, trap_family, question, expected_answer, forbidden_answers, metadata) and optionally the base64-encoded PDF bytes under `pdf_base64`.
eval_faithfulnessA	Evaluate whether an LLM output is grounded in the retrieved context. Uses multivon-eval's QAG-graded Faithfulness evaluator. Extracts factual claims from the output and verifies each one against the context. Score is the fraction of claims supported. Use this when a RAG pipeline returned an answer and you want to check the LLM didn't invent facts not present in retrieved documents. Args: input: The user's question. context: The retrieved context the LLM was given. output: The LLM's answer being evaluated. judge_model: Provider:model for the QAG judge. Default `"anthropic:claude-haiku-4-5"` (cheap + calibrated). Returns: `{"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float}`.
eval_hallucinationB	Detect fabricated information not present in the context. Score 1.0 = no hallucination. Score 0.0 = significant hallucination. Args: output: The LLM output to check. context: The ground-truth context the output should be grounded in. judge_model: Provider:model for the QAG judge. Returns: `{"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float}`.
eval_relevanceA	Check whether an LLM output actually addresses the user's question. QAG-graded — generates yes/no questions about whether the output answers the input, stays on topic, contains relevant content. Args: input: The user's question. output: The LLM's response. judge_model: Provider:model for the QAG judge. Returns: `{"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float}`.
eval_tool_call_accuracyA	Evaluate whether an agent called the right tool with the right arguments. Pure deterministic — no LLM judge needed. Compares the actual tool name + arguments against expected. Args: expected_tool: Tool name the agent should have called. actual_tool: Tool name the agent actually called. expected_arguments: Dict of expected argument values (optional). actual_arguments: Dict of argument values the agent passed (optional). Returns: `{"score": 0.0 or 1.0, "passed": bool, "reason": str}`.
eval_answer_accuracyA	Evaluate whether an answer is semantically equivalent to the ground truth. QAG-graded — generates yes/no questions about whether the actual answer matches the meaning of the expected answer. Useful when string match is too strict (e.g. paraphrased correct answers). Args: expected_answer: Ground-truth answer. actual_answer: The LLM's answer. judge_model: Provider:model for the QAG judge. Returns: `{"score": 0.0-1.0, "passed": bool, "reason": str}`.
eval_audit_packA	Build a hash-chained audit ZIP from a pdfhell run. Combines the run JSON, the case PDFs + answer keys, JUnit XML, and a SHA-256 manifest into one downloadable ZIP. Suitable for attaching to a procurement diligence appendix. Args: run_json_path: Path to a pdfhell run JSON (from `pdfhell run --out`). cases_dir: Directory containing the case PDFs + answer keys that were evaluated. Same dir the run used. output_zip_path: Where to write the audit ZIP. Returns: `{"path": "/abs/path/to.zip", "size_bytes": N, "manifest": {...}}`. The manifest dict mirrors the one inside the ZIP — useful for an agent that wants to verify the contents without opening the ZIP itself.
eval_discoverA	Return the full machine-readable capability catalog. Useful as a first call at session start — an agent can plan its evaluation strategy against the actual available evaluators rather than guessing or hallucinating tool names. Returns: A dict with three top-level keys: - ``evaluators``: every available multivon-eval evaluator, with its tier, what inputs it needs, and (when shipped) calibrated default thresholds per judge model. - ``traps``: every pdfhell trap family, the failure mode each elicits, and the expected_failure_mode metadata. - ``suites``: every named pdfhell suite, the (trap_family, seed_count) breakdown, and the suite_hash for the canonical version.

Prompts

Interactive templates invoked by user choice

Name	Description
No prompts

Resources

Contextual data attached and managed by the client

Name	Description
No resources

Server Configuration
Capabilities
Tools
Prompts
Resources

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/multivon-ai/multivon-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server