multivon-mcp

Official

by multivon-ai

Overview Schema Related Servers Score Discussions

Python

Remote

eval_faithfulness

Check if an LLM's answer is grounded in the provided context by verifying factual claims against retrieved documents.

Instructions

Evaluate whether an LLM output is grounded in the retrieved context.

Uses multivon-eval's QAG-graded Faithfulness evaluator. Extracts factual claims from the output and verifies each one against the context. Score is the fraction of claims supported.

Use this when a RAG pipeline returned an answer and you want to check the LLM didn't invent facts not present in retrieved documents.

Args: input: The user's question. context: The retrieved context the LLM was given. output: The LLM's answer being evaluated. judge_model: Provider:model for the QAG judge. Default "anthropic:claude-haiku-4-5" (cheap + calibrated).

Returns: {"score": 0.0-1.0, "passed": bool, "reason": str, "threshold": float}.

Input Schema

TableJSON Schema

Name	Required	Default
`input`	Yes
`context`	Yes
`output`	Yes
`judge_model`	No	anthropic:claude-haiku-4-5

Output Schema

TableJSON Schema

Name	Required	Description	Default
No arguments

Tool Definition Quality

A4.5/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description carries the full burden. It explains the evaluation process: extracts factual claims, verifies each against context, and provides a score. It also specifies the default judge model and hints at its calibration. It lacks details on error handling or permissions but is fairly transparent.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured and concise: a one-sentence purpose, a brief explanation of the evaluator, a use-case line, then bullet-style parameter descriptions and a return format. Every sentence adds value with no redundancy.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description covers the tool's purpose, usage context, behavioral details, parameter meanings, and return format (matching the output schema). For a tool with 4 parameters and no nested objects, this is comprehensive.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 0%, so the description must compensate. It describes each parameter: 'input' as the user's question, 'context' as retrieved context, 'output' as LLM answer, and 'judge_model' with a default and hint. This adds meaning beyond the bare schema, though not extremely detailed.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool evaluates whether an LLM output is grounded in retrieved context, using a specific evaluator (QAG-graded Faithfulness). It differentiates from sibling tools like eval_hallucination by focusing on factual claim verification against context.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly says when to use: 'when a RAG pipeline returned an answer and you want to check the LLM didn't invent facts.' It does not mention when not to use or name alternative tools, but the context of use is well-defined.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/multivon-ai/multivon-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server