What can you do with this server?

The rageval-mcp server exposes end-to-end RAG evaluation as agent-callable tools, letting you benchmark retrieval strategies and measure answer quality from within a conversation. * Retrieve passages (retrieve): Fetch top-k passages for a natural-language query using bm25, tfidf, dense (requires optional install), or hybrid (reciprocal-rank fusion). Returns ranked passages with doc IDs, scores, and text. * Evaluate retrieval quality (evaluate_retrieval): Score a single retrieval method against a labeled question set, reporting recall@k, precision@k, MRR@k, and nDCG@k averaged across all questions. * Compare all retrieval methods (compare_methods): Benchmark all available strategies side-by-side at a given cutoff k, identify the best performer by nDCG@k, and gracefully skip methods with unmet dependencies. * Evaluate answer quality (evaluate_answers): Generate answers from retrieved context and score them for faithfulness (groundedness) and correctness via an LLM-as-judge — Cloudflare Workers AI by default, or Anthropic as an alternative. * Load a custom corpus (load_corpus): Replace the default demo corpus at runtime with your own documents and optional labeled questions, enabling the full evaluation pipeline on bespoke data. All tools return structured JSON outputs. Lexical retrieval tools (BM25, TF-IDF) require no external dependencies; dense retrieval and the LLM judge require optional extras or credentials. The server integrates with MCP clients like Claude Desktop or Claude Code via stdio.

Which integrations are available for this server?

Provides AI judge for evaluating answer faithfulness and correctness using Cloudflare Workers AI (meta/llama-3.3-70b-instruct-fp8-fast).

How do I use rageval-mcp?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@rageval-mcp compare retrieval strategies on my evaluation set" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

rageval-mcp

by phillipkaraya

Overview Schema Related Servers Score Discussions

Python

Local

rageval-mcp

An MCP server that exposes end-to-end RAG evaluation as agent tools. It loads a labeled knowledge base, then lets an agent retrieve passages and measure how good that retrieval is (recall@k, precision@k, MRR, nDCG@k) across four retrieval strategies: BM25, TF-IDF, dense embeddings, and a hybrid that fuses them. It then closes the loop with an LLM-as-judge that scores the answers a RAG system produces for faithfulness and correctness, and a load_corpus tool so an agent can point the whole pipeline at its own documents.

It is the agent-facing companion to rag-eval-harness: the harness is a CLI you run to benchmark retrieval; this is the same retrieval and metrics core wrapped in the Model Context Protocol, so a model can call it mid-conversation to decide which retrieval strategy actually finds the right context.

git clone https://github.com/phillipkaraya/rageval-mcp && cd rageval-mcp
uv sync                              # provisions Python 3.12 + deps, no system Python needed
uv run python scripts/smoke_client.py   # starts the server and calls every tool

No corpus to supply, and no Anthropic key: the sample knowledge base and labeled questions ship inside the package, so the retrieval tools run with zero setup. The answer-quality tool is the one that calls a model, and by default its LLM judge runs on Cloudflare Workers AI (@cf/meta/llama-3.3-70b-instruct-fp8-fast) through an AI Gateway, which costs effectively nothing and needs no Anthropic key, only CF_AIG_TOKEN and CF_AIG_BASE_URL. Set RAGEVAL_JUDGE_PROVIDER=anthropic (with ANTHROPIC_API_KEY and uv sync --extra judge) to switch the judge to the Anthropic API instead. Either way the retrieval tools keep working with no credentials at all.

Why this exists

Most RAG systems ship on vibes. Someone asks the assistant a few questions, the answers look fine, and it goes to production, where it quietly fails because retrieval surfaced the wrong document. The model was rarely the problem. The retrieval was.

An agent that can call an eval can reason about this directly. Instead of guessing whether BM25 or embeddings will serve a given query mix, it can run compare_methods and read the numbers. rageval-mcp puts that loop one tool call away:

retrieve shows what context a strategy would surface for a question.
evaluate_retrieval puts a number on one strategy's quality over a labeled set.
compare_methods benchmarks every strategy side by side and names the winner.
evaluate_answers goes one step further: it generates an answer from the retrieved context and has an LLM judge score it for faithfulness and correctness, so you measure the answer, not just the retrieval.
load_corpus points the server at your own documents (and optional questions) so the same eval loop runs on your data, not just the demo.

The shipped dataset is a deliberately realistic stand-in for a real deployment: a fictional B2B SaaS knowledge base (data/corpus/, 10 documents covering billing, SSO, data residency, SLAs, API limits, and more) plus 20 support-style questions with labeled relevant documents (data/eval/questions.jsonl).

Related MCP server: multivon-mcp

The tools

Five tools. The three retrieval tools (retrieve, evaluate_retrieval, compare_methods) are read-only, idempotent, and fully local with no external calls. evaluate_answers is read-only but reaches an external judge service (Cloudflare Workers AI by default), so it is annotated open-world. load_corpus is the one state-mutating tool: it replaces the active corpus. Every tool returns structured output, so MCP clients that support output schemas get typed results, and others get the same data as JSON text.

`retrieve(query, k=5, method="hybrid")`

Return the top-k passages for a query. This is also the fastest way to see why retrieval choice matters. Ask the same question two ways.

With "method": "bm25", the lexical retriever surfaces the wrong article. The word "data" dominates the match, so it returns the data-export-and-deletion doc, not data-residency:

{
  "query": "Can I keep my data in Europe?",
  "method": "bm25",
  "k": 3,
  "count": 3,
  "passages": [
    { "rank": 1, "doc_id": "data-export-and-deletion", "chunk_id": "data-export-and-deletion::2", "score": 2.352516,
      "text": "To remove your account data immediately, an administrator can submit a deletion request, and we permanently erase all data within 30 days in line with GDPR." }
  ]
}

Switch to "method": "dense" and it finds the right document, because embeddings match meaning over vocabulary (output abridged to ranks 1 and 3):

{
  "query": "Can I keep my data in Europe?",
  "method": "dense",
  "k": 3,
  "count": 3,
  "passages": [
    { "rank": 1, "doc_id": "data-residency", "chunk_id": "data-residency::1", "score": 0.462297,
      "text": "You select your data region when you create your workspace, and it cannot be changed afterward without contacting support to arrange a migration..." },
    { "rank": 3, "doc_id": "data-residency", "chunk_id": "data-residency::0", "score": 0.44819,
      "text": "Meridian lets you choose where your data is stored. We operate regions in the United States, the European Union (Frankfurt), and Australia (Sydney)." }
  ]
}

That single comparison is the whole point of the server: the retrieval strategy decides whether the model ever sees the right context, and evaluate_retrieval and compare_methods turn that into numbers.

`evaluate_retrieval(method="hybrid", k=3)`

Score one method against every labeled question and average four ranking metrics.

Input:

{ "method": "bm25", "k": 3 }

Output:

{
  "method": "bm25",
  "k": 3,
  "n_questions": 20,
  "recall_at_k": 0.925,
  "precision_at_k": 0.3167,
  "mrr_at_k": 0.75,
  "ndcg_at_k": 0.789
}

`compare_methods(k=3)`

Benchmark every available method and return one row each, plus the winner by nDCG@k. Methods whose optional dependencies are missing are reported under skipped (with a reason) instead of failing the call.

Output (default install, dense extra not present):

{
  "k": 3,
  "n_questions": 20,
  "rows": [
    { "method": "bm25",   "recall_at_k": 0.925, "precision_at_k": 0.3167, "mrr_at_k": 0.75,  "ndcg_at_k": 0.789 },
    { "method": "tfidf",  "recall_at_k": 0.85,  "precision_at_k": 0.3,    "mrr_at_k": 0.725, "ndcg_at_k": 0.7537 },
    { "method": "hybrid", "recall_at_k": 0.85,  "precision_at_k": 0.3,    "mrr_at_k": 0.725, "ndcg_at_k": 0.7609 }
  ],
  "best_method": "bm25",
  "skipped": [
    { "method": "dense", "reason": "The 'dense' retriever needs the optional 'sentence-transformers' dependency..." }
  ]
}

A useful result already: with only the lexical methods, plain BM25 (0.789) edges out the hybrid (0.761), because fusing in the weaker TF-IDF ranker pulls the average down. "Hybrid" is not automatically the right answer, which is exactly the kind of thing you want measured rather than assumed.

With the dense extra installed (uv sync --extra dense), all four methods run and dense wins on this semantically-phrased eval:

method	recall@k	precision@k	mrr@k	ndcg@k
bm25	0.925	0.317	0.750	0.789
tfidf	0.850	0.300	0.725	0.754
dense	1.000	0.350	0.942	0.957
hybrid	0.950	0.333	0.792	0.829

Reading the result. Many of these questions share almost no vocabulary with their source document. "What happens to my information if I cancel?" has barely a word in common with the data export and deletion article, and BM25 misses it. Dense embeddings close that lexical gap and win clearly here. Hybrid (reciprocal-rank fusion) is the most robust generalist and beats both lexical methods, but it does not top pure dense on this query mix, because fusing in two weaker lexical rankers drags its average down. That is the honest, useful finding: hybrid is the safe default when you do not know your query mix, but for semantic queries dense alone can win.

`evaluate_answers(method="bm25", k=3, provider="cloudflare")`

Turn retrieval scores into an answer-quality score. For each labeled question this retrieves the top-k passages with method, asks the answer model to answer using only that context, then asks a second judge call to score the answer on two axes:

faithfulness (0.0 to 1.0): is every claim in the answer grounded in the retrieved context? This is the hallucination check. An answer can be correct in the world yet unfaithful to what was actually retrieved, which is the failure a RAG system has to avoid.
correctness (0.0 to 1.0): does the answer match the labeled gold answer?

This is the one tool that reaches an external service. By default the judge runs on Cloudflare Workers AI (@cf/meta/llama-3.3-70b-instruct-fp8-fast) through an AI Gateway, which needs CF_AIG_TOKEN and CF_AIG_BASE_URL and costs effectively nothing. Pass provider="anthropic" (or set RAGEVAL_JUDGE_PROVIDER=anthropic, with ANTHROPIC_API_KEY and uv sync --extra judge) to use the Anthropic API instead. Without the selected provider's credentials it returns a clear, actionable error and the retrieval tools keep working. Each question costs two model calls (one to answer, one to judge), so use limit for a quick spot check.

Input (the judge defaults to Cloudflare Workers AI, so no model id or Anthropic key is needed):

{ "method": "bm25", "k": 3 }

Output (real numbers from a live Cloudflare Workers AI run over all 20 questions; per_question abridged to two of the twenty):

{
  "method": "bm25",
  "k": 3,
  "provider": "cloudflare",
  "answer_model": "@cf/meta/llama-3.3-70b-instruct-fp8-fast",
  "judge_model": "@cf/meta/llama-3.3-70b-instruct-fp8-fast",
  "n_questions": 20,
  "avg_faithfulness": 0.975,
  "avg_correctness": 0.575,
  "per_question": [
    {
      "question_id": "q01",
      "question": "How much does the Pro plan cost per month?",
      "generated_answer": "The Pro plan costs $49 per seat per month when billed monthly.",
      "gold_answer": "$49 per seat per month when billed monthly.",
      "retrieved_doc_ids": ["billing-and-plans", "api-and-rate-limits", "sla-and-uptime"],
      "faithfulness": 1.0,
      "correctness": 1.0,
      "faithfulness_reason": "answer directly supported by context",
      "correctness_reason": "matches gold answer exactly"
    },
    {
      "question_id": "q04",
      "question": "How is my information protected while it is being stored?",
      "generated_answer": "NOT IN CONTEXT",
      "gold_answer": "It is encrypted at rest with AES-256.",
      "retrieved_doc_ids": ["data-residency", "user-roles-and-permissions", "data-export-and-deletion"],
      "faithfulness": 1.0,
      "correctness": 0.0,
      "faithfulness_reason": "answer explicitly states it is not in context",
      "correctness_reason": "answer does not convey the same facts as the gold answer"
    }
  ]
}

Reading the result. Faithfulness is near-perfect (0.975) while correctness sits at 0.575, and the gap is the whole point. Look at q04: BM25 retrieved the wrong documents, so the right passage was never in the context, the generator correctly answered NOT IN CONTEXT (faithful, no hallucination), and that scores zero on correctness. The judge is not failing and the generator is not failing; retrieval is, and the two scores pull apart exactly where it does. Switching the retriever to hybrid on the same judge gives faithfulness 0.95 / correctness 0.525, slightly lower, for the same reason compare_methods shows above: fusing in the weaker TF-IDF ranker drags BM25 down on this query mix. The judge here is Llama 3.3 70B at temperature 0, so the numbers are stable run to run but not bit-identical; treat them as a signal, not a fixed score.

`load_corpus(path | documents, questions=None, reset=false)`

Point the server at your own corpus at runtime and rebuild the index. This is what turns the server from a fixed demo into a reusable eval service. Provide exactly one of path (a directory of .md files) or documents (a list of {doc_id, text}), and optionally questions so the eval and judge tools work on your data too. The bundled corpus is the default and is restorable with reset=true. This is the only state-mutating tool: the loaded corpus becomes active for every later call until it is replaced.

Input:

{
  "documents": [
    { "doc_id": "refunds", "text": "Refunds are issued within five business days." },
    { "doc_id": "trial", "text": "The free trial lasts fourteen days, no card needed." }
  ],
  "questions": [
    { "id": "q1", "question": "How long is the trial?", "answer": "Fourteen days.", "relevant_doc_ids": ["trial"] }
  ]
}

Output:

{
  "source": "inline: 2 documents",
  "n_docs": 2,
  "n_chunks": 2,
  "n_questions": 1,
  "doc_ids": ["refunds", "trial"],
  "methods_available": ["bm25", "tfidf", "hybrid"],
  "note": "Loaded 2 documents and 1 labeled questions. retrieve, evaluate_retrieval, and compare_methods are ready; evaluate_answers also needs a judge backend (Cloudflare Workers AI by default: CF_AIG_TOKEN + CF_AIG_BASE_URL)."
}

The judge approach, and its limits

evaluate_answers uses LLM-as-judge: a model grades each generated answer instead of relying on string overlap, which is far closer to how a person reads an answer than a metric like exact match. That power comes with caveats worth stating plainly:

The judge is a model, so its scores are estimates, not ground truth. Average over the question set rather than trusting any single 0-or-1 verdict.
By default the generator and the judge are the same model (Llama 3.3 70B on Cloudflare Workers AI). That is cheap and convenient but invites self-preference bias. For a more independent read, point judge_model at a different or stronger model (or set provider="anthropic" for a Claude judge) while keeping a cheaper answer_model.
Faithfulness is judged against the retrieved context, correctness against the gold answer. A faithful answer can still be wrong if retrieval surfaced the wrong document, which is exactly why this layer sits on top of the retrieval metrics rather than replacing them.
It is not free of run-to-run variation, and it costs two model calls per question. The judge runs at temperature 0, so verdicts are stable but not bit-identical; on Cloudflare Workers AI the cost is effectively nothing. Treat the numbers as a signal, not a fixed score.

Use it from an MCP client

The server speaks stdio. Point any MCP client at the rageval-mcp command (provided by uv run from the cloned repo).

Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "rageval": {
      "command": "uv",
      "args": ["--directory", "/absolute/path/to/rageval-mcp", "run", "rageval-mcp"]
    }
  }
}

Claude Code (one command, or commit the same shape as project .mcp.json):

claude mcp add rageval -- uv --directory /absolute/path/to/rageval-mcp run rageval-mcp

Then ask the model things like "retrieve the top 3 passages for 'how do I enforce MFA', then compare retrieval methods at k=3 and tell me which to use." It will call the tools and read the numbers back.

To sanity-check the server without a full client, use the MCP Inspector:

uv run mcp dev src/rageval_mcp/server.py

How it works

rageval_mcp/data/corpus/*.md        rageval_mcp/data/eval/questions.jsonl
        |                                      |
        v                                      v
   chunk by paragraph                question + relevant_doc_ids
        |                                      |
        v                                      |
  ┌──────────────────────────┐                 |
  │ retrievers               │                 |
  │  bm25   tfidf   dense*   │                 |
  │         hybrid (RRF)     │                 |
  └──────────────────────────┘                 |
        |  top-k chunks                         |
        v                                       v
  collapse to ranked docs ───────────────▶ metrics ───▶ retrieve / evaluate_retrieval / compare_methods

Chunking (corpus.py) splits each document into paragraph passages, then results are collapsed back to document level for scoring.
Retrievers (retrievers.py) share one interface, so a new strategy is a single subclass. The hybrid uses reciprocal-rank fusion (RRF), which combines rankings without needing the underlying scores on the same scale.
Metrics (metrics.py) are plain, unit-tested functions, so the numbers are auditable.
The index (index.py) loads the corpus once, builds the lexical retrievers eagerly, and builds dense lazily on first use. The whole thing is cached, so repeated tool calls stay fast.

Design decisions and trade-offs

The dataset ships inside the package. The corpus and questions live in rageval_mcp/data, resolved relative to the module, so the server runs with zero configuration. Trade-off: it evaluates a fixed sample corpus out of the box. Pointing it at your own data is the obvious next feature (see below).
Dense embeddings are an optional extra, not a hard dependency. bm25, tfidf, and hybrid run in milliseconds with no model download. dense needs sentence-transformers (which pulls in PyTorch), so it lives behind uv sync --extra dense. For an agent tool, cold-start latency matters: most tool calls should be instant, and only a caller who explicitly wants the embedding baseline pays the model-load cost. When the extra is absent, the tools degrade gracefully (a clear, actionable error on retrieve/evaluate_retrieval, a skipped entry on compare_methods) instead of crashing the server.
Structured output, not just text. Each tool returns a typed Pydantic model, so the model gets a real schema and a parseable result rather than prose it has to scrape.
Read-only and closed-world. Every tool is annotated readOnlyHint, idempotentHint, and openWorldHint: false. There is nothing to mutate and nothing external to reach, which makes the server safe to expose to an autonomous agent.
RRF over weighted score fusion for the hybrid. RRF needs no per-retriever score normalization and no tuning, which keeps the baseline honest rather than hand-optimized.
Document-level relevance, not passage-level. Simpler to label by hand and matches how a support agent thinks ("which article answers this?"). Trade-off: it cannot measure whether the single best paragraph ranked first.

What I would build next

The first two layers I planned here have shipped: load_corpus (bring your own data) and evaluate_answers (the LLM-as-judge answer-quality eval), both documented above. These are the next ones.

A reranker (cross-encoder) as a fourth retrieval stage, with a tool to measure the precision lift.
Latency and cost fields on every result, so the benchmark reflects production trade-offs, not just quality.
A per-question failure tool that returns exactly which questions a method missed, which is where the real debugging happens.
Judge calibration: a small set of human-scored answers to measure how often the LLM judge agrees with a person, so the judge itself is evaluated rather than just trusted.

Development

uv run --extra dev pytest                  # metrics, engine, judge (stubbed + cloudflare path), stdio round-trips
uv run --extra dev ruff check .            # lint
uv run --extra dev ruff format --check .

The test suite includes an end-to-end test (tests/test_server.py) that launches the server as a subprocess, completes the MCP handshake, lists the tools, and calls each one over real JSON-RPC, the same way a client would, including load_corpus round-trips and the credential-less evaluate_answers error paths for both providers. The judge tests run with no credentials: a stub covers the generate-and-judge logic and a mocked-HTTP test covers the Cloudflare path. Two live tests are opt-in and run only when their credentials are present, one for Cloudflare (CF_AIG_TOKEN + CF_AIG_BASE_URL) and one for Anthropic (ANTHROPIC_API_KEY, plus uv sync --extra judge).

Project layout

src/rageval_mcp/
  server.py        FastMCP server: the five tools, input validation, structured output
  index.py         cached engine: serves retrieve / evaluate / compare; swappable active corpus
  retrievers.py    bm25, tfidf, dense, hybrid (shared interface)
  metrics.py       recall@k, precision@k, MRR, nDCG@k (unit-tested)
  corpus.py        load and chunk the markdown corpus, or in-memory documents
  evaluate.py      run a retriever over the question set and aggregate metrics
  judge.py         LLM-as-judge: answer from retrieved context, then score it (Cloudflare default; Anthropic optional)
  data/corpus/     the knowledge base (10 markdown docs)
  data/eval/       questions.jsonl (20 labeled questions)
scripts/smoke_client.py   a real MCP client that exercises every tool
tests/             metrics, engine, judge, load_corpus, and end-to-end server tests

Relationship to rag-eval-harness

Same retrieval and metrics core, two surfaces. rag-eval-harness is a CLI for a human running a one-off benchmark. rageval-mcp is the agent-facing surface of the same idea: evaluation a model can call as a tool. Build the system, then measure it, from inside the conversation.

MIT licensed.

Install Server

license - permissive license

quality

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

–Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Tools

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/phillipkaraya/rageval-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server