Skip to main content
Glama
phillipkaraya

rageval-mcp

rageval-mcp

An MCP server that exposes RAG retrieval evaluation as agent tools. It loads a labeled knowledge base, then lets an agent retrieve passages and measure how good that retrieval is (recall@k, precision@k, MRR, nDCG@k) across four retrieval strategies: BM25, TF-IDF, dense embeddings, and a hybrid that fuses them.

It is the agent-facing companion to rag-eval-harness: the harness is a CLI you run to benchmark retrieval; this is the same retrieval and metrics core wrapped in the Model Context Protocol, so a model can call it mid-conversation to decide which retrieval strategy actually finds the right context.

git clone https://github.com/phillipkaraya/rageval-mcp && cd rageval-mcp
uv sync                              # provisions Python 3.12 + deps, no system Python needed
uv run python scripts/smoke_client.py   # starts the server and calls every tool

No API keys. No corpus to supply. The sample knowledge base and labeled questions ship inside the package, so the server runs with zero setup.

Why this exists

Most RAG systems ship on vibes. Someone asks the assistant a few questions, the answers look fine, and it goes to production, where it quietly fails because retrieval surfaced the wrong document. The model was rarely the problem. The retrieval was.

An agent that can call an eval can reason about this directly. Instead of guessing whether BM25 or embeddings will serve a given query mix, it can run compare_methods and read the numbers. rageval-mcp puts that loop one tool call away:

  • retrieve shows what context a strategy would surface for a question.

  • evaluate_retrieval puts a number on one strategy's quality over a labeled set.

  • compare_methods benchmarks every strategy side by side and names the winner.

The shipped dataset is a deliberately realistic stand-in for a real deployment: a fictional B2B SaaS knowledge base (data/corpus/, 10 documents covering billing, SSO, data residency, SLAs, API limits, and more) plus 20 support-style questions with labeled relevant documents (data/eval/questions.jsonl).

Related MCP server: multivon-mcp

The tools

All three tools are read-only, idempotent, and fully local (no external calls). They return structured output, so MCP clients that support output schemas get typed results, and others get the same data as JSON text.

retrieve(query, k=5, method="hybrid")

Return the top-k passages for a query. This is also the fastest way to see why retrieval choice matters. Ask the same question two ways.

With "method": "bm25", the lexical retriever surfaces the wrong article. The word "data" dominates the match, so it returns the data-export-and-deletion doc, not data-residency:

{
  "query": "Can I keep my data in Europe?",
  "method": "bm25",
  "k": 3,
  "count": 3,
  "passages": [
    { "rank": 1, "doc_id": "data-export-and-deletion", "chunk_id": "data-export-and-deletion::2", "score": 2.352516,
      "text": "To remove your account data immediately, an administrator can submit a deletion request, and we permanently erase all data within 30 days in line with GDPR." }
  ]
}

Switch to "method": "dense" and it finds the right document, because embeddings match meaning over vocabulary (output abridged to ranks 1 and 3):

{
  "query": "Can I keep my data in Europe?",
  "method": "dense",
  "k": 3,
  "count": 3,
  "passages": [
    { "rank": 1, "doc_id": "data-residency", "chunk_id": "data-residency::1", "score": 0.462297,
      "text": "You select your data region when you create your workspace, and it cannot be changed afterward without contacting support to arrange a migration..." },
    { "rank": 3, "doc_id": "data-residency", "chunk_id": "data-residency::0", "score": 0.44819,
      "text": "Meridian lets you choose where your data is stored. We operate regions in the United States, the European Union (Frankfurt), and Australia (Sydney)." }
  ]
}

That single comparison is the whole point of the server: the retrieval strategy decides whether the model ever sees the right context, and evaluate_retrieval and compare_methods turn that into numbers.

evaluate_retrieval(method="hybrid", k=3)

Score one method against every labeled question and average four ranking metrics.

Input:

{ "method": "bm25", "k": 3 }

Output:

{
  "method": "bm25",
  "k": 3,
  "n_questions": 20,
  "recall_at_k": 0.925,
  "precision_at_k": 0.3167,
  "mrr_at_k": 0.75,
  "ndcg_at_k": 0.789
}

compare_methods(k=3)

Benchmark every available method and return one row each, plus the winner by nDCG@k. Methods whose optional dependencies are missing are reported under skipped (with a reason) instead of failing the call.

Output (default install, dense extra not present):

{
  "k": 3,
  "n_questions": 20,
  "rows": [
    { "method": "bm25",   "recall_at_k": 0.925, "precision_at_k": 0.3167, "mrr_at_k": 0.75,  "ndcg_at_k": 0.789 },
    { "method": "tfidf",  "recall_at_k": 0.85,  "precision_at_k": 0.3,    "mrr_at_k": 0.725, "ndcg_at_k": 0.7537 },
    { "method": "hybrid", "recall_at_k": 0.85,  "precision_at_k": 0.3,    "mrr_at_k": 0.725, "ndcg_at_k": 0.7609 }
  ],
  "best_method": "bm25",
  "skipped": [
    { "method": "dense", "reason": "The 'dense' retriever needs the optional 'sentence-transformers' dependency..." }
  ]
}

A useful result already: with only the lexical methods, plain BM25 (0.789) edges out the hybrid (0.761), because fusing in the weaker TF-IDF ranker pulls the average down. "Hybrid" is not automatically the right answer, which is exactly the kind of thing you want measured rather than assumed.

With the dense extra installed (uv sync --extra dense), all four methods run and dense wins on this semantically-phrased eval:

method

recall@k

precision@k

mrr@k

ndcg@k

bm25

0.925

0.317

0.750

0.789

tfidf

0.850

0.300

0.725

0.754

dense

1.000

0.350

0.942

0.957

hybrid

0.950

0.333

0.792

0.829

Reading the result. Many of these questions share almost no vocabulary with their source document. "What happens to my information if I cancel?" has barely a word in common with the data export and deletion article, and BM25 misses it. Dense embeddings close that lexical gap and win clearly here. Hybrid (reciprocal-rank fusion) is the most robust generalist and beats both lexical methods, but it does not top pure dense on this query mix, because fusing in two weaker lexical rankers drags its average down. That is the honest, useful finding: hybrid is the safe default when you do not know your query mix, but for semantic queries dense alone can win.

Use it from an MCP client

The server speaks stdio. Point any MCP client at the rageval-mcp command (provided by uv run from the cloned repo).

Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "rageval": {
      "command": "uv",
      "args": ["--directory", "/absolute/path/to/rageval-mcp", "run", "rageval-mcp"]
    }
  }
}

Claude Code (one command, or commit the same shape as project .mcp.json):

claude mcp add rageval -- uv --directory /absolute/path/to/rageval-mcp run rageval-mcp

Then ask the model things like "retrieve the top 3 passages for 'how do I enforce MFA', then compare retrieval methods at k=3 and tell me which to use." It will call the tools and read the numbers back.

To sanity-check the server without a full client, use the MCP Inspector:

uv run mcp dev src/rageval_mcp/server.py

How it works

rageval_mcp/data/corpus/*.md        rageval_mcp/data/eval/questions.jsonl
        |                                      |
        v                                      v
   chunk by paragraph                question + relevant_doc_ids
        |                                      |
        v                                      |
  ┌──────────────────────────┐                 |
  │ retrievers               │                 |
  │  bm25   tfidf   dense*   │                 |
  │         hybrid (RRF)     │                 |
  └──────────────────────────┘                 |
        |  top-k chunks                         |
        v                                       v
  collapse to ranked docs ───────────────▶ metrics ───▶ retrieve / evaluate_retrieval / compare_methods
  • Chunking (corpus.py) splits each document into paragraph passages, then results are collapsed back to document level for scoring.

  • Retrievers (retrievers.py) share one interface, so a new strategy is a single subclass. The hybrid uses reciprocal-rank fusion (RRF), which combines rankings without needing the underlying scores on the same scale.

  • Metrics (metrics.py) are plain, unit-tested functions, so the numbers are auditable.

  • The index (index.py) loads the corpus once, builds the lexical retrievers eagerly, and builds dense lazily on first use. The whole thing is cached, so repeated tool calls stay fast.

Design decisions and trade-offs

  • The dataset ships inside the package. The corpus and questions live in rageval_mcp/data, resolved relative to the module, so the server runs with zero configuration. Trade-off: it evaluates a fixed sample corpus out of the box. Pointing it at your own data is the obvious next feature (see below).

  • Dense embeddings are an optional extra, not a hard dependency. bm25, tfidf, and hybrid run in milliseconds with no model download. dense needs sentence-transformers (which pulls in PyTorch), so it lives behind uv sync --extra dense. For an agent tool, cold-start latency matters: most tool calls should be instant, and only a caller who explicitly wants the embedding baseline pays the model-load cost. When the extra is absent, the tools degrade gracefully (a clear, actionable error on retrieve/evaluate_retrieval, a skipped entry on compare_methods) instead of crashing the server.

  • Structured output, not just text. Each tool returns a typed Pydantic model, so the model gets a real schema and a parseable result rather than prose it has to scrape.

  • Read-only and closed-world. Every tool is annotated readOnlyHint, idempotentHint, and openWorldHint: false. There is nothing to mutate and nothing external to reach, which makes the server safe to expose to an autonomous agent.

  • RRF over weighted score fusion for the hybrid. RRF needs no per-retriever score normalization and no tuning, which keeps the baseline honest rather than hand-optimized.

  • Document-level relevance, not passage-level. Simpler to label by hand and matches how a support agent thinks ("which article answers this?"). Trade-off: it cannot measure whether the single best paragraph ranked first.

What I would build next

  1. A load_corpus tool so an agent can point the server at its own documents and questions at runtime, turning this from a demo into a reusable eval service.

  2. An answer-quality layer: add a generator and an LLM-as-judge tool to score faithfulness and correctness, taking this from a retrieval eval to an end-to-end RAG eval.

  3. A reranker (cross-encoder) as a fourth retrieval stage, with a tool to measure the precision lift.

  4. Latency and cost fields on every result, so the benchmark reflects production trade-offs, not just quality.

  5. A per-question failure tool that returns exactly which questions a method missed, which is where the real debugging happens.

Development

uv run --extra dev pytest        # unit tests for the metrics, the engine, and a full stdio round-trip
uv run --extra dev ruff check .  # lint
uv run --extra dev ruff format --check .

The test suite includes an end-to-end test (tests/test_server.py) that launches the server as a subprocess, completes the MCP handshake, lists the tools, and calls each one over real JSON-RPC, the same way a client would.

Project layout

src/rageval_mcp/
  server.py        FastMCP server: the three tools, input validation, structured output
  index.py         cached engine: loads the corpus once, serves retrieve / evaluate / compare
  retrievers.py    bm25, tfidf, dense, hybrid (shared interface)
  metrics.py       recall@k, precision@k, MRR, nDCG@k (unit-tested)
  corpus.py        load and chunk the markdown corpus
  evaluate.py      run a retriever over the question set and aggregate metrics
  data/corpus/     the knowledge base (10 markdown docs)
  data/eval/       questions.jsonl (20 labeled questions)
scripts/smoke_client.py   a real MCP client that exercises every tool
tests/             metrics, engine, and end-to-end server tests

Relationship to rag-eval-harness

Same retrieval and metrics core, two surfaces. rag-eval-harness is a CLI for a human running a one-off benchmark. rageval-mcp is the agent-facing surface of the same idea: evaluation a model can call as a tool. Build the system, then measure it, from inside the conversation.

MIT licensed.

Install Server
A
license - permissive license
A
quality
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/phillipkaraya/rageval-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server