Skip to main content
Glama
RaulSaavedraDeLaRiera

RAG Knowledge Base MCP Server

RAG Knowledge Base — Hybrid Search + Evaluation

A production-style Retrieval-Augmented Generation pipeline over a document knowledge base. It combines dense and sparse retrieval, cross-encoder reranking, source-grounded answers with citations, and a full evaluation harness that measures both retrieval and answer quality. Exposed over a REST API and as an MCP server for agentic access.

Built to run fully local at zero cost (PostgreSQL + pgvector, on-device embeddings), with a pluggable embedding backend so the same code runs against an API provider by changing one config value.


Why this is more than a basic RAG

Concern

Approach

Retrieval

Hybrid search: pgvector cosine (dense) + Postgres full-text (sparse), fused with Reciprocal Rank Fusion

Ranking

Cross-encoder reranker scores each (query, chunk) pair directly

Grounding

Answers cite sources with [n] markers and refuse when the context is insufficient

Evaluation

Retrieval metrics (precision@k, recall@k, MRR) + LLM-as-judge faithfulness and answer relevance + refusal accuracy

A/B evaluation

Same harness runs each retrieval mode (vector / hybrid / hybrid+rerank) and reports the lift with numbers

Streaming

Answers stream token by token over Server-Sent Events

UI

Minimal web frontend with live streaming and clickable citations

Portability

Pluggable embedding backend (local sentence-transformers or Voyage API)

Agentic access

MCP server exposing search_knowledge_base and ask_knowledge_base tools


Related MCP server: MCP RAG Server

Architecture

graph LR
    subgraph Ingestion
        DOCS[Documents\nmd / txt / pdf]
        CHUNK[Chunker\nparagraph-aware + overlap]
        EMB[Embedding backend\nlocal or api]
    end

    subgraph Store ["Vector Store — PostgreSQL + pgvector"]
        VEC[(chunks\nvector + tsvector)]
    end

    subgraph Retrieval
        DENSE[Vector search\ncosine / hnsw]
        SPARSE[Keyword search\nfull-text / gin]
        RRF[Reciprocal Rank Fusion]
        RER[Cross-encoder rerank]
    end

    subgraph Generation
        GEN[Claude\ngrounded + cited answer]
    end

    DOCS --> CHUNK --> EMB --> VEC
    VEC --> DENSE --> RRF
    VEC --> SPARSE --> RRF
    RRF --> RER --> GEN

Stack

Layer

Tool

Vector store

PostgreSQL + pgvector (HNSW index)

Keyword search

Postgres full-text search (GIN index)

Embeddings

sentence-transformers (local) / Voyage AI (optional)

Reranking

cross-encoder (sentence-transformers)

Generation

Claude (Anthropic)

Serving

FastAPI (REST + SSE streaming) + web UI + MCP server


Quickstart

# 1. start the vector store
make db

# 2. install dependencies and set your key
make install
cp .env.example .env      # add ANTHROPIC_API_KEY

# 3. ingest the sample knowledge base (fictional "Nimbus" product docs)
make ingest RESET=1

# 4. start the API and open the web UI
make api
# then open http://localhost:8000 in a browser, or query the API directly:
curl -X POST localhost:8000/ask \
  -H "content-type: application/json" \
  -d '{"question": "How much does the Standard tier cost?"}'

# 5. run the evaluation harness and the retrieval a/b comparison
make eval
make compare

Example response

{
  "answer": "The Standard tier costs 99 US dollars per month. [1]",
  "citations": [
    {"marker": 1, "source": "nimbus_pricing.md", "title": "nimbus_pricing", "score": 8.42}
  ],
  "retrieved": [
    {"chunk_id": 7, "source": "nimbus_pricing.md", "score": 8.42, "preview": "..."}
  ]
}

Evaluation

The harness runs a gold question set (eval/dataset.py) and reports:

  • Retrieval — precision@k, recall@k, mean reciprocal rank against known relevant sources

  • Generation — faithfulness (are all claims grounded in the retrieved context) and answer relevance (does it match the reference), both judged by an LLM on a 0-1 scale

  • Refusal accuracy — whether the system correctly declines to answer a question the knowledge base does not cover

python -m eval.run_eval

Results are printed as a summary table and written to eval/results/latest.json.

A/B comparison of retrieval modes

eval/compare.py runs the same gold set through each retrieval mode and reports the lift, so design decisions are backed by numbers rather than asserted. It uses only deterministic retrieval metrics, so it makes no LLM calls and costs nothing.

python -m eval.compare

On the sample corpus, reranking lifts top-1 retrieval accuracy from 92% to 100%:

mode                           k=1             k=3             k=5
------------------------------------------------------------------
vector only         0.923 /  0.846    1.0 /  0.885    1.0 /  0.885
hybrid (rrf)        0.923 /  0.846    1.0 /  0.885    1.0 /  0.885
hybrid + rerank       1.0 /  0.923    1.0 /  0.923    1.0 /  0.923
                    (recall@k / mrr@k)

The cross-encoder reranker fixes the case where a semantically-close distractor outranked the correct passage in the top position.


Web UI

Start the API with make api and open http://localhost:8000. The frontend streams the answer token by token and renders the cited sources with their rerank scores, so you can see exactly which passages grounded the response.


Adding your own documents

Drop .md, .txt or .pdf files into data/documents/ and re-run make ingest RESET=1. The schema adapts to the embedding dimension of the configured backend automatically.


Using it as an MCP server

The pipeline is exposed as an MCP server so an LLM agent can retrieve grounded facts on demand:

python -m mcp_server.server

Tools: search_knowledge_base(query, top_k) for raw passages and ask_knowledge_base(question) for a grounded, cited answer.


The retrieval, ranking, generation and evaluation core was designed by hand. AI agents assisted with documentation, the web frontend and peripheral scaffolding.

F
license - not found
-
quality - not tested
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/RaulSaavedraDeLaRiera/Rag-Knowledge-Documentation'

If you have feedback or need assistance with the MCP directory API, please join our Discord server