RAG Knowledge Base MCP Server
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@RAG Knowledge Base MCP ServerHow much does the Standard tier cost?"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
RAG Knowledge Base — Hybrid Search + Evaluation
A production-style Retrieval-Augmented Generation pipeline over a document knowledge base. It combines dense and sparse retrieval, cross-encoder reranking, source-grounded answers with citations, and a full evaluation harness that measures both retrieval and answer quality. Exposed over a REST API and as an MCP server for agentic access.
Built to run fully local at zero cost (PostgreSQL + pgvector, on-device embeddings), with a pluggable embedding backend so the same code runs against an API provider by changing one config value.
Why this is more than a basic RAG
Concern | Approach |
Retrieval | Hybrid search: pgvector cosine (dense) + Postgres full-text (sparse), fused with Reciprocal Rank Fusion |
Ranking | Cross-encoder reranker scores each (query, chunk) pair directly |
Grounding | Answers cite sources with |
Evaluation | Retrieval metrics (precision@k, recall@k, MRR) + LLM-as-judge faithfulness and answer relevance + refusal accuracy |
A/B evaluation | Same harness runs each retrieval mode (vector / hybrid / hybrid+rerank) and reports the lift with numbers |
Streaming | Answers stream token by token over Server-Sent Events |
UI | Minimal web frontend with live streaming and clickable citations |
Portability | Pluggable embedding backend (local sentence-transformers or Voyage API) |
Agentic access | MCP server exposing |
Related MCP server: MCP RAG Server
Architecture
graph LR
subgraph Ingestion
DOCS[Documents\nmd / txt / pdf]
CHUNK[Chunker\nparagraph-aware + overlap]
EMB[Embedding backend\nlocal or api]
end
subgraph Store ["Vector Store — PostgreSQL + pgvector"]
VEC[(chunks\nvector + tsvector)]
end
subgraph Retrieval
DENSE[Vector search\ncosine / hnsw]
SPARSE[Keyword search\nfull-text / gin]
RRF[Reciprocal Rank Fusion]
RER[Cross-encoder rerank]
end
subgraph Generation
GEN[Claude\ngrounded + cited answer]
end
DOCS --> CHUNK --> EMB --> VEC
VEC --> DENSE --> RRF
VEC --> SPARSE --> RRF
RRF --> RER --> GENStack
Layer | Tool |
Vector store | PostgreSQL + pgvector (HNSW index) |
Keyword search | Postgres full-text search (GIN index) |
Embeddings | sentence-transformers (local) / Voyage AI (optional) |
Reranking | cross-encoder (sentence-transformers) |
Generation | Claude (Anthropic) |
Serving | FastAPI (REST + SSE streaming) + web UI + MCP server |
Quickstart
# 1. start the vector store
make db
# 2. install dependencies and set your key
make install
cp .env.example .env # add ANTHROPIC_API_KEY
# 3. ingest the sample knowledge base (fictional "Nimbus" product docs)
make ingest RESET=1
# 4. start the API and open the web UI
make api
# then open http://localhost:8000 in a browser, or query the API directly:
curl -X POST localhost:8000/ask \
-H "content-type: application/json" \
-d '{"question": "How much does the Standard tier cost?"}'
# 5. run the evaluation harness and the retrieval a/b comparison
make eval
make compareExample response
{
"answer": "The Standard tier costs 99 US dollars per month. [1]",
"citations": [
{"marker": 1, "source": "nimbus_pricing.md", "title": "nimbus_pricing", "score": 8.42}
],
"retrieved": [
{"chunk_id": 7, "source": "nimbus_pricing.md", "score": 8.42, "preview": "..."}
]
}Evaluation
The harness runs a gold question set (eval/dataset.py) and reports:
Retrieval — precision@k, recall@k, mean reciprocal rank against known relevant sources
Generation — faithfulness (are all claims grounded in the retrieved context) and answer relevance (does it match the reference), both judged by an LLM on a 0-1 scale
Refusal accuracy — whether the system correctly declines to answer a question the knowledge base does not cover
python -m eval.run_evalResults are printed as a summary table and written to eval/results/latest.json.
A/B comparison of retrieval modes
eval/compare.py runs the same gold set through each retrieval mode and reports
the lift, so design decisions are backed by numbers rather than asserted. It uses
only deterministic retrieval metrics, so it makes no LLM calls and costs nothing.
python -m eval.compareOn the sample corpus, reranking lifts top-1 retrieval accuracy from 92% to 100%:
mode k=1 k=3 k=5
------------------------------------------------------------------
vector only 0.923 / 0.846 1.0 / 0.885 1.0 / 0.885
hybrid (rrf) 0.923 / 0.846 1.0 / 0.885 1.0 / 0.885
hybrid + rerank 1.0 / 0.923 1.0 / 0.923 1.0 / 0.923
(recall@k / mrr@k)The cross-encoder reranker fixes the case where a semantically-close distractor outranked the correct passage in the top position.
Web UI
Start the API with make api and open http://localhost:8000. The frontend
streams the answer token by token and renders the cited sources with their rerank
scores, so you can see exactly which passages grounded the response.
Adding your own documents
Drop .md, .txt or .pdf files into data/documents/ and re-run
make ingest RESET=1. The schema adapts to the embedding dimension of the
configured backend automatically.
Using it as an MCP server
The pipeline is exposed as an MCP server so an LLM agent can retrieve grounded facts on demand:
python -m mcp_server.serverTools: search_knowledge_base(query, top_k) for raw passages and
ask_knowledge_base(question) for a grounded, cited answer.
The retrieval, ranking, generation and evaluation core was designed by hand. AI agents assisted with documentation, the web frontend and peripheral scaffolding.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
- Your AI Chatbot Just Exposed Your CEO's Salary to an InternBy Om-Shree-0709 on .Agent IdentityMCP SecurityOAuth Delegation
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/RaulSaavedraDeLaRiera/Rag-Knowledge-Documentation'
If you have feedback or need assistance with the MCP directory API, please join our Discord server