mcp-rag-assistant
Integrates with Ollama to run local LLMs (e.g., llama3, mistral) and embedding models (e.g., qwen3-embedding), enabling document Q&A, summarization, data analysis, and entity extraction entirely offline.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@mcp-rag-assistantwhat were the revenue figures in Q1?"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
MCP-Powered AI Assistant (Local — LlamaIndex + Ollama)
Privacy-first document intelligence: all models run locally via Ollama — no data leaves your machine.
What is MCP?
Model Context Protocol (MCP) is an open standard (by Anthropic) that defines how AI models discover and invoke tools at runtime. Think of it as a "USB-C port for AI" — any MCP-compatible client (Claude Desktop, your own agent, etc.) can connect to any MCP server and immediately use its tools.
MCP vs. Advanced RAG — What's the Difference?
Dimension | Advanced RAG | MCP |
Purpose | Improve retrieval accuracy | Standardise tool/capability exposure |
Core idea | Better chunking, re-ranking, hybrid search | JSON-RPC tool registry with discovery |
What the LLM gets | Retrieved context injected into prompt | A menu of callable functions with schemas |
Execution | Single pipeline (query → retrieve → generate) | Multi-step agent loop (plan → pick tool → call → observe → repeat) |
Tools | Retrieval only | Any function: retrieval, APIs, databases, code |
State | Stateless per query | Stateful agent sessions possible |
This project | RAG is one tool inside the MCP server | MCP wraps 8 RAG tools, discoverable at runtime |
In short: Advanced RAG makes retrieval smarter. MCP makes the entire AI system composable and interoperable.
Related MCP server: MCP RAG Server
Project Architecture
mcp_rag_assistant/
├── config.py ← Central config (LLM, embed, chunking, server)
├── rag_engine.py ← LlamaIndex: load docs → build index → query engine
├── main.py ← CLI entrypoint (serve / index / query / demo)
├── mcp_client.py ← Example client that calls server tools
│
├── mcp_server/
│ └── server.py ← HTTP JSON-RPC server exposing all tools
│
├── tools/
│ └── rag_tools.py ← 8 MCP tool implementations
│
├── utils/
│ └── logger.py ← Structured logging
│
├── my_data/ ← ⬅ DROP YOUR FILES HERE (PDF, DOCX, XLSX, CSV)
├── storage/ ← ChromaDB persistence (auto-created)
├── logs/ ← Log files (auto-created)
│
├── requirements.txt
├── .env.example
├── .gitignore
└── README.mdData Flow
User Query
│
▼
MCP Client (mcp_client.py or Claude Desktop or your agent)
│ JSON-RPC POST /mcp {"method": "tools/call", "params": {...}}
▼
MCP Server (mcp_server/server.py)
│ dispatches to matching tool function
▼
Tool Function (tools/rag_tools.py)
│ calls get_query_engine().query(...)
▼
LlamaIndex Query Engine (rag_engine.py)
│ embeds query with qwen3-embedding:0.6b via Ollama
▼
ChromaDB Vector Store
│ returns top-K similar chunks
▼
Ollama LLM (llama3 or mistral)
│ synthesises answer from retrieved context
▼
JSON response back through MCP → ClientAvailable MCP Tools
Tool | Description |
| General Q&A over all indexed documents |
| Show files in |
| Re-index after adding/removing files |
| Summarise a specific file by name |
| Plain-English data analysis (CSV/XLSX) |
| Generate summary / detailed / executive report |
| Compare two documents on a given aspect |
| Extract people, orgs, dates, numbers |
Prerequisites
Python 3.10+
Ollama running locally — ollama.com
Models already pulled (you have these):
llama3:latestmistral:latestqwen3-embedding:0.6b
Setup
# 1. Clone / unzip the project
cd mcp_rag_assistant
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure (optional — defaults work out of the box)
cp .env.example .env
# Edit .env to change models, ports, chunk sizes etc.
# 5. Add your documents
# Copy PDFs, DOCX, XLSX, CSV files into:
# my_data/
# 6. Build the index
python main.py index
# 7. Start the MCP server
python main.py serveUsage
Start the server
python main.py serve
# MCP server listening on http://0.0.0.0:8080One-shot query (no server needed)
python main.py query "What are the key findings in the Q1 report?"Run the demo client (server must be running)
# In a second terminal:
python main.py demoRebuild index after adding new files
python main.py index
# or via MCP tool:
# call rebuild_index tool from any clientHealth check
GET http://localhost:8080/health
GET http://localhost:8080/toolsSwitching Models
Edit config.py or your .env:
# Use mistral instead of llama3
LLM_MODEL=mistral:latest
# Use nomic-embed-text for embeddings
EMBED_MODEL=nomic-embed-text:latestTuning Chunk Size
In config.py or .env:
Setting | Default | Notes |
| 256 | Tokens per chunk. Smaller = more precise retrieval |
| 25 | Overlap between chunks. Helps preserve context at boundaries |
| 5 | Chunks retrieved per query |
| compact |
|
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/Hassan-Butt4356/mcp-rag-assistant'
If you have feedback or need assistance with the MCP directory API, please join our Discord server