Skip to main content
Glama

MCP-Powered AI Assistant (Local — LlamaIndex + Ollama)

Privacy-first document intelligence: all models run locally via Ollama — no data leaves your machine.


What is MCP?

Model Context Protocol (MCP) is an open standard (by Anthropic) that defines how AI models discover and invoke tools at runtime. Think of it as a "USB-C port for AI" — any MCP-compatible client (Claude Desktop, your own agent, etc.) can connect to any MCP server and immediately use its tools.

MCP vs. Advanced RAG — What's the Difference?

Dimension

Advanced RAG

MCP

Purpose

Improve retrieval accuracy

Standardise tool/capability exposure

Core idea

Better chunking, re-ranking, hybrid search

JSON-RPC tool registry with discovery

What the LLM gets

Retrieved context injected into prompt

A menu of callable functions with schemas

Execution

Single pipeline (query → retrieve → generate)

Multi-step agent loop (plan → pick tool → call → observe → repeat)

Tools

Retrieval only

Any function: retrieval, APIs, databases, code

State

Stateless per query

Stateful agent sessions possible

This project

RAG is one tool inside the MCP server

MCP wraps 8 RAG tools, discoverable at runtime

In short: Advanced RAG makes retrieval smarter. MCP makes the entire AI system composable and interoperable.


Related MCP server: MCP RAG Server

Project Architecture

mcp_rag_assistant/
├── config.py               ← Central config (LLM, embed, chunking, server)
├── rag_engine.py           ← LlamaIndex: load docs → build index → query engine
├── main.py                 ← CLI entrypoint (serve / index / query / demo)
├── mcp_client.py           ← Example client that calls server tools
│
├── mcp_server/
│   └── server.py           ← HTTP JSON-RPC server exposing all tools
│
├── tools/
│   └── rag_tools.py        ← 8 MCP tool implementations
│
├── utils/
│   └── logger.py           ← Structured logging
│
├── my_data/                ← ⬅ DROP YOUR FILES HERE (PDF, DOCX, XLSX, CSV)
├── storage/                ← ChromaDB persistence (auto-created)
├── logs/                   ← Log files (auto-created)
│
├── requirements.txt
├── .env.example
├── .gitignore
└── README.md

Data Flow

User Query
    │
    ▼
MCP Client (mcp_client.py or Claude Desktop or your agent)
    │  JSON-RPC POST /mcp  {"method": "tools/call", "params": {...}}
    ▼
MCP Server (mcp_server/server.py)
    │  dispatches to matching tool function
    ▼
Tool Function (tools/rag_tools.py)
    │  calls get_query_engine().query(...)
    ▼
LlamaIndex Query Engine (rag_engine.py)
    │  embeds query with qwen3-embedding:0.6b via Ollama
    ▼
ChromaDB Vector Store
    │  returns top-K similar chunks
    ▼
Ollama LLM (llama3 or mistral)
    │  synthesises answer from retrieved context
    ▼
JSON response back through MCP → Client

Available MCP Tools

Tool

Description

query_documents

General Q&A over all indexed documents

list_indexed_files

Show files in my_data/

rebuild_index

Re-index after adding/removing files

summarize_document

Summarise a specific file by name

analyze_data

Plain-English data analysis (CSV/XLSX)

generate_report

Generate summary / detailed / executive report

compare_documents

Compare two documents on a given aspect

extract_entities

Extract people, orgs, dates, numbers


Prerequisites

  • Python 3.10+

  • Ollama running locally — ollama.com

  • Models already pulled (you have these):

    • llama3:latest

    • mistral:latest

    • qwen3-embedding:0.6b


Setup

# 1. Clone / unzip the project
cd mcp_rag_assistant

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure (optional — defaults work out of the box)
cp .env.example .env
# Edit .env to change models, ports, chunk sizes etc.

# 5. Add your documents
#    Copy PDFs, DOCX, XLSX, CSV files into:
#    my_data/

# 6. Build the index
python main.py index

# 7. Start the MCP server
python main.py serve

Usage

Start the server

python main.py serve
# MCP server listening on http://0.0.0.0:8080

One-shot query (no server needed)

python main.py query "What are the key findings in the Q1 report?"

Run the demo client (server must be running)

# In a second terminal:
python main.py demo

Rebuild index after adding new files

python main.py index
# or via MCP tool:
# call rebuild_index tool from any client

Health check

GET http://localhost:8080/health
GET http://localhost:8080/tools

Switching Models

Edit config.py or your .env:

# Use mistral instead of llama3
LLM_MODEL=mistral:latest

# Use nomic-embed-text for embeddings
EMBED_MODEL=nomic-embed-text:latest

Tuning Chunk Size

In config.py or .env:

Setting

Default

Notes

CHUNK_SIZE

256

Tokens per chunk. Smaller = more precise retrieval

CHUNK_OVERLAP

25

Overlap between chunks. Helps preserve context at boundaries

SIMILARITY_TOP_K

5

Chunks retrieved per query

RESPONSE_MODE

compact

compact | tree_summarize | refine


F
license - not found
-
quality - not tested
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Hassan-Butt4356/mcp-rag-assistant'

If you have feedback or need assistance with the MCP directory API, please join our Discord server