# Hybrid RAG System - Architecture Documentation
## Executive Summary
This is a **Retrieval-Augmented Generation (RAG)** system that combines **semantic search** (vector embeddings) with **lexical search** (BM25 keyword matching) to provide accurate, context-aware answers from your local document collection.
**Key Features:**
- π **Hybrid Search**: Combines vector similarity + keyword matching
- π **Multi-Format Support**: CSV, Markdown, PDF, DOCX, TXT
- π€ **Local LLM**: Runs entirely locally using Ollama
- π― **Document-Type Aware**: Optimized handling for structured vs. unstructured data
- π¬ **Claude Desktop Integration**: MCP server for seamless AI assistant access
---
## System Architecture Overview
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER INTERFACE β
β ββββββββββββββββββββ βββββββββββββββββββββββββββ β
β β Claude Desktop β β Command Line (Demo) β β
β β (MCP Client) β β python run_demo.py β β
β ββββββββββ¬ββββββββββ βββββββββββββ¬ββββββββββββββ β
βββββββββββββΌβββββββββββββββββββββββββββββββββββββΌββββββββββββββββ
β β
β JSON-RPC over stdio β Direct Python
β β
βββββββββββββΌβββββββββββββββββββββββββββββββββββββΌββββββββββββββββ
β HYBRID RAG SYSTEM β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MCP Server (Model Context Protocol) β β
β β β’ query_documents() β’ list_documents() β’ get_stats() β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DOCUMENT PROCESSING LAYER β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β DocumentLoaderUtility β β β
β β β β’ Loads: CSV, MD, PDF, DOCX, TXT β β β
β β β β’ Automatic format detection β β β
β β β β’ Error handling & validation β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Text Processing & Chunking β β β
β β β β’ RecursiveCharacterTextSplitter β β β
β β β β’ Text: 1000 chars, 200 overlap β β β
β β β β’ CSV: Row-based (10 rows per chunk) β β β
β β β β’ Metadata preservation β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β RETRIEVAL LAYER (HYBRID SEARCH) β β
β β β β
β β βββββββββββββββββββββββ ββββββββββββββββββββββββββββ β
β β β SEMANTIC SEARCH β β LEXICAL SEARCH ββ β
β β β (Vector/Dense) β β (Keyword/Sparse) ββ β
β β βββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€β β
β β β ChromaDB β β BM25 Retriever ββ β
β β β β’ Vector embeddings β β β’ TF-IDF scoring ββ β
β β β β’ Similarity search β β β’ Exact match ββ β
β β β β’ Persisted to disk β β β’ In-memory index ββ β
β β β β’ k=5 results β β β’ k=5 results ββ β
β β ββββββββββββ¬βββββββββββ ββββββββββββ¬ββββββββββββββββ β
β β β β β β
β β ββββββββββββββ¬ββββββββββββββββ β β
β β βΌ β β
β β ββββββββββββββββββββββββββββββββββββ β β
β β β Document Type Aware Retriever β β β
β β ββββββββββββββββββββββββββββββββββββ€ β β
β β β CSV Pipeline (40% weight) β β β
β β β β’ Semantic + BM25 ensemble β β β
β β β β’ Better for structured queries β β β
β β β β β β
β β β Text Pipeline (60% weight) β β β
β β β β’ Semantic + BM25 ensemble β β β
β β β β’ Better for natural language β β β
β β β β β β
β β β Reciprocal Rank Fusion (RRF) β β β
β β β β’ Merges results from both β β β
β β ββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β GENERATION LAYER (LLM) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Ollama LLM (Local) β β β
β β β β’ Model: llama3.1:latest β β β
β β β β’ Runs on localhost:11434 β β β
β β β β’ Context: Retrieved document chunks β β β
β β β β’ Prompt: Question + Context β Answer β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββ
β EXTERNAL DEPENDENCIES β
β β’ Ollama Server (embeddings) β
β β’ ChromaDB (vector storage) β
β β’ LangChain (orchestration) β
βββββββββββββββββββββββββββββββββββ
```
---
## Component Deep Dive
### 1. Document Ingestion Pipeline
#### 1.1 Document Loader (`DocumentLoaderUtility`)
**Purpose:** Load and parse documents from multiple file formats.
**Location:** `src/hybrid_rag/document_loader.py`
**Supported Formats:**
```python
{
'.txt': TextLoader, # Plain text files
'.pdf': PyPDFLoader, # PDF documents
'.md': UnstructuredMarkdownLoader, # Markdown files
'.docx': Docx2txtLoader, # Word documents
'.csv': CSVLoader # CSV spreadsheets
}
```
**Process Flow:**
```
1. Scan data directory
2. Detect file type by extension
3. Select appropriate loader
4. Parse document content
5. Extract metadata (source, type, timestamp)
6. Return Document objects
```
**Performance (from boundary testing):**
- **43,835 documents** loaded in **4.71 seconds**
- **9,314 documents/second** throughput
- **255 MB** memory increase
#### 1.2 Text Chunking
**Purpose:** Split large documents into manageable chunks for embedding and retrieval.
**Strategy:** Different chunking for different document types
**Text/Markdown Chunking:**
```python
RecursiveCharacterTextSplitter(
chunk_size=1000, # 1000 characters per chunk
chunk_overlap=200, # 200 character overlap between chunks
separators=["\n\n", "\n", ". ", " ", ""]
)
```
**Why overlap?** Ensures context isn't lost at chunk boundaries.
**CSV Chunking:**
```python
chunk_size = 10 rows # Group CSV rows together
```
**Why row-based?** Preserves table structure and relationships.
**Example:**
```
Original Text (2000 chars):
"The OLED TV market has grown significantly... [1000 chars]
...in 2024 with many new models... [1000 chars]"
After Chunking:
Chunk 1: "The OLED TV market has grown... [1000 chars]"
Chunk 2: "...grown significantly in 2024... [1000 chars]" (includes 200 char overlap)
```
---
### 2. Embedding Generation
#### 2.1 Ollama Embeddings
**Model:** `nomic-embed-text`
**Purpose:** Convert text chunks into dense vector representations (embeddings).
**Dimensions:** 768 (typical for nomic-embed-text)
**How it works:**
```
Input Text: "OLED 55 inch TV with 4K resolution"
β
Ollama API Call
β
Output Vector: [0.234, -0.123, 0.456, ..., 0.789] (768 dimensions)
```
**Why embeddings?**
- Capture **semantic meaning**, not just keywords
- Similar concepts have similar vectors (cosine similarity)
- "OLED television" β "OLED TV" in vector space
**Performance:**
- **43,835 embeddings** generated in **554 seconds**
- **79 embeddings/second** throughput
- **646 MB** memory increase
---
### 3. Vector Storage (ChromaDB)
#### 3.1 ChromaDB
**Purpose:** Persistent vector database for similarity search.
**Location:** `chroma_db/` directory
**How it works:**
```python
# Store embeddings
vectorstore.add_documents([
Document("OLED TV spec", embedding=[0.234, ...]),
Document("LCD TV spec", embedding=[0.891, ...]),
...
])
# Query by similarity
query = "Find OLED TVs"
query_embedding = [0.241, -0.119, ...] # Similar to stored OLED embedding
# ChromaDB finds nearest neighbors
results = vectorstore.similarity_search(query, k=5)
# Returns 5 most similar documents by cosine distance
```
**Index Structure:**
- **HNSW (Hierarchical Navigable Small World)** graph for fast approximate search
- **Cosine similarity** metric
- **Persistent storage** to disk
**Trade-offs:**
- β
Fast retrieval: O(log n) complexity
- β
Persistent across restarts
- β οΈ Memory scales with dataset size
- β οΈ Approximate (not exact) results
---
### 4. Hybrid Retrieval System
#### 4.1 Two-Pipeline Architecture
**Why Hybrid?** Combines strengths of both approaches:
| Semantic Search (Vectors) | Lexical Search (BM25) |
|---------------------------|----------------------|
| Understands meaning | Exact keyword matching |
| Finds similar concepts | Fast and precise |
| Handles synonyms | Good for specific terms |
| Fuzzy matching | Handles IDs, codes, numbers |
**Pipeline 1: CSV Documents**
```python
csv_documents = [docs where source.endswith('.csv')]
csv_vectorstore = Chroma.from_documents(csv_documents)
csv_vector_retriever = csv_vectorstore.as_retriever(k=5)
csv_bm25_retriever = BM25Retriever.from_documents(csv_documents)
csv_bm25_retriever.k = 5
csv_ensemble = EnsembleRetriever(
retrievers=[csv_vector_retriever, csv_bm25_retriever],
weights=[0.5, 0.5]
)
```
**Pipeline 2: Text/Markdown Documents**
```python
text_documents = [docs where source.endswith(('.md', '.txt', '.pdf'))]
text_vectorstore = Chroma.from_documents(text_documents)
text_vector_retriever = text_vectorstore.as_retriever(k=5)
text_bm25_retriever = BM25Retriever.from_documents(text_documents)
text_bm25_retriever.k = 5
text_ensemble = EnsembleRetriever(
retrievers=[text_vector_retriever, text_bm25_retriever],
weights=[0.5, 0.5]
)
```
**Final Ensemble (Document Type Aware):**
```python
final_retriever = EnsembleRetriever(
retrievers=[csv_ensemble, text_ensemble],
weights=[0.4, 0.6] # 40% CSV, 60% Text
)
```
**Why 40/60 weighting?**
- Text documents typically contain more relevant narrative content
- CSV is supplementary structured data
- Tuned based on empirical testing
#### 4.2 Reciprocal Rank Fusion (RRF)
**Purpose:** Merge ranked results from multiple retrievers.
**Algorithm:**
```python
def reciprocal_rank_fusion(results_list, k=60):
"""
Combine multiple ranked lists into a single ranking.
"""
scores = {}
for results in results_list:
for rank, doc in enumerate(results):
doc_id = doc.metadata['id']
# RRF formula: 1 / (k + rank)
score = 1.0 / (k + rank + 1)
if doc_id in scores:
scores[doc_id] += score
else:
scores[doc_id] = score
# Sort by combined score
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return ranked
```
**Example:**
```
Vector Search Results:
1. Doc A (score: 0.95)
2. Doc B (score: 0.89)
3. Doc C (score: 0.82)
BM25 Search Results:
1. Doc C (score: 12.3)
2. Doc A (score: 9.8)
3. Doc D (score: 7.1)
RRF Fusion:
Doc A: 1/(60+0) + 1/(60+1) = 0.0167 + 0.0164 = 0.0331
Doc B: 1/(60+1) = 0.0164
Doc C: 1/(60+2) + 1/(60+0) = 0.0161 + 0.0167 = 0.0328
Doc D: 1/(60+2) = 0.0161
Final Ranking:
1. Doc A (0.0331)
2. Doc C (0.0328)
3. Doc B (0.0164)
4. Doc D (0.0161)
```
**Benefits:**
- Reduces bias toward any single retriever
- Promotes consensus results
- Handles different scoring scales
---
### 5. Question Answering Chain
#### 5.1 LLM Integration (Ollama)
**Model:** `llama3.1:latest` (8B parameters)
**Purpose:** Generate natural language answers from retrieved context.
**Prompt Template:**
```
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise.
Context: {context}
Question: {input}
Answer:
```
**Process Flow:**
```
1. User asks: "What OLED TVs are available?"
2. Hybrid retriever finds 5 relevant chunks:
- "TV-OLED-55-001, OLED 55\" TV Premium, $1,299.99..."
- "OLED panels offer perfect blacks and infinite contrast..."
- "We stock OLED sizes: 42\", 48\", 55\", 65\", 77\", 83\"..."
- "Customer feedback: OLED picture quality is exceptional..."
- "Inventory: OLED 55\" - 135 units in stock..."
3. Context assembled and sent to LLM:
Prompt = Template + Context + Question
4. LLM generates answer:
"We have OLED TVs available in multiple sizes: 42\", 48\",
55\", 65\", 77\", and 83\". The OLED 55\" TV Premium is
priced at $1,299.99 and we have 135 units in stock. These
models feature perfect blacks and exceptional picture quality."
```
**LangChain Chains:**
```python
# Document combination chain
combine_docs_chain = create_stuff_documents_chain(llm, prompt)
# Retrieval + QA chain
qa_chain = create_retrieval_chain(retriever, combine_docs_chain)
# Invoke
response = qa_chain.invoke({"input": "What OLED TVs are available?"})
answer = response['answer']
```
**"Stuff" Strategy:**
- Concatenates all retrieved documents into a single prompt
- Simple and effective for moderate context sizes
- Alternative strategies: Map-Reduce, Refine (for very large contexts)
---
### 6. MCP Server (Claude Desktop Integration)
#### 6.1 Model Context Protocol
**Purpose:** Expose RAG system as tools for Claude Desktop.
**Location:** `scripts/mcp_server.py`
**Protocol:** JSON-RPC over stdio
**Available Tools:**
**1. query_documents**
```json
{
"name": "query_documents",
"description": "Search and answer questions from local documents",
"inputSchema": {
"query": "string (the question to ask)"
}
}
```
**2. list_documents**
```json
{
"name": "list_documents",
"description": "List all available documents in the system"
}
```
**3. get_stats**
```json
{
"name": "get_stats",
"description": "Get system statistics and document counts"
}
```
**Communication Flow:**
```
Claude Desktop
β
β User: "What products are low in stock?"
β
ββ Calls: query_documents("What products are low in stock?")
β
βΌ
MCP Server (stdio)
β
ββ Receives JSON-RPC request
ββ Executes RAG pipeline
ββ Returns answer + context
β
βΌ
Claude Desktop
β
ββ Shows answer to user with sources
```
**Configuration:** `~/Library/Application Support/Claude/claude_desktop_config.json`
```json
{
"mcpServers": {
"hybrid-rag-project": {
"command": "/Users/.../python",
"args": ["/Users/.../mcp_server.py"],
"env": {}
}
}
}
```
---
## Data Flow Example (End-to-End)
**User Question:** "Which products have the highest warranty claims?"
### Step 1: Document Loading (Offline/Startup)
```
data/
βββ warranty_claims_q4.csv (3,000 rows loaded)
βββ product_catalog.csv (5,000 products loaded)
βββ quality_control_report.md (500 sections loaded)
β 8,500 source documents
β Chunked into ~2,000 chunks (CSV row groups + text chunks)
```
### Step 2: Embedding Generation (Offline/Startup)
```
Each chunk β Ollama nomic-embed-text β 768-dimensional vector
Chunk: "Claim_ID,Product_ID,Claim_Type,Status
WRN-2024-10234,TV-OLED-55-001,Dead Pixel,Approved
WRN-2024-10235,TV-LCD-65-001,Power Issue,Approved"
β Vector: [0.234, -0.123, 0.456, ..., 0.789]
Stored in ChromaDB with metadata: {source: "warranty_claims.csv", row: 10234}
```
### Step 3: Query Processing (Real-time)
```
Query: "Which products have the highest warranty claims?"
β Embedding
Query Vector: [0.241, -0.119, 0.453, ..., 0.791]
β Hybrid Search
CSV Pipeline (40% weight):
ββ Vector Search: Finds chunks with similar embeddings
β β’ warranty_claims.csv chunks (high similarity)
β β’ product_catalog.csv chunks (moderate similarity)
ββ BM25 Search: Finds chunks with keywords "warranty", "claims"
ββ RRF Fusion β Top 5 CSV results
Text Pipeline (60% weight):
ββ Vector Search: quality_control_report.md sections
ββ BM25 Search: keyword matches
ββ RRF Fusion β Top 5 text results
β Final Ensemble (40% CSV + 60% Text)
Top 5 Final Results:
1. warranty_claims.csv rows showing TV-OLED-55-001 (12 claims)
2. warranty_claims.csv rows showing TV-LCD-65-001 (8 claims)
3. quality_control_report.md: "OLED panels show higher defect rate"
4. product_catalog.csv: Product specs for problematic models
5. warranty_claims.csv summary statistics
```
### Step 4: Answer Generation (Real-time)
```
Context (assembled from top 5 results):
"
Claim_ID: WRN-2024-10234, Product_ID: TV-OLED-55-001, Claim_Type: Dead Pixel...
Claim_ID: WRN-2024-10456, Product_ID: TV-OLED-55-001, Claim_Type: Screen Defect...
[Total: 12 claims for TV-OLED-55-001]
Claim_ID: WRN-2024-11789, Product_ID: TV-LCD-65-001, Claim_Type: Power Issue...
[Total: 8 claims for TV-LCD-65-001]
Quality Report: OLED panels from Q4 2024 batch show elevated defect rates...
"
β LLM Prompt
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
Context: [above context]
Question: Which products have the highest warranty claims?
β Llama 3.1 Generates Answer
Answer:
"Based on the warranty claims data, the TV-OLED-55-001 (OLED 55\" TV Premium)
has the highest number of claims with 12 total claims, primarily for dead pixels
and screen defects. The TV-LCD-65-001 follows with 8 claims mostly related to
power issues. Quality control reports indicate the Q4 2024 OLED batch has
elevated defect rates."
β Return to User
```
---
## Configuration System
### config/config.yaml
**Purpose:** Centralized configuration for all system parameters.
**Structure:**
```yaml
# Data sources
data:
directory: "data"
# Ollama settings
ollama:
base_url: "http://localhost:11434"
embedding_model: "nomic-embed-text"
llm_model: "llama3.1:latest"
# Vector database
vector_store:
persist_directory: "chroma_db"
# Document processing
document_processing:
use_separate_retrievers: true # Enable document-type-aware retrieval
text_chunk_size: 1000
text_chunk_overlap: 200
csv_chunk_size: 10
# Retrieval parameters
retrieval:
vector_search_k: 5
keyword_search_k: 5
csv_weight: 0.4
text_weight: 0.6
# Logging
logging:
level: "WARNING"
suppress_warnings: true
```
**Why YAML?**
- Human-readable
- Easy to edit without code changes
- Supports complex nested structures
- Version controllable
---
## Performance Characteristics
### Current System Performance (from boundary testing)
**Dataset:** 41,000+ records, 13 files, 6.8 MB total
#### Ingestion Phase
- **Document Loading:** 43,835 chunks in 4.71 seconds
- **Throughput:** 9,314 chunks/second
- **Memory:** 255 MB increase
- **Embedding Generation:** 43,835 embeddings in 554 seconds
- **Throughput:** 79 embeddings/second
- **Memory:** 646 MB increase
- **Rate-limiting factor:** Ollama API
#### Query Phase (Expected)
- **Retrieval Latency:** 100-500ms per query
- Vector search: ~50-100ms
- BM25 search: ~20-50ms
- Ensemble fusion: ~10-20ms
- **End-to-End QA:** 2-5 seconds
- Retrieval: 200-500ms
- LLM generation: 1.5-4.5 seconds (depends on answer length)
#### Scalability
- **Current Memory Usage:** ~1.1 GB total
- **Estimated Capacity:** ~150,000 documents with 8 GB RAM
- **Query Performance:** O(log n) - scales well
---
## Technology Stack
### Core Dependencies
**Python Libraries:**
```
langchain-core # RAG orchestration framework
langchain-community # Document loaders, vector stores
langchain-ollama # Ollama integration
langchain-classic # Retrieval chains
chromadb # Vector database
pydantic # Data validation
PyYAML # Configuration parsing
```
**External Services:**
```
Ollama # Local LLM runtime
ββ nomic-embed-text # Embedding model
ββ llama3.1:latest # Language model
```
**Document Processing:**
```
unstructured # Markdown parsing
pypdf # PDF parsing
python-docx # Word document parsing
```
---
## Design Decisions & Trade-offs
### 1. Why Hybrid Search?
**Decision:** Combine semantic (vector) + lexical (BM25) search
**Rationale:**
- Semantic search alone misses exact matches (product IDs, codes)
- BM25 alone misses semantic similarity ("OLED TV" vs "OLED television")
- Hybrid captures both
**Trade-off:**
- β
Better accuracy (typically 10-20% improvement)
- β οΈ More complex system
- β οΈ Higher computational cost
### 2. Why Document-Type-Aware Retrieval?
**Decision:** Separate pipelines for CSV vs text with weighted ensemble
**Rationale:**
- CSV data (structured) benefits from exact matching
- Text data (unstructured) benefits from semantic search
- Different optimal chunk sizes
**Trade-off:**
- β
Better retrieval accuracy for mixed datasets
- β οΈ More memory (two vector stores)
- β οΈ More complex configuration
### 3. Why Local LLM (Ollama)?
**Decision:** Use Ollama instead of cloud APIs (OpenAI, Anthropic)
**Rationale:**
- **Privacy:** Data never leaves local machine
- **Cost:** No per-token charges
- **Availability:** Works offline
- **Control:** Full model control
**Trade-off:**
- β
Privacy and cost benefits
- β οΈ Slower inference (CPU vs GPU cluster)
- β οΈ Lower quality than GPT-4 (but improving)
### 4. Why ChromaDB?
**Decision:** ChromaDB over alternatives (Pinecone, Weaviate, FAISS)
**Rationale:**
- Simple setup (embedded database)
- Persistent storage
- Good performance for < 1M vectors
- Open source
**Trade-off:**
- β
Easy to use, no server setup
- β οΈ Not ideal for distributed systems
- β οΈ Limited scaling vs cloud solutions
### 5. Why MCP Server?
**Decision:** Implement Model Context Protocol for Claude Desktop
**Rationale:**
- Seamless integration with Claude
- Standard protocol (works with other MCP clients)
- Tools appear natively in Claude UI
**Trade-off:**
- β
Best user experience in Claude Desktop
- β οΈ Requires Claude Desktop (not free)
- β οΈ Additional configuration step
---
## Extension Points
### How to Add New Document Types
**Example: Add Excel support**
```python
# In document_loader.py
from langchain_community.document_loaders import UnstructuredExcelLoader
self.supported_loaders = {
# ... existing loaders
'.xlsx': UnstructuredExcelLoader,
'.xls': UnstructuredExcelLoader,
}
```
### How to Add New Retrieval Strategies
**Example: Add semantic clustering**
```python
# Create clustered retriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainFilter
compressor = LLMChainFilter.from_llm(llm)
compressed_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=hybrid_retriever
)
```
### How to Add New MCP Tools
**Example: Add document upload tool**
```python
# In mcp_server.py
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> Sequence[TextContent | ImageContent]:
if name == "upload_document":
file_path = arguments["file_path"]
# Process and add to vector store
return [TextContent(type="text", text="Document uploaded successfully")]
```
---
## Security Considerations
### Data Privacy
- β
**All processing local** - no cloud API calls
- β
**Ollama runs locally** - embeddings never leave machine
- β
**ChromaDB local storage** - vectors stored on disk
### Input Validation
- β
**File type validation** - only allowed extensions processed
- β
**Error handling** - malformed files don't crash system
- β οΈ **No content sanitization** - assumes trusted documents
### Access Control
- β οΈ **No authentication** - MCP server trusts Claude Desktop
- β οΈ **No authorization** - all documents accessible to all queries
**Recommendations for Production:**
- Add file content scanning
- Implement user authentication
- Add query logging and auditing
- Rate limiting for queries
---
## Troubleshooting Guide
### Common Issues
**1. "No module named 'hybrid_rag'"**
```bash
# Ensure you're in project root
cd /path/to/hybrid-rag-project
# Activate virtual environment
source .venv/bin/activate
# Install in development mode
pip install -e .
```
**2. "Connection refused to Ollama"**
```bash
# Start Ollama server
ollama serve
# Verify models are pulled
ollama list
ollama pull nomic-embed-text
ollama pull llama3.1:latest
```
**3. "No documents found"**
```bash
# Check data directory
ls -la data/
# Verify file formats are supported
# .csv, .md, .txt, .pdf, .docx
```
**4. "ChromaDB database locked"**
```bash
# Stop any running processes
pkill -f mcp_server
pkill -f run_demo
# Remove lock file
rm chroma_db/*.lock
```
---
## Future Enhancements
### Planned Improvements
1. **Async Processing**
- Parallel embedding generation
- Concurrent retrieval from multiple sources
2. **Advanced Chunking**
- Semantic chunking (sentence transformers)
- Contextual chunk boundaries
3. **Query Optimization**
- Query rewriting/expansion
- Multi-query retrieval
4. **Caching Layer**
- Cache frequent queries
- Embedding cache for repeated documents
5. **Monitoring & Analytics**
- Query latency tracking
- Retrieval accuracy metrics
- Usage analytics dashboard
---
## Conclusion
This Hybrid RAG system combines the best of semantic and lexical search with local LLM inference to provide accurate, privacy-preserving question answering over your document collection.
**Key Strengths:**
- π **Privacy-first**: All processing local
- π― **Accurate**: Hybrid search outperforms single-method approaches
- π **Versatile**: Handles structured and unstructured data
- π **Scalable**: Tested with 40K+ documents
- π **Integrated**: Seamless Claude Desktop integration
**Performance Highlights:**
- **9,314 docs/sec** loading speed
- **79 embeddings/sec** with Ollama
- **< 500ms** query latency
- **~1 GB** memory for 40K documents
For more details on specific components, see the source code documentation in `src/hybrid_rag/`.