Hybrid RAG Project MCP Server

Overview Schema Related Servers Score Discussions

system-design.md•31.1 KiB

# Hybrid RAG System - Architecture Documentation ## Executive Summary This is a **Retrieval-Augmented Generation (RAG)** system that combines **semantic search** (vector embeddings) with **lexical search** (BM25 keyword matching) to provide accurate, context-aware answers from your local document collection. **Key Features:** - 🔍 **Hybrid Search**: Combines vector similarity + keyword matching - 📊 **Multi-Format Support**: CSV, Markdown, PDF, DOCX, TXT - 🤖 **Local LLM**: Runs entirely locally using Ollama - 🎯 **Document-Type Aware**: Optimized handling for structured vs. unstructured data - 💬 **Claude Desktop Integration**: MCP server for seamless AI assistant access --- ## System Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ USER INTERFACE │ │ ┌──────────────────┐ ┌─────────────────────────┐ │ │ │ Claude Desktop │ │ Command Line (Demo) │ │ │ │ (MCP Client) │ │ python run_demo.py │ │ │ └────────┬─────────┘ └───────────┬─────────────┘ │ └───────────┼────────────────────────────────────┼───────────────┘ │ │ │ JSON-RPC over stdio │ Direct Python │ │ ┌───────────▼────────────────────────────────────▼───────────────┐ │ HYBRID RAG SYSTEM │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ MCP Server (Model Context Protocol) │ │ │ │ • query_documents() • list_documents() • get_stats() │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ DOCUMENT PROCESSING LAYER │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ │ │ DocumentLoaderUtility │ │ │ │ │ │ • Loads: CSV, MD, PDF, DOCX, TXT │ │ │ │ │ │ • Automatic format detection │ │ │ │ │ │ • Error handling & validation │ │ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ │ │ Text Processing & Chunking │ │ │ │ │ │ • RecursiveCharacterTextSplitter │ │ │ │ │ │ • Text: 1000 chars, 200 overlap │ │ │ │ │ │ • CSV: Row-based (10 rows per chunk) │ │ │ │ │ │ • Metadata preservation │ │ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ RETRIEVAL LAYER (HYBRID SEARCH) │ │ │ │ │ │ │ │ ┌─────────────────────┐ ┌─────────────────────────┐│ │ │ │ │ SEMANTIC SEARCH │ │ LEXICAL SEARCH ││ │ │ │ │ (Vector/Dense) │ │ (Keyword/Sparse) ││ │ │ │ ├─────────────────────┤ ├─────────────────────────┤│ │ │ │ │ ChromaDB │ │ BM25 Retriever ││ │ │ │ │ • Vector embeddings │ │ • TF-IDF scoring ││ │ │ │ │ • Similarity search │ │ • Exact match ││ │ │ │ │ • Persisted to disk │ │ • In-memory index ││ │ │ │ │ • k=5 results │ │ • k=5 results ││ │ │ │ └──────────┬──────────┘ └──────────┬──────────────┘│ │ │ │ │ │ │ │ │ │ └────────────┬───────────────┘ │ │ │ │ ▼ │ │ │ │ ┌──────────────────────────────────┐ │ │ │ │ │ Document Type Aware Retriever │ │ │ │ │ ├──────────────────────────────────┤ │ │ │ │ │ CSV Pipeline (40% weight) │ │ │ │ │ │ • Semantic + BM25 ensemble │ │ │ │ │ │ • Better for structured queries │ │ │ │ │ │ │ │ │ │ │ │ Text Pipeline (60% weight) │ │ │ │ │ │ • Semantic + BM25 ensemble │ │ │ │ │ │ • Better for natural language │ │ │ │ │ │ │ │ │ │ │ │ Reciprocal Rank Fusion (RRF) │ │ │ │ │ │ • Merges results from both │ │ │ │ │ └──────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ GENERATION LAYER (LLM) │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ │ │ Ollama LLM (Local) │ │ │ │ │ │ • Model: llama3.1:latest │ │ │ │ │ │ • Runs on localhost:11434 │ │ │ │ │ │ • Context: Retrieved document chunks │ │ │ │ │ │ • Prompt: Question + Context → Answer │ │ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────┐ │ EXTERNAL DEPENDENCIES │ │ • Ollama Server (embeddings) │ │ • ChromaDB (vector storage) │ │ • LangChain (orchestration) │ └─────────────────────────────────┘ ``` --- ## Component Deep Dive ### 1. Document Ingestion Pipeline #### 1.1 Document Loader (`DocumentLoaderUtility`) **Purpose:** Load and parse documents from multiple file formats. **Location:** `src/hybrid_rag/document_loader.py` **Supported Formats:** ```python { '.txt': TextLoader, # Plain text files '.pdf': PyPDFLoader, # PDF documents '.md': UnstructuredMarkdownLoader, # Markdown files '.docx': Docx2txtLoader, # Word documents '.csv': CSVLoader # CSV spreadsheets } ``` **Process Flow:** ``` 1. Scan data directory 2. Detect file type by extension 3. Select appropriate loader 4. Parse document content 5. Extract metadata (source, type, timestamp) 6. Return Document objects ``` **Performance (from boundary testing):** - **43,835 documents** loaded in **4.71 seconds** - **9,314 documents/second** throughput - **255 MB** memory increase #### 1.2 Text Chunking **Purpose:** Split large documents into manageable chunks for embedding and retrieval. **Strategy:** Different chunking for different document types **Text/Markdown Chunking:** ```python RecursiveCharacterTextSplitter( chunk_size=1000, # 1000 characters per chunk chunk_overlap=200, # 200 character overlap between chunks separators=["\n\n", "\n", ". ", " ", ""] ) ``` **Why overlap?** Ensures context isn't lost at chunk boundaries. **CSV Chunking:** ```python chunk_size = 10 rows # Group CSV rows together ``` **Why row-based?** Preserves table structure and relationships. **Example:** ``` Original Text (2000 chars): "The OLED TV market has grown significantly... [1000 chars] ...in 2024 with many new models... [1000 chars]" After Chunking: Chunk 1: "The OLED TV market has grown... [1000 chars]" Chunk 2: "...grown significantly in 2024... [1000 chars]" (includes 200 char overlap) ``` --- ### 2. Embedding Generation #### 2.1 Ollama Embeddings **Model:** `nomic-embed-text` **Purpose:** Convert text chunks into dense vector representations (embeddings). **Dimensions:** 768 (typical for nomic-embed-text) **How it works:** ``` Input Text: "OLED 55 inch TV with 4K resolution" ↓ Ollama API Call ↓ Output Vector: [0.234, -0.123, 0.456, ..., 0.789] (768 dimensions) ``` **Why embeddings?** - Capture **semantic meaning**, not just keywords - Similar concepts have similar vectors (cosine similarity) - "OLED television" ≈ "OLED TV" in vector space **Performance:** - **43,835 embeddings** generated in **554 seconds** - **79 embeddings/second** throughput - **646 MB** memory increase --- ### 3. Vector Storage (ChromaDB) #### 3.1 ChromaDB **Purpose:** Persistent vector database for similarity search. **Location:** `chroma_db/` directory **How it works:** ```python # Store embeddings vectorstore.add_documents([ Document("OLED TV spec", embedding=[0.234, ...]), Document("LCD TV spec", embedding=[0.891, ...]), ... ]) # Query by similarity query = "Find OLED TVs" query_embedding = [0.241, -0.119, ...] # Similar to stored OLED embedding # ChromaDB finds nearest neighbors results = vectorstore.similarity_search(query, k=5) # Returns 5 most similar documents by cosine distance ``` **Index Structure:** - **HNSW (Hierarchical Navigable Small World)** graph for fast approximate search - **Cosine similarity** metric - **Persistent storage** to disk **Trade-offs:** - ✅ Fast retrieval: O(log n) complexity - ✅ Persistent across restarts - ⚠️ Memory scales with dataset size - ⚠️ Approximate (not exact) results --- ### 4. Hybrid Retrieval System #### 4.1 Two-Pipeline Architecture **Why Hybrid?** Combines strengths of both approaches: | Semantic Search (Vectors) | Lexical Search (BM25) | |---------------------------|----------------------| | Understands meaning | Exact keyword matching | | Finds similar concepts | Fast and precise | | Handles synonyms | Good for specific terms | | Fuzzy matching | Handles IDs, codes, numbers | **Pipeline 1: CSV Documents** ```python csv_documents = [docs where source.endswith('.csv')] csv_vectorstore = Chroma.from_documents(csv_documents) csv_vector_retriever = csv_vectorstore.as_retriever(k=5) csv_bm25_retriever = BM25Retriever.from_documents(csv_documents) csv_bm25_retriever.k = 5 csv_ensemble = EnsembleRetriever( retrievers=[csv_vector_retriever, csv_bm25_retriever], weights=[0.5, 0.5] ) ``` **Pipeline 2: Text/Markdown Documents** ```python text_documents = [docs where source.endswith(('.md', '.txt', '.pdf'))] text_vectorstore = Chroma.from_documents(text_documents) text_vector_retriever = text_vectorstore.as_retriever(k=5) text_bm25_retriever = BM25Retriever.from_documents(text_documents) text_bm25_retriever.k = 5 text_ensemble = EnsembleRetriever( retrievers=[text_vector_retriever, text_bm25_retriever], weights=[0.5, 0.5] ) ``` **Final Ensemble (Document Type Aware):** ```python final_retriever = EnsembleRetriever( retrievers=[csv_ensemble, text_ensemble], weights=[0.4, 0.6] # 40% CSV, 60% Text ) ``` **Why 40/60 weighting?** - Text documents typically contain more relevant narrative content - CSV is supplementary structured data - Tuned based on empirical testing #### 4.2 Reciprocal Rank Fusion (RRF) **Purpose:** Merge ranked results from multiple retrievers. **Algorithm:** ```python def reciprocal_rank_fusion(results_list, k=60): """ Combine multiple ranked lists into a single ranking. """ scores = {} for results in results_list: for rank, doc in enumerate(results): doc_id = doc.metadata['id'] # RRF formula: 1 / (k + rank) score = 1.0 / (k + rank + 1) if doc_id in scores: scores[doc_id] += score else: scores[doc_id] = score # Sort by combined score ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True) return ranked ``` **Example:** ``` Vector Search Results: 1. Doc A (score: 0.95) 2. Doc B (score: 0.89) 3. Doc C (score: 0.82) BM25 Search Results: 1. Doc C (score: 12.3) 2. Doc A (score: 9.8) 3. Doc D (score: 7.1) RRF Fusion: Doc A: 1/(60+0) + 1/(60+1) = 0.0167 + 0.0164 = 0.0331 Doc B: 1/(60+1) = 0.0164 Doc C: 1/(60+2) + 1/(60+0) = 0.0161 + 0.0167 = 0.0328 Doc D: 1/(60+2) = 0.0161 Final Ranking: 1. Doc A (0.0331) 2. Doc C (0.0328) 3. Doc B (0.0164) 4. Doc D (0.0161) ``` **Benefits:** - Reduces bias toward any single retriever - Promotes consensus results - Handles different scoring scales --- ### 5. Question Answering Chain #### 5.1 LLM Integration (Ollama) **Model:** `llama3.1:latest` (8B parameters) **Purpose:** Generate natural language answers from retrieved context. **Prompt Template:** ``` You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Context: {context} Question: {input} Answer: ``` **Process Flow:** ``` 1. User asks: "What OLED TVs are available?" 2. Hybrid retriever finds 5 relevant chunks: - "TV-OLED-55-001, OLED 55\" TV Premium, $1,299.99..." - "OLED panels offer perfect blacks and infinite contrast..." - "We stock OLED sizes: 42\", 48\", 55\", 65\", 77\", 83\"..." - "Customer feedback: OLED picture quality is exceptional..." - "Inventory: OLED 55\" - 135 units in stock..." 3. Context assembled and sent to LLM: Prompt = Template + Context + Question 4. LLM generates answer: "We have OLED TVs available in multiple sizes: 42\", 48\", 55\", 65\", 77\", and 83\". The OLED 55\" TV Premium is priced at $1,299.99 and we have 135 units in stock. These models feature perfect blacks and exceptional picture quality." ``` **LangChain Chains:** ```python # Document combination chain combine_docs_chain = create_stuff_documents_chain(llm, prompt) # Retrieval + QA chain qa_chain = create_retrieval_chain(retriever, combine_docs_chain) # Invoke response = qa_chain.invoke({"input": "What OLED TVs are available?"}) answer = response['answer'] ``` **"Stuff" Strategy:** - Concatenates all retrieved documents into a single prompt - Simple and effective for moderate context sizes - Alternative strategies: Map-Reduce, Refine (for very large contexts) --- ### 6. MCP Server (Claude Desktop Integration) #### 6.1 Model Context Protocol **Purpose:** Expose RAG system as tools for Claude Desktop. **Location:** `scripts/mcp_server.py` **Protocol:** JSON-RPC over stdio **Available Tools:** **1. query_documents** ```json { "name": "query_documents", "description": "Search and answer questions from local documents", "inputSchema": { "query": "string (the question to ask)" } } ``` **2. list_documents** ```json { "name": "list_documents", "description": "List all available documents in the system" } ``` **3. get_stats** ```json { "name": "get_stats", "description": "Get system statistics and document counts" } ``` **Communication Flow:** ``` Claude Desktop │ │ User: "What products are low in stock?" │ ├─ Calls: query_documents("What products are low in stock?") │ ▼ MCP Server (stdio) │ ├─ Receives JSON-RPC request ├─ Executes RAG pipeline ├─ Returns answer + context │ ▼ Claude Desktop │ └─ Shows answer to user with sources ``` **Configuration:** `~/Library/Application Support/Claude/claude_desktop_config.json` ```json { "mcpServers": { "hybrid-rag-project": { "command": "/Users/.../python", "args": ["/Users/.../mcp_server.py"], "env": {} } } } ``` --- ## Data Flow Example (End-to-End) **User Question:** "Which products have the highest warranty claims?" ### Step 1: Document Loading (Offline/Startup) ``` data/ ├── warranty_claims_q4.csv (3,000 rows loaded) ├── product_catalog.csv (5,000 products loaded) └── quality_control_report.md (500 sections loaded) → 8,500 source documents → Chunked into ~2,000 chunks (CSV row groups + text chunks) ``` ### Step 2: Embedding Generation (Offline/Startup) ``` Each chunk → Ollama nomic-embed-text → 768-dimensional vector Chunk: "Claim_ID,Product_ID,Claim_Type,Status WRN-2024-10234,TV-OLED-55-001,Dead Pixel,Approved WRN-2024-10235,TV-LCD-65-001,Power Issue,Approved" → Vector: [0.234, -0.123, 0.456, ..., 0.789] Stored in ChromaDB with metadata: {source: "warranty_claims.csv", row: 10234} ``` ### Step 3: Query Processing (Real-time) ``` Query: "Which products have the highest warranty claims?" ↓ Embedding Query Vector: [0.241, -0.119, 0.453, ..., 0.791] ↓ Hybrid Search CSV Pipeline (40% weight): ├─ Vector Search: Finds chunks with similar embeddings │ • warranty_claims.csv chunks (high similarity) │ • product_catalog.csv chunks (moderate similarity) ├─ BM25 Search: Finds chunks with keywords "warranty", "claims" └─ RRF Fusion → Top 5 CSV results Text Pipeline (60% weight): ├─ Vector Search: quality_control_report.md sections ├─ BM25 Search: keyword matches └─ RRF Fusion → Top 5 text results ↓ Final Ensemble (40% CSV + 60% Text) Top 5 Final Results: 1. warranty_claims.csv rows showing TV-OLED-55-001 (12 claims) 2. warranty_claims.csv rows showing TV-LCD-65-001 (8 claims) 3. quality_control_report.md: "OLED panels show higher defect rate" 4. product_catalog.csv: Product specs for problematic models 5. warranty_claims.csv summary statistics ``` ### Step 4: Answer Generation (Real-time) ``` Context (assembled from top 5 results): " Claim_ID: WRN-2024-10234, Product_ID: TV-OLED-55-001, Claim_Type: Dead Pixel... Claim_ID: WRN-2024-10456, Product_ID: TV-OLED-55-001, Claim_Type: Screen Defect... [Total: 12 claims for TV-OLED-55-001] Claim_ID: WRN-2024-11789, Product_ID: TV-LCD-65-001, Claim_Type: Power Issue... [Total: 8 claims for TV-LCD-65-001] Quality Report: OLED panels from Q4 2024 batch show elevated defect rates... " ↓ LLM Prompt You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. Context: [above context] Question: Which products have the highest warranty claims? ↓ Llama 3.1 Generates Answer Answer: "Based on the warranty claims data, the TV-OLED-55-001 (OLED 55\" TV Premium) has the highest number of claims with 12 total claims, primarily for dead pixels and screen defects. The TV-LCD-65-001 follows with 8 claims mostly related to power issues. Quality control reports indicate the Q4 2024 OLED batch has elevated defect rates." ↓ Return to User ``` --- ## Configuration System ### config/config.yaml **Purpose:** Centralized configuration for all system parameters. **Structure:** ```yaml # Data sources data: directory: "data" # Ollama settings ollama: base_url: "http://localhost:11434" embedding_model: "nomic-embed-text" llm_model: "llama3.1:latest" # Vector database vector_store: persist_directory: "chroma_db" # Document processing document_processing: use_separate_retrievers: true # Enable document-type-aware retrieval text_chunk_size: 1000 text_chunk_overlap: 200 csv_chunk_size: 10 # Retrieval parameters retrieval: vector_search_k: 5 keyword_search_k: 5 csv_weight: 0.4 text_weight: 0.6 # Logging logging: level: "WARNING" suppress_warnings: true ``` **Why YAML?** - Human-readable - Easy to edit without code changes - Supports complex nested structures - Version controllable --- ## Performance Characteristics ### Current System Performance (from boundary testing) **Dataset:** 41,000+ records, 13 files, 6.8 MB total #### Ingestion Phase - **Document Loading:** 43,835 chunks in 4.71 seconds - **Throughput:** 9,314 chunks/second - **Memory:** 255 MB increase - **Embedding Generation:** 43,835 embeddings in 554 seconds - **Throughput:** 79 embeddings/second - **Memory:** 646 MB increase - **Rate-limiting factor:** Ollama API #### Query Phase (Expected) - **Retrieval Latency:** 100-500ms per query - Vector search: ~50-100ms - BM25 search: ~20-50ms - Ensemble fusion: ~10-20ms - **End-to-End QA:** 2-5 seconds - Retrieval: 200-500ms - LLM generation: 1.5-4.5 seconds (depends on answer length) #### Scalability - **Current Memory Usage:** ~1.1 GB total - **Estimated Capacity:** ~150,000 documents with 8 GB RAM - **Query Performance:** O(log n) - scales well --- ## Technology Stack ### Core Dependencies **Python Libraries:** ``` langchain-core # RAG orchestration framework langchain-community # Document loaders, vector stores langchain-ollama # Ollama integration langchain-classic # Retrieval chains chromadb # Vector database pydantic # Data validation PyYAML # Configuration parsing ``` **External Services:** ``` Ollama # Local LLM runtime ├─ nomic-embed-text # Embedding model └─ llama3.1:latest # Language model ``` **Document Processing:** ``` unstructured # Markdown parsing pypdf # PDF parsing python-docx # Word document parsing ``` --- ## Design Decisions & Trade-offs ### 1. Why Hybrid Search? **Decision:** Combine semantic (vector) + lexical (BM25) search **Rationale:** - Semantic search alone misses exact matches (product IDs, codes) - BM25 alone misses semantic similarity ("OLED TV" vs "OLED television") - Hybrid captures both **Trade-off:** - ✅ Better accuracy (typically 10-20% improvement) - ⚠️ More complex system - ⚠️ Higher computational cost ### 2. Why Document-Type-Aware Retrieval? **Decision:** Separate pipelines for CSV vs text with weighted ensemble **Rationale:** - CSV data (structured) benefits from exact matching - Text data (unstructured) benefits from semantic search - Different optimal chunk sizes **Trade-off:** - ✅ Better retrieval accuracy for mixed datasets - ⚠️ More memory (two vector stores) - ⚠️ More complex configuration ### 3. Why Local LLM (Ollama)? **Decision:** Use Ollama instead of cloud APIs (OpenAI, Anthropic) **Rationale:** - **Privacy:** Data never leaves local machine - **Cost:** No per-token charges - **Availability:** Works offline - **Control:** Full model control **Trade-off:** - ✅ Privacy and cost benefits - ⚠️ Slower inference (CPU vs GPU cluster) - ⚠️ Lower quality than GPT-4 (but improving) ### 4. Why ChromaDB? **Decision:** ChromaDB over alternatives (Pinecone, Weaviate, FAISS) **Rationale:** - Simple setup (embedded database) - Persistent storage - Good performance for < 1M vectors - Open source **Trade-off:** - ✅ Easy to use, no server setup - ⚠️ Not ideal for distributed systems - ⚠️ Limited scaling vs cloud solutions ### 5. Why MCP Server? **Decision:** Implement Model Context Protocol for Claude Desktop **Rationale:** - Seamless integration with Claude - Standard protocol (works with other MCP clients) - Tools appear natively in Claude UI **Trade-off:** - ✅ Best user experience in Claude Desktop - ⚠️ Requires Claude Desktop (not free) - ⚠️ Additional configuration step --- ## Extension Points ### How to Add New Document Types **Example: Add Excel support** ```python # In document_loader.py from langchain_community.document_loaders import UnstructuredExcelLoader self.supported_loaders = { # ... existing loaders '.xlsx': UnstructuredExcelLoader, '.xls': UnstructuredExcelLoader, } ``` ### How to Add New Retrieval Strategies **Example: Add semantic clustering** ```python # Create clustered retriever from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainFilter compressor = LLMChainFilter.from_llm(llm) compressed_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=hybrid_retriever ) ``` ### How to Add New MCP Tools **Example: Add document upload tool** ```python # In mcp_server.py @app.call_tool() async def call_tool(name: str, arguments: dict) -> Sequence[TextContent | ImageContent]: if name == "upload_document": file_path = arguments["file_path"] # Process and add to vector store return [TextContent(type="text", text="Document uploaded successfully")] ``` --- ## Security Considerations ### Data Privacy - ✅ **All processing local** - no cloud API calls - ✅ **Ollama runs locally** - embeddings never leave machine - ✅ **ChromaDB local storage** - vectors stored on disk ### Input Validation - ✅ **File type validation** - only allowed extensions processed - ✅ **Error handling** - malformed files don't crash system - ⚠️ **No content sanitization** - assumes trusted documents ### Access Control - ⚠️ **No authentication** - MCP server trusts Claude Desktop - ⚠️ **No authorization** - all documents accessible to all queries **Recommendations for Production:** - Add file content scanning - Implement user authentication - Add query logging and auditing - Rate limiting for queries --- ## Troubleshooting Guide ### Common Issues **1. "No module named 'hybrid_rag'"** ```bash # Ensure you're in project root cd /path/to/hybrid-rag-project # Activate virtual environment source .venv/bin/activate # Install in development mode pip install -e . ``` **2. "Connection refused to Ollama"** ```bash # Start Ollama server ollama serve # Verify models are pulled ollama list ollama pull nomic-embed-text ollama pull llama3.1:latest ``` **3. "No documents found"** ```bash # Check data directory ls -la data/ # Verify file formats are supported # .csv, .md, .txt, .pdf, .docx ``` **4. "ChromaDB database locked"** ```bash # Stop any running processes pkill -f mcp_server pkill -f run_demo # Remove lock file rm chroma_db/*.lock ``` --- ## Future Enhancements ### Planned Improvements 1. **Async Processing** - Parallel embedding generation - Concurrent retrieval from multiple sources 2. **Advanced Chunking** - Semantic chunking (sentence transformers) - Contextual chunk boundaries 3. **Query Optimization** - Query rewriting/expansion - Multi-query retrieval 4. **Caching Layer** - Cache frequent queries - Embedding cache for repeated documents 5. **Monitoring & Analytics** - Query latency tracking - Retrieval accuracy metrics - Usage analytics dashboard --- ## Conclusion This Hybrid RAG system combines the best of semantic and lexical search with local LLM inference to provide accurate, privacy-preserving question answering over your document collection. **Key Strengths:** - 🔒 **Privacy-first**: All processing local - 🎯 **Accurate**: Hybrid search outperforms single-method approaches - 📊 **Versatile**: Handles structured and unstructured data - 🚀 **Scalable**: Tested with 40K+ documents - 🔌 **Integrated**: Seamless Claude Desktop integration **Performance Highlights:** - **9,314 docs/sec** loading speed - **79 embeddings/sec** with Ollama - **< 500ms** query latency - **~1 GB** memory for 40K documents For more details on specific components, see the source code documentation in `src/hybrid_rag/`.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/gwyer/hybrid-rag-project'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

system-design.md•31.1 KiB