Hybrid RAG Project MCP Server

RETRIEVAL_IMPROVEMENTS.md•7.99 KiB

# Retrieval Performance Improvements ## Problem Identified The hybrid RAG system was showing **excellent performance for CSV/structured data** but **poor performance for text/markdown documents**. Test results showed: ### ✅ Excellent CSV Retrieval - Product pricing (product_catalog.csv) - Inventory levels (inventory_levels.csv) - Warranty claims (warranty_claims_q4.csv) - Production schedule (production_schedule_dec2024.csv) - Sales orders (sales_orders_november.csv) - Supplier pricing (supplier_pricing.csv) ### ⚠️ Poor Text/Markdown Retrieval - Customer feedback (customer_feedback_q4_2024.md) - NOT retrieved - Market analysis (market_analysis_2024.md) - NOT retrieved - Return policy (return_policy_procedures.md) - NOT retrieved - Quality control report (quality_control_report_nov2024.md) - PARTIAL retrieval ## Root Causes 1. **Chunking Strategy Mismatch** - CSV files were being loaded row-by-row (optimal for structured data) - Markdown files were not being chunked, leading to large, diluted semantic relevance 2. **Retrieval Limit Too Low** - Only k=2 for both vector and keyword search (4 total results) - CSV results dominated due to row-level granularity 3. **No Document Type Awareness** - Single ensemble retriever treated all documents equally - No weighting between structured vs unstructured data 4. **Embedding Bias** - nomic-embed-text may weight structured data patterns higher - No balancing mechanism for document types ## Implemented Solutions ### 1. Enhanced Document Chunking (src/hybrid_rag/document_loader.py) **Text/Markdown Documents:** ```python # Configuration (config/config.yaml) document_processing: text_chunk_size: 1000 # Characters per chunk text_chunk_overlap: 200 # Overlap for context continuity ``` **Benefits:** - Smaller, focused chunks improve semantic relevance - Overlap preserves context across chunk boundaries - Better matching for specific queries **CSV Documents:** - Kept as-is (row-level granularity is optimal) - Tagged with `doc_category: 'structured'` ### 2. Increased Retrieval Limits (config/config.yaml) **Before:** ```yaml retrieval: vector_search_k: 2 keyword_search_k: 2 ``` **After:** ```yaml retrieval: vector_search_k: 5 # 2.5x increase keyword_search_k: 5 # 2.5x increase ``` **Benefits:** - More diverse results (up to 10 chunks per retriever) - Higher chance of capturing both CSV and text results - Better coverage for complex queries ### 3. Document-Type-Aware Retriever (src/hybrid_rag/hybrid_retriever.py) **New Architecture:** ``` Query ├── Text Document Pipeline (60% weight) │ ├── Vector Search (text docs only) │ └── BM25 Search (text docs only) │ └── Ensemble (RRF) │ └── CSV Document Pipeline (40% weight) ├── Vector Search (CSV docs only) └── BM25 Search (CSV docs only) └── Ensemble (RRF) Final Results: Merge with position-based scoring + weights ``` **Configuration:** ```yaml document_processing: use_separate_retrievers: true # Enable new retriever text_retriever_weight: 0.6 # 60% weight to text results csv_retriever_weight: 0.4 # 40% weight to CSV results ``` **Benefits:** - **Guaranteed diversity**: Both document types represented - **Balanced results**: 60/40 split ensures text docs aren't drowned out - **Separate optimization**: Each retriever optimized for its document type - **Configurable weighting**: Adjust based on your data distribution ### 4. Metadata Enrichment All documents now tagged with: ```python doc.metadata['doc_category'] = 'text' # or 'structured' doc.metadata['retrieval_score'] = 0.85 # Position + weight doc.metadata['retrieval_source'] = 'text_retriever' ``` **Benefits:** - Filter retrievers by document type - Debug retrieval performance - Analyze which retriever found each result ## Configuration Guide ### Option 1: Use New Retriever (Recommended) ```yaml # config/config.yaml retrieval: vector_search_k: 5 keyword_search_k: 5 document_processing: text_chunk_size: 1000 text_chunk_overlap: 200 use_separate_retrievers: true text_retriever_weight: 0.6 csv_retriever_weight: 0.4 ``` ### Option 2: Traditional Retriever (Backward Compatible) ```yaml # config/config.yaml retrieval: vector_search_k: 5 keyword_search_k: 5 document_processing: text_chunk_size: 1000 text_chunk_overlap: 200 use_separate_retrievers: false # Disable new retriever ``` ## Tuning Recommendations ### If CSV Results Still Dominate Increase text weight: ```yaml text_retriever_weight: 0.7 csv_retriever_weight: 0.3 ``` ### If Text Results Dominate Increase CSV weight: ```yaml text_retriever_weight: 0.5 csv_retriever_weight: 0.5 ``` ### For Very Long Documents Increase chunk size: ```yaml text_chunk_size: 1500 text_chunk_overlap: 300 ``` ### For More Results Increase k values: ```yaml vector_search_k: 8 keyword_search_k: 8 ``` ### For Mostly CSV Data ```yaml text_retriever_weight: 0.3 csv_retriever_weight: 0.7 ``` ### For Mostly Text Data ```yaml text_retriever_weight: 0.8 csv_retriever_weight: 0.2 ``` ## Testing the Improvements ### 1. Restart Claude Desktop After changing config: ```bash # Kill and restart Claude Desktop # On macOS: Cmd+Q then reopen ``` ### 2. Re-ingest Documents ```python # In Claude Desktop "Use the ingest_documents tool to reload the data with new chunking" ``` ### 3. Test Queries Try the same queries that failed before: - "What does the customer feedback say about quality?" - "Summarize the market analysis findings" - "What is our return policy procedure?" - "Show quality control issues from November" ### 4. Analyze Results Check the retrieved context: - Look for markdown file sources - Verify both CSV and text results appear - Check `retrieval_source` metadata ## Expected Improvements ### Before Changes - **CSV queries**: ✅ Excellent (6/6 success) - **Text queries**: ⚠️ Poor (1/4 partial success) - **Overall**: 70% success rate ### After Changes (Expected) - **CSV queries**: ✅ Excellent (maintained) - **Text queries**: ✅ Good to Excellent (4/4 success expected) - **Overall**: 95%+ success rate ## Performance Metrics ### Chunk Statistics ``` Text Documents: - Before: 1 chunk per file (large, unfocused) - After: ~3-10 chunks per file (focused, relevant) CSV Documents: - Before: N rows as N chunks (optimal) - After: Same (no change needed) ``` ### Retrieval Coverage ``` Before: - Max results: 4 (2 vector + 2 keyword) - Typical: 60-80% CSV, 20-40% text After: - Max results: 10 (5 vector + 5 keyword per retriever) - Guaranteed: 60% text, 40% CSV (with weighting) ``` ## Troubleshooting ### Still Not Retrieving Text Documents? 1. **Verify chunking is enabled:** ```bash # Check document count increased after chunking # Before: 10 documents # After: 50+ documents (includes chunks) ``` 2. **Check metadata:** ```python # In query results, verify: doc.metadata['doc_category'] == 'text' ``` 3. **Increase text weight:** ```yaml text_retriever_weight: 0.8 csv_retriever_weight: 0.2 ``` ### CSV Results Disappeared? 1. **Verify CSV weight:** ```yaml csv_retriever_weight: 0.4 # At least 0.3 ``` 2. **Check CSV documents exist:** ```python # Should see: doc.metadata['doc_category'] == 'structured' ``` ### Query Performance Slow? 1. **Reduce k values:** ```yaml vector_search_k: 3 keyword_search_k: 3 ``` 2. **Disable separate retrievers temporarily:** ```yaml use_separate_retrievers: false ``` ## Version History - **v2.1.0** (Current) - Added document-type-aware retriever - Implemented text chunking - Increased default k values - Added configurable weighting - **v2.0.0** - Original hybrid RAG implementation - Single ensemble retriever - No chunking for text documents ## References - Document Loader: `src/hybrid_rag/document_loader.py` - Hybrid Retriever: `src/hybrid_rag/hybrid_retriever.py` - Configuration: `config/config.yaml` - MCP Server: `scripts/mcp_server_claude.py` - Demo Script: `scripts/run_demo.py`

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/gwyer/hybrid-rag-project'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

RETRIEVAL_IMPROVEMENTS.md•7.99 KiB