Hybrid RAG Project MCP Server

Overview Schema Related Servers Score Discussions

bm25-explained.md•14.1 KiB

# BM25 Explained: Keyword Search in the Hybrid RAG System ## What is BM25? **BM25** (Best Matching 25) is a **ranking algorithm** used for keyword-based search. It's considered the industry standard for lexical (word-based) search and is used by search engines like Elasticsearch and Lucene. **Key Point:** BM25 finds documents that contain your **exact search terms** and ranks them by relevance. --- ## BM25 vs Vector Search Your hybrid RAG system uses **both** methods for better results: ### Vector Search (Semantic) - **What it does:** Understands meaning and context - **How it works:** Converts text to numerical vectors (embeddings) - **Example:** Query "television" finds documents about "TV", "display", "screen" - **Strength:** Finds similar concepts even with different words - **Weakness:** May miss exact keyword matches ### BM25 Search (Lexical) - **What it does:** Finds exact keyword matches - **How it works:** Counts word occurrences and frequencies - **Example:** Query "OLED-55-001" finds that exact product ID - **Strength:** Perfect for exact terms, IDs, product codes - **Weakness:** Doesn't understand synonyms or meaning ### Hybrid = Best of Both By combining both, you get: - ✅ Exact matches for specific terms (BM25) - ✅ Semantic understanding for concepts (Vector) - ✅ Better overall retrieval accuracy --- ## How BM25 Works (Simple Explanation) BM25 scores documents based on: ### 1. **Term Frequency (TF)** *"How often does the search term appear in this document?"* - Document with "OLED" mentioned 5 times scores higher than one with "OLED" mentioned once - But there's **diminishing returns** - 10 mentions isn't 10x better than 1 mention ### 2. **Inverse Document Frequency (IDF)** *"How rare is this search term across all documents?"* - Rare terms (like "TV-OLED-55-001") are more valuable - Common terms (like "the", "and") are less valuable - This helps find **distinctive** matches ### 3. **Document Length Normalization** *"Account for document size"* - Longer documents naturally contain more words - BM25 normalizes scores so shorter, focused documents aren't penalized --- ## BM25 Formula (For Reference) ``` Score(D,Q) = Σ IDF(qi) × (f(qi,D) × (k1 + 1)) / (f(qi,D) + k1 × (1 - b + b × |D|/avgdl)) ``` **Don't worry about the math!** Just know: - **IDF(qi)** = How rare is query term qi? - **f(qi,D)** = How often does qi appear in document D? - **|D|/avgdl** = Document length compared to average - **k1, b** = Tuning parameters (typically k1=1.5, b=0.75) --- ## How BM25 is Used in This Project ### Implementation Location **File:** `src/hybrid_rag/hybrid_retriever.py` ### Code Snippets **Creating BM25 Retrievers:** ```python from langchain_community.retrievers import BM25Retriever # Create BM25 retriever from documents text_bm25_retriever = BM25Retriever.from_documents(text_docs) text_bm25_retriever.k = 5 # Return top 5 results csv_bm25_retriever = BM25Retriever.from_documents(csv_docs) csv_bm25_retriever.k = 5 ``` **Hybrid Retrieval Process:** ```python # This project creates SEPARATE BM25 retrievers for: # 1. Text/Markdown documents # 2. CSV documents # Each is combined with its corresponding vector retriever text_retriever = EnsembleRetriever( retrievers=[text_vector_retriever, text_bm25_retriever], weights=[0.5, 0.5] # Equal weighting ) csv_retriever = EnsembleRetriever( retrievers=[csv_vector_retriever, csv_bm25_retriever], weights=[0.5, 0.5] # Equal weighting ) ``` --- ## Architecture Flow ``` User Query: "What is product TV-OLED-55-001?" ↓ ┌────────────────────────────────────────┐ │ Document Type Aware Retriever │ └────────────────────────────────────────┘ ↓ ↓ ┌─────────────┐ ┌─────────────┐ │ TEXT DOCS │ │ CSV DOCS │ └─────────────┘ └─────────────┘ ↓ ↓ ┌─────────────┐ ┌─────────────┐ │ Vector (50%)│ │ Vector (50%)│ │ BM25 (50%)│ │ BM25 (50%)│ └─────────────┘ └─────────────┘ ↓ ↓ ┌─────────────┐ ┌─────────────┐ │5 text results│ │5 CSV results │ └─────────────┘ └─────────────┘ ↓ ↓ ┌────────────────────────────────────────┐ │ Merge with Weighting │ │ Text: 60%, CSV: 40% │ └────────────────────────────────────────┘ ↓ Top 10 Final Results ``` --- ## Why This Project Uses Separate BM25 Retrievers ### Design Decision: Split by Document Type **Text/Markdown BM25:** - Searches: `.md`, `.txt`, `.pdf`, `.docx` files - Good for: Concepts, descriptions, policies, feedback - Example query: "customer complaints about delivery" **CSV BM25:** - Searches: `.csv` files only - Good for: Exact values, product IDs, numbers, dates - Example query: "TV-OLED-55-001" or "ORD-12345" ### Why Split? 1. **Different Content Structures:** - Text: Natural language, paragraphs - CSV: Structured data, tabular format 2. **Different Search Needs:** - Text: Semantic understanding matters more - CSV: Exact matches matter more 3. **Better Relevance:** - Prevents CSV rows from overwhelming text results - Prevents verbose text from overwhelming precise data --- ## Configuration BM25 behavior is controlled in `config/config.yaml`: ```yaml retrieval: vector_search_k: 5 # Top 5 from vector search keyword_search_k: 5 # Top 5 from BM25 search document_processing: # Weighting between document types csv_retriever_weight: 0.4 # 40% weight to CSV results text_retriever_weight: 0.6 # 60% weight to text results ``` **What this means:** - Each retriever (vector + BM25) returns 5 documents - BM25 and vector are weighted equally (50/50) within each document type - But CSV results get 40% overall weight vs 60% for text results - Final output: Top 10 combined results --- ## Example: BM25 in Action ### Query: "OLED TV prices" **BM25 Text Search:** ``` Document: product_specifications.txt Match: "...OLED TV models feature..." Score: 3.2 (found "OLED" and "TV") Document: market_analysis_2024.md Match: "...OLED TV market prices are..." Score: 4.1 (found "OLED", "TV", "prices") ``` **BM25 CSV Search:** ``` Document: product_catalog.csv row 42 Match: "TV-OLED-55-001, OLED TV 55\", $1299.99" Score: 5.8 (found "OLED" and "TV", exact product match) Document: supplier_pricing.csv row 108 Match: "PANEL-OLED-55, OLED panel, $450.00" Score: 3.9 (found "OLED") ``` **Combined with Vector Search:** - Vector search finds semantically similar docs (might find "television display" even without exact words) - BM25 finds exact keyword matches - Results merged and ranked - Best of both worlds! --- ## When BM25 Shines BM25 is **especially good** for: ### ✅ Exact Product IDs ``` Query: "TV-OLED-55-001" → BM25 finds exact match immediately → Vector search might miss this (no semantic meaning in ID) ``` ### ✅ Specific Numbers ``` Query: "order ORD-12345" → BM25 finds the exact order number ``` ### ✅ Technical Terms ``` Query: "BM25 algorithm" → BM25 finds exact technical term → Vector might find "ranking algorithm" (close, but not exact) ``` ### ✅ Rare/Unique Terms ``` Query: "chromaticity" → Rare technical term - BM25's IDF scoring helps ``` --- ## When BM25 Struggles BM25 has **limitations** with: ### ❌ Synonyms ``` Query: "television" → BM25 won't find docs that only say "TV" → Vector search handles this better ``` ### ❌ Typos ``` Query: "OELD TV" (typo: OELD instead of OLED) → BM25 misses the match → Vector search might still find it (embeddings are typo-tolerant) ``` ### ❌ Paraphrasing ``` Query: "cheap displays" → BM25 won't find "affordable screens" or "budget monitors" → Vector search understands these are similar concepts ``` ### ❌ Questions ``` Query: "What products are in stock?" → BM25 looks for "what", "products", "stock" - common words, low IDF → Vector search understands the intent better ``` **This is why we use BOTH!** --- ## Performance Characteristics ### BM25 Advantages: - ⚡ **Fast**: No neural network computation needed - 💾 **Lightweight**: Just word counts and frequencies - 🎯 **Precise**: Perfect for exact term matching - 📊 **Interpretable**: You can see exactly why a document matched ### BM25 Limitations: - 🔤 **Literal**: Only finds exact words - 🌐 **No Context**: Doesn't understand meaning - 📝 **Vocabulary Gap**: Misses synonyms and related terms ### Hybrid (BM25 + Vector) Benefits: - ✅ Best of both worlds - ✅ Handles both exact matches and semantic queries - ✅ More robust across different query types - ✅ Higher overall retrieval accuracy --- ## Reciprocal Rank Fusion (RRF) When combining BM25 and Vector results, this project uses **RRF**: ### What is RRF? A method to merge ranked lists from different sources. **Formula:** ``` RRF_score(doc) = Σ 1 / (k + rank_i(doc)) ``` Where: - `rank_i(doc)` = position of doc in retriever i's results - `k` = constant (usually 60) ### Example: **BM25 Results:** 1. Doc A (rank 1) 2. Doc B (rank 2) 3. Doc C (rank 3) **Vector Results:** 1. Doc C (rank 1) 2. Doc A (rank 2) 3. Doc D (rank 3) **RRF Scores:** - Doc A: 1/(60+1) + 1/(60+2) = 0.0164 + 0.0161 = **0.0325** - Doc B: 1/(60+2) = **0.0161** - Doc C: 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = **0.0323** - Doc D: 1/(60+3) = **0.0159** **Final Ranking:** A, C, B, D RRF is implemented automatically by LangChain's `EnsembleRetriever`. --- ## Code Reference: Where BM25 is Used ### 1. Initialization **File:** `src/hybrid_rag/hybrid_retriever.py:152-165` ```python # Create BM25 retrievers for each document type if text_docs: text_bm25_retriever = BM25Retriever.from_documents(text_docs) text_bm25_retriever.k = keyword_k if csv_docs: csv_bm25_retriever = BM25Retriever.from_documents(csv_docs) csv_bm25_retriever.k = keyword_k ``` ### 2. Ensemble Creation **File:** `src/hybrid_rag/hybrid_retriever.py:46-55` ```python # Create ensemble retrievers for each document type text_retriever = EnsembleRetriever( retrievers=[text_vector_retriever, text_bm25_retriever], weights=[0.5, 0.5] # 50% vector, 50% BM25 ) csv_retriever = EnsembleRetriever( retrievers=[csv_vector_retriever, csv_bm25_retriever], weights=[0.5, 0.5] ) ``` ### 3. Demo Usage **File:** `scripts/demos/basic.py:96-103` ```python # Traditional ensemble (when not using document-type-aware) keyword_retriever = BM25Retriever.from_documents(documents) keyword_k = config['retrieval']['keyword_search_k'] keyword_retriever.k = keyword_k hybrid_retriever = EnsembleRetriever( retrievers=[vector_retriever, keyword_retriever] ) ``` --- ## Tuning BM25 Performance ### Adjusting k (Number of Results) **In `config/config.yaml`:** ```yaml retrieval: keyword_search_k: 5 # Change this ``` **Effect:** - **Higher k** (e.g., 10): More BM25 results, better recall, may include less relevant docs - **Lower k** (e.g., 3): Fewer BM25 results, higher precision, might miss some relevant docs ### Adjusting Ensemble Weights **In `src/hybrid_rag/hybrid_retriever.py`:** ```python # Current: Equal weighting EnsembleRetriever( retrievers=[text_vector_retriever, text_bm25_retriever], weights=[0.5, 0.5] # Modify these ) ``` **Options:** - **[0.7, 0.3]**: Favor vector search (better for semantic queries) - **[0.3, 0.7]**: Favor BM25 search (better for exact term matching) - **[0.5, 0.5]**: Balanced (current default) ### When to Favor BM25 More Increase BM25 weight (e.g., [0.4, 0.6]) if: - Your queries are often exact product IDs - Users search for specific numbers or codes - Exact keyword matching is more important than semantic understanding - Your documents contain a lot of technical jargon ### When to Favor Vector More Increase vector weight (e.g., [0.6, 0.4]) if: - Users ask natural language questions - Synonyms and related concepts are common - Semantic understanding is more important - Users might misspell terms --- ## Summary ### BM25 in One Sentence: **BM25 is a keyword-based search algorithm that finds documents containing your exact search terms and ranks them by how important those terms are.** ### Why This Project Uses BM25: 1. ✅ **Exact matching** for product IDs, order numbers, specific terms 2. ✅ **Complements vector search** for better overall accuracy 3. ✅ **Fast and lightweight** compared to neural approaches 4. ✅ **Industry standard** - proven and reliable 5. ✅ **Interpretable** - easy to understand why documents matched ### Key Takeaways: - BM25 = Keyword search (exact words) - Vector = Semantic search (meaning) - Hybrid = Best of both - This project uses **separate BM25 retrievers** for text vs CSV documents - Results are merged using RRF (Reciprocal Rank Fusion) - You can tune BM25 behavior via `config/config.yaml` --- ## Further Reading - **Wikipedia:** [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) - **Original Paper:** Robertson & Walker (1994) - "Some simple effective approximations to the 2-Poisson model" - **LangChain Docs:** [BM25Retriever](https://python.langchain.com/docs/integrations/retrievers/bm25) - **This Project's Architecture:** `docs/architecture/system-design.md` --- ## Related Documentation - **[System Design](../architecture/system-design.md)** - Full architecture details - **[Testing Strategy](testing-strategy.md)** - How to validate retrieval - **[Quick Start](../getting-started/quick-start.md)** - Get started using the system

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/gwyer/hybrid-rag-project'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

bm25-explained.md•14.1 KiB