Markdown RAG Documentation

hybrid-search.md•35.7 KiB

# Hybrid Search Strategy This document explains the hybrid search system, including each search strategy, the Reciprocal Rank Fusion algorithm, and performance characteristics. ## Overview The query orchestrator combines eleven distinct retrieval and enhancement strategies: 1. **Query Type Classification**: Adaptive weight adjustment based on query intent (optional) 2. **Query Expansion**: Vocabulary-based expansion for improved recall 3. **Semantic Search**: Conceptual similarity via vector embeddings 4. **Keyword Search**: Exact term matching via BM25F scoring with field boosts 5. **Code Search**: Specialized code block retrieval with identifier-aware tokenization (optional) 6. **Graph Traversal**: Structural relationships via wikilinks 7. **Community Detection**: Louvain clustering with co-community score boosting 8. **Recency Bias**: Temporal relevance via file modification time 9. **Score-Aware Fusion**: Dynamic weight adjustment based on score variance 10. **Cross-Encoder Re-Ranking**: Joint query-document relevance scoring (optional) 11. **HyDE**: Hypothesis-driven embedding search (optional) Results from retrieval strategies are fused using Reciprocal Rank Fusion (RRF), then filtered, deduplicated (n-gram and semantic), diversified (MMR), and optionally re-ranked to produce a final ranked list of document chunks. ## Search Strategies ### Query Type Classification (Optional) **Purpose:** Automatically adjust search weights based on query intent detection. **Technology:** - Heuristic pattern matching (regex-based) - Zero ML dependencies **Query Types:** | Type | Signals | Weight Adjustment | |------|---------|-------------------| | Factual | camelCase, snake_case, backticks, versions, quoted phrases | keyword × 1.5 | | Navigational | "section", "guide", "docs", wikilinks (`[[...]]`) | graph × 1.5 | | Exploratory | Question words (what, how, why), question mark | semantic × 1.3 | **Detection Priority:** Factual → Navigational → Exploratory (first match wins). **Pattern Examples:** ```python # Factual signals _CAMEL_CASE_PATTERN = re.compile(r'[a-z][A-Z]|[A-Z]{2,}[a-z]') _SNAKE_CASE_PATTERN = re.compile(r'\b[a-z]+_[a-z_]+\b') _BACKTICK_PATTERN = re.compile(r'`[^`]+`') _VERSION_PATTERN = re.compile(r'\b[vV]?\d+\.\d+(?:\.\d+)?(?:-\w+)?\b') # Navigational keywords _NAVIGATIONAL_KEYWORDS = {'section', 'chapter', 'guide', 'tutorial', 'documentation', ...} # Exploratory signals _QUESTION_WORDS = {'what', 'how', 'why', 'when', 'where', 'which', ...} ``` **Example:** ``` Query: "getUserById function" → Detected: FACTUAL (camelCase signal) → Weights: semantic=1.0, keyword=1.5, graph=1.0 Query: "How do I configure authentication?" → Detected: EXPLORATORY ("How" question word) → Weights: semantic=1.3, keyword=1.0, graph=1.0 ``` **Configuration:** ```toml [search] adaptive_weights_enabled = true ``` **Code Reference:** [src/search/classifier.py](../src/search/classifier.py) ### 1. Semantic Search **Purpose:** Find documents conceptually related to the query, even when exact terms do not match. **Technology:** - Embedding model: HuggingFace BAAI/bge-small-en-v1.5 (384 dimensions) - Vector index: FAISS IndexFlatL2 (cosine similarity search) - Chunking: LlamaIndex MarkdownNodeParser (512 characters, 50 overlap) **Process:** 1. Embed query using same model as indexed documents 2. Perform cosine similarity search in FAISS index 3. Return top-k document IDs ranked by similarity score **When Most Effective:** - Queries with synonyms or paraphrased concepts - Searching for "authentication methods" finds documents about "login systems" and "credential management" - Broad conceptual queries where exact terminology is unknown **Example:** Query: "How do I secure my API?" Semantic search finds: - "authentication.md" (discusses credentials and tokens) - "security.md" (covers HTTPS and rate limiting) - "api-reference.md" (mentions security headers) Even though the word "secure" may not appear in these documents, the semantic similarity of "API security" concepts ensures retrieval. ### 2. Keyword Search **Purpose:** Find documents containing exact terms, phrases, or technical identifiers. **Technology:** - Full-text index: Whoosh with BM25F scoring - Field boosts: - title: 3.0x - headers: 2.5x - keywords: 2.5x - description: 2.0x - tags: 2.0x - aliases: 1.5x - author: 1.0x - Tokenization: StandardAnalyzer (strips punctuation, lowercases) **Process:** 1. Parse query into terms 2. Look up terms in inverted index 3. Score documents using BM25F algorithm (term frequency, inverse document frequency) 4. Return top-k document IDs ranked by BM25 score **When Most Effective:** - Queries with specific function names, error codes, or technical terms - Searching for "getToken" finds documents with that exact function name - Queries with unique identifiers that would not have semantic similarity **Example:** Query: "getToken function implementation" Keyword search finds: - "authentication.md" (contains literal string "getToken") - "api-reference.md" (documents the `getToken` endpoint) Semantic search might miss these if "getToken" is a custom function name without semantic context. **Limitation:** StandardAnalyzer strips punctuation. Queries for "C++" or "Node.js" normalize to "c" and "node". Custom analyzer required to preserve special characters. ### Code Block Search (Optional) **Purpose:** Find code snippets in documentation using identifier-aware tokenization that handles programming language conventions. **Technology:** - Whoosh with BM25F scoring - Custom analyzer with CamelCase and snake_case splitting - Preserves code identifier structure **Tokenization:** Standard tokenizers break code identifiers incorrectly: | Input | Standard Tokenizer | Code Analyzer | |-------|-------------------|---------------| | `getUserById` | `getuserbyid` | `getUserById`, `get`, `user`, `by`, `id` | | `parse_json_data` | `parse_json_data` | `parse_json_data`, `parse`, `json`, `data` | | `HTTPResponseError` | `httpresponseerror` | `HTTPResponseError`, `HTTP`, `Response`, `Error` | **Analyzer Pipeline:** ```python # 1. RegexTokenizer: extract alphanumeric identifiers _CODE_TOKEN_PATTERN = re.compile(r'[a-zA-Z_][a-zA-Z0-9_]*|[0-9]+') # 2. CamelCaseSplitter: split on case transitions "getUserById" → ["getUserById", "get", "User", "By", "Id"] # 3. SnakeCaseSplitter: split on underscores "parse_json" → ["parse_json", "parse", "json"] ``` **Schema:** ```python Schema( id=ID(stored=True, unique=True), doc_id=ID(stored=True), chunk_id=ID(stored=True), content=TEXT(stored=True, analyzer=code_analyzer), language=ID(stored=True), ) ``` **When Most Effective:** - Searching for function names, class names, variable names - Finding code examples with specific patterns - Technical documentation with embedded code blocks **Example:** ``` Query: "getUserById" → Code search finds: code blocks containing getUserById, get_user_by_id → Matches both camelCase and snake_case variants ``` **Configuration:** ```toml [search] code_search_enabled = true code_search_weight = 1.0 ``` **Code Reference:** [src/indices/code.py](../src/indices/code.py) ### 3. Graph Traversal **Purpose:** Surface structurally related documents connected via wikilinks, even when query terms do not appear in linked documents. **Technology:** - Graph representation: NetworkX directed graph - Nodes: Document IDs with metadata (title, tags, aliases) - Edges: Wikilinks (`[[Target]]`) and transclusions (`![[Target]]`) **Process:** 1. Execute semantic and keyword searches to get initial candidate documents 2. For each candidate, perform 1-hop neighbor lookup in graph 3. Retrieve all documents directly linked from or to the candidate 4. Add neighbors to result pool with reduced score (0.5x multiplier) **When Most Effective:** - Queries that should surface related documentation not explicitly mentioned - User queries for "deployment" → graph traversal includes linked "security.md" and "configuration.md" even if "deployment" does not appear in those files - Discovering supporting context for a specific topic **Example:** Query: "deployment configuration" Keyword/semantic search finds: - "deployment.md" Graph traversal adds: - "security.md" (linked from deployment.md via `[[security]]`) - "authentication.md" (linked from security.md) - "configuration.md" (linked from deployment.md) These linked documents enrich the context provided to the LLM without requiring the user to explicitly query for them. **Implementation Detail:** Only 1-hop neighbors are retrieved (direct links). Multi-hop traversal (2+ hops) was avoided due to: - Increased computation cost - Risk of topic drift (documents too far removed from original query) - Diminishing relevance returns ### 4. Recency Bias **Purpose:** Prioritize recently modified documents to surface up-to-date information. **Implementation:** Tier-based score multiplier applied during fusion: | Modification Time | Multiplier | |-------------------|------------| | Last 7 days | 1.2x | | Last 30 days | 1.1x | | Over 30 days | 1.0x | These multipliers are applied directly to the RRF score during fusion. The `recency_bias` config option (default 0.5) is not currently applied to these tiers in the implementation. **When Most Effective:** - Documentation that changes frequently (API specs, deployment guides) - Personal notes where recent entries are more relevant - Queries like "recent updates" or "what changed recently" **Example:** Query: "authentication changes" Without recency bias: - "authentication.md" (last modified 90 days ago) - "api-reference.md" (last modified 180 days ago) With recency bias (assuming "oauth-guide.md" modified 3 days ago): - "oauth-guide.md" (1.1x boost) - "authentication.md" - "api-reference.md" The recent OAuth guide surfaces higher due to recency boost. **Configuration:** Disable recency bias by setting `recency_bias = 0.0` in config. ### 5. Query Expansion **Purpose:** Improve recall by appending semantically related terms to the query. **Technology:** - Concept vocabulary built during index persist - Extracts unique terms from all indexed chunks - Embeds each term using the same embedding model - Vocabulary persisted as `concept_vocabulary.json` **Process:** 1. Embed the query using the embedding model 2. Compute cosine similarity against all vocabulary terms 3. Select top-3 terms with similarity ≥ 0.5 4. Append non-duplicate terms to the original query **Example:** Original query: "database optimization" Expansion terms found: - "indexing" (similarity: 0.68) - "query" (similarity: 0.62) - "performance" (similarity: 0.58) Expanded query: "database optimization indexing query performance" **When Most Effective:** - Short queries that benefit from related terminology - Queries using general terms where specific jargon exists in the corpus - Technical documentation with domain-specific vocabulary **Configuration:** Query expansion is automatic when concept vocabulary exists. Rebuild index to generate vocabulary: `uv run mcp-markdown-ragdocs rebuild-index` ### 6. Cross-Encoder Re-Ranking **Purpose:** Improve precision by re-scoring candidates using joint query-document relevance. **Technology:** - Default model: `cross-encoder/ms-marco-MiniLM-L-6-v2` (22MB) - Lazy loading: model downloaded and loaded on first rerank call - Sentence-transformers CrossEncoder implementation **Process:** 1. Take top candidates from fusion pipeline (after filtering/dedup) 2. Pair each candidate's content with the query 3. Score all pairs using cross-encoder 4. Sort by cross-encoder scores descending 5. Return top `rerank_top_n` results **Performance:** - ~50ms for 10 candidates on CPU - ~30ms for TinyBERT variant - ~150ms for larger BAAI/bge-reranker-base **When Most Effective:** - Queries where initial ranking has relevant results in wrong order - Technical queries requiring precise term matching in context - High-stakes searches where precision matters more than latency **Model Options:** | Model | Size | Latency | Quality | |-------|------|---------|--------| | `cross-encoder/ms-marco-MiniLM-L-6-v2` | 22MB | ~50ms | Recommended | | `cross-encoder/ms-marco-TinyBERT-L-2-v2` | 17MB | ~30ms | Faster | | `BAAI/bge-reranker-base` | 110MB | ~150ms | Higher quality | **Configuration:** ```toml [search] rerank_enabled = true rerank_model = "cross-encoder/ms-marco-MiniLM-L-6-v2" rerank_top_n = 10 ``` ### 7. Community Detection **Purpose:** Boost results that belong to the same document cluster as highly-ranked documents. **Technology:** - Louvain algorithm (NetworkX `louvain_communities`) - Runs on undirected graph conversion - No external dependencies **Process:** 1. During `GraphStore.persist()`, detect communities across all document nodes 2. Store community assignments in `communities.json` 3. During search, identify communities of top-ranked seed documents 4. Boost scores of other results in the same communities **When Most Effective:** - Queries touching well-connected documentation sections - Discovering related documents not explicitly linked - Knowledge bases with dense wikilink structures **Example:** If "auth.md" (community 3) ranks highest, other documents in community 3 (e.g., "login.md", "sessions.md") receive a 1.1× score boost even without explicit query match. **Configuration:** ```toml [search.advanced] community_detection_enabled = true community_boost_factor = 1.1 ``` **Code Reference:** [src/search/community.py](../src/search/community.py) ### 8. Score-Aware Fusion **Purpose:** Dynamically adjust strategy weights based on score variance per query. **Technology:** - Variance calculation on score distributions - Weight reduction for low-variance (uncertain) results - Implemented in `src/search/variance.py` **How It Works:** 1. Compute variance of vector scores and keyword scores separately 2. If variance < threshold, the strategy produces "muddy" matches (all scores similar) 3. Reduce weight proportionally: `factor = max(min_factor, variance / threshold)` 4. Renormalize weights to maintain original sum **Formula:** $$ W_{adjusted} = W_{base} \times \max\left(W_{min}, \frac{\sigma^2}{\sigma^2_{threshold}}\right) $$ **When Most Effective:** - Queries where one strategy produces uniformly low scores - Ambiguous queries that confuse one search strategy - Balancing semantic vs keyword when confidence differs **Configuration:** ```toml [search.advanced] dynamic_weights_enabled = true variance_threshold = 0.1 ``` **Code Reference:** [src/search/variance.py](../src/search/variance.py) ### 9. HyDE (Hypothetical Document Embeddings) **Purpose:** Improve retrieval for vague queries by searching with a hypothetical answer. **Technology:** - Direct embedding of hypothesis text - Same embedding model as document indexing (BAAI/bge-small-en-v1.5) - Implemented in `src/search/hyde.py` **Process:** 1. User (or AI) generates hypothesis describing expected documentation 2. Hypothesis is embedded directly (no query expansion) 3. Vector similarity search using hypothesis embedding 4. Returns documents matching the hypothetical description **When Most Effective:** - Vague queries like "How do I add a feature?" - Queries where describing the answer is easier than forming the question - AI assistants that can generate plausible documentation descriptions **Example:** User query: "How do I add a new tool?" AI generates hypothesis: *"To add a new tool, modify src/mcp_server.py and add a Tool definition in list_tools(). Include name, description, and inputSchema..."* HyDE search embeds this hypothesis and finds: - `src/mcp_server.py` code documentation - Tool registration examples - MCP protocol documentation **MCP Tool:** ```json { "name": "search_with_hypothesis", "inputSchema": { "hypothesis": "string (required)", "top_n": "integer (optional, default: 5)" } } ``` **Configuration:** ```toml [search.advanced] hyde_enabled = true ``` **Code Reference:** [src/search/hyde.py](../src/search/hyde.py) ## Reciprocal Rank Fusion (RRF) ### Algorithm RRF is a rank-based fusion method that combines multiple ranked lists without requiring score normalization. **Formula:** For each document appearing in any ranked list: ``` rrf_score(d) = Σ (1 / (k + rank(d, list_i))) ``` Where: - `d` is a document ID - `k` is a constant (default 60) - `rank(d, list_i)` is the rank (1-indexed) of document `d` in list `i` - Sum is over all lists where `d` appears **Example:** Document "auth.md" appears in: - Semantic search at rank 3 - Keyword search at rank 5 - Graph traversal at rank 2 With k=60: ``` rrf_score = 1/(60+3) + 1/(60+5) + 1/(60+2) = 1/63 + 1/65 + 1/62 = 0.0159 + 0.0154 + 0.0161 = 0.0474 ``` Document "deploy.md" appears in: - Semantic search at rank 1 - Graph traversal at rank 10 ``` rrf_score = 1/(60+1) + 1/(60+10) = 1/61 + 1/70 = 0.0164 + 0.0143 = 0.0307 ``` Despite "deploy.md" ranking higher in semantic search, "auth.md" has a higher RRF score due to appearing in more lists. ### Weighted RRF Strategy weights multiply each term in the RRF sum: ``` weighted_rrf_score(d) = Σ (weight_i / (k + rank(d, list_i))) ``` **Example:** With `semantic_weight = 1.2` and `keyword_weight = 0.8`: ``` rrf_score = (1.2 / (60+3)) + (0.8 / (60+5)) = 0.0190 + 0.0123 = 0.0313 ``` This increases the influence of semantic search results relative to keyword search. ### Advantages of RRF 1. **No score normalization required**: Ranks are universal (1, 2, 3...), unlike scores which vary by algorithm (0.95 for cosine similarity, 12.3 for BM25) 2. **Robust to outliers**: A single very high score in one list does not dominate the fusion 3. **Rewards consensus**: Documents appearing in multiple lists are naturally prioritized 4. **Simple to implement**: No machine learning or training required ### Disadvantages of RRF 1. **Loses score magnitude information**: A rank 1 document with score 0.99 is treated the same as rank 1 with score 0.51 2. **Fixed k constant**: Requires tuning for optimal performance (typical range: 20-100) 3. **Assumes list independence**: Does not account for correlation between lists ## Result Fusion Process ### Processing Pipeline The complete pipeline processes results in this order: ``` Query Type Classification (if adaptive_weights_enabled) ↓ Query Expansion (concept vocabulary) ↓ Semantic Search + Keyword Search + Code Search (parallel) ↓ Graph Neighbor Boosting (1-hop) ↓ Community Boosting (if community_detection_enabled) ↓ Score-Aware RRF Fusion (with dynamic weights if dynamic_weights_enabled) ↓ Recency Bias (tier-based multiplier) ↓ Score Calibration (sigmoid, [0.0, 1.0]) ↓ Confidence Threshold (min_confidence) ↓ Content Hash Deduplication (exact text match) ↓ N-gram Deduplication (if ngram_dedup_enabled) ↓ Semantic Deduplication (if dedup_enabled) ↓ MMR Selection (if mmr_enabled) OR Per-Document Limit ↓ Cross-Encoder Re-Ranking (if rerank_enabled) ↓ Parent Expansion (if parent_retrieval_enabled) ↓ Top-N Selection ``` ### Step-by-Step Fusion 1. **Execute parallel searches**: - Semantic search returns: `[doc1, doc2, doc3, ...]` (ranked) - Keyword search returns: `[doc2, doc5, doc1, ...]` (ranked) 2. **Apply graph traversal**: - For each candidate from step 1, lookup 1-hop neighbors - Add neighbors to pool with 0.5x multiplier 3. **Compute RRF scores**: - For each unique document across all lists, sum weighted RRF terms 4. **Apply recency bias**: - Multiply each document's RRF score by its recency tier multiplier 5. **Sort by final score**: - Return top-k document IDs in descending score order ### Example End-to-End Fusion **Query:** "API authentication" **Semantic search results:** 1. authentication.md (cosine similarity: 0.92) 2. security.md (0.85) 3. api-reference.md (0.78) **Keyword search results:** 1. api-reference.md (BM25: 15.3) 2. authentication.md (12.7) 3. oauth-guide.md (8.4) **Graph traversal (1-hop from candidates):** - deployment.md (linked from security.md, score 0.5x) - configuration.md (linked from authentication.md, score 0.5x) **Recency data:** - oauth-guide.md: modified 5 days ago (1.1x boost) - Others: modified over 30 days ago (1.0x) **RRF Calculation (k=60, semantic_weight=1.0, keyword_weight=1.0):** authentication.md: - Semantic rank 1: 1/(60+1) = 0.0164 - Keyword rank 2: 1/(60+2) = 0.0161 - **RRF Score:** 0.0325 - Recency: 1.0x (>30 days old) - **Final: 0.0325** (0.0325 × 1.0) api-reference.md: - Semantic rank 3: 1/(60+3) = 0.0159 - Keyword rank 1: 1/(60+1) = 0.0164 - **RRF Score:** 0.0323 - Recency: 1.0x (>30 days old) - **Final: 0.0323** (0.0323 × 1.0) security.md: - Semantic rank 2: 1/(60+2) = 0.0161 - **RRF Score:** 0.0161 - Recency: 1.0x (>30 days old) - **Final: 0.0161** (0.0161 × 1.0) oauth-guide.md: - Keyword rank 3: 1/(60+3) = 0.0159 - **RRF Score:** 0.0159 - Recency: 1.1x (modified 5 days ago, <7 days tier) - **Final: 0.0175** (0.0159 × 1.1) deployment.md: - Graph boost: 0.5 * (1/(60+1)) = 0.0082 - **RRF Score:** 0.0082 - Recency: 1.0x (>30 days old) - **Final: 0.0082** (0.0082 × 1.0) ### Recency Boost Algorithm **Application Timing:** Recency boost applied **after** RRF scoring, not during rank computation. **Tier Structure:** | Age Range | Multiplier | Purpose | |-----------|------------|----------| | ≤7 days | 1.2× | Recent updates highly relevant | | ≤30 days | 1.1× | Recent but not brand new | | >30 days | 1.0× | No boost for older content | **Age Calculation:** ```python age_days = (current_timestamp - document_modified_timestamp) / 86400 ``` **Tier Selection Logic:** ```python def get_recency_multiplier(modified_timestamp, current_timestamp): age_days = (current_timestamp - modified_timestamp) / 86400 # Tiers sorted by age (earliest first) tiers = [(7, 1.2), (30, 1.1)] # First tier where age <= tier_days wins for max_days, multiplier in tiers: if age_days <= max_days: return multiplier # Default for documents older than all tiers return 1.0 ``` **Complete Fusion Algorithm:** ```python def fuse_results(results, k, weights, modified_times, current_time): # Step 1: RRF scoring scores = {} for strategy, doc_ids in results.items(): weight = weights.get(strategy, 1.0) for rank, doc_id in enumerate(doc_ids): rrf_contribution = (1 / (k + rank)) * weight scores[doc_id] = scores.get(doc_id, 0.0) + rrf_contribution # Step 2: Recency boosting (applied AFTER RRF) boosted = [] for doc_id, rrf_score in scores.items(): multiplier = 1.0 # default for >30 days if doc_id in modified_times: age_days = (current_time - modified_times[doc_id]) / 86400 if age_days <= 7: multiplier = 1.2 elif age_days <= 30: multiplier = 1.1 final_score = rrf_score * multiplier boosted.append((doc_id, final_score)) # Step 3: Sort by final score return sorted(boosted, key=lambda x: x[1], reverse=True) ``` **Why After RRF:** - RRF score represents content relevance from multiple strategies - Recency acts as secondary signal to surface recent updates - Prevents fresh but irrelevant documents from dominating results - Preserves semantic/keyword/graph consensus while adding temporal awareness **Final Ranking:** 1. authentication.md (0.0325) 2. api-reference.md (0.0323) 3. oauth-guide.md (0.0175) 4. security.md (0.0161) 5. deployment.md (0.0082) ### Score Calibration **Purpose:** Convert raw RRF+recency scores to absolute confidence scores representing match quality independent of result set size. **Method:** Sigmoid calibration applied after RRF fusion and recency boosting. **Formula:** $$ \text{calibrated\_score} = \frac{1}{1 + e^{-s \cdot (r - t)}} $$ Where: - $r$ = raw RRF+recency score - $t$ = threshold (default: 0.035) - $s$ = steepness (default: 150.0) **Threshold Parameter:** RRF score corresponding to 50% confidence. Raw scores above threshold map to >0.5 confidence, scores below map to <0.5 confidence. **Steepness Parameter:** Controls sigmoid curve steepness. Higher values create sharper transitions between low and high confidence. **Calibration vs Normalization:** Calibration (sigmoid) produces **absolute confidence** from raw RRF+recency scores. Legacy min-max normalization is deprecated and is not applied to final scores; any internal normalization is limited to per-strategy scaling and does not change output confidence values. **Score Interpretation:** | Calibrated Score | Interpretation | Raw RRF Range | Typical Conditions | |-----------------|----------------|---------------|---------------------| | >0.9 | Excellent match | >0.050 | Top rank, multiple strategies agree | | 0.7-0.9 | Good match | 0.038-0.050 | Top-3 rank, 2+ strategies | | 0.5-0.7 | Moderate match | 0.030-0.038 | Near threshold, single strategy | | 0.3-0.5 | Weak match | 0.020-0.030 | Low rank, peripheral relevance | | <0.3 | Noise | <0.020 | Should be filtered | **Properties:** 1. **Absolute Confidence:** Same raw score produces same calibrated score across queries 2. **Asymptotic Bounds:** Approaches 1.0 for high scores (~0.98 max), approaches 0.0 for low scores 3. **No Artificial Inflation:** Single-result queries scored by absolute confidence, not always 1.0 4. **Stable Semantics:** Score thresholds remain consistent across different result set sizes **Example Calibration:** Raw RRF+recency scores: - authentication.md: 0.0325 → **0.42** (moderate) - api-reference.md: 0.0323 → **0.41** (moderate) - oauth-guide.md: 0.0175 → **0.09** (noise, filtered) - security.md: 0.0161 → **0.07** (noise, filtered) - deployment.md: 0.0082 → **0.01** (noise, filtered) With `min_confidence = 0.3`, only authentication.md and api-reference.md pass filtering. **Configuration:** ```toml [search] score_calibration_threshold = 0.035 # RRF score for 50% confidence score_calibration_steepness = 150.0 # Sigmoid curve steepness min_confidence = 0.3 # Filter results below 30% confidence ``` **Tuning Guidelines:** - **Lower threshold (0.025):** More lenient, higher confidence for same raw score - **Higher threshold (0.045):** Stricter, lower confidence for same raw score - **Lower steepness (100.0):** Gentler transitions, less separation - **Higher steepness (200.0):** Sharper transitions, more separation **Code Reference:** [src/search/calibration.py](../src/search/calibration.py) **Breaking Changes from v1.5 Min-Max Normalization:** - Top result no longer always 1.0 (typically 0.8-0.98) - Single-result queries no longer automatically 1.0 - Scores are absolute confidence, not relative to result set - Set `min_confidence = 0.3` to filter low-quality results ## Performance Characteristics ### Query Latency Measured on typical hardware (8-core CPU, 16GB RAM) with 1000-document corpus: | Component | Latency | Notes | |-----------|---------|-------| | Semantic search | 50-100ms | Dominated by embedding inference | | Keyword search | 10-20ms | Fast inverted index lookup | | Graph traversal | 5-10ms | In-memory NetworkX operations | | RRF fusion | 1-5ms | Simple arithmetic over ranked lists | | **Total query** | **100-150ms** | Excluding LLM synthesis | ### Scaling Characteristics | Corpus Size | Semantic | Keyword | Graph | Total | |-------------|----------|---------|-------|-------| | 100 docs | 30ms | 5ms | 2ms | ~40ms | | 1,000 docs | 80ms | 15ms | 8ms | ~110ms | | 10,000 docs | 250ms | 40ms | 20ms | ~320ms | **Bottleneck:** Semantic search embedding generation and FAISS similarity computation. ### Memory Usage | Component | Memory per Document | Notes | |-----------|---------------------|-------| | Vector index | ~1.5KB | 384-dim float32 embeddings + mappings | | Keyword index | ~500 bytes | Inverted index entries | | Graph | ~200 bytes | Node metadata + edge pointers | | **Total** | **~2.2KB** | 1000 docs ≈ 2.2MB | ### Indexing Performance | Operation | Latency | Notes | |-----------|---------|-------| | Single document | 200-500ms | Includes embedding generation | | Batch (100 docs) | 30-60s | Serial processing | | Full rebuild (1000 docs) | 5-10 minutes | Cold start, no existing index | **Bottleneck:** Embedding model inference (HuggingFace local model). ## Strategy Trade-Offs ### Semantic Search **Strengths:** - Finds conceptually related content - Handles synonyms and paraphrasing - Discovers unexpected relevant documents **Weaknesses:** - Misses exact term matches (function names, error codes) - Slower than keyword search (embedding inference) - Requires storage for vector embeddings **Tuning:** - Increase `semantic_weight` to prioritize semantic results - Use larger embedding models for better semantic understanding (at cost of speed) ### Keyword Search **Strengths:** - Fast retrieval (inverted index) - Exact term matching (function names, identifiers) - Low storage overhead **Weaknesses:** - Misses synonyms and paraphrasing - Sensitive to query phrasing - StandardAnalyzer strips punctuation (limitation for technical terms) **Tuning:** - Increase `keyword_weight` to prioritize exact matches - Use custom Whoosh analyzer for preserving special characters ### Graph Traversal **Strengths:** - Surfaces structurally related content - Enriches context for LLM synthesis - Discovers documents not matching query terms **Weaknesses:** - Relies on quality of wikilink annotations - Can surface tangentially related content (topic drift) - Adds computation overhead **Tuning:** - Graph boost multiplier (currently hardcoded at 0.5x) could be configurable - Multi-hop depth (currently fixed at 1-hop) could be dynamic based on query ### Recency Bias **Strengths:** - Prioritizes up-to-date information - Simple tier-based calculation - Predictable behavior **Weaknesses:** - File mtime may not reflect content relevance - Not useful for static documentation - Can de-prioritize evergreen content **Tuning:** - Adjust `recency_bias` (0.0 to 1.0) - Modify tier thresholds (7 days, 30 days) in code ## Configuration Recommendations ### General Purpose Documentation Balanced weights with moderate recency: ```toml [search] semantic_weight = 1.0 keyword_weight = 1.0 recency_bias = 0.5 ``` ### API Reference / Technical Docs Prioritize exact term matches: ```toml [search] semantic_weight = 0.8 keyword_weight = 1.5 recency_bias = 0.3 ``` ### Personal Notes (Obsidian) Prioritize semantic connections and recency: ```toml [search] semantic_weight = 1.3 keyword_weight = 0.7 recency_bias = 0.8 ``` ### Research Papers Prioritize semantic similarity, disable recency: ```toml [search] semantic_weight = 1.5 keyword_weight = 0.5 recency_bias = 0.0 ``` ## Deduplication Strategies ### N-gram Overlap Deduplication **Purpose:** Fast pre-filter to remove near-duplicate chunks before expensive embedding-based deduplication. **Technology:** - Character n-grams (trigrams by default) - Jaccard similarity for set comparison - O(n) complexity per comparison **How it Works:** 1. Convert each chunk to a set of character n-grams: ``` "hello world" → {"hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"} ``` 2. Compute Jaccard similarity: $$ J(A, B) = \frac{|A \cap B|}{|A \cup B|} $$ 3. If similarity ≥ threshold (default 0.7), mark as duplicate **Example:** ```python text_a = "Configure the authentication settings in config.toml" text_b = "Configure authentication settings in the config.toml file" ngrams_a = get_ngrams(text_a, n=3) # 48 trigrams ngrams_b = get_ngrams(text_b, n=3) # 52 trigrams intersection = 41 union = 59 jaccard = 41 / 59 = 0.69 # Below 0.7 threshold: NOT duplicate ``` **Why N-grams Before Embeddings?** | Method | Time per Comparison | Memory | |--------|--------------------|---------| | N-gram Jaccard | ~0.1ms | Minimal (sets) | | Embedding Cosine | ~0.5ms | 1.5KB per embedding | N-gram dedup removes obvious duplicates cheaply, reducing the candidate set for expensive semantic dedup. **Configuration:** ```toml [search] ngram_dedup_enabled = true ngram_dedup_threshold = 0.7 ``` **Code Reference:** [src/search/dedup.py](../src/search/dedup.py) ### Maximal Marginal Relevance (MMR) **Purpose:** Select diverse results by penalizing similarity to already-selected items. **MMR Formula:** $$ \text{MMR}(d) = \lambda \cdot \text{Sim}(d, q) - (1 - \lambda) \cdot \max_{s \in S} \text{Sim}(d, s) $$ Where: - $d$ = candidate document - $q$ = query - $S$ = already-selected documents - $\lambda$ = relevance/diversity trade-off (0.0–1.0) **Lambda Trade-off:** | Lambda | Behavior | |--------|----------| | 1.0 | Pure relevance (no diversity penalty) | | 0.7 | Balanced (default) | | 0.5 | Equal weight to relevance and diversity | | 0.3 | Diversity-focused | **Greedy Selection Algorithm:** ```python selected = [] remaining = all_candidates while len(selected) < top_n and remaining: best_id = None best_mmr = -inf for candidate in remaining: relevance = similarity(candidate, query) max_sim_to_selected = max(similarity(candidate, s) for s in selected) mmr_score = lambda * relevance - (1 - lambda) * max_sim_to_selected if mmr_score > best_mmr: best_mmr = mmr_score best_id = candidate selected.append(best_id) remaining.remove(best_id) ``` **When to Use:** - Query results cluster around similar content - Need diverse perspectives on a topic - Want to avoid repetitive chunks from same document section **Configuration:** ```toml [search] mmr_enabled = true mmr_lambda = 0.7 ``` **Code Reference:** [src/search/diversity.py](../src/search/diversity.py) --- ## Parent Document Retrieval **Purpose:** Embed small chunks for retrieval precision, but return larger parent sections for sufficient LLM context. **The Retrieval vs. Return Problem:** Small chunks (~500 chars) improve retrieval precision—specific sentences match queries better than paragraphs. But returning small chunks to an LLM loses context. Parent document retrieval decouples the **retrieval unit** from the **return unit**. **Two-Level Chunking:** | Level | Size | Purpose | |-------|------|---------| | Parent (Section) | 1500–2000 chars | Return unit, provides context | | Child (Sub-chunk) | 200–1500 chars | Retrieval unit, precision matching | **Architecture:** ``` Document → Split into Sections (parent, ~2000 chars) → Split Sections into Chunks (child, ~500 chars) → Embed children, store parent_chunk_id reference → Search returns child matches → Expand to parent sections before returning ``` **Expansion Logic:** ```python def _expand_to_parents(results: list[tuple[str, float]]) -> list[tuple[str, float]]: seen_parents: set[str] = set() expanded: list[tuple[str, float]] = [] for chunk_id, score in results: parent_chunk_id = get_parent_id(chunk_id) if parent_chunk_id: if parent_chunk_id not in seen_parents: seen_parents.add(parent_chunk_id) expanded.append((parent_chunk_id, score)) else: # Chunk is already a parent or has no parent expanded.append((chunk_id, score)) return expanded ``` **Deduplication:** Multiple child chunks may share the same parent. The expansion logic deduplicates parents, keeping the highest-scoring child's score. **Configuration:** ```toml [chunking] parent_retrieval_enabled = true parent_chunk_min_chars = 1500 parent_chunk_max_chars = 2000 ``` **Note:** Requires index rebuild when enabling (`uv run mcp-markdown-ragdocs rebuild-index`). **Code Reference:** [src/chunking/header_chunker.py](../src/chunking/header_chunker.py), [src/search/orchestrator.py](../src/search/orchestrator.py) --- ## Future Enhancements Potential improvements not currently implemented: 1. **Multi-hop graph traversal**: Configurable depth (1-hop, 2-hop) 2. **Custom analyzers**: Preserve punctuation for technical terms 3. **Learned fusion**: Train fusion weights on query/relevance pairs 4. **Query-dependent MMR lambda**: Adjust diversity based on query type

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/andnp/ragdocs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

hybrid-search.md•35.7 KiB