Markdown RAG Documentation

Overview Schema Related Servers Score Discussions

ragdocs-mcp
specs

PLAN-phase-3-4-implementation.md•20.3 KiB

# Implementation Plan: Phase 3 & Phase 4 Search Quality Improvements **Prepared for:** Code Agent **Date:** 2025-01-01 **Spec Reference:** [specs/11-search-quality-improvements.md](11-search-quality-improvements.md) --- ## 1. Executive Summary This plan covers implementation of 5 features across Phase 3 and Phase 4: | Phase | Feature | Impact | Complexity | LOC Est. | |-------|---------|--------|------------|----------| | 3 | P3: Query Type Classification + Adaptive Weights | High | Medium | ~80 | | 3 | P7: Code Block Index | Medium | Medium | ~200 | | 4 | P11: Maximal Marginal Relevance (MMR) | High | Low | ~80 | | 4 | P12: N-gram Overlap Deduplication | Medium | Low | ~60 | | 4 | P13: Parent Document Retrieval | High | Medium | ~180 | **Total Estimated LOC:** ~600 --- ## 2. Dependency Graph ``` ┌─────────────────────────────────────┐ │ Phase 3 Features │ └─────────────────────────────────────┘ │ ┌────────────────────┼────────────────────┐ │ │ │ ▼ │ ▼ ┌─────────────────┐ │ ┌─────────────────┐ │ P3: Query Type │ │ │ P7: Code Block │ │ Classification │ │ │ Index │ │ (no deps) │ │ │ (no deps) │ └─────────────────┘ │ └─────────────────┘ │ ┌─────────────────────────────────────┐ │ Phase 4 Features │ └─────────────────────────────────────┘ │ ┌─────────────────────────┼─────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────┐ ┌───────────────┐ ┌───────────────┐ │ P12: N-gram│ │ P11: MMR │ │ P13: Parent │ │ Pre-filter │────────▶│ Selection │ │ Retrieval │ │ (no deps) │ │ (depends on │ │ (no deps) │ └───────────┘ │ existing │ └───────────────┘ │ dedup.py) │ └───────────────┘ ``` ### Dependency Analysis | Feature | Depends On | Blocks | |---------|------------|--------| | P3: Query Classification | Nothing | Nothing | | P7: Code Block Index | Nothing | Nothing | | P11: MMR | `dedup.py` (existing) | Nothing | | P12: N-gram Dedup | Nothing | P11 (pipeline order) | | P13: Parent Retrieval | Nothing | Nothing | **Key Insight:** P12 (N-gram) should be implemented before P11 (MMR) because the pipeline order is: ``` threshold → n-gram dedup (fast) → MMR/semantic dedup (slow) → doc-limit → rerank ``` --- ## 3. Implementation Order ### Recommended Sequence 1. **P3: Query Type Classification** (Phase 3) - Foundation for adaptive search 2. **P12: N-gram Overlap Dedup** (Phase 4) - Fast pre-filter, minimal deps 3. **P11: MMR Selection** (Phase 4) - Replaces/augments semantic dedup 4. **P7: Code Block Index** (Phase 3) - New index, parallel searchable 5. **P13: Parent Document Retrieval** (Phase 4) - Most complex, requires reindex ### Rationale - **P3 first:** Enables testing of adaptive weights throughout other feature development - **P12 before P11:** N-gram dedup reduces candidates for MMR, improving performance - **P7 independent:** Can be developed in parallel with P11/P12 - **P13 last:** Requires chunking changes and reindex; highest risk --- ## 4. File Manifest ### 4.1. P3: Query Type Classification + Adaptive Weights #### Files to Create | File | Purpose | |------|---------| | `src/search/classifier.py` | Query classification logic | | `tests/unit/test_query_classifier.py` | Unit tests | #### Files to Modify | File | Changes | |------|---------| | [src/models.py](../src/models.py) | Add `QueryType` enum | | [src/search/orchestrator.py](../src/search/orchestrator.py) | Call classifier, apply adaptive weights | #### Functions to Add **`src/search/classifier.py`:** ``` QueryType (Enum): FACTUAL NAVIGATIONAL EXPLORATORY classify_query(query: str) -> QueryType - Regex-based heuristics - Factual: camelCase, snake_case, backticks, version numbers, quoted phrases - Navigational: "section", "chapter", "guide", wikilink patterns - Exploratory: question words (what, how, why), default get_adaptive_weights(query_type: QueryType, base_weights: dict) -> dict - Returns modified weights based on query type - Factual: keyword_weight × 1.5 - Navigational: graph_weight × 1.5 - Exploratory: semantic_weight × 1.3 ``` **`src/models.py`:** ``` class QueryType(Enum): FACTUAL = "factual" NAVIGATIONAL = "navigational" EXPLORATORY = "exploratory" ``` **`src/search/orchestrator.py`:** - Import `classify_query`, `get_adaptive_weights` from `src.search.classifier` - In `query()`: classify query, adjust weights before fusion --- ### 4.2. P7: Code Block Index #### Files to Create | File | Purpose | |------|---------| | `src/indices/code.py` | Dedicated Whoosh index for code blocks | | `tests/unit/test_code_index.py` | Unit tests | #### Files to Modify | File | Changes | |------|---------| | [src/parsers/markdown.py](../src/parsers/markdown.py) | Extract code blocks during parsing | | [src/models.py](../src/models.py) | Add `CodeBlock` dataclass | | [src/search/orchestrator.py](../src/search/orchestrator.py) | Query code index, include in fusion | | [src/indexing/manager.py](../src/indexing/manager.py) | Initialize and persist code index | #### Functions to Add **`src/indices/code.py`:** ``` CodeIndex (class): __init__() - Schema with code-aware analyzer - RegexTokenizer for camelCase splitting, preserve punctuation add_code_block(code_block: CodeBlock) remove_document(doc_id: str) search(query: str, top_k: int, language: str | None) -> list[dict] persist(path: Path) load(path: Path) clear() ``` **Code-aware analyzer:** ```python from whoosh.analysis import RegexTokenizer, LowercaseFilter # Split on camelCase boundaries, preserve punctuation in identifiers code_tokenizer = RegexTokenizer(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)|[0-9]+|\S+') code_analyzer = code_tokenizer | LowercaseFilter() ``` **Schema:** ```python Schema( id=ID(stored=True, unique=True), doc_id=ID(stored=True), chunk_id=ID(stored=True), # Reference to parent text chunk code=TEXT(stored=True, analyzer=code_analyzer), language=KEYWORD(stored=True), ) ``` **`src/models.py`:** ``` @dataclass class CodeBlock: id: str doc_id: str chunk_id: str code: str language: str start_line: int end_line: int ``` **`src/parsers/markdown.py`:** ``` _extract_code_blocks(content: str, doc_id: str) -> list[CodeBlock] - Regex: ```(\w+)?\n(.*?)\n``` - Extract language, code content, line numbers ``` --- ### 4.3. P11: Maximal Marginal Relevance (MMR) #### Files to Create | File | Purpose | |------|---------| | `src/search/diversity.py` | MMR selection algorithm | | `tests/unit/test_mmr.py` | Unit tests | #### Files to Modify | File | Changes | |------|---------| | [src/search/pipeline.py](../src/search/pipeline.py) | Add MMR as alternative to semantic dedup | | [src/config.py](../src/config.py) | Add `mmr_enabled`, `mmr_lambda` options | #### Functions to Add **`src/search/diversity.py`:** ``` select_mmr( query_embedding: list[float], candidates: list[tuple[str, float]], get_embedding: Callable[[str], list[float] | None], lambda_param: float = 0.7, top_n: int = 10, ) -> list[tuple[str, float]] Algorithm: 1. Start with empty selected set 2. While len(selected) < top_n and candidates remain: a. For each candidate, compute: mmr_score = λ × relevance - (1-λ) × max_sim_to_selected b. Select candidate with highest MMR score c. Move from candidates to selected 3. Return selected with original scores ``` **`src/config.py` additions:** ```python @dataclass class SearchConfig: # ... existing fields ... mmr_enabled: bool = False mmr_lambda: float = 0.7 ``` **`src/search/pipeline.py` changes:** - Import `select_mmr` from `src.search.diversity` - Add `mmr_enabled`, `mmr_lambda` to `SearchPipelineConfig` - In `process()`: if `mmr_enabled`, use MMR instead of `deduplicate_by_similarity` --- ### 4.4. P12: N-gram Overlap Deduplication #### Files to Modify | File | Changes | |------|---------| | [src/search/dedup.py](../src/search/dedup.py) | Add n-gram functions | | [src/search/pipeline.py](../src/search/pipeline.py) | Insert n-gram dedup in pipeline | | [src/config.py](../src/config.py) | Add `ngram_dedup_enabled`, `ngram_dedup_threshold` | | `tests/unit/test_dedup.py` | Add tests (create if not exists) | #### Functions to Add **`src/search/dedup.py`:** ``` get_char_ngrams(text: str, n: int = 3) -> set[str] - Lowercase, remove spaces - Return set of character n-grams jaccard_similarity(ngrams_a: set[str], ngrams_b: set[str]) -> float - |intersection| / |union| deduplicate_by_ngram( results: list[tuple[str, float]], get_content: Callable[[str], str | None], threshold: float = 0.7, n: int = 3, ) -> tuple[list[tuple[str, float]], int] - For each result, compare against kept results - If Jaccard >= threshold, skip (it's a duplicate) - Return (kept_results, removed_count) ``` **`src/config.py` additions:** ```python @dataclass class SearchConfig: # ... existing fields ... ngram_dedup_enabled: bool = True # On by default (fast) ngram_dedup_threshold: float = 0.7 ``` **Pipeline order in `src/search/pipeline.py`:** ``` normalize → threshold → content_hash_dedup → ngram_dedup → semantic_dedup/MMR → doc_limit → rerank ``` --- ### 4.5. P13: Parent Document Retrieval #### Files to Create | File | Purpose | |------|---------| | `tests/unit/test_parent_retrieval.py` | Unit tests | #### Files to Modify | File | Changes | |------|---------| | [src/parsers/markdown.py](../src/parsers/markdown.py) | Two-level chunking | | [src/models.py](../src/models.py) | Add parent fields to `Chunk` | | [src/indices/vector.py](../src/indices/vector.py) | Store parent reference in metadata | | [src/search/orchestrator.py](../src/search/orchestrator.py) | Expand to parents before returning | | [src/config.py](../src/config.py) | Add `parent_retrieval_enabled` | | `src/chunking/header_chunker.py` | Modify to produce two-level chunks | #### Data Model Changes **`src/models.py` - Chunk dataclass:** ```python @dataclass class Chunk: # ... existing fields ... parent_chunk_id: str | None = None # Reference to parent section is_parent: bool = False # True if this is a parent section chunk ``` **Two-Level Chunking Strategy:** | Level | Size | Overlap | Purpose | |-------|------|---------|---------| | Parent (Section) | 1500-2000 chars | 200 chars | Return unit | | Child (Sub-chunk) | 400-600 chars | 100 chars | Retrieval unit | **`src/config.py` additions:** ```python @dataclass class ChunkingConfig: # ... existing fields ... parent_retrieval_enabled: bool = False parent_chunk_min_chars: int = 1500 parent_chunk_max_chars: int = 2000 child_chunk_min_chars: int = 400 child_chunk_max_chars: int = 600 ``` **`src/search/orchestrator.py` changes:** ``` _expand_to_parents( results: list[tuple[str, float]], ) -> list[tuple[str, float]] - For each child chunk, get parent_chunk_id from metadata - Deduplicate parents (multiple children may share parent) - Return parent chunks with highest child score ``` --- ## 5. Testing Strategy ### 5.1. Unit Tests | Feature | Test File | Test Cases | |---------|-----------|------------| | P3 | `tests/unit/test_query_classifier.py` | `test_classify_factual_camelcase`, `test_classify_factual_snake_case`, `test_classify_factual_backticks`, `test_classify_navigational_section`, `test_classify_navigational_wikilink`, `test_classify_exploratory_question`, `test_classify_exploratory_default`, `test_adaptive_weights_factual`, `test_adaptive_weights_navigational`, `test_adaptive_weights_exploratory` | | P7 | `tests/unit/test_code_index.py` | `test_camelcase_tokenization`, `test_snake_case_tokenization`, `test_preserve_punctuation`, `test_search_function_name`, `test_search_class_name`, `test_language_filter`, `test_persist_and_load` | | P11 | `tests/unit/test_mmr.py` | `test_mmr_selects_diverse`, `test_mmr_respects_lambda_high`, `test_mmr_respects_lambda_low`, `test_mmr_empty_candidates`, `test_mmr_single_candidate`, `test_mmr_all_identical` | | P12 | `tests/unit/test_dedup.py` | `test_ngram_extraction`, `test_jaccard_identical`, `test_jaccard_disjoint`, `test_jaccard_partial`, `test_ngram_dedup_removes_exact`, `test_ngram_dedup_keeps_different`, `test_ngram_dedup_threshold_boundary` | | P13 | `tests/unit/test_parent_retrieval.py` | `test_two_level_chunking`, `test_parent_reference_stored`, `test_expand_to_parents`, `test_parent_dedup_across_children`, `test_parent_retrieval_disabled` | ### 5.2. Integration Tests | Test Case | Features Covered | Description | |-----------|------------------|-------------| | `test_adaptive_weights_improve_factual` | P3 | Query with function name ranks higher with adaptive weights | | `test_code_search_finds_function` | P7 | Query `getAuthToken` finds exact function in code index | | `test_mmr_vs_threshold_diversity` | P11 | MMR produces more diverse results than threshold dedup | | `test_ngram_before_semantic` | P12 | N-gram dedup removes exact duplicates, semantic handles near-dupes | | `test_parent_returns_context` | P13 | Search returns parent section containing matched child chunk | | `test_full_pipeline_with_all_features` | All | End-to-end test with all features enabled | ### 5.3. Test Fixtures Create `tests/fixtures/` with: - `code_samples/` - Python, JavaScript, Rust code blocks - `markdown_docs/` - Sample documents with headers, code, and varied content - `expected_results/` - Ground truth for integration tests --- ## 6. Risk Assessment | ID | Risk | Likelihood | Impact | Mitigation | |----|------|------------|--------|------------| | R1 | MMR latency too high with many candidates | Medium | Medium | Limit MMR input to top-50 after n-gram dedup | | R2 | Code tokenizer doesn't handle all languages | Medium | Low | Start with Python/JS/Rust; document limitations | | R3 | Parent retrieval breaks existing chunk IDs | High | High | Maintain backward compat: child chunks keep original IDs | | R4 | N-gram dedup too aggressive | Medium | Medium | Conservative default threshold (0.7); user-configurable | | R5 | Query classifier misclassifies hybrid queries | Medium | Low | Allow manual override in API; default to exploratory | | R6 | Reindex required for P7 and P13 | High | Low | Document clearly; provide `rebuild-index` command | | R7 | Code index increases storage significantly | Low | Low | Monitor index size; consider optional feature flag | ### Blockers 1. **P13 (Parent Retrieval):** Requires changes to chunking strategy. Existing indices become incompatible. Must provide migration path. 2. **P7 (Code Index):** New Whoosh index. If `IndexManager` doesn't support multiple indices cleanly, may need refactoring. --- ## 7. Configuration Summary ### New Config Options ```toml [search] # P3: Query Classification (no config needed - always on) # P11: MMR Diversity mmr_enabled = false mmr_lambda = 0.7 # 0.0=diversity only, 1.0=relevance only # P12: N-gram Dedup ngram_dedup_enabled = true # On by default (fast) ngram_dedup_threshold = 0.7 # P13: Parent Retrieval parent_retrieval_enabled = false [chunking] # P13: Two-level chunking sizes parent_chunk_min_chars = 1500 parent_chunk_max_chars = 2000 child_chunk_min_chars = 400 child_chunk_max_chars = 600 ``` ### SearchPipelineConfig Updates ```python @dataclass class SearchPipelineConfig: # Existing min_confidence: float = 0.0 max_chunks_per_doc: int = 0 dedup_enabled: bool = True dedup_threshold: float = 0.85 rerank_enabled: bool = False rerank_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2" rerank_top_n: int = 10 # P11: MMR mmr_enabled: bool = False mmr_lambda: float = 0.7 # P12: N-gram ngram_dedup_enabled: bool = True ngram_dedup_threshold: float = 0.7 ``` --- ## 8. Implementation Checklist ### Phase 3 - [ ] **P3: Query Type Classification** - [ ] Create `src/search/classifier.py` - [ ] Add `QueryType` enum to `src/models.py` - [ ] Integrate into `src/search/orchestrator.py` - [ ] Create `tests/unit/test_query_classifier.py` - [ ] Run tests: `uv run pytest tests/unit/test_query_classifier.py -v` - [ ] **P7: Code Block Index** - [ ] Create `src/indices/code.py` - [ ] Add `CodeBlock` dataclass to `src/models.py` - [ ] Add code extraction to `src/parsers/markdown.py` - [ ] Update `src/indexing/manager.py` - [ ] Integrate into `src/search/orchestrator.py` - [ ] Create `tests/unit/test_code_index.py` - [ ] Run tests: `uv run pytest tests/unit/test_code_index.py -v` ### Phase 4 - [ ] **P12: N-gram Overlap Deduplication** - [ ] Add functions to `src/search/dedup.py` - [ ] Add config options to `src/config.py` - [ ] Update `src/search/pipeline.py` - [ ] Add tests to `tests/unit/test_dedup.py` - [ ] Run tests: `uv run pytest tests/unit/test_dedup.py -v` - [ ] **P11: MMR Selection** - [ ] Create `src/search/diversity.py` - [ ] Add config options to `src/config.py` - [ ] Update `src/search/pipeline.py` - [ ] Create `tests/unit/test_mmr.py` - [ ] Run tests: `uv run pytest tests/unit/test_mmr.py -v` - [ ] **P13: Parent Document Retrieval** - [ ] Update `src/models.py` with parent fields - [ ] Modify chunking in `src/chunking/header_chunker.py` - [ ] Update `src/indices/vector.py` metadata - [ ] Add parent expansion to `src/search/orchestrator.py` - [ ] Add config options to `src/config.py` - [ ] Create `tests/unit/test_parent_retrieval.py` - [ ] Run tests: `uv run pytest tests/unit/test_parent_retrieval.py -v` ### Final Validation - [ ] Run full test suite: `uv run pytest -v` - [ ] Run type checker: `uv run pyright src/` - [ ] Run linter: `uv run ruff check src/` - [ ] Test reindex: `uv run mcp-markdown-ragdocs rebuild-index` --- ## 9. Code Style Constraints Per project instructions: 1. **No docstrings** - Code should be self-documenting 2. **No return type annotations** - Rely on type inference 3. **Absolute imports only** - Use `from src.x import y` 4. **Real implementation tests** - Avoid mocks where possible 5. **Zero type/lint errors** - Must pass `ruff` and `pyright` --- ## 10. Handoff Notes ### For Code Agent 1. **Start with P3** - It's self-contained and enables adaptive weight testing 2. **P12 before P11** - The pipeline order matters 3. **P7 can be parallel** - No dependencies on other Phase 4 features 4. **P13 is risky** - Test thoroughly before merging; consider feature flag ### Key Files to Keep Open - [src/search/pipeline.py](../src/search/pipeline.py) - Central pipeline logic - [src/search/orchestrator.py](../src/search/orchestrator.py) - Query coordination - [src/config.py](../src/config.py) - All config options - [src/models.py](../src/models.py) - Data models ### Questions to Resolve During Implementation 1. Should MMR replace semantic dedup or be an alternative? (Recommend: alternative via config) 2. Should code index be queried in parallel with text indices? (Recommend: yes) 3. Should parent retrieval return child content as well? (Recommend: no, just parent)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/andnp/ragdocs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

PLAN-phase-3-4-implementation.md•20.3 KiB