Markdown RAG Documentation

Overview Schema Related Servers Score Discussions

17-search-overhaul.md•4.72 KiB

# Feature Spec 17: Search Infrastructure Overhaul **Status:** ✅ Complete (January 2026) ## Implementation Notes | Feature | Planned | Implemented | Deviation | |---------|---------|-------------|-----------| | Edge Types | 4 types (LINKS_TO, IMPLEMENTS, TESTS, RELATED) | 4 types via `edge_type` attribute | None | | Community Detection | Leiden Algorithm | Louvain Algorithm (NetworkX) | Louvain used (no additional dependencies) | | Score-Aware Fusion | Weighted CombSUM + RRF | RRF + variance-based weight adjustment | Equivalent approach | | HyDE | `search_with_hypothesis` tool | Implemented | None | | Configuration | `[search.advanced]` section | Implemented | None | **Algorithm Deviation:** The spec proposed Leiden algorithm via `cdlib`. Implementation uses Louvain algorithm from NetworkX (`nx.community.louvain_communities`), avoiding additional dependencies. Both algorithms produce comparable community structures for document graphs. --- ## 1. Overview This specification details the roadmap for upgrading the core search infrastructure of `mcp-markdown-ragdocs` to align with 2025 state-of-the-art techniques (GraphRAG, Score-Aware Fusion, HyDE). These changes apply to **both** the Main Corpus and the Memory Corpus, though some features (Community Detection) are particularly impactful for large documentation bases. ## 2. Goals 1. **Semantic Graph**: Move from "dumb" file links to typed relationships with community detection. 2. **Advanced Fusion**: Replace static RRF with Score-Aware Fusion to better handle variance in vector/keyword confidence. 3. **Hypothesis-Driven Search**: Enable HyDE (Hypothetical Document Embeddings) for handling vague queries. ## 3. Architecture Upgrades ### A. Graph Index: Typed Edges & Community Detection **Current**: `GraphStore` stores `A -> B` (unweighted, untyped). **New**: 1. **Edge Schema**: ```python class EdgeType(Enum): LINKS_TO = "links_to" # Default IMPLEMENTS = "implements" # [[src/foo.py]] in "Implementation" section TESTS = "tests" # [[src/foo.py]] in tests/test_foo.py RELATED = "related" # "See also" ``` 2. **Community Detection (Leiden Algorithm)**: - Periodically (on index save), run the **Leiden Algorithm** (via `cdlib` or similar) to detect clusters. - **Storage**: Store `community_id` on each Node. - **Search**: When a node is retrieved, boost other nodes in the same `community_id` by a factor (e.g., 1.1x). - *Why*: If a user searches for "Auth", and we find `auth.md`, we should also boost `login.md` and `user.md` even if they don't share exact keywords, because they form a tight graph cluster. ### B. Fusion Layer: Score-Aware RRF **Current**: Standard RRF ($1 / (k + rank)$). Ignores the fact that Vector score might be 0.99 (high confidence) or 0.60 (low confidence). **New**: **Weighted CombSUM + RRF** - **Logic**: - Normalize Vector scores ($0..1$). - Normalize Keyword scores ($0..1$). - Calculate **Variance**: If Vector scores are flat (low variance), implies "muddy" semantic match. Lower the Vector weight. - Fuse: $FinalScore = (W_v \cdot S_v) + (W_k \cdot S_k) + RRF_{score}$ ### C. Search Pipeline: HyDE Support **Concept**: For vague queries like "How do I add a tool?", semantic search might fail against dry documentation. **Implementation**: - **Tool**: `search_with_hypothesis(hypothesis: str)` - **Workflow**: 1. AI generates hypothesis: *"To add a tool, I likely need to modify src/mcp_server.py and register it in the list_tools method..."* 2. Server embeds this hypothesis. 3. Server searches Vector Index with this embedding. 4. *Result*: Finds `src/mcp_server.py` even if the query was vague. ## 4. Implementation Plan ### Phase 1: Graph Upgrade 1. **Parser Update**: Modify `MarkdownParser` to infer edge types based on context (Header section name). - E.g., Links under `# Testing` -> `TESTS`. 2. **GraphStore Update**: Update `nx.DiGraph` to store edge attributes. 3. **Community Algo**: Integrate a community detection library. Run during `IndexManager.persist()`. ### Phase 2: Fusion Logic 4. **Math**: Implement `normalize_scores` and `calculate_variance` in `src/search/fusion.py`. 5. **Dynamic Weights**: Update `SearchOrchestrator` to compute weights dynamically per-query instead of just using config defaults. ### Phase 3: HyDE 6. **Tool**: Expose `search_with_hypothesis` in `MCPServer`. - *Note*: The underlying `VectorIndex.search` already takes a string; we just need to pass the hypothesis instead of the raw query. ## 5. Migration Strategy - **Graph**: Existing graphs are compatible (edges just lack types). A full re-index will populate types. - **Config**: Add `[search.advanced]` section to toggle Community Detection and HyDE features.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/andnp/ragdocs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

17-search-overhaul.md•4.72 KiB