Markdown RAG Documentation

CHANGELOG.md•8.18 KiB

# Changelog All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [Unreleased] ### Fixed - **SearchOrchestrator project context isolation:** Fixed bug where `query_documents` could use the wrong project's `documents_path` in multi-project setups. The `SearchOrchestrator` now stores `documents_path` as an immutable instance variable at construction time instead of reading from the shared `Config` object on each query. This ensures project context remains fixed for the server's lifetime, consistent with multi-project isolation guarantees. ### Added - **CLI Debug Mode for Query Command:** - Added `--debug` flag to `query` command (`uv run mcp-markdown-ragdocs query --debug`) - Displays two formatted tables showing search internals: 1. **Search Strategy Results**: Counts from Vector (Semantic), Keyword (BM25), Graph (PageRank), Code, and Tag Expansion 2. **Compression Pipeline**: Filtering stages with counts and items removed (Original, Confidence Filter, Content Dedup, N-gram Dedup, Semantic Dedup, Doc Limit) - Useful for understanding result quality, tuning configuration (weights/thresholds), and diagnosing search behavior - Implements `SearchStrategyStats` dataclass in `src/models.py` to capture intermediate counts - Modified `SearchOrchestrator.query()` to return 3-tuple: `(results, compression_stats, strategy_stats)` - Added `print_debug_stats()` formatter in `src/cli_utils/formatters.py` using Rich tables - **Parallel Git Commit Indexing:** - New `src/git/parallel_indexer.py` module with `ParallelIndexingConfig`, `parse_commits_parallel()`, `batch_embed_texts()`, `add_commits_batch()` - `ThreadPoolExecutor`-based parallel parsing (git subprocesses release GIL, enabling true parallelism) - Batch embedding generation for improved throughput - Bulk SQLite inserts via `executemany()` for reduced I/O overhead - Both async (`index_commits_parallel`) and sync (`index_commits_parallel_sync`) variants - Expected speedup: 2-4x for 100+ commits - Graceful failure handling: individual commit parse failures don't crash entire batch - New config options in `[git_indexing]`: - `parallel_workers` (default: 4): Number of parallel workers for git operations - `embed_batch_size` (default: 32): Batch size for embedding generation - **Memory Search Time Range Filtering:** - `search_memories` now supports time-based filtering with three new parameters: - `after_timestamp` (int): Unix timestamp for lower bound (inclusive) - `before_timestamp` (int): Unix timestamp for upper bound (exclusive) - `relative_days` (int): Returns memories from last N days (overrides absolute timestamps) - Validation: `after_timestamp < before_timestamp`, `relative_days ≥ 0` - Time source: Uses `created_at` from frontmatter with fallback to file `mtime` - All timestamps normalized to UTC - Enables temporal scoping for memory searches (e.g., "recent work", "last sprint", "Q4 2024") - **Search Infrastructure Overhaul (Spec 17):** - **Community Detection:** Louvain algorithm clusters documents by wikilink connectivity; co-community results receive configurable score boost (default 1.1×) - **Score-Aware Fusion:** Dynamic weight adjustment based on per-query score variance; low-variance strategies automatically down-weighted - **HyDE Search:** `search_with_hypothesis` MCP tool for hypothesis-driven document embeddings; improves retrieval for vague queries - **Edge Types:** Graph edges now carry semantic types (`links_to`, `implements`, `tests`, `related`) - New config section: `[search.advanced]` with `community_detection_enabled`, `community_boost_factor`, `dynamic_weights_enabled`, `variance_threshold`, `hyde_enabled`, `default_edge_type` - **Memory Management System:** Persistent AI memory bank with CRUD operations, hybrid search, and cross-corpus linking - 9 MCP tools: `create_memory`, `read_memory`, `update_memory`, `append_memory`, `delete_memory`, `search_memories`, `search_linked_memories`, `get_memory_stats`, `merge_memories` - Ghost node pattern for cross-corpus graph traversal (`[[doc.md]]` creates `ghost:doc.md` node) - Memory-specific recency boost (configurable days/factor) - Dual storage strategies: `"project"` (`.memories/`) or `"user"` (`~/.local/share/`) - New config section: `[memory]` with `enabled`, `storage_strategy`, `recency_boost_days`, `recency_boost_factor` - Query expansion via embeddings: `build_concept_vocabulary()` extracts terms during indexing, `expand_query()` finds top-3 nearest terms to query embedding for improved recall - Cross-encoder re-ranking with lazy model loading (loaded on first `rerank()` call) - Default model: `cross-encoder/ms-marco-MiniLM-L-6-v2` (22MB, ~50ms/10 docs) - New config options: `rerank_enabled`, `rerank_model`, `rerank_top_n` - Concept vocabulary persisted as `concept_vocabulary.json` with index - Heading-weighted embeddings: chunks prepend `header_path` to content before embedding for improved semantic context - Extended frontmatter extraction: title, description, summary, keywords, author, category, type, related fields - BM25F field boosting in keyword index: - title (3.0x), headers (2.5x), keywords (2.5x), description (2.0x), tags (2.0x), aliases (1.5x), author (1.0x), category - MultifieldParser searches all TEXT fields - Result filtering pipeline with `CompressionStats` tracking: - `min_confidence`: score threshold filtering (default: 0.0 = disabled) - `max_chunks_per_doc`: per-document chunk limit (default: 0 = disabled) - `dedup_enabled` / `dedup_similarity_threshold`: semantic deduplication via cosine similarity clustering ### Changed - `QueryOrchestrator.query()` returns `tuple[list[ChunkResult], CompressionStats]` - Processing pipeline order: normalize → threshold → doc limit → dedup → re-rank → top_n - Graph index enhanced with `related` frontmatter field edges - Vocabulary built during `persist()` after indexing completes - `rebuild-index` CLI command now includes three phases: 1. Document indexing with progress bar 2. Git commit indexing (if enabled) with repository discovery and progress tracking 3. Concept vocabulary building (if enabled) with term count display - All phases non-fatal: failures logged but do not prevent subsequent phases - Enhanced progress output with emoji indicators and detailed phase summaries ### Migration - **Reindexing required** to build concept vocabulary. Run: `uv run mcp-markdown-ragdocs rebuild-index` ### Migration - **Reindexing required** for schema changes. Run: `uv run mcp-markdown-ragdocs rebuild-index` ## [0.1.0] - 2025-12-22 ### Added - Initial implementation of MCP server for Markdown documentation - Hybrid search combining semantic embeddings (FAISS), keyword search (Whoosh), and graph traversal (NetworkX) - Reciprocal Rank Fusion (RRF) for multi-strategy result merging - Recency bias for recently modified documents - LLM synthesis using llama-index for answer generation - Automatic file watching with debounced incremental indexing - Index versioning with automatic rebuild on configuration changes - Rich Markdown parsing with tree-sitter: - Frontmatter extraction - Wikilink resolution - Tag extraction - Transclusion support - CLI commands: - `run`: Start MCP server with optional host/port overrides - `rebuild-index`: Force full index rebuild - `check-config`: Validate and display configuration - FastAPI server with endpoints: - `POST /query_documents`: Query interface with LLM synthesis - `GET /health`: Health check endpoint - `GET /status`: Operational status with document count, queue size, and failed files - Zero-configuration operation with sensible defaults - TOML configuration support with cascading config file lookup - Comprehensive test suite with unit, integration, and E2E tests - Documentation: - Architecture overview - Configuration reference - Hybrid search algorithm details - Integration guide for VS Code MCP - Development guide [unreleased]: https://github.com/yourusername/mcp-markdown-ragdocs/compare/v0.1.0...HEAD [0.1.0]: https://github.com/yourusername/mcp-markdown-ragdocs/releases/tag/v0.1.0

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/andnp/ragdocs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CHANGELOG.md•8.18 KiB