Markdown RAG Documentation

development.md•18.5 KiB

# Development Guide This document describes how to set up a development environment, run tests, understand the code structure, and contribute to the project. ## Development Setup ### Prerequisites - Python 3.13+ - uv package manager - Git ### Clone and Install ```zsh git clone https://github.com/yourusername/mcp-markdown-ragdocs.git cd mcp-markdown-ragdocs uv sync ``` This installs all dependencies including development tools (pytest, pyright, ruff). ### Install as Editable Package ```zsh uv pip install -e . ``` This allows you to run `mcp-markdown-ragdocs` commands while actively developing. ## Running Tests ### Full Test Suite ```zsh uv run pytest ``` ### By Test Category **Unit tests:** ```zsh uv run pytest tests/unit/ ``` **Integration tests:** ```zsh uv run pytest tests/integration/ ``` **End-to-end tests:** ```zsh uv run pytest tests/e2e/ ``` **Performance tests:** ```zsh uv run pytest tests/performance/ ``` ### With Coverage ```zsh uv run pytest --cov=src --cov-report=term-missing ``` ### Specific Test File ```zsh uv run pytest tests/unit/test_config.py ``` ### Specific Test Function ```zsh uv run pytest tests/unit/test_config.py::test_load_config_defaults ``` ## Code Quality ### Linting Check for lint errors: ```zsh uv run ruff check . ``` Auto-fix errors: ```zsh uv run ruff check --fix . ``` ### Formatting Format code: ```zsh uv run ruff format . ``` ### Type Checking ```zsh uv run pyright ``` ### Pre-Commit Checks Run all checks before committing: ```zsh uv run ruff check . uv run ruff format --check . uv run pyright uv run pytest ``` ## Code Structure ### Directory Layout ``` src/ ├── cli.py # CLI commands (run, rebuild-index, check-config) ├── config.py # Configuration loading and dataclasses ├── models.py # Shared data models (Document) ├── server.py # FastAPI application and endpoints ├── mcp_server.py # MCP protocol stdio server ├── indexing/ │ ├── manager.py # IndexManager (coordinates all indices) │ ├── manifest.py # Index versioning and rebuild logic │ └── watcher.py # File watching with debouncing ├── indices/ │ ├── graph.py # NetworkX graph store │ ├── keyword.py # Whoosh keyword index │ ├── code.py # Code block search index │ └── vector.py # FAISS vector index ├── git/ │ ├── repository.py # Repository discovery and git operations │ ├── commit_parser.py # Commit metadata extraction │ ├── commit_indexer.py # SQLite storage and embedding │ ├── commit_search.py # Search logic and glob filtering │ └── watcher.py # Git directory file watching ├── parsers/ │ ├── base.py # DocumentParser protocol │ ├── dispatcher.py # ParserRegistry and dispatch logic │ └── markdown.py # MarkdownParser with tree-sitter └── search/ ├── fusion.py # RRF fusion algorithm └── orchestrator.py # Query orchestration ``` ### Key Components #### [src/server.py](../src/server.py) FastAPI application with lifespan context manager. Entry point for HTTP API. **Responsibilities:** - Load configuration - Initialize indices - Check manifest and rebuild if needed - Start file watcher - Define API endpoints - Shutdown cleanup #### [src/indexing/manager.py](../src/indexing/manager.py) Coordinates updates across vector, keyword, and graph indices. **Key Methods:** - `index_document(file_path)`: Parse and update all indices - `remove_document(doc_id)`: Remove from all indices - `persist()`: Save all indices to disk - `load()`: Load existing indices from disk #### [src/parsers/markdown.py](../src/parsers/markdown.py) Markdown parser using tree-sitter for AST-based extraction. **Extraction Logic:** - YAML frontmatter (metadata, tags, aliases) - Wikilinks `[[Note]]` and `[[Note|Display]]` - Transclusions `![[Note]]` - Inline tags `#tag` - Content (excluding code blocks) #### [src/search/orchestrator.py](../src/search/orchestrator.py) Query orchestration with parallel searches and RRF fusion. **Query Flow:** 1. Execute semantic, keyword searches in parallel 2. Apply 1-hop graph neighbor boosting 3. Compute RRF scores with weights 4. Apply recency bias 5. Return top-k document IDs #### [src/search/fusion.py](../src/search/fusion.py) Reciprocal Rank Fusion algorithm implementation. **Core Function:** ```python def reciprocal_rank_fusion( ranked_lists: dict[str, list[str]], weights: dict[str, float], k: int = 60 ) -> list[str]: # Returns fused ranked list of document IDs ``` ## Testing Strategy ### Test Categories **Unit Tests:** - Test individual components in isolation - Use `tmp_path` fixtures for ephemeral storage - Fast execution (< 1 second per test) - Examples: config loading, parser extraction, fusion algorithm **Integration Tests:** - Test component interactions - Use real indices (not mocks) - Moderate execution time (1-10 seconds per test) - Examples: multi-index updates, hybrid search, persistence **End-to-End Tests:** - Test complete workflows via HTTP API - Use FastAPI TestClient - Realistic test data - Examples: server lifecycle, query flow, file watching **Performance Tests:** - Benchmark indexing speed and query latency - Large corpus tests (100-1000 documents) - Track performance regressions ### Fixture Architecture **Ephemeral Fixtures (default):** - Use pytest's `tmp_path` for isolated test storage - Fast, no cleanup needed - Preferred for unit tests **Persistent Fixtures:** - Session/module/function scoped fixtures with shared storage - Used for integration tests testing persistence/loading - See [tests/infrastructure/README.md](../tests/infrastructure/README.md) **Example Unit Test:** ```python def test_index_document(tmp_path): """Test indexing a single document.""" config = Config( indexing=IndexingConfig( documents_path=str(tmp_path / "docs"), index_path=str(tmp_path / "indices"), ) ) vector = VectorIndex() keyword = KeywordIndex() graph = GraphStore() manager = IndexManager(config, vector, keyword, graph) doc_file = tmp_path / "docs" / "test.md" doc_file.parent.mkdir(parents=True) doc_file.write_text("# Test\nContent here") manager.index_document(str(doc_file)) assert manager.get_document_count() == 1 ``` **Example Integration Test:** ```python @pytest.mark.asyncio async def test_hybrid_search_fusion(tmp_path): """Test RRF fusion across semantic, keyword, graph search.""" config = create_test_config(tmp_path) manager = create_test_manager(config) orchestrator = QueryOrchestrator( manager._vector, manager._keyword, manager._graph, config, manager ) # Create test corpus with links create_test_documents(tmp_path) # Query should combine all strategies results = await orchestrator.query("authentication", top_k=5) assert len(results) > 0 assert "authentication" in results[0] # Keyword match # Verify graph traversal added linked docs ``` ## Adding New Features ### Adding a New Parser The server supports pluggable parsers. Two parsers are included out-of-the-box: - `MarkdownParser`: Full Markdown support with tree-sitter AST parsing, frontmatter, wikilinks, tags - `PlainTextParser`: Plain text (.txt) files with UTF-8 and fallback encoding support To add a new parser: 1. **Create parser class** in `src/parsers/`: ```python # src/parsers/csv_parser.py from pathlib import Path from datetime import datetime, timezone from src.models import Document from src.parsers.base import DocumentParser class CSVParser(DocumentParser): def parse(self, file_path: str): path = Path(file_path) content = path.read_text(encoding="utf-8") modified_time = datetime.fromtimestamp(path.stat().st_mtime, tz=timezone.utc) return Document( id=path.stem, content=content, metadata={"source": str(path)}, links=[], tags=[], file_path=str(path), modified_time=modified_time, ) ``` 2. **Register parser** in `src/parsers/__init__.py`: ```python from src.parsers.csv_parser import CSVParser __all__ = ["MarkdownParser", "PlainTextParser", "CSVParser"] ``` 3. **Update dispatcher** in `src/parsers/dispatcher.py`: Add to `_instantiate_parser()` function: ```python def _instantiate_parser(parser_name: str, file_path: str): if parser_name == "MarkdownParser": return MarkdownParser() elif parser_name == "PlainTextParser": return PlainTextParser() elif parser_name == "CSVParser": return CSVParser() else: raise ValueError(f"Unknown parser: {parser_name} for {file_path}") ``` 4. **Configure pattern** in `config.toml`: ```toml [parsers] "**/*.md" = "MarkdownParser" "**/*.txt" = "PlainTextParser" "**/*.csv" = "CSVParser" ``` 5. **Add tests** in `tests/unit/test_csv_parser.py`: ```python def test_csv_parser_basic(tmp_path): parser = CSVParser() file_path = tmp_path / "test.csv" file_path.write_text("col1,col2\nval1,val2") doc = parser.parse(str(file_path)) assert doc.content == "col1,col2\nval1,val2" assert doc.id == "test" ``` **Notes:** - PlainTextParser handles encoding fallback gracefully (UTF-8 → latin-1 → cp1252 → iso-8859-1) - Plain text files use paragraph-based chunking (reuses `HeaderBasedChunker._chunk_plain_text()`) - See [specs/13-txt-file-chunking.md](../specs/13-txt-file-chunking.md) for implementation details ### Adding a New Search Strategy 1. **Implement strategy** in new index class or orchestrator method 2. **Add to parallel search** in `QueryOrchestrator.query()` 3. **Add weight configuration** in `SearchConfig` 4. **Update RRF fusion** to include new ranked list 5. **Add integration tests** for new strategy 6. **Document** in [docs/hybrid-search.md](hybrid-search.md) ### Adding a New CLI Command 1. **Add command** in `src/cli.py`: ```python @cli.command("stats") @click.option("--project", default=None, help="Override project detection") def stats_cmd(project: str | None): """Display index statistics.""" config = load_config() config = _apply_project_detection(config, project) # Implementation here click.echo(f"Total documents: {count}") ``` 2. **Add tests** in `tests/e2e/test_cli.py`: ```python def test_cli_stats(tmp_path): runner = CliRunner() result = runner.invoke(cli, ["stats"]) assert result.exit_code == 0 assert "Total documents:" in result.output ``` 3. **Update documentation** in `docs/configuration.md` CLI Commands section. ## Manual Testing ### Testing MCP Server (stdio) Start the MCP server manually: ```zsh uv run mcp-markdown-ragdocs mcp ``` The server will start and wait for MCP protocol messages on stdin. Useful for: - Verifying server startup without errors - Testing with manual JSON-RPC messages - Debugging stdio transport issues Press Ctrl+C to stop. ### Testing HTTP Server Start the HTTP server: ```zsh uv run mcp-markdown-ragdocs run ``` Query the API: ```zsh curl http://127.0.0.1:8000/health curl -X POST http://127.0.0.1:8000/query_documents \ -H "Content-Type: application/json" \ -d '{"query": "test", "top_n": 3}' ``` ### Testing CLI Query Run a direct query: ```zsh uv run mcp-markdown-ragdocs query "authentication" ``` With JSON output for parsing: ```zsh uv run mcp-markdown-ragdocs query "test" --json | jq ``` ## Testing Patterns ### Anti-Pattern: Mocks This project avoids mocks in favor of real implementations: **Bad:** ```python def test_index_document_mock(): mock_vector = Mock() mock_keyword = Mock() manager = IndexManager(config, mock_vector, mock_keyword, graph) manager.index_document("test.md") mock_vector.add.assert_called_once() ``` **Good:** ```python def test_index_document_real(tmp_path): vector = VectorIndex() # Real instance keyword = KeywordIndex() manager = IndexManager(config, vector, keyword, graph) manager.index_document(str(test_file)) assert manager.get_document_count() == 1 ``` ### Pattern: Realistic Test Data Use realistic document content in tests: ```python markdown_content = """--- title: Authentication Guide tags: [security, api] aliases: [auth, credentials] --- # Authentication This guide covers authentication methods: - [[OAuth 2.0]] - [[API Keys]] See also: [[Security Best Practices]] """ ``` ### Pattern: Fixture Reuse Extract common setup to fixtures: ```python @pytest.fixture def test_corpus(tmp_path): """Create realistic documentation corpus.""" docs_path = tmp_path / "docs" docs_path.mkdir() (docs_path / "auth.md").write_text("# Auth\n...") (docs_path / "api.md").write_text("# API\n...") return docs_path def test_index_corpus(tmp_path, test_corpus): # Use test_corpus fixture config = Config(indexing=IndexingConfig(documents_path=str(test_corpus))) # ... ``` ## Debugging ### Enable Debug Logging ```python import logging logging.basicConfig(level=logging.DEBUG) ``` ### Inspect Index Contents ```python # In pytest with breakpoint def test_debug_index(tmp_path): manager = create_manager(tmp_path) manager.index_document("test.md") # Inspect vector index import pdb; pdb.set_trace() results = manager._vector.search("query", top_k=10) ``` ### Verify Manifest ```zsh cat .index_data/index.manifest.json | jq ``` ### Check Index Files ```zsh ls -lh .index_data/ ls -lh .index_data/vector/ ls -lh .index_data/keyword/ ls -lh .index_data/commits.db # Git history index ``` ## Git History Implementation ### Module Structure The git history feature is organized under `src/git/`: - **repository.py**: Repository discovery and git operations - **commit_parser.py**: Commit metadata extraction and delta truncation - **commit_indexer.py**: SQLite storage with embedding management - **commit_search.py**: Search logic with glob filtering - **watcher.py**: Git directory file watching (`.git/HEAD`, `.git/refs/`) ### Running Git-Specific Tests ```zsh # Unit tests for git module uv run pytest tests/unit/test_repository.py tests/unit/test_commit_parser.py tests/unit/test_commit_indexer.py # Integration tests uv run pytest tests/integration/test_git_search.py # E2E tests uv run pytest tests/e2e/test_git_mcp.py ``` ### How Commit Indexing Works **1. Repository Discovery:** ```python def discover_git_repositories( documents_path: Path, exclude_patterns: list[str], exclude_hidden_dirs: bool = True ) -> list[Path]: """ Recursively find .git directories. - Uses os.walk() with in-place directory filtering - Applies glob pattern matching - Stops descent at .git (no nested repos) """ ``` **2. Commit Parsing:** ```python def parse_commit(git_dir: Path, commit_hash: str, max_delta_lines: int = 200) -> CommitData: """ Extract commit metadata and truncated delta. Uses: - git show --format="%H%n%ct%n%an..." for metadata - git diff-tree --name-only for files changed - git show --format="" for delta Truncates delta to max_delta_lines with indicator. """ ``` **3. Embedding and Storage:** ```python def add_commit( hash: str, metadata: dict, delta: str, commit_document: str, ): """ Generate embedding and store in SQLite. - Embedding shared with VectorIndex model - Serialized as BLOB (numpy.float32.tobytes()) - Files changed stored as JSON array """ ``` **4. Incremental Updates:** ```python def get_commits_after_timestamp( git_dir: Path, after_timestamp: int | None = None ) -> list[str]: """ Query new commits since last index. Uses: git log --all --after={timestamp} --format="%H" """ ``` **5. GitWatcher:** Monitors `.git/HEAD` and `.git/refs/` with 5-second cooldown. On changes, queries `get_commits_after_timestamp()` and indexes incrementally. ### Performance Benchmarks Located in `tests/performance/test_git_performance.py`: - **Indexing Speed:** 60 commits/sec target - **Query Latency:** 5ms p50, 15ms p95 for 10k commits - **Storage Efficiency:** ~2KB per commit Run benchmarks: ```zsh uv run pytest tests/performance/test_git_performance.py -v ``` ## Contributing Guidelines ### Workflow 1. Fork the repository 2. Create a feature branch: `git checkout -b feature/new-parser` 3. Make changes with tests 4. Run quality checks: `ruff check`, `pyright`, `pytest` 5. Commit with clear messages 6. Push and open a pull request ### Commit Messages Use conventional commit format: ``` feat: add PlainTextParser for .txt files fix: handle empty frontmatter in MarkdownParser docs: update configuration reference test: add integration test for graph traversal ``` ### Pull Request Checklist - [ ] Tests pass: `pytest` - [ ] Lint clean: `ruff check` - [ ] Type clean: `pyright` - [ ] Format clean: `ruff format --check` - [ ] Documentation updated (if needed) - [ ] CHANGELOG updated (if user-facing change) ### Code Style - Follow PEP 8 (enforced by ruff) - Type hints on all function parameters - Docstrings for public APIs (Google style) - Prefer composition over inheritance - Keep functions focused (single responsibility) ### Testing Requirements - New features require tests (unit + integration) - Bug fixes require regression tests - Tests must be deterministic (no random failures) - Use realistic test data - Avoid mocks when possible ## Release Process 1. Update version in `pyproject.toml` 2. Update CHANGELOG.md 3. Run full test suite 4. Tag release: `git tag v0.2.0` 5. Push tag: `git push --tags` 6. Build package: `uv build` 7. Publish: `uv publish` (if configured) ## Architecture Decision Records Significant architecture decisions are documented in the `specs/` directory: - [specs/02-architecture-and-tech-stack.md](../specs/02-architecture-and-tech-stack.md): System architecture and technology choices - [specs/10-multi-project-support.md](../specs/10-multi-project-support.md): Project detection and index isolation - [specs/11-search-quality-improvements.md](../specs/11-search-quality-improvements.md): Search pipeline enhancements (BM25F, dedup, re-ranking) - [specs/12-context-compression.md](../specs/12-context-compression.md): Compression strategy decision (threshold + dedup) - [docs/specs/19-self-healing-indices.md](specs/19-self-healing-indices.md): Self-healing index infrastructure (corruption detection and recovery) - [docs/specs/20-vector-index-thread-safety.md](specs/20-vector-index-thread-safety.md): Thread-safe concurrent operations with fine-grained locking Each spec includes: - Executive summary - Current state analysis - Decision matrix with alternatives - Implementation details - Risk register

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/andnp/ragdocs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

development.md•18.5 KiB