Markdown RAG Documentation

Overview Schema Related Servers Score Discussions

git-history-search.md•52.1 KiB

# Implementation Plan: Git History Search Feature **Document Version:** 1.0 **Created:** 2026-01-10 **Target Completion:** 6 phases, ~15-20 hours total **Design Reference:** [docs/specs/15-git-history-search.md](../specs/15-git-history-search.md) --- ## Executive Summary This document provides a step-by-step implementation plan for adding git commit history search to mcp-markdown-ragdocs. The feature will enable semantic search over commit metadata and diffs using the existing embedding model and search infrastructure. **Estimated Scope:** - **New Files:** 8 files (~1,100 LOC) - **Modified Files:** 6 files (~300 LOC changes) - **Test Files:** 5 files (~450 LOC) - **Total:** ~1,850 LOC --- ## Table of Contents 1. [File Manifest](#file-manifest) 2. [Implementation Phases](#implementation-phases) 3. [Dependencies & Ordering](#dependencies--ordering) 4. [Test Specifications](#test-specifications) 5. [Quality Gates](#quality-gates) 6. [Acceptance Criteria](#acceptance-criteria) 7. [Risk Mitigation](#risk-mitigation) --- ## File Manifest ### New Files to Create | File Path | Purpose | LOC | Dependencies | |-----------|---------|-----|--------------| | `src/git/__init__.py` | Git module package | 5 | None | | `src/git/repository.py` | Repository discovery & git operations | 120 | config.py | | `src/git/commit_parser.py` | Commit metadata extraction & delta truncation | 100 | None | | `src/git/commit_indexer.py` | SQLite storage & embedding management | 280 | indices/vector.py | | `src/git/commit_search.py` | Search logic & glob filtering | 150 | commit_indexer.py | | `src/git/watcher.py` | Git directory file watching | 120 | indexing/watcher.py | | `src/models.py` (additions) | CommitResult, GitSearchResponse dataclasses | 40 | None | | `src/context.py` (additions) | CommitIndexer lifecycle integration | 50 | git/commit_indexer.py | ### Test Files to Create | File Path | Purpose | LOC | Coverage Target | |-----------|---------|-----|-----------------| | `tests/unit/test_repository.py` | Repository discovery logic | 100 | 90% | | `tests/unit/test_commit_parser.py` | Commit parsing & delta truncation | 100 | 90% | | `tests/unit/test_commit_indexer.py` | SQLite CRUD operations | 120 | 95% | | `tests/integration/test_git_search.py` | End-to-end search workflow | 100 | 85% | | `tests/e2e/test_git_mcp.py` | MCP tool integration | 30 | N/A | ### Files to Modify | File Path | Changes | LOC Delta | |-----------|---------|-----------| | `src/config.py` | Add GitIndexingConfig dataclass | +25 | | `src/context.py` | Integrate CommitIndexer lifecycle | +60 | | `src/mcp_server.py` | Register search_git_history tool | +90 | | `src/lifecycle.py` | Add git watcher management | +40 | | `src/models.py` | Add CommitResult, GitSearchResponse | +40 | | `docs/architecture.md` | Document git search architecture | +60 | --- ## Implementation Phases ### Phase 1: Git Module Foundation (3-4 hours, ~220 LOC) **Objective:** Build core git operations without embedding or indexing. #### Step 1.1: Create Repository Discovery Module **File:** `src/git/repository.py` ```python """Git repository discovery and commit listing.""" import logging import subprocess from pathlib import Path from typing import Optional logger = logging.getLogger(__name__) def discover_git_repositories( documents_path: Path, exclude_patterns: list[str], exclude_hidden_dirs: bool = True ) -> list[Path]: """ Recursively discover .git directories. Args: documents_path: Root path to search exclude_patterns: Glob patterns to exclude (e.g., '**/.venv/**') exclude_hidden_dirs: Skip hidden directories except .git Returns: List of absolute paths to .git directories """ # Implementation details in spec section 3.3 pass def get_commits_after_timestamp( git_dir: Path, after_timestamp: Optional[int] = None ) -> list[str]: """ Get commit hashes after a timestamp. Args: git_dir: Path to .git directory after_timestamp: Unix timestamp (None = all commits) Returns: List of commit SHAs (newest first) """ # Implementation details in spec section 3.6 pass def is_git_available() -> bool: """Check if git binary is available in PATH.""" try: subprocess.run( ["git", "--version"], capture_output=True, check=True, timeout=5 ) return True except (subprocess.CalledProcessError, FileNotFoundError, subprocess.TimeoutExpired): return False ``` **Key Implementation Details:** - Use `os.walk()` with in-place directory filtering - Apply glob pattern matching via `Path.match()` - Handle nested `.git` directories (stop descent) - Log discovered repositories with counts #### Step 1.2: Create Commit Parser Module **File:** `src/git/commit_parser.py` ```python """Commit metadata extraction and delta truncation.""" import json import logging import subprocess from dataclasses import dataclass from datetime import datetime, timezone from pathlib import Path logger = logging.getLogger(__name__) @dataclass class CommitData: hash: str timestamp: int # Unix seconds author: str # "Name <email>" committer: str title: str # First line of message message: str # Body (excluding title) files_changed: list[str] delta_truncated: str def parse_commit(git_dir: Path, commit_hash: str, max_delta_lines: int = 200) -> CommitData: """ Extract commit metadata and truncated delta. Args: git_dir: Path to .git directory commit_hash: Full commit SHA max_delta_lines: Maximum diff lines to keep Returns: CommitData with all fields populated """ # Implementation details in spec sections 3.2 and 3.4 pass def build_commit_document(commit: CommitData) -> str: """ Build searchable text document from commit data. Format (from spec section 3.2): {title} {message} Author: {author} Committer: {committer} Files changed: {file_1} {file_2} {delta_truncated} Returns: Formatted text for embedding """ pass def truncate_delta(diff_output: str, max_lines: int = 200) -> str: """Truncate diff to max_lines with indicator if truncated.""" lines = diff_output.splitlines() if len(lines) <= max_lines: return diff_output truncated = '\n'.join(lines[:max_lines]) remaining = len(lines) - max_lines return f"{truncated}\n\n... ({remaining} lines omitted)" ``` **Key Implementation Details:** - Use `git show --format="%H%n%ct%n%an <%ae>%n%cn <%ce>%n%s%n%b" {hash}` - Parse format string into structured fields - Split message into title (first line) and body - Use `git diff-tree --no-commit-id --name-only -r {hash}` for files - Use `git show --format="" {hash}` for delta - Handle encoding errors (UTF-8 → Latin-1 fallback) #### Step 1.3: Unit Tests for Phase 1 **File:** `tests/unit/test_repository.py` Test cases: - ✅ Discovers single repository - ✅ Discovers nested repositories - ✅ Excludes patterns (`**/.venv/**`, `**/build/**`) - ✅ Handles hidden directory filtering - ✅ Returns empty list when no repos found - ✅ `get_commits_after_timestamp` with None (all commits) - ✅ `get_commits_after_timestamp` with timestamp filter **File:** `tests/unit/test_commit_parser.py` Test cases: - ✅ Parses standard commit - ✅ Parses merge commit - ✅ Handles multi-line message - ✅ Truncates delta at 200 lines - ✅ Handles UTF-8 encoding - ✅ Fallback to Latin-1 on encoding error - ✅ `build_commit_document` format correctness **Quality Gate:** All unit tests pass with 90%+ coverage --- ### Phase 2: Commit Index Storage (3-4 hours, ~280 LOC) **Objective:** Implement SQLite storage with embedding serialization. #### Step 2.1: Create CommitIndexer Class **File:** `src/git/commit_indexer.py` ```python """SQLite-based commit index with embedding storage.""" import json import logging import sqlite3 from pathlib import Path from typing import Optional import numpy as np from numpy.typing import NDArray from src.indices.vector import VectorIndex logger = logging.getLogger(__name__) class CommitIndexer: """Manages git commit index with embeddings.""" def __init__( self, db_path: Path, embedding_model: VectorIndex, ): """ Initialize commit indexer. Args: db_path: Path to SQLite database file embedding_model: VectorIndex for embedding generation """ self._db_path = db_path self._embedding_model = embedding_model self._conn: Optional[sqlite3.Connection] = None self._ensure_schema() def _ensure_schema(self) -> None: """Create git_commits table if not exists.""" # SQL schema from spec section 3.1 pass def add_commit( self, hash: str, timestamp: int, author: str, committer: str, title: str, message: str, files_changed: list[str], delta_truncated: str, commit_document: str, ) -> None: """ Add or update commit in index. Args: hash: Commit SHA timestamp: Unix timestamp author: Author string committer: Committer string title: First line of commit message message: Full commit message body files_changed: List of changed file paths delta_truncated: Truncated diff text commit_document: Full searchable text for embedding """ # Generate embedding embedding = self._embedding_model.get_text_embedding(commit_document) embedding_bytes = self._serialize_embedding(embedding) # SQLite INSERT OR REPLACE pass def remove_commit(self, commit_hash: str) -> None: """Remove commit from index by hash.""" pass def query_by_embedding( self, query_embedding: list[float], top_k: int = 10, after_timestamp: Optional[int] = None, before_timestamp: Optional[int] = None, ) -> list[dict]: """ Query commits by embedding similarity. Returns: List of dicts with keys: hash, score, timestamp, etc. """ # Load all embeddings, compute cosine similarity, sort pass def get_last_indexed_timestamp(self, repo_path: str) -> Optional[int]: """Get most recent indexed_at timestamp for a repository.""" pass def get_total_commits(self) -> int: """Count total commits in index.""" pass @staticmethod def _serialize_embedding(embedding: list[float]) -> bytes: """Convert embedding to bytes for BLOB storage.""" return np.array(embedding, dtype=np.float32).tobytes() @staticmethod def _deserialize_embedding(blob: bytes) -> NDArray[np.float32]: """Convert BLOB to numpy array.""" return np.frombuffer(blob, dtype=np.float32) def close(self) -> None: """Close database connection.""" if self._conn: self._conn.close() self._conn = None ``` **Key Implementation Details:** - Use context managers for SQLite transactions - Create indexes on `timestamp`, `indexed_at` - Store `files_changed` as JSON string - Use `sqlite3.Row` for dict-like access - Implement cosine similarity in Python (numpy vectorized) - Handle SQLite locking (WAL mode for concurrency) #### Step 2.2: Unit Tests for Phase 2 **File:** `tests/unit/test_commit_indexer.py` Test cases: - ✅ Creates schema on initialization - ✅ Adds commit with embedding - ✅ Updates existing commit (idempotent) - ✅ Removes commit by hash - ✅ Queries by embedding (top-k) - ✅ Filters by timestamp range - ✅ Returns empty results for empty index - ✅ Embedding serialization roundtrip - ✅ Handles malformed JSON in files_changed **Quality Gate:** All unit tests pass with 95%+ coverage --- ### Phase 3: Configuration & Lifecycle (2-3 hours, ~115 LOC) **Objective:** Integrate CommitIndexer into application context and configuration. #### Step 3.1: Add Git Configuration **File:** `src/config.py` Add new dataclass: ```python @dataclass class GitIndexingConfig: enabled: bool = True delta_max_lines: int = 200 batch_size: int = 100 # Commits per batch for embedding watch_enabled: bool = True watch_cooldown: float = 5.0 # Seconds ``` Update `Config` dataclass: ```python @dataclass class Config: server: ServerConfig = field(default_factory=ServerConfig) indexing: IndexingConfig = field(default_factory=IndexingConfig) git_indexing: GitIndexingConfig = field(default_factory=GitIndexingConfig) # NEW parsers: dict[str, str] = field(default_factory=lambda: {...}) search: SearchConfig = field(default_factory=SearchConfig) llm: LLMConfig = field(default_factory=LLMConfig) chunking: ChunkingConfig = field(default_factory=ChunkingConfig) projects: list[ProjectConfig] = field(default_factory=list) ``` Update `load_config()` function: ```python def load_config(): # ... existing code ... git_indexing_data = config_data.get("git_indexing", {}) git_indexing = GitIndexingConfig( enabled=git_indexing_data.get("enabled", True), delta_max_lines=git_indexing_data.get("delta_max_lines", 200), batch_size=git_indexing_data.get("batch_size", 100), watch_enabled=git_indexing_data.get("watch_enabled", True), watch_cooldown=git_indexing_data.get("watch_cooldown", 5.0), ) return Config( server=server, indexing=indexing, git_indexing=git_indexing, # NEW parsers=parsers, # ... rest of config ... ) ``` **LOC:** +25 lines in config.py #### Step 3.2: Integrate CommitIndexer in ApplicationContext **File:** `src/context.py` Add imports: ```python from src.git.commit_indexer import CommitIndexer from src.git.repository import discover_git_repositories, is_git_available ``` Add to ApplicationContext class: ```python class ApplicationContext: def __init__( self, config: Config, vector_index: VectorIndex, keyword_index: KeywordIndex, graph_store: GraphStore, index_manager: IndexManager, orchestrator: SearchOrchestrator, file_watcher: FileWatcher | None = None, code_index: CodeIndex | None = None, commit_indexer: CommitIndexer | None = None, # NEW ): self.config = config self.vector_index = vector_index self.keyword_index = keyword_index self.graph_store = graph_store self.index_manager = index_manager self.orchestrator = orchestrator self.file_watcher = file_watcher self.code_index = code_index self.commit_indexer = commit_indexer # NEW @staticmethod def create(project_override: str | None = None) -> "ApplicationContext": # ... existing code ... # NEW: Initialize commit indexer commit_indexer = None if config.git_indexing.enabled and is_git_available(): db_path = index_path / "git_commits.db" commit_indexer = CommitIndexer( db_path=db_path, embedding_model=vector_index, ) logger.info(f"Git commit indexer initialized: {db_path}") else: if not is_git_available(): logger.warning("Git binary not found - git history search disabled") return ApplicationContext( config=config, vector_index=vector_index, keyword_index=keyword_index, graph_store=graph_store, index_manager=index_manager, orchestrator=orchestrator, file_watcher=file_watcher, code_index=code_index, commit_indexer=commit_indexer, # NEW ) ``` **LOC:** +60 lines in context.py #### Step 3.3: Add Initial Indexing **File:** `src/context.py` Add method to ApplicationContext: ```python def index_git_commits_initial(self) -> None: """Index all commits in discovered repositories (startup only).""" if self.commit_indexer is None: return from src.git.repository import discover_git_repositories, get_commits_after_timestamp from src.git.commit_parser import parse_commit, build_commit_document logger.info("Starting initial git commit indexing") repos = discover_git_repositories( Path(self.config.indexing.documents_path), self.config.indexing.exclude, self.config.indexing.exclude_hidden_dirs, ) total_indexed = 0 for repo_path in repos: try: # Get last indexed timestamp for this repo last_timestamp = self.commit_indexer.get_last_indexed_timestamp(str(repo_path)) # Get new commits commit_hashes = get_commits_after_timestamp(repo_path, last_timestamp) logger.info(f"Indexing {len(commit_hashes)} commits from {repo_path.parent}") # Batch process for i in range(0, len(commit_hashes), self.config.git_indexing.batch_size): batch = commit_hashes[i:i + self.config.git_indexing.batch_size] for hash in batch: try: commit = parse_commit( repo_path, hash, self.config.git_indexing.delta_max_lines, ) doc = build_commit_document(commit) self.commit_indexer.add_commit( hash=commit.hash, timestamp=commit.timestamp, author=commit.author, committer=commit.committer, title=commit.title, message=commit.message, files_changed=commit.files_changed, delta_truncated=commit.delta_truncated, commit_document=doc, ) total_indexed += 1 except Exception as e: logger.error(f"Failed to index commit {hash}: {e}") except Exception as e: logger.error(f"Failed to index repository {repo_path}: {e}") logger.info(f"Initial git commit indexing complete: {total_indexed} commits") ``` Call this method in `MCPServer.startup()` after index loading. **LOC:** +30 lines in context.py **Quality Gate:** Startup indexing works, logs show progress --- ### Phase 4: Search Tool & MCP Integration (2-3 hours, ~240 LOC) **Objective:** Implement search logic and expose MCP tool. #### Step 4.1: Create Commit Search Module **File:** `src/git/commit_search.py` ```python """Git commit search with glob filtering.""" import logging from dataclasses import dataclass from pathlib import Path from typing import Optional from src.git.commit_indexer import CommitIndexer logger = logging.getLogger(__name__) @dataclass class CommitResult: hash: str title: str author: str committer: str timestamp: int message: str files_changed: list[str] delta_truncated: str score: float repo_path: str @dataclass class GitSearchResponse: results: list[CommitResult] query: str total_commits_indexed: int def search_git_history( commit_indexer: CommitIndexer, query: str, top_n: int = 5, files_glob: Optional[str] = None, after_timestamp: Optional[int] = None, before_timestamp: Optional[int] = None, ) -> GitSearchResponse: """ Search git commit history with optional filters. Args: commit_indexer: CommitIndexer instance query: Natural language query top_n: Maximum results to return files_glob: Optional glob pattern (e.g., 'src/**/*.py') after_timestamp: Optional Unix timestamp (commits after) before_timestamp: Optional Unix timestamp (commits before) Returns: GitSearchResponse with ranked commits """ # Generate query embedding query_embedding = commit_indexer._embedding_model.get_text_embedding(query) # Query index candidates = commit_indexer.query_by_embedding( query_embedding, top_k=top_n * 2, # Over-fetch for filtering after_timestamp=after_timestamp, before_timestamp=before_timestamp, ) # Apply glob filtering if files_glob: candidates = filter_by_glob(candidates, files_glob) # Take top N results = [] for commit_dict in candidates[:top_n]: results.append(CommitResult( hash=commit_dict["hash"], title=commit_dict["title"], author=commit_dict["author"], committer=commit_dict["committer"], timestamp=commit_dict["timestamp"], message=commit_dict["message"], files_changed=commit_dict["files_changed"], delta_truncated=commit_dict["delta_truncated"], score=commit_dict["score"], repo_path=commit_dict.get("repo_path", ""), )) total = commit_indexer.get_total_commits() return GitSearchResponse( results=results, query=query, total_commits_indexed=total, ) def filter_by_glob(commits: list[dict], glob_pattern: str) -> list[dict]: """ Filter commits by glob pattern matching any changed file. Args: commits: List of commit dicts with 'files_changed' key glob_pattern: Glob pattern (e.g., 'src/**/*.py') Returns: Filtered list of commits """ filtered = [] for commit in commits: files_changed = commit.get("files_changed", []) for file_path in files_changed: if Path(file_path).match(glob_pattern): filtered.append(commit) break # Match found, include commit return filtered ``` **LOC:** +150 lines #### Step 4.2: Add CommitResult Model **File:** `src/models.py` Add dataclasses at end of file: ```python @dataclass class CommitResult: """Git commit search result.""" hash: str title: str author: str committer: str timestamp: int message: str files_changed: list[str] delta_truncated: str score: float repo_path: str @dataclass class GitSearchResponse: """Response from git history search.""" results: list[CommitResult] query: str total_commits_indexed: int ``` **LOC:** +20 lines #### Step 4.3: Register MCP Tool **File:** `src/mcp_server.py` Add to `list_tools()`: ```python Tool( name="search_git_history", description=( "Search git commit history using natural language queries. " + "Returns relevant commits with metadata, message, and diff context. " + "Supports filtering by file glob patterns and timestamp ranges." ), inputSchema={ "type": "object", "properties": { "query": { "type": "string", "description": "Natural language query describing commits to find", }, "top_n": { "type": "integer", "description": "Maximum number of commits to return (default: 5, max: 100)", "default": 5, "minimum": 1, "maximum": 100, }, "files_glob": { "type": "string", "description": "Optional glob pattern to filter by changed files (e.g., 'src/**/*.py')", }, "after_timestamp": { "type": "integer", "description": "Optional Unix timestamp to filter commits after this date", }, "before_timestamp": { "type": "integer", "description": "Optional Unix timestamp to filter commits before this date", }, }, "required": ["query"], }, ), ``` Add handler method: ```python async def _handle_search_git_history(self, arguments: dict) -> list[TextContent]: """Handle search_git_history tool call.""" if self.ctx.commit_indexer is None: return [TextContent( type="text", text="Git history search is not available (git binary not found or disabled in config)" )] from src.git.commit_search import search_git_history query = arguments["query"] top_n = arguments.get("top_n", 5) files_glob = arguments.get("files_glob") after_timestamp = arguments.get("after_timestamp") before_timestamp = arguments.get("before_timestamp") # Validate top_n top_n = max(MIN_TOP_N, min(top_n, MAX_TOP_N)) # Execute search response = await asyncio.to_thread( search_git_history, self.ctx.commit_indexer, query, top_n, files_glob, after_timestamp, before_timestamp, ) # Format response output_lines = [ f"# Git History Search Results", f"", f"**Query:** {response.query}", f"**Total Commits Indexed:** {response.total_commits_indexed}", f"**Results Returned:** {len(response.results)}", f"", ] for i, commit in enumerate(response.results, 1): output_lines.extend([ f"## {i}. {commit.title}", f"", f"**Commit:** `{commit.hash[:8]}`", f"**Author:** {commit.author}", f"**Date:** {datetime.fromtimestamp(commit.timestamp, timezone.utc).isoformat()}", f"**Score:** {commit.score:.3f}", f"", f"### Message", f"", commit.message if commit.message else "(no message body)", f"", f"### Files Changed ({len(commit.files_changed)})", f"", ]) for file_path in commit.files_changed[:10]: # Limit to 10 files output_lines.append(f"- `{file_path}`") if len(commit.files_changed) > 10: output_lines.append(f"- ... and {len(commit.files_changed) - 10} more") output_lines.extend([ f"", f"### Delta (truncated)", f"", f"```diff", commit.delta_truncated[:1000], # Truncate for display f"```", f"", "---", "", ]) return [TextContent(type="text", text="\n".join(output_lines))] ``` Update `call_tool()` dispatcher: ```python @self.server.call_tool() async def call_tool(name: str, arguments: dict) -> list[TextContent]: if name == "query_documents": return await self._handle_query_documents(arguments) elif name == "query_unique_documents": return await self._handle_query_unique_documents(arguments) elif name == "search_git_history": # NEW return await self._handle_search_git_history(arguments) else: raise ValueError(f"Unknown tool: {name}") ``` **LOC:** +90 lines in mcp_server.py **Quality Gate:** Tool callable via MCP, returns formatted results --- ### Phase 5: File Watcher Integration (2-3 hours, ~160 LOC) **Objective:** Auto-update commit index when git operations occur. #### Step 5.1: Create Git Watcher **File:** `src/git/watcher.py` ```python """Git directory file watcher for automatic commit indexing.""" import asyncio import logging import queue from pathlib import Path from watchdog.events import FileSystemEvent, FileSystemEventHandler from watchdog.observers import Observer logger = logging.getLogger(__name__) class GitWatcher: """Watches .git directories for changes and triggers incremental indexing.""" def __init__( self, git_repos: list[Path], commit_indexer, config, cooldown: float = 5.0, ): """ Initialize git watcher. Args: git_repos: List of .git directory paths to watch commit_indexer: CommitIndexer instance config: Configuration object cooldown: Debounce cooldown in seconds """ self._git_repos = git_repos self._commit_indexer = commit_indexer self._config = config self._cooldown = cooldown self._observers: list[Observer] = [] self._event_queue = queue.Queue[Path]() self._running = False self._task: asyncio.Task | None = None def start(self) -> None: """Start watching git directories.""" if self._running: return self._running = True for git_dir in self._git_repos: # Watch specific paths: HEAD, refs/, objects/ watch_paths = [ git_dir / "HEAD", git_dir / "refs", ] for watch_path in watch_paths: if watch_path.exists(): event_handler = _GitEventHandler(self._event_queue, git_dir) observer = Observer() observer.schedule( event_handler, str(watch_path), recursive=(watch_path.name == "refs"), ) observer.start() self._observers.append(observer) self._task = asyncio.create_task(self._process_events()) logger.info(f"Git watcher started for {len(self._git_repos)} repositories") async def stop(self) -> None: """Stop watching git directories.""" if not self._running: return self._running = False # Stop all observers for observer in self._observers: observer.stop() try: await asyncio.wait_for( asyncio.to_thread(observer.join, timeout=1.0), timeout=1.5, ) except asyncio.TimeoutError: logger.warning("Observer thread did not stop within timeout") self._observers.clear() # Cancel processing task if self._task: self._task.cancel() try: await asyncio.wait_for(self._task, timeout=1.0) except (asyncio.TimeoutError, asyncio.CancelledError): pass self._task = None logger.info("Git watcher stopped") async def _process_events(self) -> None: """Process queued git directory changes with debouncing.""" pending_repos: set[Path] = set() while self._running: try: try: git_dir = await asyncio.to_thread( self._event_queue.get, timeout=0.5 ) pending_repos.add(git_dir) except queue.Empty: if pending_repos: await asyncio.sleep(self._cooldown) await self._batch_process(pending_repos) pending_repos.clear() except asyncio.CancelledError: break except Exception as e: logger.error(f"Error in git event processing: {e}") # Process remaining if pending_repos: try: await asyncio.wait_for( self._batch_process(pending_repos), timeout=5.0, ) except (asyncio.TimeoutError, Exception) as e: logger.warning(f"Failed to process final git events: {e}") async def _batch_process(self, git_dirs: set[Path]) -> None: """Incrementally index commits for changed repositories.""" from src.git.repository import get_commits_after_timestamp from src.git.commit_parser import parse_commit, build_commit_document for git_dir in git_dirs: try: # Get last indexed timestamp last_timestamp = self._commit_indexer.get_last_indexed_timestamp( str(git_dir) ) # Get new commits commit_hashes = await asyncio.to_thread( get_commits_after_timestamp, git_dir, last_timestamp, ) if not commit_hashes: logger.debug(f"No new commits in {git_dir.parent}") continue logger.info(f"Indexing {len(commit_hashes)} new commits from {git_dir.parent}") # Index new commits for hash in commit_hashes: try: commit = await asyncio.to_thread( parse_commit, git_dir, hash, self._config.git_indexing.delta_max_lines, ) doc = build_commit_document(commit) await asyncio.to_thread( self._commit_indexer.add_commit, hash=commit.hash, timestamp=commit.timestamp, author=commit.author, committer=commit.committer, title=commit.title, message=commit.message, files_changed=commit.files_changed, delta_truncated=commit.delta_truncated, commit_document=doc, ) except Exception as e: logger.error(f"Failed to index commit {hash}: {e}") logger.info(f"Updated commit index for {git_dir.parent.name}") except Exception as e: logger.error(f"Failed to update commits for {git_dir}: {e}") class _GitEventHandler(FileSystemEventHandler): """Event handler for git directory changes.""" def __init__(self, queue: queue.Queue[Path], git_dir: Path): super().__init__() self._queue = queue self._git_dir = git_dir def on_modified(self, event: FileSystemEvent) -> None: """Detect commits via refs/ or HEAD changes.""" if event.is_directory: return path = Path(event.src_path) # Trigger on HEAD or refs/* changes if path.name == "HEAD" or "refs" in path.parts: self._queue.put_nowait(self._git_dir) def on_created(self, event: FileSystemEvent) -> None: """Detect new branches/tags.""" if not event.is_directory and "refs" in Path(event.src_path).parts: self._queue.put_nowait(self._git_dir) ``` **LOC:** +120 lines #### Step 5.2: Integrate Git Watcher in Lifecycle **File:** `src/lifecycle.py` Add git watcher field and management: ```python from src.git.watcher import GitWatcher class LifecycleCoordinator: def __init__(self): # ... existing fields ... self._git_watcher: GitWatcher | None = None async def start(self, ctx: ApplicationContext) -> None: # ... existing startup logic ... # Start git watcher if enabled if ctx.config.git_indexing.enabled and ctx.config.git_indexing.watch_enabled: if ctx.commit_indexer is not None: from src.git.repository import discover_git_repositories repos = discover_git_repositories( Path(ctx.config.indexing.documents_path), ctx.config.indexing.exclude, ctx.config.indexing.exclude_hidden_dirs, ) self._git_watcher = GitWatcher( git_repos=repos, commit_indexer=ctx.commit_indexer, config=ctx.config, cooldown=ctx.config.git_indexing.watch_cooldown, ) self._git_watcher.start() logger.info("Git watcher started") async def shutdown(self, ctx: ApplicationContext) -> None: # ... existing shutdown logic ... # Stop git watcher if self._git_watcher: await self._git_watcher.stop() self._git_watcher = None ``` **LOC:** +40 lines in lifecycle.py **Quality Gate:** Git watcher detects changes, logs show incremental updates --- ### Phase 6: Testing & Documentation (3-4 hours, ~450 LOC) **Objective:** Comprehensive test coverage and documentation. #### Step 6.1: Integration Tests **File:** `tests/integration/test_git_search.py` ```python """Integration tests for git commit search.""" import tempfile from pathlib import Path import pytest @pytest.fixture def test_repo(tmp_path): """Create a test git repository with commits.""" repo_path = tmp_path / "test_repo" repo_path.mkdir() # Initialize git repo subprocess.run(["git", "init"], cwd=repo_path, check=True) subprocess.run(["git", "config", "user.name", "Test User"], cwd=repo_path, check=True) subprocess.run(["git", "config", "user.email", "test@example.com"], cwd=repo_path, check=True) # Create test commits for i in range(5): file_path = repo_path / f"file_{i}.txt" file_path.write_text(f"Content {i}") subprocess.run(["git", "add", "."], cwd=repo_path, check=True) subprocess.run( ["git", "commit", "-m", f"Commit {i}: Add file {i}"], cwd=repo_path, check=True, ) return repo_path def test_end_to_end_search(test_repo, tmp_path): """Test full workflow: discover -> index -> search.""" # Test implementation pass def test_incremental_indexing(test_repo, tmp_path): """Test incremental update after new commit.""" # Test implementation pass def test_glob_filtering(test_repo, tmp_path): """Test file glob pattern filtering.""" # Test implementation pass def test_timestamp_filtering(test_repo, tmp_path): """Test temporal filtering.""" # Test implementation pass ``` **LOC:** ~120 lines #### Step 6.2: E2E MCP Tool Test **File:** `tests/e2e/test_git_mcp.py` ```python """End-to-end test for search_git_history MCP tool.""" import pytest @pytest.mark.e2e async def test_search_git_history_tool(mcp_server, test_repo): """Test search_git_history tool via MCP protocol.""" response = await mcp_server.call_tool( "search_git_history", {"query": "Add file", "top_n": 5} ) assert len(response) == 1 assert "Git History Search Results" in response[0].text # More assertions ``` **LOC:** ~30 lines #### Step 6.3: Documentation Updates **File:** `docs/architecture.md` Add section: ```markdown ### Git History Search #### CommitIndexer (src/git/commit_indexer.py) **Technology:** SQLite with BLOB embeddings **Schema:** - `hash`: TEXT PRIMARY KEY (commit SHA) - `timestamp`: INTEGER (Unix seconds) - `author`: TEXT - `committer`: TEXT - `title`: TEXT (first line of message) - `message`: TEXT (full body) - `files_changed`: TEXT (JSON array) - `delta_truncated`: TEXT (max 200 lines) - `embedding`: BLOB (384-dim float32) - `indexed_at`: INTEGER **Storage:** `{index_path}/git_commits.db` **Embedding Model:** Shared with VectorIndex (BAAI/bge-small-en-v1.5) #### Repository Discovery (src/git/repository.py) Recursively finds `.git` directories, respecting exclusion patterns from `IndexingConfig.exclude`. #### GitWatcher (src/git/watcher.py) Watches `.git/HEAD` and `.git/refs/` for changes (new commits, branches). Triggers incremental indexing with 5-second cooldown. ``` **LOC:** +60 lines in docs/architecture.md **File:** `docs/git-search.md` (new) Create user guide: ```markdown # Git History Search ## Overview Search your git commit history using natural language queries. ## Configuration ```toml [git_indexing] enabled = true delta_max_lines = 200 batch_size = 100 watch_enabled = true watch_cooldown = 5.0 ``` ## Usage Examples ### Basic Search ```json { "query": "fix authentication bug", "top_n": 10 } ``` ### Filter by Files ```json { "query": "API changes", "files_glob": "src/**/*.py" } ``` ### Temporal Filtering ```json { "query": "performance improvements", "after_timestamp": 1704067200 } ``` ## Troubleshooting ### Git binary not found Ensure `git` is in your PATH. ### No commits indexed Check logs for repository discovery issues. ``` **LOC:** ~100 lines in docs/git-search.md **File:** `README.md` Add feature description: ```markdown ### Git History Search - Semantic search over commit history - Filter by file patterns and timestamps - Automatic incremental indexing via file watching ``` **LOC:** +20 lines in README.md **Quality Gate:** All tests pass, documentation complete --- ## Dependencies & Ordering ```mermaid graph TD A[Phase 1: Git Module] --> B[Phase 2: Commit Index] B --> C[Phase 3: Configuration] C --> D[Phase 4: Search Tool] D --> E[Phase 5: File Watcher] E --> F[Phase 6: Testing & Docs] style A fill:#e1f5e1 style B fill:#e1f5e1 style C fill:#fff4e1 style D fill:#fff4e1 style E fill:#e1f0ff style F fill:#f0e1ff ``` **Critical Path:** 1. Phase 1 (foundation) → Phase 2 (storage) → Phase 3 (config) → Phase 4 (search) 2. Phase 5 (watcher) depends on Phase 3 but can run parallel to Phase 4 3. Phase 6 (testing) depends on all prior phases **Parallelization Opportunities:** - Phase 1 unit tests can be written during Phase 2 implementation - Phase 5 can start after Phase 3 completes - Documentation can be written in parallel with Phase 6 testing --- ## Test Specifications ### Unit Tests #### test_repository.py | Test Case | Assertion | Coverage Target | |-----------|-----------|-----------------| | `test_discover_single_repo` | Found 1 .git directory | 95% | | `test_discover_nested_repos` | Found 2 .git directories | - | | `test_exclude_venv_pattern` | Excluded .venv/.git | - | | `test_exclude_hidden_dirs` | Skipped .hidden/.git | - | | `test_no_repos_found` | Returns empty list | - | | `test_get_all_commits` | Returns 10 commits | - | | `test_get_commits_after_timestamp` | Returns 3 recent commits | - | | `test_git_not_available` | Returns False | - | #### test_commit_parser.py | Test Case | Assertion | Coverage Target | |-----------|-----------|-----------------| | `test_parse_standard_commit` | All fields populated | 90% | | `test_parse_merge_commit` | Multiple parents handled | - | | `test_multiline_message` | Message body captured | - | | `test_delta_truncation_200_lines` | Exactly 200 lines + marker | - | | `test_delta_no_truncation` | No marker if < 200 lines | - | | `test_utf8_encoding` | Decodes correctly | - | | `test_latin1_fallback` | Handles encoding error | - | | `test_build_commit_document` | Format matches spec | - | #### test_commit_indexer.py | Test Case | Assertion | Coverage Target | |-----------|-----------|-----------------| | `test_schema_creation` | Table exists with indexes | 95% | | `test_add_commit` | Commit stored with embedding | - | | `test_update_commit_idempotent` | Same hash updates | - | | `test_remove_commit` | Commit deleted | - | | `test_query_by_embedding` | Top-k ranked by similarity | - | | `test_timestamp_filter_after` | Filters correctly | - | | `test_timestamp_filter_before` | Filters correctly | - | | `test_empty_index_query` | Returns empty list | - | | `test_embedding_roundtrip` | Serialize/deserialize matches | - | | `test_malformed_json_files` | Handles gracefully | - | ### Integration Tests #### test_git_search.py | Test Case | Description | Duration | |-----------|-------------|----------| | `test_end_to_end_search` | Full workflow: discover → index → search | <5s | | `test_incremental_indexing` | Add commit, verify indexed | <3s | | `test_glob_filtering` | Filter by `**/*.py` | <2s | | `test_timestamp_filtering` | Query recent commits only | <2s | ### E2E Tests #### test_git_mcp.py | Test Case | Description | Duration | |-----------|-------------|----------| | `test_search_git_history_tool` | Call tool via MCP, verify response format | <5s | --- ## Quality Gates ### Phase 1 Quality Gate - ✅ `ruff check src/git/repository.py src/git/commit_parser.py` passes - ✅ `pyright src/git/` passes with 0 errors - ✅ `pytest tests/unit/test_repository.py tests/unit/test_commit_parser.py -v` passes - ✅ Coverage ≥ 90% for both modules **Command:** ```bash ruff check src/git/ && pyright src/git/ && pytest tests/unit/test_repository.py tests/unit/test_commit_parser.py --cov=src/git --cov-report=term ``` ### Phase 2 Quality Gate - ✅ `ruff check src/git/commit_indexer.py` passes - ✅ `pyright src/git/commit_indexer.py` passes - ✅ `pytest tests/unit/test_commit_indexer.py -v` passes - ✅ Coverage ≥ 95% for commit_indexer.py - ✅ SQLite schema verified: `sqlite3 .index_data/git_commits.db ".schema"` **Command:** ```bash ruff check src/git/commit_indexer.py && pyright src/git/commit_indexer.py && pytest tests/unit/test_commit_indexer.py --cov=src/git/commit_indexer --cov-report=term ``` ### Phase 3 Quality Gate - ✅ `ruff check src/config.py src/context.py` passes - ✅ `pyright src/config.py src/context.py` passes - ✅ Server starts without errors: `python -m src.mcp_server 2>&1 | grep -i "git"` - ✅ Logs show "Git commit indexer initialized" - ✅ `index_git_commits_initial()` completes successfully **Command:** ```bash ruff check src/config.py src/context.py && pyright src/config.py src/context.py ``` ### Phase 4 Quality Gate - ✅ `ruff check src/git/commit_search.py src/mcp_server.py src/models.py` passes - ✅ `pyright src/git/commit_search.py src/mcp_server.py src/models.py` passes - ✅ Tool listed in MCP tools: `python -m src.mcp_server list-tools | grep search_git_history` - ✅ Manual test call succeeds (returns formatted results) **Command:** ```bash ruff check src/git/commit_search.py src/mcp_server.py src/models.py && pyright src/git/commit_search.py src/mcp_server.py src/models.py ``` ### Phase 5 Quality Gate - ✅ `ruff check src/git/watcher.py src/lifecycle.py` passes - ✅ `pyright src/git/watcher.py src/lifecycle.py` passes - ✅ Watcher starts: logs show "Git watcher started for N repositories" - ✅ Make test commit, verify incremental index update within 10s - ✅ Shutdown completes within 2s **Command:** ```bash ruff check src/git/watcher.py src/lifecycle.py && pyright src/git/watcher.py src/lifecycle.py ``` ### Phase 6 Quality Gate - ✅ All unit tests pass: `pytest tests/unit/ -v --cov=src/git --cov-report=term` - ✅ All integration tests pass: `pytest tests/integration/test_git_search.py -v` - ✅ E2E test passes: `pytest tests/e2e/test_git_mcp.py -v` - ✅ Overall coverage ≥ 85% for `src/git/` module - ✅ Documentation rendered correctly: `grip docs/git-search.md` - ✅ CHANGELOG updated with feature description **Command:** ```bash pytest tests/ -v --cov=src/git --cov-report=html && open htmlcov/index.html ``` --- ## Acceptance Criteria ### Functional Requirements 1. **Repository Discovery** - [x] Discovers all `.git` directories recursively - [x] Respects exclusion patterns from config - [x] Excludes hidden directories (except `.git`) - [x] Returns absolute paths to `.git` directories 2. **Commit Indexing** - [x] Indexes all commits on all branches - [x] Extracts metadata: hash, timestamp, author, committer, title, message - [x] Truncates delta to 200 lines with indicator - [x] Stores embeddings in SQLite BLOB - [x] Handles encoding errors gracefully (UTF-8 → Latin-1 fallback) 3. **Search Functionality** - [x] Semantic search via embedding similarity - [x] Returns top-N ranked results - [x] Filters by glob pattern (files changed) - [x] Filters by timestamp range (after/before) - [x] Returns formatted response with scores 4. **Incremental Updates** - [x] Detects new commits via file watching - [x] Updates index within 10s of git operation - [x] Debounces rapid git operations (5s cooldown) - [x] Logs progress and errors 5. **MCP Tool** - [x] Registered as `search_git_history` - [x] Accepts query, top_n, files_glob, timestamps - [x] Returns markdown-formatted results - [x] Handles errors gracefully (missing git, no commits) ### Non-Functional Requirements 1. **Performance** - [x] Full index build: <60s for 1000 commits - [x] Incremental update: <10s for 10 commits - [x] Query latency: <500ms (p95) for 10k commits - [x] Embedding generation: <5s per 100 commits 2. **Reliability** - [x] No crashes on malformed repositories - [x] No data loss on server restart - [x] Graceful degradation if git unavailable - [x] Atomic SQLite transactions 3. **Maintainability** - [x] Type hints on all public functions - [x] Docstrings on all modules and classes - [x] 85%+ test coverage for new code - [x] Passes ruff and pyright checks --- ## Risk Mitigation ### High Risk: Large Repository Performance **Risk:** Query latency >1s for repositories with 10k+ commits. **Mitigation:** - Phase 2: Implement batch embedding (100 commits at a time) - Phase 4: Add LIMIT clause to SQLite queries - Phase 6: Add performance benchmark test - Post-launch: Consider HNSW index for ANN search **Fallback:** Document recommended commit limit (10k), add config option to skip old commits. ### Medium Risk: Encoding Errors **Risk:** Non-UTF-8 commit messages cause indexing failures. **Mitigation:** - Phase 1: Implement encoding fallback chain (UTF-8 → Latin-1 → CP1252) - Phase 1: Replace invalid bytes with `�` character - Phase 6: Add test case for non-UTF-8 commits **Fallback:** Log error, skip commit, continue indexing. ### Medium Risk: File Descriptor Limits **Risk:** Watching many repositories exhausts file descriptors. **Mitigation:** - Phase 5: Watch only `.git/HEAD` and `.git/refs/` (2 descriptors per repo) - Phase 6: Document file descriptor increase (`ulimit -n 4096`) - Phase 5: Log warning if >10 repositories discovered **Fallback:** Disable git watcher, rely on startup reconciliation. ### Low Risk: SQLite Locking **Risk:** Concurrent queries block incremental updates. **Mitigation:** - Phase 2: Use WAL mode for SQLite - Phase 5: Keep write transactions short (<100ms) **Fallback:** Retry with exponential backoff. --- ## Appendix: Quick Reference ### File Checklist **Phase 1:** - [ ] `src/git/__init__.py` - [ ] `src/git/repository.py` - [ ] `src/git/commit_parser.py` - [ ] `tests/unit/test_repository.py` - [ ] `tests/unit/test_commit_parser.py` **Phase 2:** - [ ] `src/git/commit_indexer.py` - [ ] `tests/unit/test_commit_indexer.py` **Phase 3:** - [ ] Modify `src/config.py` - [ ] Modify `src/context.py` **Phase 4:** - [ ] `src/git/commit_search.py` - [ ] Modify `src/models.py` - [ ] Modify `src/mcp_server.py` **Phase 5:** - [ ] `src/git/watcher.py` - [ ] Modify `src/lifecycle.py` **Phase 6:** - [ ] `tests/integration/test_git_search.py` - [ ] `tests/e2e/test_git_mcp.py` - [ ] Modify `docs/architecture.md` - [ ] `docs/git-search.md` - [ ] Modify `README.md` ### Command Shortcuts ```bash # Run all git module tests pytest tests/unit/test_repository.py tests/unit/test_commit_parser.py tests/unit/test_commit_indexer.py tests/integration/test_git_search.py -v # Check code quality ruff check src/git/ && pyright src/git/ # Test MCP tool manually echo '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"search_git_history","arguments":{"query":"test","top_n":5}},"id":1}' | python -m src.mcp_server # Monitor git watcher logs python -m src.mcp_server 2>&1 | grep -i "git" # Check SQLite schema sqlite3 ~/.local/share/mcp-markdown-ragdocs/*/git_commits.db ".schema" # Measure query latency time python -c "from src.git.commit_search import search_git_history; ..." ``` --- ## Summary for Lead Agent **Plan Location:** `docs/plans/git-history-search.md` **Estimated Total:** - **Files:** 14 (8 new, 6 modified) - **LOC:** ~1,850 lines - Implementation: ~1,100 LOC - Tests: ~450 LOC - Config/Integration: ~200 LOC - Documentation: ~100 LOC **Implementation Time:** 15-20 hours across 6 phases **Dependencies:** - External: `git` binary, `sqlite3`, `numpy` - Internal: `VectorIndex` (embedding model), `IndexingConfig` (exclusions), `FileWatcher` (patterns) **Critical Success Factors:** 1. Repository discovery correctly respects exclusions 2. Embedding serialization roundtrip preserves precision 3. File watcher debouncing prevents redundant indexing 4. SQLite transactions avoid data corruption **Validation Steps:** 1. Run all quality gates sequentially 2. Verify acceptance criteria checklist 3. Test with real repository (e.g., mcp-markdown-ragdocs itself) 4. Check documentation renders correctly **Ready for Handoff:** This plan is code-agent ready with: - Specific function signatures and class names - Exact file paths and line-of-code estimates - Concrete test cases with assertions - Quality gates with executable commands - Risk mitigation strategies The code agent can proceed phase-by-phase without requiring additional architectural decisions.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/andnp/ragdocs-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

git-history-search.md•52.1 KiB