Codebase MCP Server

IMPLEMENTATION_HANDOFF.md•23.5 kB

# Background Indexing MVP - Implementation Handoff **Date**: 2025-10-17 **Branch**: `015-background-indexing-mvp` ✅ **ACTIVE** **Previous Branch**: `014-add-background-indexing` **Status**: Phase 1 Complete (T001-T002) - Ready for T003 --- ## Executive Summary ### What We're Building Background indexing support for large repositories (10K+ files) that exceed MCP client timeout limits (30 seconds). Users can start indexing, receive immediate confirmation with a job ID, and poll for status while the work completes in the background. ### Why It's Needed **Problem**: Large repositories (10K+ files) take 5-10 minutes to index, exceeding typical MCP request timeouts. Claude Code stops waiting, and users lose their work. **Solution**: Background indexing with PostgreSQL job tracking - Start job → get job_id immediately (<1s) - Poll status with `get_indexing_status(job_id)` - State persists in PostgreSQL (survives restarts) ### What Approach Was Chosen **MVP-optimized approach** (6-7 hours, 13 tasks): - PostgreSQL-native architecture (not in-memory) - Reuse existing infrastructure (`session.py`, `indexer.py`) - Simplified schema (10 columns vs. 18 in full plan) - Binary state tracking (running/done, no granular progress) - Deferred features: job listing, ETA calculation, cancellation, resumption ### Current Status **Phase 1 Complete**: Database migration created, tested, and validated. **✅ Completed Tasks (Phase 1)**: - **T001**: Alembic migration created (`008_add_indexing_jobs.py`, 138 lines) - **T002**: Migration applied and validated (see validation report) **Deliverables**: - Migration file with upgrade/downgrade functions - Type stub for mypy compliance - Comprehensive validation report (208 lines) - Git commit: `e6573958` **Validation Summary**: - ✅ All 10 columns created with correct types/defaults - ✅ CHECK constraint working (rejects invalid status) - ✅ Partial index performance: 0.030ms (target <100ms) - ✅ created_at index performance: 0.039ms (target <100ms) - ✅ Downgrade tested successfully **Next Steps**: - **T003** (30m): Create `IndexingJob` Pydantic model - **T003-test** (15m): Unit tests for model validation --- ## Key Decisions Made ### 1. MVP-First Approach - **Scope**: User Story 1 only (start job + check status) - **Timeline**: 6-7 hours vs. 10-12 hours for full plan (58% faster) - **Code**: ~1,200-1,500 lines vs. ~2,500 lines (50% less) - **Rationale**: Ship core functionality fast, iterate based on user feedback ### 2. PostgreSQL-Native Architecture - **State storage**: PostgreSQL `indexing_jobs` table (not in-memory) - **Benefits**: Survives restarts, ACID guarantees, no custom job manager - **Trade-off**: No real-time progress updates (poll-based only) ### 3. Reuse Existing Infrastructure - **Database**: Use `get_session()` and `resolve_project_id()` from `session.py` - **Indexer**: No modifications to `indexer.py` (call as-is) - **MCP**: Use `@mcp.tool()` decorator from FastMCP - **Rationale**: Minimize new code, leverage battle-tested infrastructure ### 4. Simplified Database Schema - **Essential columns**: 10 vs. 18 in full plan - **Removed**: progress_percentage, progress_message, phase messages, ETA - **Keep**: id, repo_path, project_id, status, error_message, timestamps, counters - **Rationale**: MVP needs binary state (running/done), not real-time progress ### 5. Deferred Features (Phase 2) - **US2**: Job listing, ETA calculation, phase messages - **US3**: Cancellation - **US4**: Resumption after failures - **Rationale**: Nice-to-have features, not critical for core functionality --- ## Implementation Plan ### Primary Task Document **File**: `docs/architecture/background-indexing-tasks-mvp.md` **Tasks**: 13 tasks (T001-T013) **Timeline**: 6-7 hours **Critical Path**: T001 → T002 → T003 → T004 → T005 → T006 → T013 (5.5 hours) ### Task Breakdown #### Phase 1: Database (1h) - **T001**: Create simplified Alembic migration (45m) - **T002**: Apply migration and validate (15m) #### Phase 2: Models (1h) - **T003**: Create IndexingJob SQLAlchemy + Pydantic models (45m) - **T003-test**: Unit tests for models (15m) [CAN RUN PARALLEL WITH T004] #### Phase 3: Core Implementation (2.5h) - **T004**: Implement simplified background worker (45m) - **T005**: Implement `start_indexing_background()` MCP tool (30m) - **T006**: Implement `get_indexing_status()` MCP tool (30m) - **T007**: Add `update_job()` utility function (15m) - **T008**: Integration test for complete workflow (30m) #### Phase 4: Production Hardening (1h) - **T009**: Add error handling and logging (20m) [CAN RUN PARALLEL] - **T010**: Test state persistence across restart (20m) - **T011**: Add configuration to .env (10m) [CAN RUN PARALLEL] - **T012**: Update documentation (10m) [CAN RUN PARALLEL] #### Phase 5: Validation (30m) - **T013**: End-to-end test with large repository (30m) ### Parallel Opportunities - T003-test can run parallel with T004 (after T003) - T009-T012 can run in parallel (documentation and config) --- ## Key Files to Reference ### Architecture & Planning - `docs/architecture/background-indexing-tasks-mvp.md` - **PRIMARY TASK LIST** (read this first!) - `docs/architecture/background-indexing.md` - Detailed architecture specification - `specs/014-add-background-indexing/spec.md` - User stories and requirements - `.specify/memory/constitution.md` - Constitutional principles (simplicity, production quality) ### Existing Code to Reuse - `src/database/session.py` - **Key infrastructure** - `get_session(ctx=ctx)` - Database session with project resolution - `resolve_project_id(ctx=ctx)` - 4-tier project resolution chain - No modifications needed, just call these functions - `src/services/indexer.py` - **Existing indexer** - `index_repository(repo_path, name, db, project_id)` - No modifications needed! - Call as-is from background worker - `src/mcp/server_fastmcp.py` - **MCP server** - `@mcp.tool()` decorator for registering new tools - `from src.mcp.server_fastmcp import mcp` - `migrations/versions/` - **Alembic migration examples** - Look at recent migrations for schema patterns - Follow naming convention: `###_add_indexing_jobs.py` ### Git Context - **Current branch**: `015-background-indexing-mvp` ✅ **ACTIVE** - **Main branch**: `master` - **Last commit**: `e6573958` - "feat(indexing): add indexing_jobs table migration" --- ## First Steps (Immediate Actions) ### ✅ 1. Create Feature Branch (COMPLETE) Branch `015-background-indexing-mvp` created and active. ### ✅ 2. Database Migration (T001-T002 COMPLETE) **Migration file**: `migrations/versions/008_add_indexing_jobs.py` ✅ Created **Type stub**: `migrations/versions/008_add_indexing_jobs.pyi` ✅ Created **Validation report**: `docs/migrations/008-indexing-jobs-validation.md` ✅ Created **Status**: Migration applied, all tests passing **Validation Results**: - ✅ All 10 columns created with correct types/defaults - ✅ CHECK constraint rejects invalid status values - ✅ Partial index optimizes active job queries (0.030ms) - ✅ created_at index optimizes history queries (0.039ms) - ✅ Downgrade tested and working ### 3. Next: Create Pydantic Model (T003) **Create file**: `src/models/indexing_job.py` **Reference**: `docs/architecture/background-indexing-tasks-mvp.md` lines 196-276 **Requirements**: - SQLAlchemy model (`IndexingJob`) mapping to `indexing_jobs` table - Pydantic models (`IndexingJobCreate`, `IndexingJobResponse`) - Path validation (absolute paths only, no traversal) - Status type safety (TypedDict or Enum) ### 4. Follow TDD Approach - Implement test task first (if exists) - Then implementation task - Commit after each task with conventional commit message **Example commit messages**: ```bash git commit -m "feat(indexing): add indexing_jobs table migration" git commit -m "test(indexing): add model validation tests" git commit -m "feat(indexing): implement background worker" ``` --- ## Critical Context ### Problem Statement (Detailed) **Symptoms**: - Large repositories (10K+ files) take 5-10 minutes to index - MCP client timeout is ~30 seconds (validated with Claude Code) - Claude Code stops waiting, user loses progress - Current `index_repository()` tool is synchronous (blocking) **User Experience Impact**: - Users with enterprise codebases cannot use semantic search - Frustration from lost work and unclear status - No visibility into long-running operations ### Solution Architecture **Background indexing with PostgreSQL job tracking**: 1. **Start Job** (`start_indexing_background()`): - Validate repo_path (absolute, no traversal) - Create job record (status=pending) - Spawn asyncio.Task for worker - Return job_id immediately (<1s) 2. **Poll Status** (`get_indexing_status(job_id)`): - Query indexing_jobs table - Return status, counters, error message - User polls every 2-5 seconds 3. **Background Worker**: - Update job to status=running - Call existing `index_repository()` (no changes!) - Update job to status=completed or failed - All state in PostgreSQL (survives restarts) **State Machine**: ``` pending → running → completed → failed ``` ### MVP Scope (User Story 1) **What's Included**: - ✅ `start_indexing_background()` - Start job, return job_id - ✅ `get_indexing_status(job_id)` - Poll for status - ✅ Job table with status tracking (pending/running/completed/failed) - ✅ State persists across restarts - ✅ Error handling and logging **What's Deferred (Phase 2)**: - ❌ `list_background_jobs()` - No job listing - ❌ ETA calculation - No time estimates - ❌ Cancellation - Jobs run to completion - ❌ Granular progress - Just running/done (no percentage) - ❌ Resumption - No checkpoint recovery --- ## Key Design Patterns ### Pattern 1: Reuse Existing Session Management ```python from src.database.session import get_session, resolve_project_id # In MCP tools and worker async with get_session(project_id=project_id, ctx=ctx) as session: job = await session.get(IndexingJob, job_id) job.status = "completed" await session.commit() ``` **Why**: `session.py` already handles: - Project resolution (4-tier chain) - Connection pooling - Transaction management - Error handling ### Pattern 2: Simple Background Worker ```python async def _background_indexing_worker(job_id: UUID, repo_path: str, project_id: str, ctx: Context): """Simple state machine: pending → running → completed/failed.""" try: # Update to running async with get_session(project_id=project_id, ctx=ctx) as session: job = await session.get(IndexingJob, job_id) job.status = "running" job.started_at = datetime.now() await session.commit() # Run existing indexer (NO MODIFICATIONS!) async with get_session(project_id=project_id, ctx=ctx) as session: result = await index_repository( repo_path=Path(repo_path), name=Path(repo_path).name, db=session, project_id=project_id, ) # Update to completed async with get_session(project_id=project_id, ctx=ctx) as session: job = await session.get(IndexingJob, job_id) job.status = "completed" job.completed_at = datetime.now() job.files_indexed = result.files_indexed job.chunks_created = result.chunks_created await session.commit() except Exception as e: # Update to failed async with get_session(project_id=project_id, ctx=ctx) as session: job = await session.get(IndexingJob, job_id) job.status = "failed" job.error_message = str(e) job.completed_at = datetime.now() await session.commit() ``` **Why**: - No progress callbacks (deferred to Phase 2) - Binary state (running or done) - Reuses existing indexer unchanged ### Pattern 3: FastMCP Tool Structure ```python from src.mcp.server_fastmcp import mcp from fastmcp import Context @mcp.tool() async def start_indexing_background( repo_path: str, project_id: str | None = None, ctx: Context | None = None, ) -> dict[str, Any]: """Start repository indexing in background.""" # Resolve project resolved_id, db_name = await resolve_project_id(explicit_id=project_id, ctx=ctx) # Validate path job_create = IndexingJobCreate( repo_path=repo_path, project_id=resolved_id, ) # Create job record async with get_session(project_id=resolved_id, ctx=ctx) as session: job = IndexingJob( repo_path=job_create.repo_path, project_id=resolved_id, status="pending", ) session.add(job) await session.commit() await session.refresh(job) job_id = job.id # Start worker asyncio.create_task(_background_indexing_worker(job_id, repo_path, resolved_id, ctx)) return { "job_id": str(job_id), "status": "pending", "message": "Indexing job started", } ``` **Why**: - FastMCP integration (`@mcp.tool()`) - Path validation (security) - Immediate response (<1s) - Non-blocking worker spawn --- ## Database Schema (Reference) ### Simplified Schema (10 columns) ```sql CREATE TABLE indexing_jobs ( -- Identity id UUID PRIMARY KEY DEFAULT gen_random_uuid(), -- Input repo_path TEXT NOT NULL, project_id VARCHAR(255) NOT NULL, -- Status status VARCHAR(20) NOT NULL DEFAULT 'pending' CHECK (status IN ('pending', 'running', 'completed', 'failed')), error_message TEXT, -- Counters files_indexed INTEGER DEFAULT 0, chunks_created INTEGER DEFAULT 0, -- Timestamps started_at TIMESTAMPTZ, completed_at TIMESTAMPTZ, created_at TIMESTAMPTZ DEFAULT NOW() ); -- Performance indexes CREATE INDEX idx_active_jobs ON indexing_jobs(project_id, status) WHERE status IN ('pending', 'running'); CREATE INDEX idx_created_at ON indexing_jobs(created_at DESC); ``` ### Removed Columns (Deferred to Phase 2) - `progress_percentage` - No real-time progress - `progress_message` - No phase messages - `files_scanned` - No scanning counter - `error_type`, `error_traceback` - Simple error_message only - `cancelled_at` - No cancellation in MVP - `metadata` - No extensibility yet - `worker_task_id`, `connection_id` - No resource tracking - `force_reindex` - No force flag **Rationale**: MVP focuses on start + status, not monitoring or control. --- ## Success Criteria Before considering implementation complete: - [ ] All 13 tasks completed (T001-T013) - [ ] Index 10K+ file repository without timeout - [ ] Job creation responds in <1 second - [ ] State persists across server restart - [ ] Integration tests pass - [ ] Documentation updated (CLAUDE.md) **Acceptance Tests**: 1. **Basic workflow**: - Start indexing on 4-file test repo - Poll status until completed - Verify files_indexed=4, chunks_created>0 2. **Large repository**: - Index codebase-mcp itself (1,000+ files) - Job creation <1s - Completes within 5 minutes - No timeout errors 3. **State persistence**: - Start job - Stop server mid-execution - Restart server - Query job status (should persist) 4. **Error handling**: - Index non-existent path - Status shows "failed" with clear error message --- ## Constitutional Compliance Checklist ### Principle I: Simplicity Over Features - [ ] Reuses existing code (session.py, indexer.py) - [ ] No custom job manager (PostgreSQL-native) - [ ] Minimal new complexity (13 tasks, 1,200 lines) ### Principle II: Local-First Architecture - [ ] PostgreSQL only (no Redis, RabbitMQ) - [ ] No external dependencies - [ ] Offline-capable storage ### Principle III: Protocol Compliance - [ ] MCP-compliant tool responses - [ ] Structured JSON responses - [ ] No stdout/stderr pollution ### Principle IV: Performance Guarantees - [ ] Job creation <1s (non-blocking) - [ ] Status queries <100ms (indexed queries) - [ ] No degradation during concurrent jobs ### Principle V: Production Quality - [ ] Comprehensive error handling - [ ] State persistence across restarts - [ ] Structured logging with context - [ ] Path traversal prevention - [ ] No resource leaks ### Principle VIII: Type Safety - [ ] Pydantic models for validation - [ ] SQLAlchemy models for database - [ ] mypy --strict compliance ### Principle XI: FastMCP Foundation - [ ] Uses `@mcp.tool()` decorator - [ ] Async/await throughout - [ ] Context passed to sessions --- ## Quick Reference - Task Breakdown ### Phase 1: Database (1h) ✅ **COMPLETE** - **T001**: Migration (45m) - Create `migrations/versions/008_add_indexing_jobs.py` ✅ - **T002**: Validate (15m) - Apply migration, test schema ✅ ### Phase 2: Models (1h) - **T003**: SQLAlchemy + Pydantic (45m) - Create `src/models/indexing_job.py` - **T003-test**: Model tests (15m) - Create `tests/unit/test_indexing_job_models.py` ### Phase 3: Core (2.5h) - **T004**: Worker (45m) - Create `src/services/background_worker.py` - **T005**: start_indexing_background (30m) - Create `src/mcp/tools/background_indexing.py` - **T006**: get_indexing_status (30m) - Add to `background_indexing.py` - **T007**: update_job utility (15m) - Add to `background_worker.py` - **T008**: Integration test (30m) - Create `tests/integration/test_background_indexing.py` ### Phase 4: Hardening (1h) - **T009**: Error handling (20m) - Enhance worker error handling - **T010**: State persistence test (20m) - Test server restart scenario - **T011**: Configuration (10m) - Update `.env.example` - **T012**: Documentation (10m) - Update `CLAUDE.md` ### Phase 5: Validation (30m) - **T013**: E2E test with large repo (30m) - Test with codebase-mcp itself --- ## Known Issues from Previous Session ### Issue Resolved: Missing `projects` Table - **Problem**: Registry database missing `projects` table - **Fix**: Created table with proper schema - **Status**: ✅ Resolved - **Note**: This is already fixed in the environment ### Current Environment State - **Registry database**: `codebase_mcp` (has `projects` table) - **Project databases**: `cb_proj_*` pattern - **Test repository**: `/tmp/test-repo` (4 files, 5 chunks) already indexed - **Semantic search**: ✅ Working **Validation**: ```bash # Check registry database psql codebase_mcp -c "\d projects" # Check test repo indexed psql cb_proj_default_00000000 -c "SELECT COUNT(*) FROM repositories;" ``` --- ## Contact Information ### Previous Session Context - **Session date**: 2025-10-17 - **Focus**: Architecture design, task planning, optimization - **Key decisions**: MVP-first approach, reuse existing infrastructure - **Parallel reviews**: code-reviewer + architect-reviewer validated approach - **Comparison**: Human-AI collaborative planning vs spec-driven (/speckit.tasks) - **Result**: MVP-optimized plan chosen (50% less code than full-featured) ### Key Contributors - **Architecture**: Designed PostgreSQL-native approach with simplified schema - **Planning**: Created MVP task breakdown with time estimates - **Optimization**: Reduced from 18 columns to 10, 12 hours to 6 hours --- ## How to Use This Document ### Quick Start (10 minutes) 1. **Read sections 1-6** for context (Executive Summary → Critical Context) 2. **Review section 7** for code patterns (Key Design Patterns) 3. **Start T001** using section 11 as checklist (Quick Reference) 4. **Reference sections 8-10** during implementation (as needed) ### During Implementation - **Stuck on a task?** → Check "Key Design Patterns" section - **Need schema details?** → Check "Database Schema (Reference)" section - **Confused about scope?** → Check "Critical Context - MVP Scope" section - **Want to verify completion?** → Check "Success Criteria" section ### Before Committing - **Run tests**: `pytest tests/integration/test_background_indexing.py -v` - **Check constitutional compliance**: Review "Constitutional Compliance Checklist" - **Update documentation**: Ensure CLAUDE.md has usage examples --- ## Next Session Prompt **Suggested prompt for next AI session**: ``` I'm implementing background indexing for the codebase-mcp project. Please read: 1. docs/IMPLEMENTATION_HANDOFF.md (handoff context) 2. docs/architecture/background-indexing-tasks-mvp.md (task list) I want to start with T001 (database migration). Let's create the Alembic migration for the indexing_jobs table following the MVP specification. Key requirements: - 10 columns only (simplified schema) - 2 indexes for performance - Status CHECK constraint - Follow existing migration patterns in migrations/versions/ ``` --- ## Appendix A: File Creation Checklist **Files to Create** (6 new files): - [ ] `migrations/versions/008_add_indexing_jobs.py` (T001) - [ ] `src/models/indexing_job.py` (T003) - [ ] `src/services/background_worker.py` (T004) - [ ] `src/mcp/tools/background_indexing.py` (T005) - [ ] `tests/unit/test_indexing_job_models.py` (T003-test) - [ ] `tests/integration/test_background_indexing.py` (T008) **Files to Modify** (2 existing files): - [ ] `src/models/__init__.py` - Export IndexingJob model - [ ] `.env.example` - Add background indexing config **Total Lines**: ~1,200-1,500 (50% less than full plan) --- ## Appendix B: Environment Variables Add to `.env.example` (T011): ```bash # Background Indexing Configuration # Maximum number of concurrent background indexing jobs MAX_CONCURRENT_INDEXING_JOBS=2 # Timeout per indexing job (seconds) - 1 hour default INDEXING_JOB_TIMEOUT_SECONDS=3600 ``` **Note**: MVP doesn't enforce these limits yet, but documents them for Phase 2. --- ## Appendix C: Documentation Updates Add to `CLAUDE.md` (T012): ```markdown ## Background Indexing Large repositories (10,000+ files) require 5-10 minutes to index. Use background indexing for these repositories. ### Usage Pattern: Start and Poll ```python # Start indexing result = await start_indexing_background( repo_path="/path/to/large/repo", ctx=ctx ) job_id = result["job_id"] # Poll for completion while True: status = await get_indexing_status(job_id=job_id, ctx=ctx) if status["status"] in ["completed", "failed"]: break await asyncio.sleep(2) if status["status"] == "completed": print(f"✅ Indexed {status['files_indexed']} files!") ``` ### When to Use - **Foreground**: Repositories <5,000 files (completes in <60s) - **Background**: Repositories 10,000+ files (requires 5-10 minutes) ``` --- ## Appendix D: Testing Checklist ### Unit Tests (T003-test) - [ ] Valid absolute path accepted - [ ] Relative path rejected - [ ] Path traversal rejected (`../../etc/passwd`) - [ ] Model serialization works ### Integration Tests (T008) - [ ] Job creation works - [ ] Status polling works - [ ] Job completes successfully - [ ] Counters accurate (files_indexed, chunks_created) ### State Persistence (T010) - [ ] Job survives server restart - [ ] Status queryable after restart - [ ] All fields preserved ### End-to-End (T013) - [ ] Index large repo (1,000+ files) - [ ] No timeout errors - [ ] Completes within 5 minutes - [ ] Status tracking accurate --- **End of Handoff Document** This document provides complete context to start implementation immediately without requiring the previous conversation history. Good luck! 🚀

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Ravenight13/codebase-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server