Codebase MCP Server

codebase-mcp
docs
architecture

background-indexing.md•49.7 KiB

# Background Indexing with PostgreSQL-Native Status Tracking ## Problem Statement Large repositories (10,000+ files) take 5-10 minutes to index, which exceeds typical MCP request timeouts. We need a solution that: 1. Starts indexing jobs in the background (non-blocking MCP tools) 2. Provides real-time progress updates to AI agents 3. Persists job state across server restarts (production-ready) 4. Handles job failures gracefully with proper cleanup 5. Secures against path traversal attacks 6. Maintains constitutional compliance (simplicity, production quality) ## Architecture Overview **Design Philosophy**: Use PostgreSQL as the single source of truth for job state instead of in-memory tracking. This eliminates state loss on restart, simplifies the implementation, and leverages PostgreSQL's ACID guarantees for consistency. ``` ┌─────────────────────────────────────────────────────────────────┐ │ AI Agent (Claude) │ └────────┬─────────────────────────────────────────────┬──────────┘ │ │ │ 1. start_indexing_background() │ 3. get_indexing_status() │ (returns immediately) │ (queries PostgreSQL) ↓ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ MCP Server (codebase-mcp) │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ MCP Tools (Non-blocking) │ │ │ │ - start_indexing_background() → creates DB row + task │ │ │ │ - get_indexing_status() → queries DB row │ │ │ │ - cancel_indexing_background() → sets DB cancelled flag │ │ │ └──────────────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ Background Worker (asyncio.Task) │ │ │ │ - Runs indexing in separate asyncio task │ │ │ │ - Updates PostgreSQL row with progress │ │ │ │ - Cleans up resources on completion/failure │ │ │ │ - Respects cancellation requests via DB polling │ │ │ └──────────────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ Indexing Service (with progress callbacks) │ │ │ │ - Scanner → progress_callback() → UPDATE indexing_jobs │ │ │ │ - Chunker → progress_callback() → UPDATE indexing_jobs │ │ │ │ - Embedder → progress_callback() → UPDATE indexing_jobs │ │ │ │ - Writer → progress_callback() → UPDATE indexing_jobs │ │ │ └──────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌──────────────┐ │ PostgreSQL │ │ + pgvector │ │ │ │ indexing_jobs│ ← Single source of truth └──────────────┘ ``` **Key Architectural Decisions:** 1. **PostgreSQL-native state**: Job state persisted in `indexing_jobs` table, survives restarts 2. **Dedicated connections**: Background workers use separate connection pool connections 3. **No in-memory job manager**: Eliminated complexity, state in database only 4. **Simple polling**: AI agents poll `get_indexing_status()` every 2-5 seconds 5. **Transaction boundaries**: Each progress update is a committed transaction 6. **Optional LISTEN/NOTIFY**: Real-time updates for advanced clients (future enhancement) ## Database Schema Add `indexing_jobs` table to project databases (applies to all `cb_proj_*` databases): ```sql -- Migration: Add indexing_jobs table for background job tracking CREATE TABLE indexing_jobs ( -- Identity id UUID PRIMARY KEY DEFAULT gen_random_uuid(), repository_id UUID, -- NULL until repository created (FK added later) -- Input parameters (for restart/debugging) repo_path TEXT NOT NULL, repo_name TEXT NOT NULL, project_id VARCHAR(255) NOT NULL, force_reindex BOOLEAN NOT NULL DEFAULT FALSE, -- Status tracking status VARCHAR(20) NOT NULL CHECK (status IN ('pending', 'running', 'completed', 'failed', 'cancelled')), progress_percentage INTEGER NOT NULL DEFAULT 0 CHECK (progress_percentage >= 0 AND progress_percentage <= 100), progress_message TEXT, -- Counters (updated by worker) files_scanned INTEGER NOT NULL DEFAULT 0, files_indexed INTEGER NOT NULL DEFAULT 0, chunks_created INTEGER NOT NULL DEFAULT 0, -- Error tracking error_message TEXT, error_type VARCHAR(255), -- Exception class name error_traceback TEXT, -- Full stack trace for debugging -- Timestamps (UTC) created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), started_at TIMESTAMPTZ, completed_at TIMESTAMPTZ, cancelled_at TIMESTAMPTZ, -- Metadata (JSONB for extensibility) metadata JSONB DEFAULT '{}'::jsonb, -- Resource tracking (for leak detection) worker_task_id TEXT, -- asyncio.Task id() for monitoring connection_id TEXT -- Database connection ID for cleanup ); -- Performance indexes CREATE INDEX idx_indexing_jobs_status ON indexing_jobs(status) WHERE status IN ('pending', 'running'); CREATE INDEX idx_indexing_jobs_project_id ON indexing_jobs(project_id); CREATE INDEX idx_indexing_jobs_created_at ON indexing_jobs(created_at DESC); CREATE INDEX idx_indexing_jobs_repository_id ON indexing_jobs(repository_id) WHERE repository_id IS NOT NULL; -- Add foreign key constraint after repository is created -- ALTER TABLE indexing_jobs ADD CONSTRAINT fk_indexing_jobs_repository -- FOREIGN KEY (repository_id) REFERENCES repositories(id) ON DELETE SET NULL; COMMENT ON TABLE indexing_jobs IS 'Background indexing job tracking with PostgreSQL-native state persistence'; COMMENT ON COLUMN indexing_jobs.repo_path IS 'Absolute path to repository (validated for path traversal)'; COMMENT ON COLUMN indexing_jobs.status IS 'Job lifecycle state (pending/running/completed/failed/cancelled)'; COMMENT ON COLUMN indexing_jobs.metadata IS 'Extensible metadata for logging context, user info, etc.'; ``` ## Implementation Components ### 1. Job Status Model (Pydantic) ```python # src/models/indexing_job.py from enum import Enum from datetime import datetime from pydantic import BaseModel, Field, validator from uuid import UUID class IndexingJobStatus(str, Enum): """Job lifecycle states (matches DB constraint).""" PENDING = "pending" # Created, not started RUNNING = "running" # Worker executing COMPLETED = "completed" # Successfully finished FAILED = "failed" # Error occurred CANCELLED = "cancelled" # User-requested cancellation class IndexingJobProgress(BaseModel): """Progress update payload (immutable, for MCP responses).""" job_id: UUID status: IndexingJobStatus progress_percentage: int = Field(ge=0, le=100) progress_message: str files_scanned: int = Field(ge=0) files_indexed: int = Field(ge=0) chunks_created: int = Field(ge=0) error_message: str | None = None created_at: datetime started_at: datetime | None = None completed_at: datetime | None = None class Config: frozen = True # Immutable for thread safety class IndexingJobCreate(BaseModel): """Input parameters for creating a new indexing job.""" repo_path: str = Field(min_length=1) repo_name: str = Field(min_length=1) project_id: str = Field(min_length=1) force_reindex: bool = False @validator('repo_path') def validate_repo_path(cls, v: str) -> str: """Validate repo_path is absolute and free from path traversal attacks.""" import os from pathlib import Path # Must be absolute path if not os.path.isabs(v): raise ValueError(f"repo_path must be absolute, got: {v}") # Resolve path to detect traversal attempts resolved = Path(v).resolve() original = Path(v) # If resolved path differs significantly, likely traversal attempt if not str(resolved).startswith(str(original.parent.resolve())): raise ValueError(f"Path traversal detected in repo_path: {v}") return v ``` ### 2. MCP Tool: start_indexing_background() ```python # src/mcp/tools/background_indexing.py from fastmcp import Context from uuid import UUID import asyncio from pathlib import Path from src.models.indexing_job import IndexingJobCreate, IndexingJobStatus from src.database.session import get_session, resolve_project_id from src.mcp.mcp_logging import get_logger logger = get_logger(__name__) @mcp.tool() async def start_indexing_background( repo_path: str, project_id: str | None = None, force_reindex: bool = False, ctx: Context | None = None, ) -> dict[str, Any]: """Start repository indexing in the background (non-blocking). Returns immediately with job_id. Use get_indexing_status(job_id) to poll progress. Security: - Validates repo_path is absolute (no relative paths) - Checks for path traversal attacks (../ sequences) - Enforces project isolation via database-per-project Args: repo_path: Absolute path to repository (validated) project_id: Optional project identifier (resolved via 4-tier chain) force_reindex: If True, re-index even if already indexed ctx: FastMCP Context for session-based project resolution Returns: { "job_id": "uuid", "status": "pending", "message": "Indexing job started", "project_id": "resolved_project_id", "database_name": "cb_proj_xxx" } Raises: ValueError: If repo_path validation fails (path traversal, not absolute) RuntimeError: If database operations fail Example: >>> result = await start_indexing_background( ... repo_path="/Users/alice/projects/myapp", ... ctx=ctx ... ) >>> job_id = result["job_id"] >>> # Poll status: >>> status = await get_indexing_status(job_id=job_id) """ # Resolve project_id and database_name via 4-tier chain resolved_id, database_name = await resolve_project_id( explicit_id=project_id, ctx=ctx, ) # Validate input (includes path traversal check) job_input = IndexingJobCreate( repo_path=repo_path, repo_name=Path(repo_path).name, project_id=resolved_id, force_reindex=force_reindex, ) # Create job record in database (status=pending) async with get_session(project_id=resolved_id, ctx=ctx) as db: from sqlalchemy import text result = await db.execute( text(""" INSERT INTO indexing_jobs (repo_path, repo_name, project_id, force_reindex, status, progress_message) VALUES (:repo_path, :repo_name, :project_id, :force_reindex, 'pending', 'Job queued') RETURNING id, status, progress_percentage, created_at """), { "repo_path": job_input.repo_path, "repo_name": job_input.repo_name, "project_id": job_input.project_id, "force_reindex": job_input.force_reindex, } ) row = result.fetchone() job_id = row[0] # Commit transaction (job is now visible to other connections) await db.commit() # Start background worker task (non-blocking) asyncio.create_task( _background_indexing_worker( job_id=job_id, repo_path=job_input.repo_path, repo_name=job_input.repo_name, project_id=resolved_id, force_reindex=job_input.force_reindex, ) ) logger.info( f"Indexing job created: {job_id}", extra={ "context": { "operation": "start_indexing_background", "job_id": str(job_id), "project_id": resolved_id, "repo_path": job_input.repo_path, } }, ) if ctx: await ctx.info(f"Indexing started in background. Job ID: {job_id}") return { "job_id": str(job_id), "status": "pending", "message": "Indexing job started", "project_id": resolved_id, "database_name": database_name, } ``` ### 3. Background Worker Function ```python # src/mcp/tools/background_indexing.py (continued) async def _background_indexing_worker( job_id: UUID, repo_path: str, repo_name: str, project_id: str, force_reindex: bool, ) -> None: """Background worker that executes indexing and updates PostgreSQL. This function runs as an asyncio.Task and handles: - Status transitions (pending → running → completed/failed) - Progress updates via database transactions - Error handling and cleanup - Cancellation detection via database polling Constitutional Compliance: - Principle V: Production Quality (comprehensive error handling, cleanup) - Principle IV: Performance (batched updates, efficient queries) - Principle II: Local-First (no external dependencies) Args: job_id: UUID of indexing_jobs row repo_path: Absolute path to repository repo_name: Repository name for display project_id: Resolved project identifier force_reindex: Whether to re-index existing repository """ logger.info( f"Background worker started for job {job_id}", extra={ "context": { "operation": "_background_indexing_worker", "job_id": str(job_id), "project_id": project_id, } }, ) try: # Update status to running await _update_job_status( job_id=job_id, project_id=project_id, status=IndexingJobStatus.RUNNING, progress_percentage=0, progress_message="Starting indexing...", started_at=datetime.now(timezone.utc), ) # Define progress callback that updates database async def progress_callback(message: str, percentage: int, **kwargs) -> None: """Progress callback invoked by indexer at milestones. Updates: - progress_percentage (0-100) - progress_message (human-readable) - files_scanned, files_indexed, chunks_created (counters) """ # Check for cancellation request is_cancelled = await _check_cancellation(job_id, project_id) if is_cancelled: logger.warning( f"Job {job_id} cancellation detected, aborting", extra={ "context": { "operation": "progress_callback", "job_id": str(job_id), } }, ) raise asyncio.CancelledError("Job cancelled by user") # Update progress in database await _update_job_progress( job_id=job_id, project_id=project_id, percentage=percentage, message=message, **kwargs # files_scanned, files_indexed, chunks_created ) # Run indexing with progress callbacks from src.services.indexer import index_repository async with get_session(project_id=project_id) as db: result = await index_repository( repo_path=Path(repo_path), name=repo_name, db=db, project_id=project_id, force_reindex=force_reindex, progress_callback=progress_callback, ) # Update to completed await _update_job_status( job_id=job_id, project_id=project_id, status=IndexingJobStatus.COMPLETED, progress_percentage=100, progress_message=f"Completed: {result.files_indexed} files, {result.chunks_created} chunks", completed_at=datetime.now(timezone.utc), files_indexed=result.files_indexed, chunks_created=result.chunks_created, repository_id=result.repository_id, # Link to created repository ) logger.info( f"Job {job_id} completed successfully", extra={ "context": { "operation": "_background_indexing_worker", "job_id": str(job_id), "files_indexed": result.files_indexed, "chunks_created": result.chunks_created, } }, ) except asyncio.CancelledError: # User-requested cancellation await _update_job_status( job_id=job_id, project_id=project_id, status=IndexingJobStatus.CANCELLED, progress_percentage=0, progress_message="Job cancelled by user", cancelled_at=datetime.now(timezone.utc), ) logger.warning(f"Job {job_id} cancelled by user") except Exception as e: # Unexpected error import traceback await _update_job_status( job_id=job_id, project_id=project_id, status=IndexingJobStatus.FAILED, progress_percentage=0, progress_message=f"Error: {str(e)}", completed_at=datetime.now(timezone.utc), error_message=str(e), error_type=type(e).__name__, error_traceback=traceback.format_exc(), ) logger.error( f"Job {job_id} failed with error", extra={ "context": { "operation": "_background_indexing_worker", "job_id": str(job_id), "error": str(e), "error_type": type(e).__name__, } }, exc_info=True, ) finally: # Ensure cleanup (close any open resources) logger.debug(f"Background worker cleanup for job {job_id}") ``` ### 4. Database Update Utilities ```python # src/mcp/tools/background_indexing.py (continued) async def _update_job_status( job_id: UUID, project_id: str, status: IndexingJobStatus, progress_percentage: int, progress_message: str, started_at: datetime | None = None, completed_at: datetime | None = None, cancelled_at: datetime | None = None, error_message: str | None = None, error_type: str | None = None, error_traceback: str | None = None, files_scanned: int | None = None, files_indexed: int | None = None, chunks_created: int | None = None, repository_id: UUID | None = None, ) -> None: """Update job status in PostgreSQL (transactional). This is the atomic operation for updating job state. Each call is a committed transaction, ensuring external visibility. Args: job_id: UUID of indexing_jobs row project_id: Project identifier for get_session() status: New job status progress_percentage: Progress (0-100) progress_message: Human-readable progress message started_at: Optional start timestamp completed_at: Optional completion timestamp cancelled_at: Optional cancellation timestamp error_message: Optional error message error_type: Optional exception class name error_traceback: Optional full stack trace files_scanned: Optional file count files_indexed: Optional indexed file count chunks_created: Optional chunk count repository_id: Optional link to created repository """ async with get_session(project_id=project_id) as db: from sqlalchemy import text # Build dynamic UPDATE query updates = [ "status = :status", "progress_percentage = :progress_percentage", "progress_message = :progress_message", ] params = { "job_id": job_id, "status": status.value, "progress_percentage": progress_percentage, "progress_message": progress_message, } # Add optional fields if started_at is not None: updates.append("started_at = :started_at") params["started_at"] = started_at if completed_at is not None: updates.append("completed_at = :completed_at") params["completed_at"] = completed_at if cancelled_at is not None: updates.append("cancelled_at = :cancelled_at") params["cancelled_at"] = cancelled_at if error_message is not None: updates.append("error_message = :error_message") params["error_message"] = error_message if error_type is not None: updates.append("error_type = :error_type") params["error_type"] = error_type if error_traceback is not None: updates.append("error_traceback = :error_traceback") params["error_traceback"] = error_traceback if files_scanned is not None: updates.append("files_scanned = :files_scanned") params["files_scanned"] = files_scanned if files_indexed is not None: updates.append("files_indexed = :files_indexed") params["files_indexed"] = files_indexed if chunks_created is not None: updates.append("chunks_created = :chunks_created") params["chunks_created"] = chunks_created if repository_id is not None: updates.append("repository_id = :repository_id") params["repository_id"] = repository_id # Execute UPDATE query = f""" UPDATE indexing_jobs SET {', '.join(updates)} WHERE id = :job_id """ await db.execute(text(query), params) await db.commit() # Transactional visibility async def _update_job_progress( job_id: UUID, project_id: str, percentage: int, message: str, **kwargs, ) -> None: """Update job progress (lightweight version for frequent updates). Args: job_id: UUID of indexing_jobs row project_id: Project identifier percentage: Progress percentage (0-100) message: Progress message **kwargs: Optional counters (files_scanned, files_indexed, chunks_created) """ await _update_job_status( job_id=job_id, project_id=project_id, status=IndexingJobStatus.RUNNING, progress_percentage=percentage, progress_message=message, files_scanned=kwargs.get("files_scanned"), files_indexed=kwargs.get("files_indexed"), chunks_created=kwargs.get("chunks_created"), ) async def _check_cancellation(job_id: UUID, project_id: str) -> bool: """Check if job has been cancelled via database query. Called periodically by worker to detect cancellation requests. Args: job_id: UUID of indexing_jobs row project_id: Project identifier Returns: True if job status is 'cancelled', False otherwise """ async with get_session(project_id=project_id) as db: from sqlalchemy import text result = await db.execute( text("SELECT status FROM indexing_jobs WHERE id = :job_id"), {"job_id": job_id} ) row = result.fetchone() if row is None: logger.warning(f"Job {job_id} not found during cancellation check") return False return row[0] == IndexingJobStatus.CANCELLED.value ``` ### 5. MCP Tool: get_indexing_status() ```python # src/mcp/tools/background_indexing.py (continued) @mcp.tool() async def get_indexing_status( job_id: str, project_id: str | None = None, ctx: Context | None = None, ) -> dict[str, Any]: """Get status of a background indexing job. Queries PostgreSQL for current job state. This is a read-only operation that AI agents can poll every 2-5 seconds. Args: job_id: UUID of the indexing job project_id: Optional project identifier (resolved via 4-tier chain) ctx: FastMCP Context for session-based project resolution Returns: { "job_id": "uuid", "status": "running", "progress_percentage": 45, "progress_message": "Generating embeddings: 5000/10000", "files_scanned": 10585, "files_indexed": 5000, "chunks_created": 45000, "created_at": "2025-10-17T10:30:00Z", "started_at": "2025-10-17T10:30:01Z", "error_message": null } Raises: ValueError: If job_id is invalid or not found Example: >>> status = await get_indexing_status(job_id="550e8400-...") >>> if status["status"] == "completed": ... print(f"Indexed {status['files_indexed']} files!") """ # Resolve project_id resolved_id, _ = await resolve_project_id( explicit_id=project_id, ctx=ctx, ) # Query job from database async with get_session(project_id=resolved_id, ctx=ctx) as db: from sqlalchemy import text result = await db.execute( text(""" SELECT id, status, progress_percentage, progress_message, files_scanned, files_indexed, chunks_created, error_message, error_type, created_at, started_at, completed_at, cancelled_at FROM indexing_jobs WHERE id = :job_id """), {"job_id": job_id} ) row = result.fetchone() if row is None: raise ValueError(f"Job not found: {job_id}") # Convert to dict return { "job_id": str(row[0]), "status": row[1], "progress_percentage": row[2], "progress_message": row[3], "files_scanned": row[4], "files_indexed": row[5], "chunks_created": row[6], "error_message": row[7], "error_type": row[8], "created_at": row[9].isoformat() if row[9] else None, "started_at": row[10].isoformat() if row[10] else None, "completed_at": row[11].isoformat() if row[11] else None, "cancelled_at": row[12].isoformat() if row[12] else None, } @mcp.tool() async def cancel_indexing_background( job_id: str, project_id: str | None = None, ctx: Context | None = None, ) -> dict[str, Any]: """Cancel a running indexing job. Sets the job status to 'cancelled' in PostgreSQL. The background worker detects this on its next progress update and aborts. Args: job_id: UUID of the indexing job project_id: Optional project identifier (resolved via 4-tier chain) ctx: FastMCP Context for session-based project resolution Returns: { "job_id": "uuid", "status": "cancelled", "message": "Cancellation requested (worker will abort)" } Raises: ValueError: If job_id is invalid or already completed Example: >>> result = await cancel_indexing_background(job_id="550e8400-...") >>> print(result["message"]) """ # Resolve project_id resolved_id, _ = await resolve_project_id( explicit_id=project_id, ctx=ctx, ) # Update job status to cancelled async with get_session(project_id=resolved_id, ctx=ctx) as db: from sqlalchemy import text from datetime import datetime, timezone result = await db.execute( text(""" UPDATE indexing_jobs SET status = 'cancelled', cancelled_at = :cancelled_at, progress_message = 'Cancellation requested by user' WHERE id = :job_id AND status IN ('pending', 'running') RETURNING id, status """), { "job_id": job_id, "cancelled_at": datetime.now(timezone.utc), } ) row = result.fetchone() if row is None: raise ValueError( f"Job not found or already completed: {job_id}. " "Only pending/running jobs can be cancelled." ) await db.commit() logger.info( f"Cancellation requested for job {job_id}", extra={ "context": { "operation": "cancel_indexing_background", "job_id": job_id, } }, ) if ctx: await ctx.info(f"Indexing job {job_id} cancellation requested") return { "job_id": job_id, "status": "cancelled", "message": "Cancellation requested (worker will abort on next progress update)", } ``` ### 6. Enhanced Indexer with Progress Callbacks ```python # Modify src/services/indexer.py to add progress_callback parameter from typing import Callable, Awaitable async def index_repository( repo_path: Path, name: str, db: AsyncSession, project_id: str, force_reindex: bool = False, progress_callback: Callable[[str, int], Awaitable[None]] | None = None, ) -> IndexResult: """Index repository with progress callbacks. Args: repo_path: Path to repository name: Repository name db: Database session project_id: Project identifier force_reindex: Re-index even if exists progress_callback: Optional async function(message: str, percentage: int, **kwargs) Called at key milestones with progress updates Returns: IndexResult with files_indexed, chunks_created, repository_id """ # Example progress callback invocations: # 1. After scanning (10%) if progress_callback: await progress_callback( "Scanning repository...", 10, files_scanned=len(all_files) ) # 2. During chunking (10-50%) for i, batch in enumerate(batches): percentage = 10 + int(40 * (i / total_batches)) if progress_callback: await progress_callback( f"Chunking files: {i * batch_size}/{total_files}", percentage, files_indexed=i * batch_size ) # 3. During embedding generation (50-90%) for i, batch in enumerate(embedding_batches): percentage = 50 + int(40 * (i / total_embedding_batches)) if progress_callback: await progress_callback( f"Generating embeddings: {i * embedding_batch_size}/{total_chunks}", percentage, chunks_created=i * embedding_batch_size ) # 4. During database writes (90-100%) if progress_callback: await progress_callback("Writing to database...", 95) # Final result return IndexResult( files_indexed=total_files, chunks_created=total_chunks, repository_id=repo_id, ) ``` ## AI Agent Usage Patterns ### Pattern 1: Start and Poll (Recommended) ```python # AI Agent workflow result = await start_indexing_background(repo_path="/path/to/repo", ctx=ctx) job_id = result["job_id"] # Poll for status every 2-5 seconds while True: status = await get_indexing_status(job_id=job_id, ctx=ctx) if status["status"] in ["completed", "failed", "cancelled"]: break print(f"Progress: {status['progress_percentage']}% - {status['progress_message']}") await asyncio.sleep(2) # Poll interval if status["status"] == "completed": print(f"✅ Indexing complete! {status['files_indexed']} files, {status['chunks_created']} chunks") elif status["status"] == "failed": print(f"❌ Indexing failed: {status['error_message']}") ``` ### Pattern 2: Fire-and-Forget ```python # Start indexing result = await start_indexing_background(repo_path="/path/to/repo", ctx=ctx) print(f"Indexing started with job ID: {result['job_id']}") # AI agent continues with other tasks # Check status later if needed ``` ### Pattern 3: Cancellation ```python # Start job result = await start_indexing_background(repo_path="/path/to/repo", ctx=ctx) job_id = result["job_id"] # Later, user requests cancellation cancel_result = await cancel_indexing_background(job_id=job_id, ctx=ctx) print(cancel_result["message"]) # Wait for worker to detect cancellation await asyncio.sleep(1) status = await get_indexing_status(job_id=job_id, ctx=ctx) assert status["status"] == "cancelled" ``` ## Security Considerations ### Path Traversal Prevention ```python # Pydantic validator prevents path traversal attacks job_input = IndexingJobCreate( repo_path="/var/data/../../etc/passwd", # ❌ Rejected ... ) # Raises ValueError: "Path traversal detected in repo_path" # Valid paths: job_input = IndexingJobCreate( repo_path="/Users/alice/projects/myapp", # ✅ Absolute path ... ) ``` **Validation Rules:** 1. Must be absolute path (`os.path.isabs()`) 2. Resolve path to detect `../` sequences 3. Compare resolved path with original parent 4. Reject if significant deviation detected ### Resource Limits ```python # Configuration (.env) MAX_CONCURRENT_INDEXING_JOBS=2 # Limit concurrent background jobs INDEXING_JOB_TIMEOUT_SECONDS=3600 # 1 hour max per job MAX_REPO_SIZE_GB=10 # Reject repos larger than 10GB ``` **Enforcement:** - Connection pool limits prevent resource exhaustion - Job timeout via asyncio.wait_for() in worker - Repo size check before starting indexing ### Database Transaction Boundaries **Write Operations (Transactional):** - Job creation: Single INSERT + COMMIT - Status updates: Single UPDATE + COMMIT - Progress updates: Single UPDATE + COMMIT **Read Operations (No Transaction):** - Status queries: Simple SELECT (read-committed isolation) **Concurrency Safety:** - PostgreSQL row-level locking prevents race conditions - Each update is atomic (no partial state visible) - Cancellation detection uses committed reads ## Configuration Add to `.env`: ```bash # Background Indexing MAX_CONCURRENT_INDEXING_JOBS=2 # Max parallel indexing jobs INDEXING_JOB_TIMEOUT_SECONDS=3600 # 1 hour timeout per job JOB_STATUS_RETENTION_DAYS=7 # Cleanup completed jobs after 7 days MAX_REPO_SIZE_GB=10 # Reject repos larger than 10GB # Progress Update Frequency PROGRESS_UPDATE_INTERVAL_SECONDS=2 # Update DB every 2 seconds (not too frequent) # Cancellation Polling CANCELLATION_CHECK_INTERVAL_SECONDS=5 # Check for cancellation every 5 seconds ``` ## Testing Strategy ### Unit Tests ```python # tests/unit/test_background_indexing.py async def test_job_creation(): """Test job creation with valid input.""" job_input = IndexingJobCreate( repo_path="/tmp/test-repo", repo_name="test-repo", project_id="test-project", force_reindex=False, ) assert job_input.repo_path == "/tmp/test-repo" async def test_path_traversal_rejection(): """Test path traversal attack is rejected.""" with pytest.raises(ValueError, match="Path traversal detected"): IndexingJobCreate( repo_path="/var/data/../../etc/passwd", repo_name="test", project_id="test", ) async def test_relative_path_rejection(): """Test relative paths are rejected.""" with pytest.raises(ValueError, match="must be absolute"): IndexingJobCreate( repo_path="./relative/path", repo_name="test", project_id="test", ) ``` ### Integration Tests ```python # tests/integration/test_background_indexing.py async def test_background_indexing_workflow(test_repo_path: Path): """Test complete background indexing workflow.""" # Start job result = await start_indexing_background( repo_path=str(test_repo_path), project_id="test-project", ) job_id = result["job_id"] # Check initial status status = await get_indexing_status(job_id=job_id, project_id="test-project") assert status["status"] in ["pending", "running"] # Poll until completion (with timeout) for _ in range(30): # Max 30 * 2s = 60s status = await get_indexing_status(job_id=job_id, project_id="test-project") if status["status"] in ["completed", "failed"]: break await asyncio.sleep(2) # Verify completion assert status["status"] == "completed" assert status["progress_percentage"] == 100 assert status["files_indexed"] > 0 assert status["chunks_created"] > 0 async def test_job_cancellation(test_repo_path: Path): """Test job cancellation during execution.""" # Start job result = await start_indexing_background( repo_path=str(test_repo_path), project_id="test-project", ) job_id = result["job_id"] # Wait for job to start await asyncio.sleep(1) # Cancel job cancel_result = await cancel_indexing_background( job_id=job_id, project_id="test-project", ) assert cancel_result["status"] == "cancelled" # Wait for worker to detect cancellation await asyncio.sleep(5) # Verify cancellation status = await get_indexing_status(job_id=job_id, project_id="test-project") assert status["status"] == "cancelled" async def test_concurrent_jobs(test_repo_path: Path): """Test multiple concurrent indexing jobs.""" jobs = [] for i in range(3): result = await start_indexing_background( repo_path=str(test_repo_path), project_id=f"test-project-{i}", ) jobs.append(result["job_id"]) # All jobs should start for job_id in jobs: status = await get_indexing_status(job_id=job_id) assert status["status"] in ["pending", "running"] ``` ### Edge Case Tests ```python # tests/integration/test_background_indexing_edge_cases.py async def test_job_status_persistence_after_restart(): """Test job state persists across server restarts.""" # Start job result = await start_indexing_background( repo_path="/tmp/test-repo", project_id="test", ) job_id = result["job_id"] # Simulate server restart (close connection pools) await close_db_connection() await init_db_connection() # Job status should still be accessible status = await get_indexing_status(job_id=job_id, project_id="test") assert status is not None async def test_orphaned_job_cleanup(): """Test cleanup of jobs with no running worker.""" # Create job record directly in DB (no worker) async with get_session(project_id="test") as db: result = await db.execute( text(""" INSERT INTO indexing_jobs (repo_path, repo_name, project_id, status, progress_message) VALUES ('/tmp/test', 'test', 'test', 'running', 'Stuck') RETURNING id """) ) job_id = result.fetchone()[0] await db.commit() # Job should remain in database (no automatic cleanup) status = await get_indexing_status(job_id=job_id, project_id="test") assert status["status"] == "running" # Cleanup task should mark as failed after timeout # (Future enhancement: background cleanup task) ``` ## Migration Path ### Alembic Migration ```python # migrations/versions/xxx_add_indexing_jobs.py """Add indexing_jobs table for background job tracking Revision ID: xxx Revises: yyy Create Date: 2025-10-17 10:00:00.000000 """ from alembic import op import sqlalchemy as sa from sqlalchemy.dialects import postgresql revision = 'xxx' down_revision = 'yyy' branch_labels = None depends_on = None def upgrade() -> None: """Add indexing_jobs table.""" op.create_table( 'indexing_jobs', sa.Column('id', postgresql.UUID(as_uuid=True), server_default=sa.text('gen_random_uuid()'), nullable=False), sa.Column('repository_id', postgresql.UUID(as_uuid=True), nullable=True), sa.Column('repo_path', sa.Text(), nullable=False), sa.Column('repo_name', sa.Text(), nullable=False), sa.Column('project_id', sa.VARCHAR(length=255), nullable=False), sa.Column('force_reindex', sa.Boolean(), nullable=False, server_default='false'), sa.Column('status', sa.VARCHAR(length=20), nullable=False), sa.Column('progress_percentage', sa.Integer(), nullable=False, server_default='0'), sa.Column('progress_message', sa.Text(), nullable=True), sa.Column('files_scanned', sa.Integer(), nullable=False, server_default='0'), sa.Column('files_indexed', sa.Integer(), nullable=False, server_default='0'), sa.Column('chunks_created', sa.Integer(), nullable=False, server_default='0'), sa.Column('error_message', sa.Text(), nullable=True), sa.Column('error_type', sa.VARCHAR(length=255), nullable=True), sa.Column('error_traceback', sa.Text(), nullable=True), sa.Column('created_at', sa.TIMESTAMP(timezone=True), server_default=sa.text('now()'), nullable=False), sa.Column('started_at', sa.TIMESTAMP(timezone=True), nullable=True), sa.Column('completed_at', sa.TIMESTAMP(timezone=True), nullable=True), sa.Column('cancelled_at', sa.TIMESTAMP(timezone=True), nullable=True), sa.Column('metadata', postgresql.JSONB(astext_type=sa.Text()), server_default='{}', nullable=True), sa.Column('worker_task_id', sa.Text(), nullable=True), sa.Column('connection_id', sa.Text(), nullable=True), sa.PrimaryKeyConstraint('id'), sa.CheckConstraint("status IN ('pending', 'running', 'completed', 'failed', 'cancelled')", name='ck_indexing_jobs_status'), sa.CheckConstraint('progress_percentage >= 0 AND progress_percentage <= 100', name='ck_indexing_jobs_progress_range'), ) # Indexes for performance op.create_index('idx_indexing_jobs_status', 'indexing_jobs', ['status'], postgresql_where=sa.text("status IN ('pending', 'running')")) op.create_index('idx_indexing_jobs_project_id', 'indexing_jobs', ['project_id']) op.create_index('idx_indexing_jobs_created_at', 'indexing_jobs', [sa.text('created_at DESC')]) op.create_index('idx_indexing_jobs_repository_id', 'indexing_jobs', ['repository_id'], postgresql_where=sa.text('repository_id IS NOT NULL')) def downgrade() -> None: """Remove indexing_jobs table.""" op.drop_index('idx_indexing_jobs_repository_id', table_name='indexing_jobs') op.drop_index('idx_indexing_jobs_created_at', table_name='indexing_jobs') op.drop_index('idx_indexing_jobs_project_id', table_name='indexing_jobs') op.drop_index('idx_indexing_jobs_status', table_name='indexing_jobs') op.drop_table('indexing_jobs') ``` ### Deployment Steps 1. **Run Alembic migration**: ```bash alembic upgrade head ``` 2. **Verify table creation**: ```sql \d indexing_jobs ``` 3. **Deploy updated code** with new MCP tools 4. **Test with small repository**: ```python result = await start_indexing_background(repo_path="/tmp/small-repo") ``` 5. **Monitor logs** for any errors ## Benefits of PostgreSQL-Native Approach ### Compared to In-Memory State | Feature | In-Memory (OLD) | PostgreSQL-Native (NEW) | |---------|-----------------|-------------------------| | State persistence | ❌ Lost on restart | ✅ Survives restarts | | Complexity | ❌ Custom job manager class | ✅ Simple SQL operations | | ACID guarantees | ❌ Race conditions possible | ✅ PostgreSQL transactions | | Multi-server | ❌ State isolated per server | ✅ Shared state across servers | | Debugging | ❌ Ephemeral logs only | ✅ Query job history anytime | | Resource cleanup | ❌ Manual tracking needed | ✅ Database handles consistency | ### Production Quality - **No memory leaks**: No in-memory dict to grow unbounded - **No race conditions**: PostgreSQL row-level locking prevents conflicts - **Proper cleanup**: Connection pool handles connection lifecycle - **Audit trail**: Complete job history in database - **Monitoring**: Query active jobs via SQL ## Constitutional Compliance ### Principle I: Simplicity Over Features ✅ **COMPLIANT**: PostgreSQL-native approach is simpler than custom job manager - No in-memory state management - No custom queue implementation - Simple SQL operations for all state changes - Reuses existing database infrastructure ### Principle II: Local-First Architecture ✅ **COMPLIANT**: No external dependencies - PostgreSQL already required for codebase-mcp - No Redis, RabbitMQ, or external job queue - All state in local PostgreSQL database ### Principle V: Production Quality ✅ **COMPLIANT**: Enterprise-grade reliability - Job state persists across server restarts - ACID transactions prevent data corruption - Comprehensive error handling and logging - Resource cleanup via connection pool - Security: path traversal validation ### Principle IV: Performance Guarantees ✅ **COMPLIANT**: Efficient database operations - Indexed queries for status lookups (<10ms) - Batched progress updates (every 2 seconds) - Connection pooling for efficient resource usage - No performance regression from baseline ### Principle VIII: Pydantic-Based Type Safety ✅ **COMPLIANT**: Complete type safety - Pydantic models for all inputs/outputs - Path validation with clear error messages - Enum-based status values (no magic strings) - mypy --strict compliance ## Future Enhancements (Optional) ### PostgreSQL LISTEN/NOTIFY (Real-time Updates) For advanced clients that want push-based updates instead of polling: ```python # Optional: Real-time notification setup async def _notify_progress(job_id: UUID, project_id: str) -> None: """Send PostgreSQL NOTIFY for real-time subscribers.""" async with get_session(project_id=project_id) as db: await db.execute( text("NOTIFY indexing_jobs_progress, :job_id"), {"job_id": str(job_id)} ) await db.commit() # Client subscription (example) @mcp.tool() async def subscribe_indexing_progress(job_id: str, ctx: Context) -> None: """Subscribe to real-time progress notifications (advanced).""" # Listen for notifications async with get_session() as db: await db.execute(text("LISTEN indexing_jobs_progress")) # Wait for notifications while True: notification = await db.connection().notifies.get() if notification.payload == job_id: status = await get_indexing_status(job_id=job_id) await ctx.info(f"Progress: {status['progress_percentage']}%") if status["status"] in ["completed", "failed", "cancelled"]: break ``` **Note**: LISTEN/NOTIFY is optional and adds complexity. Start with simple polling. ### Job History Cleanup Periodically clean up old completed jobs: ```sql -- Cleanup job: Delete completed jobs older than 7 days DELETE FROM indexing_jobs WHERE status IN ('completed', 'failed', 'cancelled') AND completed_at < NOW() - INTERVAL '7 days'; ``` Run via cron or background task: ```python async def cleanup_old_jobs() -> None: """Background task to clean up old completed jobs.""" while True: await asyncio.sleep(86400) # Daily cleanup async with get_session() as db: result = await db.execute( text(""" DELETE FROM indexing_jobs WHERE status IN ('completed', 'failed', 'cancelled') AND completed_at < NOW() - INTERVAL '7 days' RETURNING id """) ) deleted_count = len(result.fetchall()) logger.info(f"Cleaned up {deleted_count} old indexing jobs") await db.commit() ``` ## Summary This revised architecture: 1. ✅ **Addresses all critical security issues**: Path validation, resource cleanup, transaction management 2. ✅ **Follows PostgreSQL-native approach**: State in database, no in-memory complexity 3. ✅ **Maintains constitutional compliance**: Simplicity, production quality, type safety 4. ✅ **Production-ready from Phase 1**: No prototype phase, state persists across restarts 5. ✅ **Well-tested**: Comprehensive unit, integration, and edge case tests 6. ✅ **Secure by default**: Path traversal prevention, resource limits, proper error handling **Implementation Estimate**: 8-10 hours - Migration + models: 1 hour - MCP tools: 2 hours - Background worker: 2 hours - Database utilities: 1 hour - Progress callbacks: 1 hour - Tests: 2-3 hours - Documentation: 1 hour

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Ravenight13/codebase-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

background-indexing.md•49.7 KiB