Codebase MCP Server

codebase-mcp
specs
006-database-schema-refactoring

research.md•19.6 KiB

# Research & Design Decisions **Feature**: Database Schema Refactoring for Multi-Project Support **Date**: 2025-10-11 **Phase**: 0 (Research) ## Overview This document consolidates research findings and design decisions for the database schema refactoring feature. All decisions are made to satisfy the 31 functional requirements defined in spec.md while adhering to constitutional principles. ## R1: Migration Tool Selection ### Decision Use **Alembic** (existing migration framework) with transactional DDL execution ### Rationale - Project already uses Alembic (001_initial_schema.py, 003_project_tracking.py, 005_case_insensitive_vendor_name.py exist) - Alembic handles transactions automatically (single transaction per upgrade/downgrade) - Built-in migration tracking via `alembic_version` table - Better integration with SQLAlchemy (existing ORM) - Automatic rollback on any error (connection loss, constraint violation, etc.) - Standard tooling: `alembic upgrade head`, `alembic downgrade -1` - Constitutional principle alignment: Production Quality (comprehensive error handling), Pydantic-Based Type Safety (Python migrations) - Satisfies FR-018 (single atomic transaction requirement) ### Alternatives Considered 1. **Raw SQL scripts with psql** - Rejected: Creates parallel migration system - Rejected: Manual tracking required - Rejected: Less integration with Python codebase - Rejected: More complexity (need custom wrapper scripts) 2. **Manual transactions in SQL** - Rejected: Alembic handles this correctly - Rejected: More error-prone - Rejected: Less standardized 3. **Other migration frameworks (Flyway, Liquibase)** - Rejected: Not Python-native - Rejected: Alembic already installed and working ### Implementation Pattern ```python # migrations/versions/002_remove_non_search_tables.py from alembic import op import sqlalchemy as sa from sqlalchemy import text import logging logger = logging.getLogger(__name__) def upgrade(): """Forward migration - executed in single transaction""" conn = op.get_bind() logger.info("Step 1/10: Checking prerequisites...") # ... DDL operations ... logger.info("Step 1/10: Complete") # Alembic automatically wraps in BEGIN/COMMIT def downgrade(): """Rollback migration - executed in single transaction""" conn = op.get_bind() logger.info("Rollback Step 1/6: Dropping performance index...") # ... inverse DDL operations ... logger.info("Rollback Step 1/6: Complete") ``` **Alembic Execution**: ```bash # Forward migration alembic upgrade head # Rollback alembic downgrade -1 # Check current state alembic current ``` ## R2: project_id Validation Strategy ### Decision Multi-layer validation (database CHECK constraint + Pydantic validator in Phase 02) ### Rationale - **Database CHECK constraint**: Last line of defense, prevents SQL injection, enforces at insert/update - **Pydantic validator (Phase 02)**: Early validation in application code, better error messages - **Regex pattern** `^[a-z0-9-]{1,50}$`: Simple, database-native, no extensions needed - Satisfies FR-009 through FR-012 (pattern validation requirements) - Satisfies FR-024 through FR-026 (validation enforcement requirements) ### Alternatives Considered 1. **Application-only validation** - Rejected: Can be bypassed via direct DB access - Rejected: Violates defense-in-depth security principle 2. **Custom PostgreSQL domain type** - Rejected: Unnecessary complexity for simple pattern - Rejected: Violates Simplicity Over Features principle 3. **ENUM type** - Rejected: Inflexible, requires DDL for each new project_id - Rejected: Doesn't scale to arbitrary project identifiers ### Implementation Pattern ```python # In Alembic migration upgrade() function def upgrade(): conn = op.get_bind() # Add CHECK constraints conn.execute(text(""" ALTER TABLE repositories ADD CONSTRAINT check_repositories_project_id CHECK (project_id ~ '^[a-z0-9-]{1,50}$') """)) conn.execute(text(""" ALTER TABLE code_chunks ADD CONSTRAINT check_code_chunks_project_id CHECK (project_id ~ '^[a-z0-9-]{1,50}$') """)) ``` ## R3: Foreign Key Verification Approach ### Decision Query `information_schema.table_constraints` and `information_schema.referential_constraints` before dropping tables ### Rationale - Standard SQL approach, works across PostgreSQL versions - Explicit verification prevents accidental data corruption - Clear error message identifies problematic foreign keys - Fails fast before any destructive operations - Satisfies FR-003 (foreign key verification requirement) ### Alternatives Considered 1. **`DROP TABLE ... CASCADE`** - Rejected: Too aggressive, hides potential issues - Rejected: Could accidentally drop data from repositories/code_chunks 2. **Manual inspection** - Rejected: Error-prone, not automatable - Rejected: Violates Production Quality principle 3. **Rely on PostgreSQL error** - Rejected: Less clear error messages - Rejected: Fails after starting drops (less safe) ### Implementation Pattern ```python # In Alembic migration upgrade() function def upgrade(): conn = op.get_bind() logger.info("Step 2/10: Verifying foreign key constraints...") # Check for blocking foreign keys result = conn.execute(text(""" SELECT COUNT(*) AS fk_count FROM information_schema.referential_constraints rc JOIN information_schema.table_constraints tc ON rc.constraint_name = tc.constraint_name WHERE tc.table_name IN ('work_items', 'tasks', 'vendors', ...) AND rc.unique_constraint_name IN ( SELECT constraint_name FROM information_schema.table_constraints WHERE table_name IN ('repositories', 'code_chunks') ) """)) fk_count = result.scalar() if fk_count > 0: raise ValueError(f"Found {fk_count} foreign keys from dropped tables to core tables") logger.info("Step 2/10: Complete - No blocking foreign keys found") ``` ## R4: Migration Idempotency Strategy ### Decision Use **Alembic's migration tracking** (`alembic_version` table) + `IF EXISTS`/`IF NOT EXISTS` as defense-in-depth ### Rationale - **Alembic tracking**: Automatic, standard approach, prevents duplicate migrations - **IF EXISTS clauses**: Defense-in-depth, handles manual database modifications - `alembic upgrade head` when already at head is safe no-op - Clear status via `alembic current` command - Satisfies FR-016 (idempotency requirement) ### Alternatives Considered 1. **IF EXISTS only (no Alembic tracking)** - Rejected: Alembic already provides better tracking - Rejected: Less visibility into migration state 2. **No idempotency checks** - Rejected: Operational risk, requires careful orchestration - Rejected: Violates FR-016 3. **Custom version tracking table** - Rejected: Alembic's `alembic_version` table already does this - Rejected: Unnecessary duplication ### Implementation Pattern ```python # Alembic automatically handles migration tracking # Each migration applied is recorded in alembic_version table # Check if migration applied $ alembic current 002 (head) # Migration 002 is applied # Running upgrade when already at head is safe $ alembic upgrade head INFO [alembic.runtime.migration] Running upgrade -> 002 # No-op, no changes made # Defense-in-depth: Use IF EXISTS in DDL def upgrade(): conn = op.get_bind() # Safe to run multiple times conn.execute(text(""" ALTER TABLE repositories ADD COLUMN IF NOT EXISTS project_id VARCHAR(50) NOT NULL DEFAULT 'default' """)) conn.execute(text(""" DROP TABLE IF EXISTS work_items CASCADE """)) ``` ## R5: Code Chunk project_id Population Strategy ### Decision Three-step ADD COLUMN approach with UPDATE between steps ### Rationale - **Step 1**: Add nullable column (allows UPDATE to succeed) - **Step 2**: Populate from parent repository (maintains referential integrity) - **Step 3**: Add NOT NULL + CHECK constraint (enforces future integrity) - Handles existing data gracefully (copies from parent) - Validates referential integrity (UPDATE fails if orphan chunks exist) - Satisfies FR-022 (data integrity requirement) ### Alternatives Considered 1. **Direct ADD COLUMN with NOT NULL + UPDATE** - Rejected: PostgreSQL doesn't allow UPDATE between constraint checks - Rejected: Would fail on existing data 2. **Default value approach** - Rejected: Loses referential integrity, assigns 'default' to all chunks - Rejected: Violates FR-022 (copy from parent repository) 3. **Two separate transactions** - Rejected: Violates FR-018 (single transaction requirement) ### Implementation Pattern ```python def upgrade(): conn = op.get_bind() logger.info("Step 4/10: Adding project_id to code_chunks table...") # Step 1: Add nullable column conn.execute(text(""" ALTER TABLE code_chunks ADD COLUMN IF NOT EXISTS project_id VARCHAR(50) """)) # Step 2: Populate from parent repository conn.execute(text(""" UPDATE code_chunks SET project_id = ( SELECT project_id FROM repositories WHERE repositories.id = code_chunks.repository_id ) WHERE project_id IS NULL """)) # Step 3: Add constraints conn.execute(text(""" ALTER TABLE code_chunks ALTER COLUMN project_id SET NOT NULL """)) conn.execute(text(""" ALTER TABLE code_chunks ADD CONSTRAINT check_code_chunks_project_id CHECK (project_id ~ '^[a-z0-9-]{1,50}$') """)) logger.info("Step 4/10: Complete - Column added, populated from parent repositories") ``` ## R6: Migration Performance Optimization ### Decision Create index AFTER data population, optionally use CONCURRENTLY for production (manual step) ### Rationale - Index creation on existing data: Faster to populate first, then index - `CREATE INDEX CONCURRENTLY`: Optional for production (allows reads during index build) - For initial migration (minimal data): Standard index creation acceptable (< 1 second) - Performance target: < 5 minutes easily met with up to 100K rows - Satisfies FR-031 (< 5 minute performance requirement) ### Alternatives Considered 1. **Index before populating** - Rejected: Slower, each UPDATE incurs index maintenance cost - Rejected: Could exceed 5 minute target on large datasets 2. **No performance optimization** - Rejected: Violates FR-031 performance requirement 3. **Always use CONCURRENTLY** - Rejected: CONCURRENTLY can't run in transaction - Rejected: Incompatible with Alembic's transactional approach ### Implementation Pattern ```python def upgrade(): conn = op.get_bind() # After all data population complete logger.info("Step 6/10: Creating performance index...") # Standard approach (transactional, fast for test data) op.create_index( 'idx_project_repository', 'repositories', ['project_id', 'id'] ) logger.info("Step 6/10: Complete - Index idx_project_repository created") # Optional for production (manual step BEFORE migration): # CREATE INDEX CONCURRENTLY idx_project_repository # ON repositories(project_id, id); # Then comment out op.create_index() in migration ``` ## R7: Logging and Observability Strategy ### Decision Use **Python logging** in Alembic migration + PostgreSQL `RAISE NOTICE` for SQL-level feedback ### Rationale - **Python logger**: Standard logging framework, structured output, configurable levels - **RAISE NOTICE**: SQL-level progress visibility (shows in Alembic output) - Logs to stdout (captured by Alembic) and optionally to file - Timestamps added by Python logging framework - Error context captured via Python exceptions + SQL error messages - Constitutional principle alignment: Production Quality (comprehensive logging) - Satisfies FR-020 (logging requirement) ### Alternatives Considered 1. **RAISE NOTICE only** - Rejected: No structured logging - Rejected: Harder to parse for monitoring 2. **No logging** - Rejected: Violates FR-020 logging requirement - Rejected: Poor operational visibility 3. **Custom logging table** - Rejected: Overkill, complicates transaction isolation - Rejected: Adds unnecessary complexity ### Implementation Pattern ```python # migrations/versions/002_remove_non_search_tables.py import logging logger = logging.getLogger(__name__) def upgrade(): conn = op.get_bind() # Python logging for major steps logger.info("Step 1/10: Checking prerequisites...") # SQL RAISE NOTICE for inline feedback conn.execute(text(""" DO $$ BEGIN RAISE NOTICE 'Verifying foreign key constraints...'; -- ... query ... RAISE NOTICE 'No blocking foreign keys found'; END $$ """)) logger.info("Step 1/10: Complete") # Error handling with clear context try: conn.execute(text("DROP TABLE work_items CASCADE")) except Exception as e: logger.error(f"Failed to drop work_items: {e}") raise ValueError(f"Migration failed at step 7: {e}") from e ``` **Logging Configuration** (in migrations/env.py): ```python import logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.StreamHandler(), # stdout logging.FileHandler('/tmp/codebase-mcp-migration.log') # file ] ) ``` ## R8: Rollback Design Pattern ### Decision Use **Alembic downgrade() function** with inverse operations + data restoration caveat ### Rationale - **Alembic downgrade()**: Standard, automatic, mirrors upgrade() structure - Inverse DDL operations: `CREATE TABLE` mirrors `DROP TABLE`, `DROP COLUMN` mirrors `ADD COLUMN` - Schema-only rollback: Structure restored, data lost (acceptable per spec clarifications) - Data restoration requires backup: Documented in contract - Fast execution: < 30 seconds (no data restoration by default) - Single transaction: Alembic handles automatically - Satisfies FR-014 (rollback script requirement) ### Alternatives Considered 1. **Full data restoration in rollback** - Rejected: Slow, requires backup embedded in script - Rejected: Complex, prone to errors 2. **No rollback (downgrade not implemented)** - Rejected: Violates FR-014 reversibility requirement - Rejected: Poor operational safety 3. **Separate rollback file** - Rejected: Alembic's downgrade() function is standard - Rejected: More files to maintain ### Implementation Pattern ```python # migrations/versions/002_remove_non_search_tables.py def upgrade(): """Forward migration""" conn = op.get_bind() # Add column conn.execute(text("ALTER TABLE repositories ADD COLUMN project_id VARCHAR(50)...")) # Drop tables conn.execute(text("DROP TABLE IF EXISTS work_items CASCADE")) conn.execute(text("DROP TABLE IF EXISTS tasks CASCADE")) # ... (9 tables total) def downgrade(): """Rollback migration""" conn = op.get_bind() logger.info("Rollback Step 1/6: Dropping performance index...") op.drop_index('idx_project_repository', 'repositories') logger.info("Rollback Step 1/6: Complete") logger.info("Rollback Step 2/6: Dropping validation constraints...") conn.execute(text("ALTER TABLE repositories DROP CONSTRAINT IF EXISTS check_repositories_project_id")) conn.execute(text("ALTER TABLE code_chunks DROP CONSTRAINT IF EXISTS check_code_chunks_project_id")) logger.info("Rollback Step 2/6: Complete") logger.info("Rollback Step 3/6: Dropping project_id from code_chunks...") conn.execute(text("ALTER TABLE code_chunks DROP COLUMN IF EXISTS project_id")) logger.info("Rollback Step 3/6: Complete") logger.info("Rollback Step 4/6: Dropping project_id from repositories...") conn.execute(text("ALTER TABLE repositories DROP COLUMN IF EXISTS project_id")) logger.info("Rollback Step 4/6: Complete") logger.info("Rollback Step 5/6: Restoring dropped tables (schema only)...") # Recreate table structures (data NOT restored) conn.execute(text("CREATE TABLE IF NOT EXISTS work_items (...)")) conn.execute(text("CREATE TABLE IF NOT EXISTS tasks (...)")) # ... (9 tables total) logger.warning("Data in dropped tables NOT restored. Restore from backup if needed.") logger.info("Rollback Step 5/6: Complete - 9 tables restored") logger.info("Rollback Step 6/6: Running validation...") # Validation checks logger.info("Rollback Step 6/6: Complete") ``` **Rollback Execution**: ```bash # Downgrade one revision alembic downgrade -1 # Downgrade to specific revision alembic downgrade 005 # Verify rollback alembic current # Should show: 005 ``` ## R9: Database Connection Configuration ### Decision Use **DATABASE_URL environment variable** for database connection configuration ### Rationale - Standard pattern across Python database tools (SQLAlchemy, Alembic, Django, Flask) - Flexible: Works with local, test, and production databases - Secure: No hardcoded credentials in code - Alembic native: `alembic.ini` can reference `${DATABASE_URL}` or detect from environment - Constitutional principle alignment: Local-First Architecture (configurable for any local PostgreSQL) - Operational simplicity: Single configuration point ### Alternatives Considered 1. **Hardcoded connection in alembic.ini** - Rejected: Not flexible for different environments - Rejected: Security risk (credentials in version control) 2. **Separate config file per environment** - Rejected: More complexity, more files to maintain - Rejected: Doesn't follow Python ecosystem conventions 3. **Command-line arguments** - Rejected: More verbose, error-prone - Rejected: Harder to script ### Implementation Pattern ```bash # Environment variable approach (recommended) export DATABASE_URL=postgresql://localhost/codebase_mcp alembic upgrade head # Or inline for single command DATABASE_URL=postgresql://localhost/codebase_mcp_test alembic upgrade head # Different environments DATABASE_URL=postgresql://localhost/codebase_mcp_test alembic upgrade head # Test DATABASE_URL=postgresql://prod-server/codebase_mcp alembic upgrade head # Production ``` **alembic.ini configuration**: ```ini [alembic] # Support DATABASE_URL from environment sqlalchemy.url = ${DATABASE_URL} # Or with fallback to default sqlalchemy.url = postgresql://localhost/codebase_mcp ``` **migrations/env.py configuration**: ```python import os from alembic import context # Get DATABASE_URL from environment or alembic.ini config = context.config database_url = os.environ.get('DATABASE_URL') or config.get_main_option('sqlalchemy.url') config.set_main_option('sqlalchemy.url', database_url) ``` ## Summary of Research Outcomes All research decisions satisfy constitutional principles: - ✅ **Simplicity Over Features** (R1: use existing Alembic, no custom framework; R4: no custom version tracking) - ✅ **Local-First Architecture** (R9: configurable for any local PostgreSQL, R1: all operations local) - ✅ **Protocol Compliance** (no impact to MCP protocol) - ✅ **Performance Guarantees** (R6: < 5 minute target, index optimization strategy) - ✅ **Production Quality** (R1, R3, R7: error handling, validation, comprehensive logging) - ✅ **Specification-First** (all decisions traced to functional requirements) - ✅ **Test-Driven Development** (pytest validation suite, idempotency tests) - ✅ **Pydantic-Based Type Safety** (R1: Python migrations leverage type safety) **Key Architectural Decision**: Using Alembic instead of raw SQL provides: - Automatic transaction management (FR-018) - Built-in migration tracking (FR-016) - Better Python integration (Constitutional Principle VIII) - Production-grade error handling (Constitutional Principle V) - Standard tooling (`alembic upgrade/downgrade`) **Next Phase**: Phase 1 (Design & Contracts)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Ravenight13/codebase-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

research.md•19.6 KiB