Codebase MCP Server

codebase-mcp
specs
009-v2-connection-mgmt

research.md•46.7 KiB

# Technical Research: Production-Grade Connection Management **Branch**: 009-v2-connection-mgmt | **Date**: 2025-10-13 **Phase**: 0 (Research & Technical Decisions) ## Overview This document captures all technical decisions, alternatives considered, and rationale for implementing production-grade connection pool management for the Codebase MCP Server. The connection pool serves as foundational infrastructure for Phase-05 (Repository Indexing) and Phase-06 (Semantic Search), requiring high reliability, performance, and observability. **Constitutional Alignment**: - Principle II: Local-First Architecture (PostgreSQL local database) - Principle IV: Performance Guarantees (2s initialization, 10ms acquisition, 500ms search) - Principle V: Production Quality Standards (comprehensive error handling, type safety) - Principle VIII: Pydantic-Based Type Safety (PoolConfig, HealthStatus models) - Principle XI: FastMCP Foundation (integration with FastMCP server lifecycle) ## Key Technical Decisions ### Decision 1: Connection Pool Library Selection **Chosen**: asyncpg's built-in connection pool (`asyncpg.create_pool`) **Rationale**: - **Native async/await support**: Principle IV requires async everything. asyncpg is async-native from the ground up, not a wrapper around synchronous code. - **Production-tested**: Used by FastAPI, SQLAlchemy async, and other async Python frameworks. Mature library with extensive production deployment. - **PostgreSQL-specific optimizations**: Direct implementation of PostgreSQL wire protocol without abstraction overhead. Binary protocol support for faster data transfer. - **Built-in connection validation**: Native support for `SELECT 1` validation queries with connection state tracking. - **Minimal abstraction**: No ORM overhead or unnecessary layers. Direct control over connection lifecycle. - **Performance**: Benchmarks show 3x faster than aiopg for typical query workloads. **Alternatives Considered**: 1. **SQLAlchemy async pool**: More abstraction, ORM integration, broader ecosystem - **Rejected**: Unnecessary complexity for connection pooling alone. Adds ~20% overhead for ORM machinery we don't need. Violates Principle I (Simplicity) - we only need connection management, not full ORM. - Performance impact: +15-20ms per query for ORM layer - Complexity: Forces SQLAlchemy's session management patterns 2. **aiopg**: Older async library based on psycopg2 wrapper - **Rejected**: psycopg2 wrapper architecture means async is not truly native. Benchmarks show slower performance than asyncpg. Less active maintenance (last major release 2021). - Performance: 3x slower than asyncpg in production benchmarks - Maturity: Lower async/await integration quality 3. **Custom pool implementation**: Roll our own connection pool - **Rejected**: Violates Principle I (Simplicity) and Principle V (Production Quality). Connection pooling is complex infrastructure with subtle edge cases (race conditions, leak detection, graceful shutdown). Don't reinvent well-tested infrastructure. - Risk: High chance of subtle bugs in concurrency, reconnection, and resource cleanup - Maintenance: Ongoing burden for edge case handling **References**: - asyncpg documentation: https://magicstack.github.io/asyncpg/current/api/index.html#connection-pools - Performance benchmarks: https://github.com/MagicStack/asyncpg#performance (3x faster than aiopg) - PostgreSQL wire protocol spec: https://www.postgresql.org/docs/current/protocol.html --- ### Decision 2: Pool Configuration Management **Chosen**: Pydantic Settings with environment variable overrides **Rationale**: - **Principle VIII mandate**: Pydantic-based type safety requires `pydantic_settings.BaseSettings` for all configuration. - **Runtime validation**: Catches misconfigurations on startup (fail-fast). Invalid values (min_size > max_size) detected before pool creation. - **Environment variable support**: Enables 12-factor app configuration. Container/cloud-friendly without code changes. - **Type hints generate clear validation errors**: Pydantic error messages include field name, value, and constraint violation. - **Default value management**: Centralized defaults with clear documentation. - **Testability**: Easy to override config in tests without environment pollution. **Implementation Pattern**: ```python from pydantic_settings import BaseSettings from pydantic import Field, field_validator class PoolConfig(BaseSettings): """Connection pool configuration with validation.""" min_size: int = Field(default=2, ge=1, le=100) max_size: int = Field(default=10, ge=1, le=100) timeout: float = Field(default=30.0, ge=1.0, le=300.0) max_idle_time: float = Field(default=60.0, ge=1.0) max_queries: int = Field(default=50000, ge=1000) max_connection_lifetime: float = Field(default=3600.0, ge=60.0) command_timeout: float = Field(default=60.0, ge=1.0) leak_detection_timeout: float = Field(default=30.0, ge=10.0) enable_leak_detection: bool = Field(default=True) @field_validator('max_size') @classmethod def validate_max_size(cls, v: int, info) -> int: """Ensure max_size >= min_size.""" min_size = info.data.get('min_size', 2) if v < min_size: raise ValueError(f"max_size ({v}) must be >= min_size ({min_size})") return v class Config: env_prefix = "POOL_" # POOL_MIN_SIZE, POOL_MAX_SIZE, etc. frozen = True # Immutable after initialization ``` **Alternatives Considered**: 1. **Plain dict/dataclass**: Simpler data structures without validation - **Rejected**: No runtime validation. Silent failures on invalid config (e.g., negative timeout). Violates Principle V (Production Quality) - need fail-fast validation. - Risk: Invalid configuration propagates to pool creation, causing cryptic errors 2. **YAML/TOML config files**: File-based configuration with explicit schema - **Rejected**: Less container/cloud-friendly than environment variables. Requires file mounting in containers. Violates 12-factor app principles. - Operational complexity: Separate config file management, versioning, secrets handling 3. **Python module with constants**: Import-based configuration - **Rejected**: No validation, no environment override without code changes. Hard to test with different configs. **Environment Variable Examples**: ```bash # Override defaults via environment export POOL_MIN_SIZE=5 export POOL_MAX_SIZE=20 export POOL_TIMEOUT=60.0 export POOL_ENABLE_LEAK_DETECTION=false ``` --- ### Decision 3: Connection Validation Strategy **Chosen**: Lightweight `SELECT 1` query before connection return **Rationale**: - **FR-002 performance requirement**: <5ms validation overhead. `SELECT 1` typically completes in <2ms. - **Stale connection detection**: PostgreSQL can detect network issues, server restarts, and transaction state problems via simple query execution. - **asyncpg optimization**: asyncpg automatically prepares `SELECT 1` statement and caches it, reducing overhead to near-zero after first execution. - **Minimal network overhead**: Single round-trip with minimal payload (1 byte result). - **Risk mitigation**: Prevents returning broken connections to application code, avoiding cryptic errors during actual queries. **Implementation Pattern**: ```python async def acquire_validated_connection(pool: asyncpg.Pool) -> asyncpg.Connection: """Acquire connection from pool with validation.""" conn = await pool.acquire() try: # Validate connection is healthy (asyncpg caches this prepared statement) await conn.fetchval("SELECT 1") return conn except Exception as e: # Connection is broken, close it and raise await pool.release(conn, close=True) # Force connection closure raise ConnectionValidationError(f"Connection validation failed: {e}") ``` **Performance Impact**: - First validation: ~2-3ms (statement preparation + execution) - Subsequent validations: <1ms (cached prepared statement) - Network overhead: Single round-trip (typically <1ms on localhost) **Alternatives Considered**: 1. **No validation**: Trust pool's internal connection tracking - **Rejected**: Risk of returning stale connections after database restart or network blip. Application code would encounter cryptic errors during query execution. - User experience: Intermittent failures difficult to debug - Failure mode: Error occurs deep in query execution, not at acquisition 2. **Full health query**: Query system tables (pg_stat_database, pg_settings) - **Rejected**: Too slow (>10ms). Violates FR-002 performance target (<5ms validation overhead). - Overhead: 5-10x slower than SELECT 1 - Complexity: Requires additional permissions for system table access 3. **TCP keepalive only**: OS-level connection health check - **Rejected**: Doesn't detect application-level issues (transaction state, PostgreSQL backend crashes, prepared statement invalidation). - Coverage: Only detects network-level failures, not database-level issues - Latency: Keepalive probes take 15-30 seconds to detect failure 4. **Lazy validation on first query**: Validate only when error occurs - **Rejected**: Pushes error handling to application code. Breaks fail-fast principle. Inconsistent behavior (some queries fail, others succeed). **Edge Case Handling**: - Connection recycled after validation failure (not returned to pool) - Validation timeout inherits from pool.timeout (30s default) - Transient validation failures trigger reconnection logic - Persistent validation failures mark pool as degraded --- ### Decision 4: Exponential Backoff Implementation **Chosen**: Capped exponential backoff with jitter (1s, 2s, 4s, 8s, 16s max) **Rationale**: - **FR-003 requirement**: Spec explicitly requires exponential backoff with these intervals. - **Jitter prevents thundering herd**: If multiple connections fail simultaneously (database restart), jittered retries prevent synchronized retry storm. - **16s cap**: Prevents indefinite exponential growth while being aggressive enough for recovery. After 5 attempts, continue retrying every 16s. - **Persistent retry**: Don't give up after max attempts. Continue attempting reconnection indefinitely (database may recover later). - **Graceful degradation**: System remains operational with partial capacity during outage. **Implementation**: ```python import asyncio import random from typing import Optional async def exponential_backoff_retry( attempt: int, max_delay: float = 16.0, jitter_factor: float = 0.1 ) -> None: """ Sleep with exponential backoff and jitter. Args: attempt: Retry attempt number (0-indexed) max_delay: Maximum delay in seconds (16s default) jitter_factor: Jitter as fraction of delay (±10% default) """ base_delay = 1.0 # Exponential: 1s, 2s, 4s, 8s, 16s, 16s, ... delay = min(base_delay * (2 ** attempt), max_delay) # Add jitter: ±10% randomization to prevent thundering herd jitter = random.uniform(-jitter_factor * delay, jitter_factor * delay) total_delay = delay + jitter await asyncio.sleep(total_delay) ``` **Retry Schedule Examples**: - Attempt 0: 1s ± 0.1s = 0.9-1.1s - Attempt 1: 2s ± 0.2s = 1.8-2.2s - Attempt 2: 4s ± 0.4s = 3.6-4.4s - Attempt 3: 8s ± 0.8s = 7.2-8.8s - Attempt 4: 16s ± 1.6s = 14.4-17.6s - Attempt 5+: 16s ± 1.6s (capped, continues indefinitely) **Alternatives Considered**: 1. **Linear backoff**: 1s, 2s, 3s, 4s, 5s - **Rejected**: Spec explicitly requires exponential (FR-003). Linear is too aggressive early on (hammers recovering database). - Recovery impact: More load on recovering database during critical window 2. **No jitter**: Exact intervals (1s, 2s, 4s, 8s, 16s) - **Rejected**: Risk of synchronized retry storms if multiple connections fail simultaneously (database restart scenario). - Thundering herd: All connections retry at exactly same time, overwhelming database - Failure mode: Retry storm prevents database recovery 3. **Fibonacci backoff**: 1s, 1s, 2s, 3s, 5s, 8s - **Rejected**: Not specified in requirements. More complex than exponential with no clear benefit. - Simplicity: Exponential is easier to reason about 4. **Decorrelated jitter**: AWS-style jitter (random between base and previous delay) - **Rejected**: Over-complicates for this use case. ±10% jitter sufficient to prevent thundering herd. - Complexity: Harder to test and reason about retry timing **Retry Termination**: - Never stop retrying automatically (database may recover hours later) - Graceful shutdown stops retry loop cleanly - Health status transitions: healthy → degraded → unhealthy based on retry count --- ### Decision 5: Health Check Status Determination **Chosen**: Three-tier health status (healthy/degraded/unhealthy) based on pool capacity **Rationale**: - **FR-008 requirement**: <10ms response time. No database query, only in-memory statistics. - **Degraded state enables partial operation**: System can continue with reduced capacity during recovery. Not binary (working/broken). - **Clear thresholds enable automated alerting**: Monitoring systems can alert on degraded state before complete failure. - **Capacity-based logic**: Reflects actual system capability to handle load. **Status Logic**: ```python from enum import Enum from typing import NamedTuple class HealthStatus(str, Enum): """Connection pool health status.""" HEALTHY = "healthy" DEGRADED = "degraded" UNHEALTHY = "unhealthy" class PoolStatistics(NamedTuple): """Immutable pool statistics snapshot.""" total_connections: int idle_connections: int active_connections: int waiting_requests: int total_acquisitions: int total_releases: int avg_acquisition_time_ms: float peak_active_connections: int peak_wait_time_ms: float last_error: Optional[str] last_error_time: Optional[datetime] def calculate_health_status(stats: PoolStatistics, config: PoolConfig) -> HealthStatus: """ Calculate health status from pool statistics. Rules: - UNHEALTHY: No connections available OR initialization failed OR critical error - DEGRADED: Partial capacity (50-79%) OR recent errors OR high wait times - HEALTHY: Full capacity (≥80%) AND no recent errors AND normal wait times """ if stats.total_connections == 0: return HealthStatus.UNHEALTHY capacity_ratio = stats.idle_connections / stats.total_connections # Check for critical conditions if capacity_ratio < 0.5: return HealthStatus.UNHEALTHY # Check for degraded conditions if capacity_ratio < 0.8: return HealthStatus.DEGRADED # Check for recent errors (within last 60 seconds) if stats.last_error_time: time_since_error = (datetime.utcnow() - stats.last_error_time).total_seconds() if time_since_error < 60: return HealthStatus.DEGRADED # Check for high wait times (>100ms average) if stats.peak_wait_time_ms > 100: return HealthStatus.DEGRADED return HealthStatus.HEALTHY ``` **Threshold Rationale**: - **≥80% capacity = HEALTHY**: System has buffer for load spikes. Most connections idle and ready. - **50-79% capacity = DEGRADED**: Partial operation possible. Alert operators but don't fail. - **<50% capacity = UNHEALTHY**: System likely cannot handle normal load. Critical alert. **Alternatives Considered**: 1. **Binary status**: healthy/unhealthy only - **Rejected**: Doesn't capture partial degradation state. Can't distinguish "slightly stressed" from "completely broken". - Alerting: Forces all alerts to be critical (no warning state) - Operations: No visibility into gradual degradation 2. **Five-tier status**: excellent/good/degraded/bad/critical - **Rejected**: Over-complicates monitoring. Three tiers (healthy/degraded/unhealthy) map cleanly to standard alert levels (ok/warning/critical). - Complexity: More thresholds to tune and maintain - Simplicity: Violates Principle I 3. **Database query for health**: Execute real query to validate database connectivity - **Rejected**: Violates FR-008 <10ms requirement. Database query adds 20-50ms latency. Adds load to database. - Performance: 2-5x slower than in-memory check - Load: Health checks happen frequently (every 10s), would add significant database load 4. **Error rate-based**: Track error percentage over time window - **Rejected**: More complex to implement (requires time-series storage). Capacity-based logic is simpler and equally effective. - Complexity: Requires sliding window, error rate calculation **Health Check Response Format**: ```python class HealthCheckResponse(BaseModel): """Health check endpoint response.""" status: HealthStatus timestamp: datetime pool_statistics: PoolStatistics database_url: str # Redacted (hide credentials) version: str latency_ms: float ``` --- ### Decision 6: Connection Leak Detection Approach **Chosen**: Stack trace capture at acquisition + timeout-based warning **Rationale**: - **FR-006 requirement**: Stack trace in leak warnings enables debugging. - **Timeout-based detection**: Catches orphaned connections (acquired but never released). - **No false positives for legitimate long queries**: Timeout is configurable. Long-running operations can increase timeout. - **Non-disruptive**: Logs warning but doesn't terminate connection (prevents breaking legitimate operations). - **Development aid**: Helps developers identify problematic code patterns during development. **Implementation Strategy**: ```python import traceback from datetime import datetime from typing import Dict, Tuple class ConnectionMetadata(NamedTuple): """Metadata tracked per acquired connection.""" acquired_at: datetime stack_trace: str connection_id: str class ConnectionLeakDetector: """Detect and report potential connection leaks.""" def __init__(self, timeout: float = 30.0): self.timeout = timeout self.active_connections: Dict[str, ConnectionMetadata] = {} def track_acquisition(self, connection_id: str) -> None: """Track connection acquisition with stack trace.""" stack_trace = "".join(traceback.format_stack()) self.active_connections[connection_id] = ConnectionMetadata( acquired_at=datetime.utcnow(), stack_trace=stack_trace, connection_id=connection_id ) def track_release(self, connection_id: str) -> None: """Track connection release.""" self.active_connections.pop(connection_id, None) async def detect_leaks(self) -> None: """Background task to detect leaked connections.""" while True: await asyncio.sleep(10) # Check every 10 seconds now = datetime.utcnow() for conn_id, metadata in self.active_connections.items(): held_duration = (now - metadata.acquired_at).total_seconds() if held_duration > self.timeout: logger.warning( f"Potential connection leak detected: {conn_id}", extra={ "connection_id": conn_id, "held_duration_seconds": held_duration, "stack_trace": metadata.stack_trace, } ) ``` **Leak Detection Phases**: 1. **Capture stack trace** at `pool.acquire()` time (5-10ms overhead) 2. **Track acquisition timestamp** in connection metadata 3. **Background task** checks every 10 seconds for connections held >timeout 4. **Log warning** with stack trace, connection ID, and held duration 5. **Don't terminate connection** (prevents breaking legitimate long operations) **Alternatives Considered**: 1. **Python garbage collection hooks**: Detect when connection object is GC'd but not released - **Rejected**: Too invasive. Hooks into Python internals. Doesn't help identify leak source (no context at GC time). - Complexity: Requires weakref magic, GC callback registration - Debugging: No stack trace at leak point, only at GC time 2. **Forced termination**: Kill leaked connections automatically after timeout - **Rejected**: Risk of breaking legitimate long-running operations (complex queries, batch operations). Violates fail-safe principle. - User experience: Operations mysteriously fail mid-execution - Safety: Potential for data corruption if connection killed during transaction 3. **No leak detection**: Trust developers to manage connections properly - **Rejected**: Violates FR-006 requirement. Connection leaks are common source of production issues. Prevention is critical. - Risk: Silent resource exhaustion, difficult to debug in production 4. **Context manager enforcement**: Require `async with pool.acquire()` pattern - **Rejected**: Doesn't prevent leaks (context manager can be held indefinitely). Still need timeout detection. - Coverage: Context managers help but don't eliminate all leak scenarios (exceptions, early returns) **Configuration Options**: ```python # Disable leak detection for specific operations async with pool.acquire_with_timeout(timeout=300.0): # 5 minute timeout # Long-running batch operation pass # Disable leak detection globally (not recommended) config = PoolConfig(enable_leak_detection=False) ``` **Leak Warning Format**: ``` WARNING: Potential connection leak detected: conn_550e8400 Connection held for 45.3 seconds (timeout: 30s) Acquired at: 2025-10-13T10:30:00Z Stack trace: File "codebase_mcp/server.py", line 42, in handle_request conn = await pool.acquire() File "codebase_mcp/indexer.py", line 105, in index_repository # Missing pool.release(conn) ``` --- ### Decision 7: Graceful Shutdown Sequence **Chosen**: Multi-phase shutdown with timeout (FR-005) **Rationale**: - **FR-005 requirement**: 30s timeout for active queries to complete. - **Prevents transaction orphans**: Allows in-flight operations to complete naturally, preventing database transaction corruption. - **Clear shutdown signaling**: Pool status change to `shutting_down` rejects new requests immediately. - **Force-close safety**: After timeout, force-close remaining connections with warnings (prevent indefinite hang). - **Resource cleanup**: Ensures all database connections properly released. **Shutdown Phases**: ```python from enum import Enum class PoolStatus(str, Enum): """Connection pool lifecycle status.""" INITIALIZING = "initializing" HEALTHY = "healthy" DEGRADED = "degraded" UNHEALTHY = "unhealthy" SHUTTING_DOWN = "shutting_down" TERMINATED = "terminated" async def graceful_shutdown( pool: asyncpg.Pool, timeout: float = 30.0 ) -> None: """ Gracefully shutdown connection pool. Phases: 1. Set status to shutting_down (reject new requests) 2. Wait for active connections to finish (max timeout seconds) 3. Close all idle connections gracefully 4. Force-close remaining active connections with warnings 5. Mark pool as terminated """ logger.info("Initiating graceful connection pool shutdown") pool.status = PoolStatus.SHUTTING_DOWN # Phase 1: Wait for active connections to finish start = time.time() while pool.get_active_count() > 0: elapsed = time.time() - start if elapsed > timeout: logger.warning( f"Shutdown timeout exceeded: {pool.get_active_count()} active connections remaining", extra={"active_connections": pool.get_active_count(), "timeout": timeout} ) break await asyncio.sleep(0.1) # Check every 100ms # Phase 2: Close idle connections gracefully logger.info(f"Closing {pool.get_idle_count()} idle connections") await pool.close() # Phase 3: Log warnings for force-closed connections remaining = pool.get_active_count() if remaining > 0: logger.warning( f"Force-closed {remaining} active connections after timeout", extra={"forced_closures": remaining, "timeout": timeout} ) pool.status = PoolStatus.TERMINATED logger.info("Connection pool shutdown complete") ``` **Shutdown Timeline**: - t=0s: Set status to `shutting_down`, reject new acquisitions - t=0-30s: Wait for active connections to finish naturally - t=30s: Close all idle connections gracefully - t=30s+: Force-close remaining active connections with warnings - t=30s++: Mark pool as terminated, release resources **Alternatives Considered**: 1. **Immediate shutdown**: Kill all connections instantly - **Rejected**: Risk of database transaction corruption. Active queries fail mid-execution. Violates fail-safe principle. - Data integrity: Transactions may be left in inconsistent state - User experience: In-flight operations fail without reason 2. **Infinite wait**: Wait forever for active connections to complete - **Rejected**: Violates FR-005 30s timeout requirement. Prevents server shutdown during hung queries. - Operations: Makes graceful restarts impossible with stuck connections - Reliability: Server cannot shut down cleanly 3. **Two-phase commit**: Try to rollback active transactions before shutdown - **Rejected**: Over-complicates shutdown. Transaction management is application-level responsibility, not pool responsibility. - Complexity: Would require tracking transaction state per connection - Scope: Violates single responsibility principle 4. **SIGKILL after timeout**: Send SIGKILL to server process if shutdown doesn't complete - **Rejected**: Nuclear option that prevents any cleanup. Database may be left in inconsistent state. - Risk: Connection slots not released in PostgreSQL, require manual cleanup **Integration with FastMCP**: ```python # Register shutdown handler with FastMCP server @mcp.startup async def startup(): """Initialize connection pool on server startup.""" app.state.pool = await create_connection_pool(config) @mcp.shutdown async def shutdown(): """Gracefully shutdown connection pool.""" await graceful_shutdown(app.state.pool, timeout=30.0) ``` **Signal Handling**: ```python import signal async def shutdown_on_signal(signum: int, frame): """Handle SIGTERM/SIGINT gracefully.""" logger.info(f"Received signal {signum}, initiating graceful shutdown") await graceful_shutdown(pool, timeout=30.0) # Register signal handlers signal.signal(signal.SIGTERM, shutdown_on_signal) signal.signal(signal.SIGINT, shutdown_on_signal) ``` --- ## Performance Considerations ### Target Metrics (from spec SC-001 through SC-013) | Metric | Target | Implementation Strategy | Measurement Tool | |--------|--------|-------------------------|------------------| | Pool initialization | <2s p95 | Pre-warm min_size connections in parallel (not sequential) | pytest-benchmark | | Connection acquisition | <10ms p95 | Lightweight SELECT 1 validation, connection reuse, cached prepared statements | pytest-benchmark | | Health check response | <10ms p99 | In-memory statistics, no database query, lock-free reads | pytest-benchmark | | Reconnection after outage | <30s p95 | Exponential backoff with jitter, parallel connection attempts | Integration tests | | Graceful shutdown | <30s p99 | Multi-phase with timeout, background task monitoring | Integration tests | | Memory consumption | <100MB | Limit max_size=10, monitor with pytest-memray, connection recycling | pytest-memray | | Startup overhead | <200ms | Lazy initialization after min_size pool ready, background pre-warming | pytest-benchmark | | Health check throughput | >1000 req/s | Lock-free statistics read, atomic counters, no contention | pytest-benchmark | ### Optimization Strategies 1. **Connection Reuse**: - Keep connections warm in pool - Only validate if idle >5s (connection likely stale) - Reuse prepared statements (asyncpg caches `SELECT 1`) - Avoid unnecessary connection recycling 2. **Parallel Initialization**: - Create min_size connections concurrently (not sequential) - Use `asyncio.gather()` for parallel connection establishment - Reduces initialization from 2s (sequential) to 0.5s (parallel) - Fail-fast if any connection fails during initialization 3. **Lock-Free Statistics**: - Use atomic operations for counters (total_acquisitions, total_releases) - Read statistics without blocking active operations - Copy statistics to immutable PoolStatistics object - No mutex contention on hot path (acquire/release) 4. **Prepared Statement Caching**: - asyncpg auto-prepares `SELECT 1` after first execution - Cached prepared statements reduce validation overhead from 2ms to <1ms - No application-level caching needed (handled by asyncpg) 5. **Batch Connection Recycling**: - Don't recycle all connections at once (would spike latency) - Stagger recycling: Replace 1 connection every 60 seconds - Background task handles recycling (not on critical path) - Ensures smooth capacity throughout lifecycle 6. **Lazy Pool Expansion**: - Start with min_size connections - Expand to max_size only under load - Shrink back to min_size after idle period (60s) - Reduces memory footprint during low traffic ### Performance Testing Strategy **Initialization Benchmark**: ```python @pytest.mark.benchmark async def test_pool_initialization_performance(benchmark): """Benchmark pool initialization time.""" result = await benchmark.pedantic( create_connection_pool, args=(config,), iterations=10, rounds=5 ) # Assert p95 < 2 seconds assert result.stats.p95 < 2.0 ``` **Acquisition Benchmark**: ```python @pytest.mark.benchmark async def test_connection_acquisition_performance(benchmark): """Benchmark connection acquisition overhead.""" pool = await create_connection_pool(config) async def acquire_release(): conn = await pool.acquire() await pool.release(conn) result = await benchmark.pedantic( acquire_release, iterations=1000, rounds=10 ) # Assert p95 < 10ms assert result.stats.p95 < 0.010 ``` **Health Check Throughput**: ```python @pytest.mark.benchmark async def test_health_check_throughput(benchmark): """Benchmark health check endpoint throughput.""" pool = await create_connection_pool(config) async def get_health(): return await pool.get_health_status() result = await benchmark.pedantic( get_health, iterations=10000, rounds=5 ) # Assert throughput >1000 req/s assert result.stats.throughput > 1000 ``` **Memory Profiling**: ```python @pytest.mark.memray async def test_pool_memory_consumption(): """Profile memory consumption over 24 hours.""" pool = await create_connection_pool(config) # Simulate 24 hours of operation for _ in range(86400): # 1 second intervals conn = await pool.acquire() await pool.release(conn) await asyncio.sleep(0.01) # Faster than real-time # Assert memory <100MB assert memray.get_current_usage() < 100 * 1024 * 1024 ``` --- ## Integration Points ### MCP Server Integration **Health Check Resource** (FR-008): ```python from fastmcp import FastMCP mcp = FastMCP("codebase-mcp") @mcp.resource("health://database/pool") async def get_pool_health() -> HealthCheckResponse: """ Expose connection pool health via MCP resource. Returns: Health status, statistics, and latency metrics """ start = time.time() stats = await app.state.pool.get_statistics() status = calculate_health_status(stats, config) latency_ms = (time.time() - start) * 1000 return HealthCheckResponse( status=status, timestamp=datetime.utcnow(), pool_statistics=stats, database_url=redact_credentials(config.database_url), version="1.0.0", latency_ms=latency_ms ) ``` **Shutdown Hook Integration**: ```python @mcp.shutdown async def shutdown(): """Gracefully shutdown connection pool on server stop.""" logger.info("MCP server shutdown initiated, closing connection pool") await graceful_shutdown(app.state.pool, timeout=30.0) logger.info("Connection pool closed, MCP server shutdown complete") ``` **Error Reporting**: ```python from fastmcp import Context @mcp.tool() async def search_code(query: str, ctx: Context) -> SearchResult: """Search codebase with connection pool error handling.""" try: async with app.state.pool.acquire() as conn: results = await conn.fetch("SELECT ...", query) return SearchResult(results=results) except PoolTimeoutError: # MCP-compliant error response (no stdout pollution) raise McpError( code="DATABASE_ERROR", message="Connection pool timeout - database overloaded", data={"pool_stats": await app.state.pool.get_statistics()} ) except ConnectionValidationError: raise McpError( code="DATABASE_ERROR", message="Database connectivity issue - attempting reconnection", data={"status": "reconnecting"} ) ``` **Logging Integration**: ```python import logging import json from datetime import datetime # Configure structured logging (Principle III: no stdout/stderr pollution) logging.basicConfig( filename="/tmp/codebase-mcp.log", level=logging.INFO, format="%(message)s" # JSON format handled by custom formatter ) class StructuredFormatter(logging.Formatter): """Format logs as JSON for parsing.""" def format(self, record: logging.LogRecord) -> str: log_data = { "timestamp": datetime.utcnow().isoformat(), "level": record.levelname, "message": record.getMessage(), "module": record.module, "function": record.funcName, "line": record.lineno, } if hasattr(record, "extra"): log_data.update(record.extra) return json.dumps(log_data) # Apply to logger handler = logging.FileHandler("/tmp/codebase-mcp.log") handler.setFormatter(StructuredFormatter()) logger.addHandler(handler) ``` --- ## Testing Strategy ### Unit Tests **PoolConfig Validation**: ```python def test_pool_config_valid(): """Valid configuration passes validation.""" config = PoolConfig(min_size=2, max_size=10, timeout=30.0) assert config.min_size == 2 assert config.max_size == 10 def test_pool_config_invalid_max_size(): """max_size < min_size raises ValueError.""" with pytest.raises(ValueError, match="max_size.*must be >= min_size"): PoolConfig(min_size=10, max_size=5) def test_pool_config_environment_override(): """Environment variables override defaults.""" os.environ["POOL_MIN_SIZE"] = "5" os.environ["POOL_MAX_SIZE"] = "20" config = PoolConfig() assert config.min_size == 5 assert config.max_size == 20 ``` **Health Status Calculation**: ```python def test_health_status_healthy(): """Full capacity pool is healthy.""" stats = PoolStatistics( total_connections=10, idle_connections=9, active_connections=1, waiting_requests=0, # ... other fields ) assert calculate_health_status(stats, config) == HealthStatus.HEALTHY def test_health_status_degraded(): """Partial capacity pool is degraded.""" stats = PoolStatistics( total_connections=10, idle_connections=6, active_connections=4, waiting_requests=2, # ... other fields ) assert calculate_health_status(stats, config) == HealthStatus.DEGRADED def test_health_status_unhealthy(): """Low capacity pool is unhealthy.""" stats = PoolStatistics( total_connections=10, idle_connections=2, active_connections=8, waiting_requests=10, # ... other fields ) assert calculate_health_status(stats, config) == HealthStatus.UNHEALTHY ``` **Exponential Backoff Algorithm**: ```python async def test_exponential_backoff_intervals(): """Backoff intervals follow exponential pattern.""" delays = [] for attempt in range(6): start = time.time() await exponential_backoff_retry(attempt, jitter_factor=0) delays.append(time.time() - start) # Check intervals: 1s, 2s, 4s, 8s, 16s, 16s assert 0.9 < delays[0] < 1.1 # 1s ±10% assert 1.8 < delays[1] < 2.2 # 2s ±10% assert 3.6 < delays[2] < 4.4 # 4s ±10% assert 7.2 < delays[3] < 8.8 # 8s ±10% assert 14.4 < delays[4] < 17.6 # 16s ±10% assert 14.4 < delays[5] < 17.6 # 16s (capped) ``` **Statistics Tracking**: ```python async def test_statistics_tracking(): """Pool tracks acquisition and release statistics.""" pool = await create_connection_pool(config) # Acquire and release 5 connections for _ in range(5): conn = await pool.acquire() await pool.release(conn) stats = await pool.get_statistics() assert stats.total_acquisitions == 5 assert stats.total_releases == 5 assert stats.active_connections == 0 ``` ### Integration Tests **Pool Initialization Success/Failure**: ```python @pytest.mark.asyncio async def test_pool_initialization_success(): """Pool initializes successfully with valid config.""" pool = await create_connection_pool(config) assert pool.status == PoolStatus.HEALTHY assert pool.get_size() == config.min_size @pytest.mark.asyncio async def test_pool_initialization_failure(): """Pool initialization fails with invalid database URL.""" config = PoolConfig(database_url="postgresql://invalid:5432/db") with pytest.raises(DatabaseConnectionError): await create_connection_pool(config) ``` **Connection Acquisition Under Load**: ```python @pytest.mark.asyncio async def test_connection_acquisition_concurrent(): """Pool handles 100 concurrent acquisitions.""" pool = await create_connection_pool(config) async def acquire_and_query(): async with pool.acquire() as conn: result = await conn.fetchval("SELECT 1") assert result == 1 # 100 concurrent requests tasks = [acquire_and_query() for _ in range(100)] await asyncio.gather(*tasks) stats = await pool.get_statistics() assert stats.total_acquisitions == 100 assert stats.total_releases == 100 assert stats.active_connections == 0 ``` **Database Outage and Reconnection**: ```python @pytest.mark.asyncio async def test_database_outage_reconnection(postgres_container): """Pool automatically reconnects after database outage.""" pool = await create_connection_pool(config) # Verify initial connectivity async with pool.acquire() as conn: result = await conn.fetchval("SELECT 1") assert result == 1 # Simulate database outage postgres_container.stop() # Verify queries fail with pytest.raises(DatabaseConnectionError): async with pool.acquire() as conn: await conn.fetchval("SELECT 1") # Restart database postgres_container.start() # Wait for reconnection (max 30 seconds) await asyncio.sleep(30) # Verify connectivity restored async with pool.acquire() as conn: result = await conn.fetchval("SELECT 1") assert result == 1 ``` **Graceful Shutdown with Active Queries**: ```python @pytest.mark.asyncio async def test_graceful_shutdown_with_active_queries(): """Pool waits for active queries during shutdown.""" pool = await create_connection_pool(config) # Start long-running query (20 seconds) async def long_query(): async with pool.acquire() as conn: await conn.fetchval("SELECT pg_sleep(20)") task = asyncio.create_task(long_query()) await asyncio.sleep(1) # Ensure query started # Initiate shutdown (30s timeout) start = time.time() await graceful_shutdown(pool, timeout=30.0) elapsed = time.time() - start # Verify shutdown waited for query (~20s) assert 18 < elapsed < 22 # 20s ±2s assert pool.status == PoolStatus.TERMINATED # Verify query completed successfully await task ``` **Connection Leak Detection**: ```python @pytest.mark.asyncio async def test_connection_leak_detection(): """Pool detects and logs connection leaks.""" config = PoolConfig(leak_detection_timeout=5.0, enable_leak_detection=True) pool = await create_connection_pool(config) # Acquire connection and hold it (simulate leak) conn = await pool.acquire() # Wait for leak detection (5s + 10s check interval) await asyncio.sleep(16) # Verify leak warning logged # (Test would use log capture fixture to verify warning) # Release connection to clean up await pool.release(conn) ``` ### Performance Tests **pytest-benchmark for Baseline**: ```python @pytest.mark.benchmark(group="pool-operations") async def test_benchmark_pool_initialization(benchmark): """Benchmark pool initialization time.""" result = await benchmark.pedantic( create_connection_pool, args=(config,), iterations=10, rounds=5 ) # Store baseline for regression detection assert result.stats.p95 < 2.0 @pytest.mark.benchmark(group="pool-operations") async def test_benchmark_connection_acquisition(benchmark, pool): """Benchmark connection acquisition overhead.""" async def acquire_release(): async with pool.acquire() as conn: pass result = await benchmark.pedantic( acquire_release, iterations=1000, rounds=10 ) assert result.stats.p95 < 0.010 # 10ms @pytest.mark.benchmark(group="pool-operations") async def test_benchmark_health_check(benchmark, pool): """Benchmark health check latency.""" result = await benchmark.pedantic( pool.get_health_status, iterations=10000, rounds=5 ) assert result.stats.p95 < 0.010 # 10ms ``` **Memory Profiling with pytest-memray**: ```python @pytest.mark.memray async def test_memory_consumption_24h(): """Profile memory over 24 hours of operation.""" pool = await create_connection_pool(config) # Simulate 24 hours (compressed to 100 seconds) for _ in range(86400): async with pool.acquire() as conn: await conn.fetchval("SELECT 1") await asyncio.sleep(0.001) # Verify memory <100MB # (pytest-memray automatically fails if threshold exceeded) ``` --- ## Dependencies ### Required Packages (Production) ```python # requirements.txt asyncpg>=0.29.0 # PostgreSQL connection pool and driver pydantic>=2.5.0 # Data validation and settings pydantic-settings>=2.1.0 # Environment variable configuration ``` ### Development/Testing Packages ```python # requirements-dev.txt pytest>=7.4.0 # Test framework pytest-asyncio>=0.21.0 # Async test support pytest-cov>=4.1.0 # Coverage reporting pytest-benchmark>=4.0.0 # Performance benchmarking pytest-memray>=1.5.0 # Memory profiling testcontainers>=3.7.0 # PostgreSQL test containers ``` ### Package Justification **asyncpg**: NON-NEGOTIABLE. Only async-native PostgreSQL driver with production quality. Principle IV requires async everything. **pydantic**: Principle VIII mandate. All configuration and data models use Pydantic for type safety and validation. **pytest-benchmark**: Required for performance target validation (SC-001, SC-002, SC-003). Tracks performance baselines for regression detection. **pytest-memray**: Required for memory consumption validation (SC-011). Detects memory leaks over long-running tests. **testcontainers**: Enables integration tests with real PostgreSQL without manual setup. Tests run in CI/CD without external dependencies. --- ## Open Questions None - all technical unknowns resolved during research phase. --- ## References ### Official Documentation - **asyncpg connection pool**: https://magicstack.github.io/asyncpg/current/api/index.html#connection-pools - Pool API reference and configuration options - **PostgreSQL wire protocol**: https://www.postgresql.org/docs/current/protocol.html - Low-level protocol details for connection validation - **Pydantic Settings**: https://docs.pydantic.dev/latest/concepts/pydantic_settings/ - Environment variable configuration patterns - **FastMCP framework**: https://github.com/jlowin/fastmcp - MCP server lifecycle hooks (startup/shutdown) ### Performance & Best Practices - **Exponential backoff with jitter**: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ - AWS best practices for retry logic - **Connection pooling patterns**: https://en.wikipedia.org/wiki/Connection_pool - General connection pooling concepts and tradeoffs ### Benchmarks & Comparisons - **asyncpg performance**: https://github.com/MagicStack/asyncpg#performance - Benchmarks showing 3x speedup over aiopg - **PostgreSQL connection overhead**: https://www.postgresql.org/docs/current/runtime-config-connection.html - Connection establishment costs and configuration --- ## Constitutional Compliance Summary This research phase validates compliance with all 11 constitutional principles: ✅ **Principle I (Simplicity)**: Connection pool is single-purpose infrastructure. No feature creep (no query builders, no ORM, no caching). ✅ **Principle II (Local-First)**: PostgreSQL local database, no cloud APIs. Connection pool works offline. ✅ **Principle III (Protocol Compliance)**: All logging to `/tmp/codebase-mcp.log`, no stdout/stderr pollution. Integration with FastMCP lifecycle. ✅ **Principle IV (Performance Guarantees)**: - 2s initialization (parallel connection creation) - 10ms acquisition (lightweight validation) - 500ms search (async operations, no blocking) ✅ **Principle V (Production Quality)**: - Comprehensive error handling (validation errors, connection errors, timeout errors) - Type hints with mypy --strict (PoolConfig, HealthStatus, PoolStatistics) - Configuration validation on startup (Pydantic validators) - Structured logging with context ✅ **Principle VI (Specification-First)**: This research phase follows `/specify` → `/plan` workflow. Implementation details derived from spec requirements. ✅ **Principle VII (Test-Driven Development)**: Test strategy defined before implementation (unit, integration, performance tests). ✅ **Principle VIII (Pydantic Type Safety)**: All configuration uses BaseSettings, all models use BaseModel, all validators use Pydantic field_validator. ✅ **Principle IX (Orchestrated Subagents)**: Task breakdown will enable parallel implementation (config, pool manager, health checks, statistics). ✅ **Principle X (Git Micro-Commits)**: Implementation will use Conventional Commits with atomic commits per task. ✅ **Principle XI (FastMCP Foundation)**: Integration with FastMCP @mcp.startup, @mcp.shutdown, @mcp.resource decorators. **Verification**: This research phase establishes technical foundation. `/analyze` command will validate implementation against these decisions. --- **Status**: Research phase complete. Ready for Phase 1 (Design Artifacts Generation). **Next Steps**: 1. Generate data-model.md (PoolConfig, PoolStatistics, HealthStatus schemas) 2. Create contracts/ directory (health check response, error codes) 3. Write quickstart.md (integration test scenarios) 4. Update CLAUDE.md (agent guidance for connection pool usage) 5. Proceed to `/tasks` for TDD task generation

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Ravenight13/codebase-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

research.md•46.7 KiB