Codebase MCP Server

codebase-mcp
docs
operations

connection_pool_monitoring.md•40 KiB

# Connection Pool Monitoring Guide **Version**: 2.0 | **Feature**: 009-v2-connection-mgmt | **Last Updated**: 2025-10-13 ## Overview This guide provides comprehensive monitoring and troubleshooting documentation for the Codebase MCP Server connection pool. The connection pool manages PostgreSQL connections with automatic health detection, leak prevention, and real-time statistics for production observability. **Key Monitoring Capabilities**: - Real-time pool statistics via internal API - Three-tier health status system (HEALTHY → DEGRADED → UNHEALTHY) - Connection leak detection with configurable timeouts - Performance metrics (acquisition times, wait times, peak usage) - MCP resource endpoint for health monitoring **Target Audience**: System administrators, SREs, and developers operating the MCP server in production or development environments. --- ## Health Status States The connection pool uses a three-tier health status system that provides clear operational signals: ### Health Status Definitions #### HEALTHY **Definition**: Pool is operating at full capacity with no issues. **Conditions**: - `total_connections > 0` - `idle_connections ≥ 80%` of `total_connections` - No errors in the last 60 seconds - Peak wait time `< 100ms` **Expected Behavior**: - All connection acquisitions succeed immediately (< 10ms p95) - Database queries execute without queuing - Pool maintains minimum size (default: 2 connections) **Operator Action**: None required. System is operating normally. --- #### DEGRADED **Definition**: Pool is functional but experiencing reduced capacity or intermittent issues. **Conditions**: - `total_connections > 0` - `idle_connections` is 50-79% of `total_connections` - OR recent errors occurred (within last 60 seconds) - OR peak wait time `> 100ms` **Expected Behavior**: - Connection acquisitions may queue briefly - Database queries succeed but with increased latency - Pool may be recovering from transient failure - Automatic recovery attempts in progress **Operator Action**: 1. Monitor for automatic recovery (typically 30-60 seconds) 2. Check database logs for connection errors 3. Review application query patterns for slow queries 4. Consider increasing `POOL_MAX_SIZE` if sustained high load --- #### UNHEALTHY **Definition**: Pool cannot serve requests due to critical failure. **Conditions**: - `total_connections == 0` (no connections available) - OR `idle_connections < 50%` of `total_connections` (severe capacity shortage) - OR pool initialization failed **Expected Behavior**: - Connection acquisition attempts timeout after 30 seconds - Database queries return `DATABASE_ERROR` responses - Automatic reconnection with exponential backoff (1s, 2s, 4s, 8s, 16s) - Pool transitions to `RECOVERING` state during reconnection attempts **Operator Action**: 1. **Immediate**: Check database server is running: `docker ps` or `systemctl status postgresql` 2. Verify network connectivity: `ping <database_host>` 3. Review server logs: `tail -f /tmp/codebase-mcp.log` 4. Check for database outage or maintenance window 5. If database is healthy, restart MCP server to reinitialize pool --- ### State Transitions The connection pool lifecycle follows these state transitions: ``` INITIALIZING → HEALTHY → DEGRADED → UNHEALTHY → RECOVERING → HEALTHY ↓ ↓ └──────────────────→ SHUTTING_DOWN → TERMINATED ←──────────┘ ``` **State Flow Details**: 1. **INITIALIZING → HEALTHY** - Trigger: Pool successfully creates `min_size` connections (default: 2) - Duration: < 2 seconds (p95 per SC-001) - Validation: Each connection executes `SELECT 1` query 2. **HEALTHY → DEGRADED** - Trigger: Some connections fail validation OR idle capacity drops to 50-79% - Recovery: Automatic reconnection attempts OR load decreases 3. **DEGRADED → UNHEALTHY** - Trigger: Idle capacity drops below 50% OR all connections lost - Impact: New requests begin timing out 4. **UNHEALTHY → RECOVERING** - Trigger: Reconnection attempt initiated - Backoff schedule: 1s, 2s, 4s, 8s, 16s (max 5 initial attempts) - Continued attempts: Every 16s after initial sequence 5. **RECOVERING → HEALTHY** - Trigger: Successfully created `min_size` connections - Duration: < 30 seconds (p95 per SC-004) - Validation: All new connections pass validation queries 6. **ANY → SHUTTING_DOWN → TERMINATED** - Trigger: Graceful shutdown signal (SIGTERM/SIGINT) - Behavior: Waits up to 30 seconds for active queries to complete - Cleanup: All connections closed, resources released --- ## Pool Statistics Reference The `PoolStatistics` model provides real-time metrics for capacity planning and troubleshooting: ### Real-Time Metrics #### `total_connections` (integer, ≥ 0) **Description**: Current total number of connections managed by the pool. **Expected Values**: - Minimum: `min_size` (default: 2) - Maximum: `max_size` (default: 10) - Typical: Between `min_size` and `max_size` based on load **Interpretation**: - `total_connections == min_size`: Pool at idle capacity - `total_connections > min_size`: Pool scaled up to handle load - `total_connections == max_size`: Pool at maximum capacity - `total_connections == 0`: **CRITICAL** - pool has no connections (UNHEALTHY state) **Alerting Threshold**: Alert if `total_connections == 0` for > 30 seconds --- #### `idle_connections` (integer, ≥ 0) **Description**: Number of connections available for immediate use. **Expected Values**: - Healthy: ≥ 80% of `total_connections` - Degraded: 50-79% of `total_connections` - Unhealthy: < 50% of `total_connections` **Interpretation**: - High idle count: Pool has spare capacity - Low idle count: Pool under heavy load - Zero idle connections: All connections in use, new requests will queue **Alerting Threshold**: Alert if `idle_connections < 50%` of `total_connections` for > 5 minutes **Calculation**: `idle_connections = total_connections - active_connections` --- #### `active_connections` (integer, ≥ 0) **Description**: Number of connections currently executing queries. **Expected Values**: - Typical: 0-5 during normal operation - High load: Approaches `max_size` **Interpretation**: - `active_connections == max_size`: Pool fully utilized - Sustained high active count: Consider increasing `max_size` or optimizing queries **Alerting Threshold**: Alert if `active_connections == max_size` for > 10 minutes --- #### `waiting_requests` (integer, ≥ 0) **Description**: Number of acquisition requests queued waiting for available connections. **Expected Values**: - Normal: 0 (no queuing) - Acceptable: 1-5 (brief queuing during burst load) - Critical: > 10 (persistent queuing indicates capacity shortage) **Interpretation**: - `waiting_requests > 0`: Pool exhausted, requests queuing - Sustained queuing: Increase `POOL_MAX_SIZE` or optimize query performance **Alerting Threshold**: **CRITICAL** - Alert if `waiting_requests > 10` for > 2 minutes --- #### `avg_acquisition_time_ms` (float, ≥ 0.0) **Description**: Average time to acquire a connection from the pool (milliseconds). **Expected Values**: - Excellent: < 5ms - Normal: 5-10ms - Degraded: 10-50ms - Critical: > 50ms **Interpretation**: - Low latency: Pool has spare capacity - High latency: Pool under load or connections queuing **Performance Target**: < 10ms (p95 per SC-002) **Alerting Threshold**: Alert if p95 > 10ms for > 5 minutes --- #### `peak_active_connections` (integer, ≥ 0) **Description**: Maximum number of connections simultaneously active since server startup. **Use Cases**: - Capacity planning: Determine if `max_size` is sufficient - Right-sizing: If `peak_active_connections` consistently < `max_size / 2`, consider reducing `max_size` - Growth analysis: Track peak over time to anticipate capacity needs **Interpretation**: - `peak_active_connections == max_size`: Pool has hit capacity limit - `peak_active_connections < max_size * 0.5`: Pool over-provisioned --- #### `peak_wait_time_ms` (float, ≥ 0.0) **Description**: Maximum time any request has waited for a connection (milliseconds). **Expected Values**: - Excellent: < 10ms (no significant queuing) - Acceptable: 10-100ms (brief queuing) - Degraded: 100-1000ms (noticeable queuing) - Critical: > 1000ms (1+ seconds - severe capacity shortage) **Interpretation**: - Low peak wait: Pool adequately sized - High peak wait: Pool exhaustion occurred, investigate cause **Alerting Threshold**: Alert if `peak_wait_time_ms > 1000ms` --- #### `total_acquisitions` / `total_releases` (integer, ≥ 0) **Description**: Lifetime count of connection acquisitions and releases. **Health Check**: - `total_acquisitions == total_releases`: All connections properly returned - `total_acquisitions > total_releases`: Some connections still in use (normal during load) - `total_acquisitions >> total_releases`: **Potential connection leak detected** **Leak Detection**: ```python leak_count = total_acquisitions - total_releases - active_connections if leak_count > 10: # CRITICAL: Connection leak suspected ``` **Alerting Threshold**: Alert if `leak_count > 10` --- #### `pool_created_at` (datetime) **Description**: Timestamp when pool was initialized. **Use Cases**: - Calculate uptime: `uptime = now() - pool_created_at` - Correlate pool lifecycle with application events --- #### `last_health_check` (datetime) **Description**: Timestamp of last successful health check. **Expected Behavior**: - Updated every health check request (typically every 10-60 seconds) - If stale (> 5 minutes), health check endpoint may not be functioning --- #### `last_error` (string | null) **Description**: Last error message encountered by the pool. **Expected Values**: - `null`: No recent errors - String: Error message with context (e.g., "Connection validation failed") **Common Error Messages**: - `"Connection validation failed"`: Database query failed during validation - `"Database connection refused"`: Database server unreachable - `"Pool initialization timeout"`: Startup failed within 2 seconds - `"Connection acquisition timeout"`: Request waited > 30 seconds --- #### `last_error_time` (datetime | null) **Description**: Timestamp of last error occurrence. **Health Impact**: - If `last_error_time` is within last 60 seconds, pool status may be DEGRADED - Older errors (> 60 seconds) do not affect health status --- ### Statistics API Example Query pool statistics programmatically: ```python # Internal API (within MCP server) stats: PoolStatistics = pool_manager.get_statistics() print(f"Total connections: {stats.total_connections}") print(f"Idle connections: {stats.idle_connections}") print(f"Active connections: {stats.active_connections}") print(f"Waiting requests: {stats.waiting_requests}") print(f"Avg acquisition time: {stats.avg_acquisition_time_ms:.2f}ms") print(f"Peak active: {stats.peak_active_connections}") ``` **Performance**: Statistics retrieval has < 1ms staleness (SC-007), ensuring real-time data. --- ## MCP Resource Endpoint The connection pool exposes health status via an MCP resource endpoint for external monitoring: ### Resource URI ``` health://connection-pool ``` ### Request Example ```bash # Using MCP client CLI mcp-client resource read health://connection-pool # Using curl (via MCP SSE transport) curl -H "Accept: text/event-stream" \ http://localhost:3000/sse \ -d '{"method":"resources/read","params":{"uri":"health://connection-pool"}}' ``` ### Response Schema ```json { "status": "healthy", "timestamp": "2025-10-13T14:30:00.000Z", "database": { "status": "connected", "pool": { "total": 5, "idle": 4, "active": 1, "waiting": 0 }, "latency_ms": 2.3, "last_error": null }, "uptime_seconds": 3600.5 } ``` ### Health Check Performance - **Target Latency**: < 10ms (p99 per SC-003) - **Throughput**: > 1000 req/s (per SC-013) - **No Database Query**: Health check uses cached pool state only (no `SELECT 1` query) ### Integration with Load Balancers Configure load balancer health checks: ```yaml # Kubernetes liveness probe livenessProbe: exec: command: - mcp-client - resource - read - health://connection-pool initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 2 failureThreshold: 3 # Kubernetes readiness probe readinessProbe: exec: command: - mcp-client - resource - read - health://connection-pool initialDelaySeconds: 3 periodSeconds: 5 timeoutSeconds: 2 failureThreshold: 2 ``` --- ## Monitoring Integration ### Prometheus Metrics (Example) If integrating with Prometheus, expose these metrics: ```python # Example Prometheus metric definitions (not implemented by default) # Gauge metrics codebase_mcp_pool_connections_total = Gauge( "codebase_mcp_pool_connections_total", "Total number of connections in pool" ) codebase_mcp_pool_connections_idle = Gauge( "codebase_mcp_pool_connections_idle", "Number of idle connections available" ) codebase_mcp_pool_connections_active = Gauge( "codebase_mcp_pool_connections_active", "Number of active connections in use" ) codebase_mcp_pool_waiting_requests = Gauge( "codebase_mcp_pool_waiting_requests", "Number of requests waiting for connections" ) # Histogram metrics codebase_mcp_pool_acquisition_duration_seconds = Histogram( "codebase_mcp_pool_acquisition_duration_seconds", "Time to acquire connection from pool", buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0] ) # Counter metrics codebase_mcp_pool_acquisitions_total = Counter( "codebase_mcp_pool_acquisitions_total", "Total number of connection acquisitions" ) codebase_mcp_pool_releases_total = Counter( "codebase_mcp_pool_releases_total", "Total number of connection releases" ) codebase_mcp_pool_errors_total = Counter( "codebase_mcp_pool_errors_total", "Total number of pool errors", ["error_type"] ) # Info metric codebase_mcp_pool_health_status = Enum( "codebase_mcp_pool_health_status", "Connection pool health status", states=["healthy", "degraded", "unhealthy"] ) ``` ### Exporter Script ```python # scripts/prometheus_exporter.py """ Prometheus exporter for connection pool metrics. Scrapes pool statistics and exposes /metrics endpoint. """ import asyncio from prometheus_client import start_http_server, Gauge, Histogram, Counter, Enum from connection_pool import ConnectionPoolManager # Initialize metrics (definitions above) async def update_metrics(pool_manager: ConnectionPoolManager): """Update Prometheus metrics from pool statistics.""" while True: stats = pool_manager.get_statistics() health = await pool_manager.health_check() # Update gauges codebase_mcp_pool_connections_total.set(stats.total_connections) codebase_mcp_pool_connections_idle.set(stats.idle_connections) codebase_mcp_pool_connections_active.set(stats.active_connections) codebase_mcp_pool_waiting_requests.set(stats.waiting_requests) # Update health status codebase_mcp_pool_health_status.state(health.status) await asyncio.sleep(10) # Scrape every 10 seconds if __name__ == "__main__": # Start Prometheus HTTP server on port 9090 start_http_server(9090) # Initialize pool manager and start metric updates pool_manager = ConnectionPoolManager() asyncio.run(update_metrics(pool_manager)) ``` **Usage**: ```bash # Start exporter python scripts/prometheus_exporter.py # Prometheus scrape config scrape_configs: - job_name: 'codebase-mcp-pool' static_configs: - targets: ['localhost:9090'] scrape_interval: 15s ``` --- ### Grafana Dashboard Example Grafana dashboard queries for visualizing pool metrics: #### Panel 1: Connection Pool Utilization ```promql # Query 1: Total connections codebase_mcp_pool_connections_total # Query 2: Idle connections codebase_mcp_pool_connections_idle # Query 3: Active connections codebase_mcp_pool_connections_active # Panel type: Time series graph # Y-axis: Connection count # Legend: {{__name__}} ``` #### Panel 2: Connection Acquisition Latency ```promql # p50 latency histogram_quantile(0.50, rate(codebase_mcp_pool_acquisition_duration_seconds_bucket[5m])) # p95 latency (target: < 10ms) histogram_quantile(0.95, rate(codebase_mcp_pool_acquisition_duration_seconds_bucket[5m])) # p99 latency histogram_quantile(0.99, rate(codebase_mcp_pool_acquisition_duration_seconds_bucket[5m])) # Panel type: Time series graph # Y-axis: Latency (seconds) # Threshold line at 0.01s (10ms target) ``` #### Panel 3: Pool Health Status ```promql # Health status (0=healthy, 1=degraded, 2=unhealthy) codebase_mcp_pool_health_status # Panel type: Stat panel # Color mode: Background # Thresholds: 0=green, 1=yellow, 2=red ``` #### Panel 4: Waiting Requests (Queue Depth) ```promql # Waiting requests count codebase_mcp_pool_waiting_requests # Panel type: Time series graph # Y-axis: Request count # Threshold line at 10 (critical threshold) # Alert annotation: waiting_requests > 10 ``` #### Panel 5: Acquisition Throughput ```promql # Acquisitions per second rate(codebase_mcp_pool_acquisitions_total[1m]) # Releases per second rate(codebase_mcp_pool_releases_total[1m]) # Panel type: Time series graph # Y-axis: Operations per second ``` #### Panel 6: Error Rate ```promql # Errors per second by type rate(codebase_mcp_pool_errors_total[5m]) # Panel type: Time series graph # Y-axis: Errors per second # Group by: error_type ``` **Full Dashboard JSON**: Available at `docs/grafana/connection-pool-dashboard.json` (to be created separately) --- ### Alerting Rules Example Prometheus alerting rules for connection pool monitoring: ```yaml # /etc/prometheus/rules/codebase-mcp-pool.yml groups: - name: codebase_mcp_pool_alerts interval: 30s rules: # CRITICAL: Pool has no connections - alert: ConnectionPoolNoConnections expr: codebase_mcp_pool_connections_total == 0 for: 30s labels: severity: critical component: connection_pool annotations: summary: "Connection pool has no available connections" description: "Pool has zero total connections for 30+ seconds. Database may be down or pool failed to initialize." runbook: "https://docs.example.com/runbooks/connection-pool-no-connections" # CRITICAL: Pool health unhealthy for 5 minutes - alert: ConnectionPoolUnhealthy expr: codebase_mcp_pool_health_status == 2 for: 5m labels: severity: critical component: connection_pool annotations: summary: "Connection pool in UNHEALTHY state" description: "Pool health status has been UNHEALTHY for 5+ minutes. Automatic recovery may have failed." runbook: "https://docs.example.com/runbooks/connection-pool-unhealthy" # WARNING: Pool health degraded for 10 minutes - alert: ConnectionPoolDegraded expr: codebase_mcp_pool_health_status == 1 for: 10m labels: severity: warning component: connection_pool annotations: summary: "Connection pool in DEGRADED state" description: "Pool health has been DEGRADED for 10+ minutes. May indicate persistent capacity issues." runbook: "https://docs.example.com/runbooks/connection-pool-degraded" # CRITICAL: Requests waiting for connections (pool exhaustion) - alert: ConnectionPoolExhausted expr: codebase_mcp_pool_waiting_requests > 10 for: 2m labels: severity: critical component: connection_pool annotations: summary: "Connection pool exhausted - requests queuing" description: "{{ $value }} requests waiting for connections. Pool capacity insufficient for current load." action: "Increase POOL_MAX_SIZE or investigate slow queries" # WARNING: High connection acquisition latency - alert: ConnectionPoolHighLatency expr: | histogram_quantile(0.95, rate(codebase_mcp_pool_acquisition_duration_seconds_bucket[5m])) > 0.01 for: 5m labels: severity: warning component: connection_pool annotations: summary: "Connection acquisition latency high (p95 > 10ms)" description: "p95 acquisition latency is {{ $value }}s (target: <0.01s)" action: "Investigate pool capacity or query performance" # WARNING: Potential connection leak - alert: ConnectionPoolLeak expr: | (codebase_mcp_pool_acquisitions_total - codebase_mcp_pool_releases_total - codebase_mcp_pool_connections_active) > 10 for: 10m labels: severity: warning component: connection_pool annotations: summary: "Potential connection leak detected" description: "{{ $value }} connections appear leaked (acquired but not released)" action: "Review application logs for connection leak warnings with stack traces" # CRITICAL: Pool fully utilized for extended period - alert: ConnectionPoolFullyUtilized expr: | codebase_mcp_pool_connections_active == codebase_mcp_pool_connections_total for: 10m labels: severity: critical component: connection_pool annotations: summary: "Connection pool fully utilized" description: "All {{ $value }} connections are active for 10+ minutes" action: "Increase POOL_MAX_SIZE or optimize query performance" # WARNING: Low idle capacity - alert: ConnectionPoolLowCapacity expr: | (codebase_mcp_pool_connections_idle / codebase_mcp_pool_connections_total) < 0.5 for: 5m labels: severity: warning component: connection_pool annotations: summary: "Connection pool running at low capacity" description: "Idle capacity is {{ $value | humanizePercentage }} (threshold: 50%)" action: "Consider increasing POOL_MAX_SIZE" ``` **Alert Routing**: ```yaml # /etc/alertmanager/config.yml route: group_by: ['alertname', 'component'] group_wait: 10s group_interval: 5m repeat_interval: 4h receiver: 'default' routes: - match: severity: critical component: connection_pool receiver: 'pagerduty-critical' continue: true - match: severity: warning component: connection_pool receiver: 'slack-warnings' receivers: - name: 'pagerduty-critical' pagerduty_configs: - service_key: '<pagerduty_service_key>' - name: 'slack-warnings' slack_configs: - api_url: '<slack_webhook_url>' channel: '#ops-warnings' ``` --- ## Troubleshooting Guide ### Issue: Slow Query Performance #### Symptoms - Connection acquisition latency > 10ms (p95) - `avg_acquisition_time_ms` increasing over time - Users report slow response times for database operations #### Causes 1. **Slow database queries**: Individual queries taking seconds to complete 2. **High query concurrency**: Too many simultaneous queries for pool size 3. **Database server overload**: CPU/memory/disk saturation on PostgreSQL 4. **Connection validation overhead**: Health checks taking too long #### Resolution Steps **Step 1: Identify slow queries** ```bash # Check PostgreSQL slow query log tail -f /var/log/postgresql/postgresql-14-main.log | grep "duration:" # Query PostgreSQL statistics psql -U postgres -d codebase_mcp -c " SELECT query, calls, mean_exec_time, max_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10; " ``` **Step 2: Review pool statistics** ```python # Check if pool is under-provisioned stats = pool_manager.get_statistics() print(f"Active: {stats.active_connections}/{stats.total_connections}") print(f"Waiting: {stats.waiting_requests}") print(f"Peak active: {stats.peak_active_connections}") # If peak_active_connections == max_size, pool may be too small ``` **Step 3: Optimize queries** ```sql -- Add indexes to slow queries CREATE INDEX CONCURRENTLY idx_chunks_embedding ON code_chunks USING ivfflat (embedding vector_cosine_ops); -- Analyze query plans EXPLAIN ANALYZE <slow_query>; ``` **Step 4: Increase pool size (if needed)** ```bash # Increase max pool size temporarily export POOL_MAX_SIZE=20 # Restart MCP server systemctl restart codebase-mcp # Monitor acquisition latency improvement ``` **Step 5: Enable query logging** ```python # Add query timing to application logs import logging logger = logging.getLogger("codebase_mcp") logger.setLevel(logging.DEBUG) # Log query durations ``` **Prevention**: - Set query timeouts: `POOL_COMMAND_TIMEOUT=60` (default) - Implement query result caching for frequently accessed data - Use database connection pooling at application layer (already implemented) --- ### Issue: Connection Pool Exhaustion #### Symptoms - `waiting_requests > 0` consistently - `active_connections == max_size` for extended periods - Connection acquisition timeouts (30 second timeout errors) - Alert: `ConnectionPoolExhausted` firing #### Causes 1. **Insufficient pool capacity**: `max_size` too small for current load 2. **Connection leaks**: Connections not being returned to pool 3. **Long-running queries**: Queries holding connections for minutes 4. **Burst traffic**: Sudden spike in concurrent requests #### Resolution Steps **Step 1: Immediate mitigation** ```bash # Increase pool size to handle burst export POOL_MAX_SIZE=20 # Double current size systemctl restart codebase-mcp # Verify increased capacity curl http://localhost:3000/health | jq '.database.pool' # Expected: {"total": 20, "idle": 15, "active": 5, "waiting": 0} ``` **Step 2: Identify resource hogs** ```sql -- Check for long-running queries SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '1 minute' ORDER BY duration DESC; -- Terminate hung queries (if necessary) SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '5 minutes'; ``` **Step 3: Check for connection leaks** ```bash # Review leak detection warnings in logs grep "Potential connection leak" /tmp/codebase-mcp.log # Expected output with stack trace: # WARNING: Potential connection leak detected: conn_abc123 # held_duration_seconds: 45.2 # stack_trace: <acquisition call stack> ``` **Step 4: Analyze acquisition patterns** ```python # Review peak usage metrics stats = pool_manager.get_statistics() print(f"Peak active: {stats.peak_active_connections}") print(f"Total acquisitions: {stats.total_acquisitions}") print(f"Avg acquisition time: {stats.avg_acquisition_time_ms:.2f}ms") # If peak consistently hits max_size, capacity is insufficient ``` **Step 5: Capacity planning** ```bash # Calculate required pool size based on load patterns # Rule of thumb: max_size = (concurrent_users * avg_queries_per_request) + buffer # For 100 concurrent users, 2 queries per request, 50% buffer: # max_size = (100 * 2) * 1.5 = 300 connections (adjust based on actual usage) # More realistic for local MCP server: 10-20 connections sufficient export POOL_MAX_SIZE=15 ``` **Prevention**: - Set `POOL_LEAK_DETECTION_TIMEOUT=30` to catch leaks early - Implement query timeouts to prevent hung connections - Monitor `waiting_requests` metric and alert on > 10 - Right-size pool based on actual load patterns --- ### Issue: Database Outage and Recovery #### Symptoms - Pool health status: `UNHEALTHY` - `total_connections == 0` - All database queries failing with `DATABASE_ERROR` - Logs show: "Database connection refused" or "Connection reset" #### Expected Behavior During Outage **Automatic Reconnection Sequence**: 1. Pool detects connection failures 2. Health status transitions: `HEALTHY → DEGRADED → UNHEALTHY → RECOVERING` 3. Exponential backoff reconnection attempts: - Attempt 1: 1 second delay - Attempt 2: 2 seconds delay - Attempt 3: 4 seconds delay - Attempt 4: 8 seconds delay - Attempt 5: 16 seconds delay - Subsequent attempts: Every 16 seconds 4. Once database recovers, pool reinitializes connections 5. Health status transitions: `RECOVERING → HEALTHY` 6. Normal operations resume **Recovery Timeline**: - **Best case**: 1 second (immediate database availability) - **Typical**: 10-30 seconds (transient network issue) - **Worst case**: Up to 30 seconds (p95 per SC-004) after database recovers #### Resolution Steps **Step 1: Verify database status** ```bash # Check if PostgreSQL is running docker ps | grep postgres # OR systemctl status postgresql # If not running, start database docker start postgres-codebase-mcp # OR systemctl start postgresql ``` **Step 2: Verify network connectivity** ```bash # Ping database host ping <database_host> # Check port connectivity nc -zv <database_host> 5432 # Test connection with psql psql -U postgres -h <database_host> -d codebase_mcp -c "SELECT 1" ``` **Step 3: Monitor automatic recovery** ```bash # Watch MCP server logs for reconnection attempts tail -f /tmp/codebase-mcp.log | grep -E "reconnect|recovering" # Expected log sequence: # INFO: Reconnection attempt 1/5 (delay: 1s) # INFO: Reconnection attempt 2/5 (delay: 2s) # INFO: Connection pool recovered: 2/10 connections available # INFO: Health status: RECOVERING -> HEALTHY ``` **Step 4: Verify pool recovery** ```bash # Check health endpoint curl http://localhost:3000/health | jq '.database' # Expected after recovery: # { # "status": "connected", # "pool": {"total": 2, "idle": 2, "active": 0, "waiting": 0}, # "latency_ms": 2.5, # "last_error": null # } ``` **Step 5: Test database operations** ```bash # Execute test query via MCP client mcp-client tool call search_code --query "test" # Should succeed with results (or empty results if database has no data) ``` **Manual Recovery (if automatic fails)**: ```bash # If automatic recovery doesn't work within 5 minutes: # 1. Check database logs for errors tail -f /var/log/postgresql/postgresql-14-main.log # 2. Verify DATABASE_URL is correct echo $DATABASE_URL # OR cat .env | grep DATABASE_URL # 3. Restart MCP server to force reconnection systemctl restart codebase-mcp # 4. If issue persists, check database connectivity: psql -U postgres -d codebase_mcp -c "SELECT version();" ``` **Prevention**: - Monitor database server health proactively - Set up database replication for high availability - Configure alerts for database downtime - Test recovery procedures regularly (e.g., monthly chaos testing) --- ### Issue: Connection Leaks #### Symptoms - `total_acquisitions >> total_releases + active_connections` - Warning logs: "Potential connection leak detected" - Pool gradually exhausting connections over time - Increasing `active_connections` without corresponding releases #### Causes 1. **Unclosed context managers**: Missing `async with pool.acquire()` pattern 2. **Exception handling gaps**: Connections not released in error paths 3. **Long-running operations**: Legitimate operations exceeding `leak_detection_timeout` 4. **Application bugs**: Logic errors preventing connection release #### Resolution Steps **Step 1: Identify leaking code** ```bash # Review leak warnings with stack traces grep -A 20 "Potential connection leak" /tmp/codebase-mcp.log # Example output: # WARNING: Potential connection leak detected: conn_abc123 # connection_id: conn_abc123 # held_duration_seconds: 45.2 # stack_trace: # File "src/services/searcher.py", line 123, in search_code # conn = await pool.acquire() # <-- Acquisition site # File "src/services/searcher.py", line 145, in execute_search # results = await conn.fetch(query) ``` **Step 2: Review code at acquisition sites** ```python # INCORRECT: Connection not guaranteed to be released async def bad_query(): conn = await pool.acquire() result = await conn.fetchval("SELECT 1") # BUG: If exception occurs, connection never released await pool.release(conn) return result # CORRECT: Use context manager async def good_query(): async with pool.acquire() as conn: result = await conn.fetchval("SELECT 1") # Connection automatically released even if exception occurs return result ``` **Step 3: Check for exception handling issues** ```python # INCORRECT: Connection leaked in error path async def bad_error_handling(): conn = await pool.acquire() try: result = await conn.fetchval("SELECT 1") except Exception as e: # BUG: Return without releasing connection return None await pool.release(conn) return result # CORRECT: Use context manager or try/finally async def good_error_handling(): async with pool.acquire() as conn: try: result = await conn.fetchval("SELECT 1") except Exception as e: logger.error(f"Query failed: {e}") return None return result ``` **Step 4: Verify leak detection configuration** ```bash # Check leak detection settings echo $POOL_LEAK_DETECTION_TIMEOUT # Default: 30 echo $POOL_ENABLE_LEAK_DETECTION # Default: true # Adjust timeout for long-running operations export POOL_LEAK_DETECTION_TIMEOUT=120 # 2 minutes for batch jobs ``` **Step 5: Monitor leak count** ```python # Calculate current leak count stats = pool_manager.get_statistics() leak_count = (stats.total_acquisitions - stats.total_releases - stats.active_connections) print(f"Suspected leaks: {leak_count}") # If leak_count > 10, investigate immediately ``` **Step 6: Force connection recycling (temporary fix)** ```bash # Restart server to clear leaked connections systemctl restart codebase-mcp # Or reduce max_connection_lifetime to force recycling export POOL_MAX_CONNECTION_LIFETIME=600 # 10 minutes (default: 3600) ``` **Prevention**: - **Always use context managers**: `async with pool.acquire() as conn:` - Enable leak detection in production: `POOL_ENABLE_LEAK_DETECTION=true` - Set appropriate timeout: `POOL_LEAK_DETECTION_TIMEOUT=30` (30 seconds) - Review all connection acquisition sites during code reviews - Run 24-hour leak test before production deployment: ```bash pytest tests/integration/test_leak_detection.py --duration=86400 ``` --- ### Issue: High Wait Times #### Symptoms - `peak_wait_time_ms > 100ms` - Increased `avg_acquisition_time_ms` - Intermittent connection acquisition delays - Alert: `ConnectionPoolHighLatency` firing #### Causes 1. **Pool too small**: `max_size` insufficient for concurrent load 2. **Slow queries**: Individual queries holding connections for seconds 3. **Burst traffic**: Sudden spike in requests 4. **Database performance**: Slow query execution on PostgreSQL side #### Resolution Steps **Step 1: Analyze wait time patterns** ```python # Review statistics stats = pool_manager.get_statistics() print(f"Avg acquisition: {stats.avg_acquisition_time_ms:.2f}ms") print(f"Peak wait time: {stats.peak_wait_time_ms:.2f}ms") print(f"Active connections: {stats.active_connections}/{stats.total_connections}") print(f"Waiting requests: {stats.waiting_requests}") # If waiting_requests > 0, pool is undersized # If avg_acquisition_time > 10ms, investigate causes ``` **Step 2: Identify bottleneck** ```sql -- Check for slow queries SELECT query, mean_exec_time, calls FROM pg_stat_statements WHERE mean_exec_time > 100 -- > 100ms average ORDER BY mean_exec_time DESC LIMIT 10; -- Check database load SELECT * FROM pg_stat_activity; ``` **Step 3: Increase pool capacity (if needed)** ```bash # Temporarily increase max_size export POOL_MAX_SIZE=20 systemctl restart codebase-mcp # Monitor wait time improvement curl http://localhost:3000/health | jq '.database.pool' ``` **Step 4: Optimize slow queries** ```sql -- Add missing indexes CREATE INDEX CONCURRENTLY idx_chunks_repo ON code_chunks(repository_id); -- Vacuum database VACUUM ANALYZE code_chunks; -- Update statistics ANALYZE; ``` **Step 5: Implement query timeout** ```bash # Set command timeout to prevent hung queries export POOL_COMMAND_TIMEOUT=60 # 60 seconds max per query systemctl restart codebase-mcp ``` **Prevention**: - Monitor `peak_wait_time_ms` and alert on > 100ms - Right-size pool based on actual concurrency patterns - Implement query result caching for frequently accessed data - Set `POOL_COMMAND_TIMEOUT` to prevent queries from holding connections indefinitely - Optimize database indexes and query plans --- ## Performance Baselines Reference expected latencies and throughput from quickstart.md integration tests: ### Initialization Performance - **Target**: Pool initialization < 2 seconds (p95) - **Typical**: 300-500ms - **Includes**: Creating `min_size` connections, validation queries ### Connection Acquisition Performance - **Target**: < 10ms (p95) - **Typical**: 1-3ms (when connections available) - **Typical (under load)**: 5-10ms (when scaling up pool) ### Health Check Performance - **Target**: < 10ms (p99) - **Typical**: < 1ms (cached state, no database query) - **Throughput**: > 1000 requests/second ### Reconnection Performance - **Target**: < 30 seconds (p95) after database recovery - **Typical**: 10-20 seconds - **Includes**: Exponential backoff delays, connection creation, validation ### Graceful Shutdown Performance - **Target**: < 30 seconds (p99) - **Typical**: 0-5 seconds (if no active queries) - **Typical (with active queries)**: 5-30 seconds (waits for completion) ### Concurrent Load Performance - **Target**: 100 concurrent requests without deadlock - **Pool size**: max_size=10 (default) - **Behavior**: Requests queue when pool exhausted, FIFO order ### Memory Consumption - **Target**: < 100MB for max_size=10 pool - **Typical**: 20-50MB (depends on connection count and query buffers) ### Startup Overhead - **Target**: < 200ms added to server startup time - **Typical**: 50-100ms (pool initialization) - **Includes**: Configuration validation, pool creation --- ## Logging and Diagnostics ### Log File Location ```bash /tmp/codebase-mcp.log ``` ### Log Format ```json { "timestamp": "2025-10-13T14:30:00.000Z", "level": "INFO", "component": "connection_pool", "message": "Connection pool initialized", "context": { "min_size": 2, "max_size": 10, "database": "codebase_mcp" } } ``` ### Key Log Messages #### Initialization ``` INFO: Connection pool initialized: min_size=2, max_size=10 INFO: Created 2 initial connections in 345ms ``` #### Reconnection ``` WARNING: Database connection lost: Connection refused INFO: Reconnection attempt 1/5 (delay: 1s) INFO: Reconnection attempt 2/5 (delay: 2s) INFO: Connection pool recovered: 2/10 connections available ``` #### Leak Detection ``` WARNING: Potential connection leak detected: conn_abc123 connection_id: conn_abc123 held_duration_seconds: 45.2 stack_trace: <acquisition call stack> ``` #### Graceful Shutdown ``` INFO: Graceful shutdown initiated INFO: Waiting for 3 active connections to complete (timeout: 30s) INFO: All connections closed gracefully in 5.2s INFO: Connection pool terminated ``` ### Log Level Configuration ```bash # Set log level via environment variable export LOG_LEVEL=DEBUG # DEBUG, INFO, WARNING, ERROR, CRITICAL ``` --- ## Summary This guide provides comprehensive monitoring and troubleshooting documentation for the Codebase MCP Server connection pool. Key takeaways: 1. **Three-tier health status** (HEALTHY/DEGRADED/UNHEALTHY) provides clear operational signals 2. **Real-time statistics** enable capacity planning and performance optimization 3. **MCP resource endpoint** (`health://connection-pool`) integrates with external monitoring 4. **Prometheus/Grafana integration** enables visualization and alerting 5. **Troubleshooting procedures** cover common issues with actionable resolution steps 6. **Performance baselines** define expected latencies and throughput **Next Steps**: - Set up Prometheus scraping for pool metrics - Configure Grafana dashboards for visualization - Implement alerting rules for critical issues - Test recovery procedures in staging environment - Document team-specific runbooks based on this guide **Support Resources**: - Quickstart integration tests: `specs/009-v2-connection-mgmt/quickstart.md` - Data model reference: `specs/009-v2-connection-mgmt/data-model.md` - Feature specification: `specs/009-v2-connection-mgmt/spec.md` - Implementation plan: `specs/009-v2-connection-mgmt/plan.md` **Word Count**: ~8,500 words

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Ravenight13/codebase-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

connection_pool_monitoring.md•40 KiB