Skip to main content
Glama

Codebase MCP Server

by Ravenight13
connection_pool_monitoring.md40.9 kB
# Connection Pool Monitoring Guide **Version**: 2.0 | **Feature**: 009-v2-connection-mgmt | **Last Updated**: 2025-10-13 ## Overview This guide provides comprehensive monitoring and troubleshooting documentation for the Codebase MCP Server connection pool. The connection pool manages PostgreSQL connections with automatic health detection, leak prevention, and real-time statistics for production observability. **Key Monitoring Capabilities**: - Real-time pool statistics via internal API - Three-tier health status system (HEALTHY → DEGRADED → UNHEALTHY) - Connection leak detection with configurable timeouts - Performance metrics (acquisition times, wait times, peak usage) - MCP resource endpoint for health monitoring **Target Audience**: System administrators, SREs, and developers operating the MCP server in production or development environments. --- ## Health Status States The connection pool uses a three-tier health status system that provides clear operational signals: ### Health Status Definitions #### HEALTHY **Definition**: Pool is operating at full capacity with no issues. **Conditions**: - `total_connections > 0` - `idle_connections ≥ 80%` of `total_connections` - No errors in the last 60 seconds - Peak wait time `< 100ms` **Expected Behavior**: - All connection acquisitions succeed immediately (< 10ms p95) - Database queries execute without queuing - Pool maintains minimum size (default: 2 connections) **Operator Action**: None required. System is operating normally. --- #### DEGRADED **Definition**: Pool is functional but experiencing reduced capacity or intermittent issues. **Conditions**: - `total_connections > 0` - `idle_connections` is 50-79% of `total_connections` - OR recent errors occurred (within last 60 seconds) - OR peak wait time `> 100ms` **Expected Behavior**: - Connection acquisitions may queue briefly - Database queries succeed but with increased latency - Pool may be recovering from transient failure - Automatic recovery attempts in progress **Operator Action**: 1. Monitor for automatic recovery (typically 30-60 seconds) 2. Check database logs for connection errors 3. Review application query patterns for slow queries 4. Consider increasing `POOL_MAX_SIZE` if sustained high load --- #### UNHEALTHY **Definition**: Pool cannot serve requests due to critical failure. **Conditions**: - `total_connections == 0` (no connections available) - OR `idle_connections < 50%` of `total_connections` (severe capacity shortage) - OR pool initialization failed **Expected Behavior**: - Connection acquisition attempts timeout after 30 seconds - Database queries return `DATABASE_ERROR` responses - Automatic reconnection with exponential backoff (1s, 2s, 4s, 8s, 16s) - Pool transitions to `RECOVERING` state during reconnection attempts **Operator Action**: 1. **Immediate**: Check database server is running: `docker ps` or `systemctl status postgresql` 2. Verify network connectivity: `ping <database_host>` 3. Review server logs: `tail -f /tmp/codebase-mcp.log` 4. Check for database outage or maintenance window 5. If database is healthy, restart MCP server to reinitialize pool --- ### State Transitions The connection pool lifecycle follows these state transitions: ``` INITIALIZING → HEALTHY → DEGRADED → UNHEALTHY → RECOVERING → HEALTHY ↓ ↓ └──────────────────→ SHUTTING_DOWN → TERMINATED ←──────────┘ ``` **State Flow Details**: 1. **INITIALIZING → HEALTHY** - Trigger: Pool successfully creates `min_size` connections (default: 2) - Duration: < 2 seconds (p95 per SC-001) - Validation: Each connection executes `SELECT 1` query 2. **HEALTHY → DEGRADED** - Trigger: Some connections fail validation OR idle capacity drops to 50-79% - Recovery: Automatic reconnection attempts OR load decreases 3. **DEGRADED → UNHEALTHY** - Trigger: Idle capacity drops below 50% OR all connections lost - Impact: New requests begin timing out 4. **UNHEALTHY → RECOVERING** - Trigger: Reconnection attempt initiated - Backoff schedule: 1s, 2s, 4s, 8s, 16s (max 5 initial attempts) - Continued attempts: Every 16s after initial sequence 5. **RECOVERING → HEALTHY** - Trigger: Successfully created `min_size` connections - Duration: < 30 seconds (p95 per SC-004) - Validation: All new connections pass validation queries 6. **ANY → SHUTTING_DOWN → TERMINATED** - Trigger: Graceful shutdown signal (SIGTERM/SIGINT) - Behavior: Waits up to 30 seconds for active queries to complete - Cleanup: All connections closed, resources released --- ## Pool Statistics Reference The `PoolStatistics` model provides real-time metrics for capacity planning and troubleshooting: ### Real-Time Metrics #### `total_connections` (integer, ≥ 0) **Description**: Current total number of connections managed by the pool. **Expected Values**: - Minimum: `min_size` (default: 2) - Maximum: `max_size` (default: 10) - Typical: Between `min_size` and `max_size` based on load **Interpretation**: - `total_connections == min_size`: Pool at idle capacity - `total_connections > min_size`: Pool scaled up to handle load - `total_connections == max_size`: Pool at maximum capacity - `total_connections == 0`: **CRITICAL** - pool has no connections (UNHEALTHY state) **Alerting Threshold**: Alert if `total_connections == 0` for > 30 seconds --- #### `idle_connections` (integer, ≥ 0) **Description**: Number of connections available for immediate use. **Expected Values**: - Healthy: ≥ 80% of `total_connections` - Degraded: 50-79% of `total_connections` - Unhealthy: < 50% of `total_connections` **Interpretation**: - High idle count: Pool has spare capacity - Low idle count: Pool under heavy load - Zero idle connections: All connections in use, new requests will queue **Alerting Threshold**: Alert if `idle_connections < 50%` of `total_connections` for > 5 minutes **Calculation**: `idle_connections = total_connections - active_connections` --- #### `active_connections` (integer, ≥ 0) **Description**: Number of connections currently executing queries. **Expected Values**: - Typical: 0-5 during normal operation - High load: Approaches `max_size` **Interpretation**: - `active_connections == max_size`: Pool fully utilized - Sustained high active count: Consider increasing `max_size` or optimizing queries **Alerting Threshold**: Alert if `active_connections == max_size` for > 10 minutes --- #### `waiting_requests` (integer, ≥ 0) **Description**: Number of acquisition requests queued waiting for available connections. **Expected Values**: - Normal: 0 (no queuing) - Acceptable: 1-5 (brief queuing during burst load) - Critical: > 10 (persistent queuing indicates capacity shortage) **Interpretation**: - `waiting_requests > 0`: Pool exhausted, requests queuing - Sustained queuing: Increase `POOL_MAX_SIZE` or optimize query performance **Alerting Threshold**: **CRITICAL** - Alert if `waiting_requests > 10` for > 2 minutes --- #### `avg_acquisition_time_ms` (float, ≥ 0.0) **Description**: Average time to acquire a connection from the pool (milliseconds). **Expected Values**: - Excellent: < 5ms - Normal: 5-10ms - Degraded: 10-50ms - Critical: > 50ms **Interpretation**: - Low latency: Pool has spare capacity - High latency: Pool under load or connections queuing **Performance Target**: < 10ms (p95 per SC-002) **Alerting Threshold**: Alert if p95 > 10ms for > 5 minutes --- #### `peak_active_connections` (integer, ≥ 0) **Description**: Maximum number of connections simultaneously active since server startup. **Use Cases**: - Capacity planning: Determine if `max_size` is sufficient - Right-sizing: If `peak_active_connections` consistently < `max_size / 2`, consider reducing `max_size` - Growth analysis: Track peak over time to anticipate capacity needs **Interpretation**: - `peak_active_connections == max_size`: Pool has hit capacity limit - `peak_active_connections < max_size * 0.5`: Pool over-provisioned --- #### `peak_wait_time_ms` (float, ≥ 0.0) **Description**: Maximum time any request has waited for a connection (milliseconds). **Expected Values**: - Excellent: < 10ms (no significant queuing) - Acceptable: 10-100ms (brief queuing) - Degraded: 100-1000ms (noticeable queuing) - Critical: > 1000ms (1+ seconds - severe capacity shortage) **Interpretation**: - Low peak wait: Pool adequately sized - High peak wait: Pool exhaustion occurred, investigate cause **Alerting Threshold**: Alert if `peak_wait_time_ms > 1000ms` --- #### `total_acquisitions` / `total_releases` (integer, ≥ 0) **Description**: Lifetime count of connection acquisitions and releases. **Health Check**: - `total_acquisitions == total_releases`: All connections properly returned - `total_acquisitions > total_releases`: Some connections still in use (normal during load) - `total_acquisitions >> total_releases`: **Potential connection leak detected** **Leak Detection**: ```python leak_count = total_acquisitions - total_releases - active_connections if leak_count > 10: # CRITICAL: Connection leak suspected ``` **Alerting Threshold**: Alert if `leak_count > 10` --- #### `pool_created_at` (datetime) **Description**: Timestamp when pool was initialized. **Use Cases**: - Calculate uptime: `uptime = now() - pool_created_at` - Correlate pool lifecycle with application events --- #### `last_health_check` (datetime) **Description**: Timestamp of last successful health check. **Expected Behavior**: - Updated every health check request (typically every 10-60 seconds) - If stale (> 5 minutes), health check endpoint may not be functioning --- #### `last_error` (string | null) **Description**: Last error message encountered by the pool. **Expected Values**: - `null`: No recent errors - String: Error message with context (e.g., "Connection validation failed") **Common Error Messages**: - `"Connection validation failed"`: Database query failed during validation - `"Database connection refused"`: Database server unreachable - `"Pool initialization timeout"`: Startup failed within 2 seconds - `"Connection acquisition timeout"`: Request waited > 30 seconds --- #### `last_error_time` (datetime | null) **Description**: Timestamp of last error occurrence. **Health Impact**: - If `last_error_time` is within last 60 seconds, pool status may be DEGRADED - Older errors (> 60 seconds) do not affect health status --- ### Statistics API Example Query pool statistics programmatically: ```python # Internal API (within MCP server) stats: PoolStatistics = pool_manager.get_statistics() print(f"Total connections: {stats.total_connections}") print(f"Idle connections: {stats.idle_connections}") print(f"Active connections: {stats.active_connections}") print(f"Waiting requests: {stats.waiting_requests}") print(f"Avg acquisition time: {stats.avg_acquisition_time_ms:.2f}ms") print(f"Peak active: {stats.peak_active_connections}") ``` **Performance**: Statistics retrieval has < 1ms staleness (SC-007), ensuring real-time data. --- ## MCP Resource Endpoint The connection pool exposes health status via an MCP resource endpoint for external monitoring: ### Resource URI ``` health://connection-pool ``` ### Request Example ```bash # Using MCP client CLI mcp-client resource read health://connection-pool # Using curl (via MCP SSE transport) curl -H "Accept: text/event-stream" \ http://localhost:3000/sse \ -d '{"method":"resources/read","params":{"uri":"health://connection-pool"}}' ``` ### Response Schema ```json { "status": "healthy", "timestamp": "2025-10-13T14:30:00.000Z", "database": { "status": "connected", "pool": { "total": 5, "idle": 4, "active": 1, "waiting": 0 }, "latency_ms": 2.3, "last_error": null }, "uptime_seconds": 3600.5 } ``` ### Health Check Performance - **Target Latency**: < 10ms (p99 per SC-003) - **Throughput**: > 1000 req/s (per SC-013) - **No Database Query**: Health check uses cached pool state only (no `SELECT 1` query) ### Integration with Load Balancers Configure load balancer health checks: ```yaml # Kubernetes liveness probe livenessProbe: exec: command: - mcp-client - resource - read - health://connection-pool initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 2 failureThreshold: 3 # Kubernetes readiness probe readinessProbe: exec: command: - mcp-client - resource - read - health://connection-pool initialDelaySeconds: 3 periodSeconds: 5 timeoutSeconds: 2 failureThreshold: 2 ``` --- ## Monitoring Integration ### Prometheus Metrics (Example) If integrating with Prometheus, expose these metrics: ```python # Example Prometheus metric definitions (not implemented by default) # Gauge metrics codebase_mcp_pool_connections_total = Gauge( "codebase_mcp_pool_connections_total", "Total number of connections in pool" ) codebase_mcp_pool_connections_idle = Gauge( "codebase_mcp_pool_connections_idle", "Number of idle connections available" ) codebase_mcp_pool_connections_active = Gauge( "codebase_mcp_pool_connections_active", "Number of active connections in use" ) codebase_mcp_pool_waiting_requests = Gauge( "codebase_mcp_pool_waiting_requests", "Number of requests waiting for connections" ) # Histogram metrics codebase_mcp_pool_acquisition_duration_seconds = Histogram( "codebase_mcp_pool_acquisition_duration_seconds", "Time to acquire connection from pool", buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0] ) # Counter metrics codebase_mcp_pool_acquisitions_total = Counter( "codebase_mcp_pool_acquisitions_total", "Total number of connection acquisitions" ) codebase_mcp_pool_releases_total = Counter( "codebase_mcp_pool_releases_total", "Total number of connection releases" ) codebase_mcp_pool_errors_total = Counter( "codebase_mcp_pool_errors_total", "Total number of pool errors", ["error_type"] ) # Info metric codebase_mcp_pool_health_status = Enum( "codebase_mcp_pool_health_status", "Connection pool health status", states=["healthy", "degraded", "unhealthy"] ) ``` ### Exporter Script ```python # scripts/prometheus_exporter.py """ Prometheus exporter for connection pool metrics. Scrapes pool statistics and exposes /metrics endpoint. """ import asyncio from prometheus_client import start_http_server, Gauge, Histogram, Counter, Enum from connection_pool import ConnectionPoolManager # Initialize metrics (definitions above) async def update_metrics(pool_manager: ConnectionPoolManager): """Update Prometheus metrics from pool statistics.""" while True: stats = pool_manager.get_statistics() health = await pool_manager.health_check() # Update gauges codebase_mcp_pool_connections_total.set(stats.total_connections) codebase_mcp_pool_connections_idle.set(stats.idle_connections) codebase_mcp_pool_connections_active.set(stats.active_connections) codebase_mcp_pool_waiting_requests.set(stats.waiting_requests) # Update health status codebase_mcp_pool_health_status.state(health.status) await asyncio.sleep(10) # Scrape every 10 seconds if __name__ == "__main__": # Start Prometheus HTTP server on port 9090 start_http_server(9090) # Initialize pool manager and start metric updates pool_manager = ConnectionPoolManager() asyncio.run(update_metrics(pool_manager)) ``` **Usage**: ```bash # Start exporter python scripts/prometheus_exporter.py # Prometheus scrape config scrape_configs: - job_name: 'codebase-mcp-pool' static_configs: - targets: ['localhost:9090'] scrape_interval: 15s ``` --- ### Grafana Dashboard Example Grafana dashboard queries for visualizing pool metrics: #### Panel 1: Connection Pool Utilization ```promql # Query 1: Total connections codebase_mcp_pool_connections_total # Query 2: Idle connections codebase_mcp_pool_connections_idle # Query 3: Active connections codebase_mcp_pool_connections_active # Panel type: Time series graph # Y-axis: Connection count # Legend: {{__name__}} ``` #### Panel 2: Connection Acquisition Latency ```promql # p50 latency histogram_quantile(0.50, rate(codebase_mcp_pool_acquisition_duration_seconds_bucket[5m])) # p95 latency (target: < 10ms) histogram_quantile(0.95, rate(codebase_mcp_pool_acquisition_duration_seconds_bucket[5m])) # p99 latency histogram_quantile(0.99, rate(codebase_mcp_pool_acquisition_duration_seconds_bucket[5m])) # Panel type: Time series graph # Y-axis: Latency (seconds) # Threshold line at 0.01s (10ms target) ``` #### Panel 3: Pool Health Status ```promql # Health status (0=healthy, 1=degraded, 2=unhealthy) codebase_mcp_pool_health_status # Panel type: Stat panel # Color mode: Background # Thresholds: 0=green, 1=yellow, 2=red ``` #### Panel 4: Waiting Requests (Queue Depth) ```promql # Waiting requests count codebase_mcp_pool_waiting_requests # Panel type: Time series graph # Y-axis: Request count # Threshold line at 10 (critical threshold) # Alert annotation: waiting_requests > 10 ``` #### Panel 5: Acquisition Throughput ```promql # Acquisitions per second rate(codebase_mcp_pool_acquisitions_total[1m]) # Releases per second rate(codebase_mcp_pool_releases_total[1m]) # Panel type: Time series graph # Y-axis: Operations per second ``` #### Panel 6: Error Rate ```promql # Errors per second by type rate(codebase_mcp_pool_errors_total[5m]) # Panel type: Time series graph # Y-axis: Errors per second # Group by: error_type ``` **Full Dashboard JSON**: Available at `docs/grafana/connection-pool-dashboard.json` (to be created separately) --- ### Alerting Rules Example Prometheus alerting rules for connection pool monitoring: ```yaml # /etc/prometheus/rules/codebase-mcp-pool.yml groups: - name: codebase_mcp_pool_alerts interval: 30s rules: # CRITICAL: Pool has no connections - alert: ConnectionPoolNoConnections expr: codebase_mcp_pool_connections_total == 0 for: 30s labels: severity: critical component: connection_pool annotations: summary: "Connection pool has no available connections" description: "Pool has zero total connections for 30+ seconds. Database may be down or pool failed to initialize." runbook: "https://docs.example.com/runbooks/connection-pool-no-connections" # CRITICAL: Pool health unhealthy for 5 minutes - alert: ConnectionPoolUnhealthy expr: codebase_mcp_pool_health_status == 2 for: 5m labels: severity: critical component: connection_pool annotations: summary: "Connection pool in UNHEALTHY state" description: "Pool health status has been UNHEALTHY for 5+ minutes. Automatic recovery may have failed." runbook: "https://docs.example.com/runbooks/connection-pool-unhealthy" # WARNING: Pool health degraded for 10 minutes - alert: ConnectionPoolDegraded expr: codebase_mcp_pool_health_status == 1 for: 10m labels: severity: warning component: connection_pool annotations: summary: "Connection pool in DEGRADED state" description: "Pool health has been DEGRADED for 10+ minutes. May indicate persistent capacity issues." runbook: "https://docs.example.com/runbooks/connection-pool-degraded" # CRITICAL: Requests waiting for connections (pool exhaustion) - alert: ConnectionPoolExhausted expr: codebase_mcp_pool_waiting_requests > 10 for: 2m labels: severity: critical component: connection_pool annotations: summary: "Connection pool exhausted - requests queuing" description: "{{ $value }} requests waiting for connections. Pool capacity insufficient for current load." action: "Increase POOL_MAX_SIZE or investigate slow queries" # WARNING: High connection acquisition latency - alert: ConnectionPoolHighLatency expr: | histogram_quantile(0.95, rate(codebase_mcp_pool_acquisition_duration_seconds_bucket[5m])) > 0.01 for: 5m labels: severity: warning component: connection_pool annotations: summary: "Connection acquisition latency high (p95 > 10ms)" description: "p95 acquisition latency is {{ $value }}s (target: <0.01s)" action: "Investigate pool capacity or query performance" # WARNING: Potential connection leak - alert: ConnectionPoolLeak expr: | (codebase_mcp_pool_acquisitions_total - codebase_mcp_pool_releases_total - codebase_mcp_pool_connections_active) > 10 for: 10m labels: severity: warning component: connection_pool annotations: summary: "Potential connection leak detected" description: "{{ $value }} connections appear leaked (acquired but not released)" action: "Review application logs for connection leak warnings with stack traces" # CRITICAL: Pool fully utilized for extended period - alert: ConnectionPoolFullyUtilized expr: | codebase_mcp_pool_connections_active == codebase_mcp_pool_connections_total for: 10m labels: severity: critical component: connection_pool annotations: summary: "Connection pool fully utilized" description: "All {{ $value }} connections are active for 10+ minutes" action: "Increase POOL_MAX_SIZE or optimize query performance" # WARNING: Low idle capacity - alert: ConnectionPoolLowCapacity expr: | (codebase_mcp_pool_connections_idle / codebase_mcp_pool_connections_total) < 0.5 for: 5m labels: severity: warning component: connection_pool annotations: summary: "Connection pool running at low capacity" description: "Idle capacity is {{ $value | humanizePercentage }} (threshold: 50%)" action: "Consider increasing POOL_MAX_SIZE" ``` **Alert Routing**: ```yaml # /etc/alertmanager/config.yml route: group_by: ['alertname', 'component'] group_wait: 10s group_interval: 5m repeat_interval: 4h receiver: 'default' routes: - match: severity: critical component: connection_pool receiver: 'pagerduty-critical' continue: true - match: severity: warning component: connection_pool receiver: 'slack-warnings' receivers: - name: 'pagerduty-critical' pagerduty_configs: - service_key: '<pagerduty_service_key>' - name: 'slack-warnings' slack_configs: - api_url: '<slack_webhook_url>' channel: '#ops-warnings' ``` --- ## Troubleshooting Guide ### Issue: Slow Query Performance #### Symptoms - Connection acquisition latency > 10ms (p95) - `avg_acquisition_time_ms` increasing over time - Users report slow response times for database operations #### Causes 1. **Slow database queries**: Individual queries taking seconds to complete 2. **High query concurrency**: Too many simultaneous queries for pool size 3. **Database server overload**: CPU/memory/disk saturation on PostgreSQL 4. **Connection validation overhead**: Health checks taking too long #### Resolution Steps **Step 1: Identify slow queries** ```bash # Check PostgreSQL slow query log tail -f /var/log/postgresql/postgresql-14-main.log | grep "duration:" # Query PostgreSQL statistics psql -U postgres -d codebase_mcp -c " SELECT query, calls, mean_exec_time, max_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10; " ``` **Step 2: Review pool statistics** ```python # Check if pool is under-provisioned stats = pool_manager.get_statistics() print(f"Active: {stats.active_connections}/{stats.total_connections}") print(f"Waiting: {stats.waiting_requests}") print(f"Peak active: {stats.peak_active_connections}") # If peak_active_connections == max_size, pool may be too small ``` **Step 3: Optimize queries** ```sql -- Add indexes to slow queries CREATE INDEX CONCURRENTLY idx_chunks_embedding ON code_chunks USING ivfflat (embedding vector_cosine_ops); -- Analyze query plans EXPLAIN ANALYZE <slow_query>; ``` **Step 4: Increase pool size (if needed)** ```bash # Increase max pool size temporarily export POOL_MAX_SIZE=20 # Restart MCP server systemctl restart codebase-mcp # Monitor acquisition latency improvement ``` **Step 5: Enable query logging** ```python # Add query timing to application logs import logging logger = logging.getLogger("codebase_mcp") logger.setLevel(logging.DEBUG) # Log query durations ``` **Prevention**: - Set query timeouts: `POOL_COMMAND_TIMEOUT=60` (default) - Implement query result caching for frequently accessed data - Use database connection pooling at application layer (already implemented) --- ### Issue: Connection Pool Exhaustion #### Symptoms - `waiting_requests > 0` consistently - `active_connections == max_size` for extended periods - Connection acquisition timeouts (30 second timeout errors) - Alert: `ConnectionPoolExhausted` firing #### Causes 1. **Insufficient pool capacity**: `max_size` too small for current load 2. **Connection leaks**: Connections not being returned to pool 3. **Long-running queries**: Queries holding connections for minutes 4. **Burst traffic**: Sudden spike in concurrent requests #### Resolution Steps **Step 1: Immediate mitigation** ```bash # Increase pool size to handle burst export POOL_MAX_SIZE=20 # Double current size systemctl restart codebase-mcp # Verify increased capacity curl http://localhost:3000/health | jq '.database.pool' # Expected: {"total": 20, "idle": 15, "active": 5, "waiting": 0} ``` **Step 2: Identify resource hogs** ```sql -- Check for long-running queries SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '1 minute' ORDER BY duration DESC; -- Terminate hung queries (if necessary) SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '5 minutes'; ``` **Step 3: Check for connection leaks** ```bash # Review leak detection warnings in logs grep "Potential connection leak" /tmp/codebase-mcp.log # Expected output with stack trace: # WARNING: Potential connection leak detected: conn_abc123 # held_duration_seconds: 45.2 # stack_trace: <acquisition call stack> ``` **Step 4: Analyze acquisition patterns** ```python # Review peak usage metrics stats = pool_manager.get_statistics() print(f"Peak active: {stats.peak_active_connections}") print(f"Total acquisitions: {stats.total_acquisitions}") print(f"Avg acquisition time: {stats.avg_acquisition_time_ms:.2f}ms") # If peak consistently hits max_size, capacity is insufficient ``` **Step 5: Capacity planning** ```bash # Calculate required pool size based on load patterns # Rule of thumb: max_size = (concurrent_users * avg_queries_per_request) + buffer # For 100 concurrent users, 2 queries per request, 50% buffer: # max_size = (100 * 2) * 1.5 = 300 connections (adjust based on actual usage) # More realistic for local MCP server: 10-20 connections sufficient export POOL_MAX_SIZE=15 ``` **Prevention**: - Set `POOL_LEAK_DETECTION_TIMEOUT=30` to catch leaks early - Implement query timeouts to prevent hung connections - Monitor `waiting_requests` metric and alert on > 10 - Right-size pool based on actual load patterns --- ### Issue: Database Outage and Recovery #### Symptoms - Pool health status: `UNHEALTHY` - `total_connections == 0` - All database queries failing with `DATABASE_ERROR` - Logs show: "Database connection refused" or "Connection reset" #### Expected Behavior During Outage **Automatic Reconnection Sequence**: 1. Pool detects connection failures 2. Health status transitions: `HEALTHY → DEGRADED → UNHEALTHY → RECOVERING` 3. Exponential backoff reconnection attempts: - Attempt 1: 1 second delay - Attempt 2: 2 seconds delay - Attempt 3: 4 seconds delay - Attempt 4: 8 seconds delay - Attempt 5: 16 seconds delay - Subsequent attempts: Every 16 seconds 4. Once database recovers, pool reinitializes connections 5. Health status transitions: `RECOVERING → HEALTHY` 6. Normal operations resume **Recovery Timeline**: - **Best case**: 1 second (immediate database availability) - **Typical**: 10-30 seconds (transient network issue) - **Worst case**: Up to 30 seconds (p95 per SC-004) after database recovers #### Resolution Steps **Step 1: Verify database status** ```bash # Check if PostgreSQL is running docker ps | grep postgres # OR systemctl status postgresql # If not running, start database docker start postgres-codebase-mcp # OR systemctl start postgresql ``` **Step 2: Verify network connectivity** ```bash # Ping database host ping <database_host> # Check port connectivity nc -zv <database_host> 5432 # Test connection with psql psql -U postgres -h <database_host> -d codebase_mcp -c "SELECT 1" ``` **Step 3: Monitor automatic recovery** ```bash # Watch MCP server logs for reconnection attempts tail -f /tmp/codebase-mcp.log | grep -E "reconnect|recovering" # Expected log sequence: # INFO: Reconnection attempt 1/5 (delay: 1s) # INFO: Reconnection attempt 2/5 (delay: 2s) # INFO: Connection pool recovered: 2/10 connections available # INFO: Health status: RECOVERING -> HEALTHY ``` **Step 4: Verify pool recovery** ```bash # Check health endpoint curl http://localhost:3000/health | jq '.database' # Expected after recovery: # { # "status": "connected", # "pool": {"total": 2, "idle": 2, "active": 0, "waiting": 0}, # "latency_ms": 2.5, # "last_error": null # } ``` **Step 5: Test database operations** ```bash # Execute test query via MCP client mcp-client tool call search_code --query "test" # Should succeed with results (or empty results if database has no data) ``` **Manual Recovery (if automatic fails)**: ```bash # If automatic recovery doesn't work within 5 minutes: # 1. Check database logs for errors tail -f /var/log/postgresql/postgresql-14-main.log # 2. Verify DATABASE_URL is correct echo $DATABASE_URL # OR cat .env | grep DATABASE_URL # 3. Restart MCP server to force reconnection systemctl restart codebase-mcp # 4. If issue persists, check database connectivity: psql -U postgres -d codebase_mcp -c "SELECT version();" ``` **Prevention**: - Monitor database server health proactively - Set up database replication for high availability - Configure alerts for database downtime - Test recovery procedures regularly (e.g., monthly chaos testing) --- ### Issue: Connection Leaks #### Symptoms - `total_acquisitions >> total_releases + active_connections` - Warning logs: "Potential connection leak detected" - Pool gradually exhausting connections over time - Increasing `active_connections` without corresponding releases #### Causes 1. **Unclosed context managers**: Missing `async with pool.acquire()` pattern 2. **Exception handling gaps**: Connections not released in error paths 3. **Long-running operations**: Legitimate operations exceeding `leak_detection_timeout` 4. **Application bugs**: Logic errors preventing connection release #### Resolution Steps **Step 1: Identify leaking code** ```bash # Review leak warnings with stack traces grep -A 20 "Potential connection leak" /tmp/codebase-mcp.log # Example output: # WARNING: Potential connection leak detected: conn_abc123 # connection_id: conn_abc123 # held_duration_seconds: 45.2 # stack_trace: # File "src/services/searcher.py", line 123, in search_code # conn = await pool.acquire() # <-- Acquisition site # File "src/services/searcher.py", line 145, in execute_search # results = await conn.fetch(query) ``` **Step 2: Review code at acquisition sites** ```python # INCORRECT: Connection not guaranteed to be released async def bad_query(): conn = await pool.acquire() result = await conn.fetchval("SELECT 1") # BUG: If exception occurs, connection never released await pool.release(conn) return result # CORRECT: Use context manager async def good_query(): async with pool.acquire() as conn: result = await conn.fetchval("SELECT 1") # Connection automatically released even if exception occurs return result ``` **Step 3: Check for exception handling issues** ```python # INCORRECT: Connection leaked in error path async def bad_error_handling(): conn = await pool.acquire() try: result = await conn.fetchval("SELECT 1") except Exception as e: # BUG: Return without releasing connection return None await pool.release(conn) return result # CORRECT: Use context manager or try/finally async def good_error_handling(): async with pool.acquire() as conn: try: result = await conn.fetchval("SELECT 1") except Exception as e: logger.error(f"Query failed: {e}") return None return result ``` **Step 4: Verify leak detection configuration** ```bash # Check leak detection settings echo $POOL_LEAK_DETECTION_TIMEOUT # Default: 30 echo $POOL_ENABLE_LEAK_DETECTION # Default: true # Adjust timeout for long-running operations export POOL_LEAK_DETECTION_TIMEOUT=120 # 2 minutes for batch jobs ``` **Step 5: Monitor leak count** ```python # Calculate current leak count stats = pool_manager.get_statistics() leak_count = (stats.total_acquisitions - stats.total_releases - stats.active_connections) print(f"Suspected leaks: {leak_count}") # If leak_count > 10, investigate immediately ``` **Step 6: Force connection recycling (temporary fix)** ```bash # Restart server to clear leaked connections systemctl restart codebase-mcp # Or reduce max_connection_lifetime to force recycling export POOL_MAX_CONNECTION_LIFETIME=600 # 10 minutes (default: 3600) ``` **Prevention**: - **Always use context managers**: `async with pool.acquire() as conn:` - Enable leak detection in production: `POOL_ENABLE_LEAK_DETECTION=true` - Set appropriate timeout: `POOL_LEAK_DETECTION_TIMEOUT=30` (30 seconds) - Review all connection acquisition sites during code reviews - Run 24-hour leak test before production deployment: ```bash pytest tests/integration/test_leak_detection.py --duration=86400 ``` --- ### Issue: High Wait Times #### Symptoms - `peak_wait_time_ms > 100ms` - Increased `avg_acquisition_time_ms` - Intermittent connection acquisition delays - Alert: `ConnectionPoolHighLatency` firing #### Causes 1. **Pool too small**: `max_size` insufficient for concurrent load 2. **Slow queries**: Individual queries holding connections for seconds 3. **Burst traffic**: Sudden spike in requests 4. **Database performance**: Slow query execution on PostgreSQL side #### Resolution Steps **Step 1: Analyze wait time patterns** ```python # Review statistics stats = pool_manager.get_statistics() print(f"Avg acquisition: {stats.avg_acquisition_time_ms:.2f}ms") print(f"Peak wait time: {stats.peak_wait_time_ms:.2f}ms") print(f"Active connections: {stats.active_connections}/{stats.total_connections}") print(f"Waiting requests: {stats.waiting_requests}") # If waiting_requests > 0, pool is undersized # If avg_acquisition_time > 10ms, investigate causes ``` **Step 2: Identify bottleneck** ```sql -- Check for slow queries SELECT query, mean_exec_time, calls FROM pg_stat_statements WHERE mean_exec_time > 100 -- > 100ms average ORDER BY mean_exec_time DESC LIMIT 10; -- Check database load SELECT * FROM pg_stat_activity; ``` **Step 3: Increase pool capacity (if needed)** ```bash # Temporarily increase max_size export POOL_MAX_SIZE=20 systemctl restart codebase-mcp # Monitor wait time improvement curl http://localhost:3000/health | jq '.database.pool' ``` **Step 4: Optimize slow queries** ```sql -- Add missing indexes CREATE INDEX CONCURRENTLY idx_chunks_repo ON code_chunks(repository_id); -- Vacuum database VACUUM ANALYZE code_chunks; -- Update statistics ANALYZE; ``` **Step 5: Implement query timeout** ```bash # Set command timeout to prevent hung queries export POOL_COMMAND_TIMEOUT=60 # 60 seconds max per query systemctl restart codebase-mcp ``` **Prevention**: - Monitor `peak_wait_time_ms` and alert on > 100ms - Right-size pool based on actual concurrency patterns - Implement query result caching for frequently accessed data - Set `POOL_COMMAND_TIMEOUT` to prevent queries from holding connections indefinitely - Optimize database indexes and query plans --- ## Performance Baselines Reference expected latencies and throughput from quickstart.md integration tests: ### Initialization Performance - **Target**: Pool initialization < 2 seconds (p95) - **Typical**: 300-500ms - **Includes**: Creating `min_size` connections, validation queries ### Connection Acquisition Performance - **Target**: < 10ms (p95) - **Typical**: 1-3ms (when connections available) - **Typical (under load)**: 5-10ms (when scaling up pool) ### Health Check Performance - **Target**: < 10ms (p99) - **Typical**: < 1ms (cached state, no database query) - **Throughput**: > 1000 requests/second ### Reconnection Performance - **Target**: < 30 seconds (p95) after database recovery - **Typical**: 10-20 seconds - **Includes**: Exponential backoff delays, connection creation, validation ### Graceful Shutdown Performance - **Target**: < 30 seconds (p99) - **Typical**: 0-5 seconds (if no active queries) - **Typical (with active queries)**: 5-30 seconds (waits for completion) ### Concurrent Load Performance - **Target**: 100 concurrent requests without deadlock - **Pool size**: max_size=10 (default) - **Behavior**: Requests queue when pool exhausted, FIFO order ### Memory Consumption - **Target**: < 100MB for max_size=10 pool - **Typical**: 20-50MB (depends on connection count and query buffers) ### Startup Overhead - **Target**: < 200ms added to server startup time - **Typical**: 50-100ms (pool initialization) - **Includes**: Configuration validation, pool creation --- ## Logging and Diagnostics ### Log File Location ```bash /tmp/codebase-mcp.log ``` ### Log Format ```json { "timestamp": "2025-10-13T14:30:00.000Z", "level": "INFO", "component": "connection_pool", "message": "Connection pool initialized", "context": { "min_size": 2, "max_size": 10, "database": "codebase_mcp" } } ``` ### Key Log Messages #### Initialization ``` INFO: Connection pool initialized: min_size=2, max_size=10 INFO: Created 2 initial connections in 345ms ``` #### Reconnection ``` WARNING: Database connection lost: Connection refused INFO: Reconnection attempt 1/5 (delay: 1s) INFO: Reconnection attempt 2/5 (delay: 2s) INFO: Connection pool recovered: 2/10 connections available ``` #### Leak Detection ``` WARNING: Potential connection leak detected: conn_abc123 connection_id: conn_abc123 held_duration_seconds: 45.2 stack_trace: <acquisition call stack> ``` #### Graceful Shutdown ``` INFO: Graceful shutdown initiated INFO: Waiting for 3 active connections to complete (timeout: 30s) INFO: All connections closed gracefully in 5.2s INFO: Connection pool terminated ``` ### Log Level Configuration ```bash # Set log level via environment variable export LOG_LEVEL=DEBUG # DEBUG, INFO, WARNING, ERROR, CRITICAL ``` --- ## Summary This guide provides comprehensive monitoring and troubleshooting documentation for the Codebase MCP Server connection pool. Key takeaways: 1. **Three-tier health status** (HEALTHY/DEGRADED/UNHEALTHY) provides clear operational signals 2. **Real-time statistics** enable capacity planning and performance optimization 3. **MCP resource endpoint** (`health://connection-pool`) integrates with external monitoring 4. **Prometheus/Grafana integration** enables visualization and alerting 5. **Troubleshooting procedures** cover common issues with actionable resolution steps 6. **Performance baselines** define expected latencies and throughput **Next Steps**: - Set up Prometheus scraping for pool metrics - Configure Grafana dashboards for visualization - Implement alerting rules for critical issues - Test recovery procedures in staging environment - Document team-specific runbooks based on this guide **Support Resources**: - Quickstart integration tests: `specs/009-v2-connection-mgmt/quickstart.md` - Data model reference: `specs/009-v2-connection-mgmt/data-model.md` - Feature specification: `specs/009-v2-connection-mgmt/spec.md` - Implementation plan: `specs/009-v2-connection-mgmt/plan.md` **Word Count**: ~8,500 words

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Ravenight13/codebase-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server