Codebase MCP Server

codebase-mcp
docs
operations

incident-response.md•19 KiB

# Incident Response Runbook **Version**: 1.0.0 **Last Updated**: 2025-10-13 **Feature**: 011-performance-validation-multi **Phase**: 8 **Task**: T054 **Constitutional Compliance**: Principle V (Production Quality) ## Overview This runbook provides step-by-step procedures for responding to production incidents in the dual-server MCP architecture. Each scenario includes detection methods, immediate actions, resolution steps, and post-incident procedures. ## Incident Severity Levels | Level | Name | Response Time | Examples | Escalation | |-------|------|---------------|----------|------------| | **P1** | Critical | 15 min | Complete outage, data loss | Page on-call immediately | | **P2** | High | 1 hour | Degraded performance, partial outage | Page on-call within 30 min | | **P3** | Medium | 4 hours | Non-critical features down | Email team | | **P4** | Low | 24 hours | Minor issues, cosmetic bugs | Next business day | ## Quick Reference | Incident Type | Severity | First Action | Page | |---------------|----------|--------------|------| | Database connection failure | P1 | Check DB status | [Link](#database-connection-failures) | | Pool exhaustion | P2 | Increase pool size | [Link](#connection-pool-exhaustion) | | Server crash | P1 | Restart service | [Link](#server-failures) | | High latency | P2 | Check metrics | [Link](#performance-degradation) | | Memory leak | P2 | Restart with heap dump | [Link](#memory-issues) | ## Database Connection Failures ### Detection **Automated Alerts**: ```yaml - alert: DatabaseConnectionLost expr: up{job="postgresql"} == 0 severity: critical ``` **Manual Detection**: ```bash # Check database connectivity psql -h localhost -U mcp_user -d codebase_mcp -c "SELECT 1" # Check server health curl http://localhost:8000/health/status | jq '.database_status' ``` ### Immediate Actions 1. **Verify database is running**: ```bash # Check PostgreSQL status systemctl status postgresql # Check if port is listening netstat -an | grep 5432 # Check PostgreSQL logs tail -f /var/log/postgresql/postgresql-*.log ``` 2. **Check network connectivity**: ```bash # Test connection from app server nc -zv database-host 5432 # Check DNS resolution nslookup database-host # Trace network path traceroute database-host ``` 3. **Verify credentials**: ```bash # Test with connection string psql "postgresql://user:password@host:5432/database" # Check environment variables env | grep DATABASE_URL ``` ### Resolution Steps #### Scenario 1: Database is down ```bash # 1. Start PostgreSQL if stopped sudo systemctl start postgresql # 2. If fails to start, check disk space df -h /var/lib/postgresql # 3. Check for corruption sudo -u postgres pg_checksums --check -D /var/lib/postgresql/data # 4. If corrupted, restore from backup sudo systemctl stop postgresql sudo -u postgres pg_basebackup -h backup-server -D /var/lib/postgresql/data sudo systemctl start postgresql ``` #### Scenario 2: Connection pool exhausted ```bash # 1. Identify blocking queries psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';" # 2. Kill long-running queries psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid <> pg_backend_pid() AND query_start < now() - interval '10 minutes';" # 3. Increase connection limit temporarily psql -c "ALTER SYSTEM SET max_connections = 500;" psql -c "SELECT pg_reload_conf();" ``` #### Scenario 3: Network issues ```bash # 1. Restart network service sudo systemctl restart networking # 2. Check firewall rules iptables -L -n | grep 5432 # 3. Add firewall exception if needed iptables -A INPUT -p tcp --dport 5432 -j ACCEPT # 4. Check connection limits sysctl net.core.somaxconn sysctl -w net.core.somaxconn=1024 ``` ### Recovery Verification ```bash # 1. Test database connection psql -h localhost -U mcp_user -d codebase_mcp -c "SELECT COUNT(*) FROM code_chunks;" # 2. Verify server health curl http://localhost:8000/health/status | jq '.status' # 3. Check metrics curl http://localhost:8000/metrics | grep database_connections # 4. Run test query curl -X POST http://localhost:8000/search \ -H "Content-Type: application/json" \ -d '{"query": "test search"}' ``` ### Post-Incident Actions 1. **Document timeline** in incident report 2. **Analyze root cause**: - Review logs from 1 hour before incident - Check for recent deployments - Review database metrics 3. **Implement preventive measures**: - Add connection pooling monitors - Implement circuit breakers - Schedule regular connection pool analysis ## Connection Pool Exhaustion ### Detection **Automated Alerts**: ```yaml - alert: ConnectionPoolExhausted expr: pool_utilization > 0.95 severity: high ``` **Manual Detection**: ```bash # Check pool statistics curl http://localhost:8000/health/status | jq '.connection_pool' # Monitor pool in real-time watch -n 1 'curl -s http://localhost:8000/health/status | jq .connection_pool' ``` ### Immediate Actions 1. **Increase pool size dynamically**: ```bash # Set environment variable and restart export POOL_MAX_SIZE=200 systemctl restart codebase-mcp # Or use API if available curl -X POST http://localhost:8000/admin/pool/resize \ -H "Content-Type: application/json" \ -d '{"max_size": 200}' ``` 2. **Identify pool consumers**: ```sql -- Find active connections SELECT pid, usename, application_name, client_addr, state, query_start, NOW() - query_start AS duration, query FROM pg_stat_activity WHERE datname = 'codebase_mcp' ORDER BY query_start; ``` 3. **Kill idle connections**: ```sql -- Terminate idle connections older than 5 minutes SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'codebase_mcp' AND state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes'; ``` ### Resolution Steps #### Short-term Fix ```bash # 1. Restart with larger pool docker-compose down docker-compose up -d --env POOL_MAX_SIZE=200 # 2. Or modify systemd service cat > /etc/systemd/system/codebase-mcp.service.d/override.conf << EOF [Service] Environment="POOL_MAX_SIZE=200" EOF systemctl daemon-reload systemctl restart codebase-mcp ``` #### Long-term Solution ```python # Implement adaptive pool sizing class AdaptivePoolManager: def adjust_pool_size(self): utilization = self.get_utilization() if utilization > 0.9: new_size = min(self.max_size * 1.5, 500) self.resize_pool(new_size) logger.warning(f"Pool resized to {new_size} due to high utilization") elif utilization < 0.3 and self.max_size > self.initial_size: new_size = max(self.max_size * 0.7, self.initial_size) self.resize_pool(new_size) logger.info(f"Pool downsized to {new_size} due to low utilization") ``` ### Recovery Verification ```bash # Verify pool is healthy curl http://localhost:8000/health/status | \ jq '.connection_pool | select(.pool_utilization < 0.8)' # Run load test to verify capacity k6 run -u 20 -d 30s tests/load/quick_test.js ``` ## Server Failures ### Detection **Automated Alerts**: ```yaml - alert: ServerDown expr: up{job="mcp-servers"} == 0 severity: critical ``` ### Immediate Actions 1. **Check server status**: ```bash # Check if process is running ps aux | grep -E "(codebase|workflow)-mcp" # Check systemd status systemctl status codebase-mcp systemctl status workflow-mcp # Check Docker status (if containerized) docker ps | grep mcp ``` 2. **Restart failed server**: ```bash # Systemd restart systemctl restart codebase-mcp # Docker restart docker restart codebase-mcp # With health check wait timeout 30 bash -c 'until curl -f http://localhost:8000/health/status; do sleep 1; done' ``` 3. **Check for port conflicts**: ```bash # Find process using port lsof -i :8000 netstat -tlnp | grep 8000 # Kill conflicting process kill -9 $(lsof -t -i:8000) ``` ### Resolution Steps #### Scenario 1: Out of Memory ```bash # 1. Check memory usage free -h dmesg | grep -i "killed process" # 2. Increase memory limit # For systemd cat > /etc/systemd/system/codebase-mcp.service.d/memory.conf << EOF [Service] MemoryLimit=4G MemoryMax=4G EOF # For Docker docker update --memory="4g" --memory-swap="4g" codebase-mcp # 3. Restart with new limits systemctl daemon-reload systemctl restart codebase-mcp ``` #### Scenario 2: Corrupted State ```bash # 1. Clear cache and temporary files rm -rf /tmp/mcp-cache/* rm -rf /var/lib/mcp/temp/* # 2. Reset application state redis-cli FLUSHDB # 3. Restart clean systemctl stop codebase-mcp sleep 5 systemctl start codebase-mcp ``` #### Scenario 3: Dependency Issues ```bash # 1. Check Ollama service (for embeddings) systemctl status ollama curl http://localhost:11434/api/tags # 2. Restart dependencies systemctl restart ollama systemctl restart redis # 3. Verify dependencies are accessible nc -zv localhost 11434 # Ollama nc -zv localhost 6379 # Redis ``` ### Recovery Verification ```bash # Full health check for service in codebase-mcp workflow-mcp; do echo "Checking $service..." systemctl is-active $service curl -s http://localhost:$([ "$service" = "workflow-mcp" ] && echo 8001 || echo 8000)/health/status | jq '.status' done ``` ## Performance Degradation ### Detection **Automated Alerts**: ```yaml - alert: HighLatency expr: histogram_quantile(0.95, request_duration_seconds) > 0.5 severity: high ``` ### Immediate Actions 1. **Identify bottleneck**: ```bash # Check CPU usage top -n 1 | head -20 # Check disk I/O iostat -x 1 5 # Check network iftop -i eth0 # Database slow queries psql -c "SELECT * FROM pg_stat_statements WHERE mean_exec_time > 100 ORDER BY mean_exec_time DESC LIMIT 10;" ``` 2. **Enable profiling**: ```python # Add profiling endpoint import cProfile import pstats import io @app.route('/debug/profile') async def profile(): pr = cProfile.Profile() pr.enable() # Run sample workload await perform_search("test query") pr.disable() s = io.StringIO() ps = pstats.Stats(pr, stream=s).sort_stats('cumulative') ps.print_stats(20) return s.getvalue() ``` ### Resolution Steps #### Quick Wins ```bash # 1. Clear caches redis-cli FLUSHALL # 2. Restart to clear memory systemctl restart codebase-mcp # 3. Increase connection pool export POOL_MAX_SIZE=200 systemctl restart codebase-mcp # 4. Disable debug logging export LOG_LEVEL=WARNING systemctl restart codebase-mcp ``` #### Database Optimization ```sql -- Update statistics ANALYZE; -- Rebuild indexes REINDEX INDEX CONCURRENTLY idx_chunks_embedding; -- Clear bloat VACUUM FULL ANALYZE code_chunks; ``` ## Memory Issues ### Detection ```bash # Monitor memory usage watch -n 1 'free -h; echo "---"; ps aux | grep mcp | grep -v grep' # Check for memory leaks pmap -x $(pgrep -f codebase-mcp) | tail -1 ``` ### Resolution Steps 1. **Capture heap dump**: ```python # Add memory profiling import tracemalloc import gc # Start tracing tracemalloc.start() # After some operations snapshot = tracemalloc.take_snapshot() top_stats = snapshot.statistics('lineno') for stat in top_stats[:10]: print(stat) # Force garbage collection gc.collect() ``` 2. **Restart with monitoring**: ```bash # Enable memory profiling export PYTHONTRACEMALLOC=10 systemctl restart codebase-mcp # Monitor with lower memory limit systemd-run --scope -p MemoryLimit=2G --uid=mcp /usr/bin/codebase-mcp ``` ## Escalation Paths ### Escalation Matrix | Time | Severity | Status | Action | |------|----------|--------|--------| | 0-15 min | P1 | Unresolved | Page secondary on-call | | 15-30 min | P1 | Unresolved | Page team lead | | 30-60 min | P1 | Unresolved | Page engineering manager | | 60+ min | P1 | Unresolved | Activate incident command | ### Contact Information ```yaml on_call_rotation: primary: - name: "Primary On-Call" phone: "+1-555-0100" slack: "@oncall-primary" secondary: - name: "Secondary On-Call" phone: "+1-555-0101" slack: "@oncall-secondary" escalation: - name: "Team Lead" phone: "+1-555-0102" slack: "@team-lead" - name: "Engineering Manager" phone: "+1-555-0103" slack: "@eng-manager" ``` ## Recovery Procedures ### Full System Recovery ```bash #!/bin/bash # full_recovery.sh - Complete system recovery procedure echo "Starting full system recovery..." # 1. Stop all services systemctl stop codebase-mcp workflow-mcp # 2. Clear all caches redis-cli FLUSHALL rm -rf /tmp/mcp-cache/* # 3. Verify database psql -U postgres -c "SELECT 1;" || { echo "Database down, starting..." systemctl start postgresql } # 4. Reset connection pools export POOL_MIN_SIZE=5 export POOL_MAX_SIZE=50 # 5. Start services in order systemctl start workflow-mcp sleep 5 systemctl start codebase-mcp # 6. Wait for health for i in {1..30}; do if curl -f http://localhost:8000/health/status; then echo "Codebase-MCP healthy" break fi sleep 2 done for i in {1..30}; do if curl -f http://localhost:8001/health/status; then echo "Workflow-MCP healthy" break fi sleep 2 done # 7. Run smoke tests python tests/smoke/test_basic_operations.py echo "Recovery complete!" ``` ### Data Recovery ```bash #!/bin/bash # data_recovery.sh - Restore from backup # 1. Stop services systemctl stop codebase-mcp workflow-mcp # 2. Backup current (possibly corrupted) data pg_dump codebase_mcp > backup_corrupted_$(date +%Y%m%d_%H%M%S).sql # 3. Restore from last known good backup psql -U postgres -c "DROP DATABASE IF EXISTS codebase_mcp;" psql -U postgres -c "CREATE DATABASE codebase_mcp;" psql -U postgres codebase_mcp < /backups/last_known_good.sql # 4. Verify restoration psql -U postgres codebase_mcp -c "SELECT COUNT(*) FROM code_chunks;" # 5. Restart services systemctl start codebase-mcp workflow-mcp ``` ## Post-Incident Procedures ### Incident Report Template ```markdown # Incident Report - [DATE] ## Summary - **Incident ID**: INC-2025-001 - **Severity**: P1/P2/P3/P4 - **Duration**: XX minutes - **Impact**: [Users affected, features impacted] ## Timeline - **HH:MM** - Alert triggered - **HH:MM** - On-call engineer paged - **HH:MM** - Initial investigation started - **HH:MM** - Root cause identified - **HH:MM** - Fix deployed - **HH:MM** - Service restored - **HH:MM** - Incident closed ## Root Cause [Detailed explanation of what caused the incident] ## Resolution [Steps taken to resolve the incident] ## Impact - **Users Affected**: X - **Requests Failed**: Y - **Data Loss**: None/Describe ## Lessons Learned 1. What went well 2. What could be improved 3. Action items ## Action Items - [ ] Update monitoring for earlier detection - [ ] Add runbook for this scenario - [ ] Implement preventive measures ``` ### Post-Mortem Meeting **Agenda**: 1. Timeline review (10 min) 2. Root cause analysis (20 min) 3. Impact assessment (10 min) 4. Improvement discussion (20 min) 5. Action items assignment (10 min) **Participants**: - On-call engineer - Team lead - Service owners - SRE representative ## Automation Scripts ### Health Check Loop ```bash #!/bin/bash # health_monitor.sh - Continuous health monitoring while true; do for service in codebase-mcp:8000 workflow-mcp:8001; do IFS=':' read -r name port <<< "$service" if ! curl -sf http://localhost:$port/health/status > /dev/null; then echo "ALERT: $name is unhealthy!" # Send alert curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \ -H "Content-Type: application/json" \ -d "{\"text\":\"🚨 $name health check failed!\"}" fi done sleep 30 done ``` ### Automatic Recovery ```python #!/usr/bin/env python3 """auto_recovery.py - Automatic incident recovery""" import subprocess import time import requests from typing import Dict, Callable class AutoRecovery: def __init__(self): self.recovery_strategies = { "database_connection": self.recover_database, "pool_exhausted": self.recover_pool, "high_memory": self.recover_memory, "server_down": self.recover_server } def detect_issue(self) -> str: """Detect what type of issue is occurring.""" try: response = requests.get("http://localhost:8000/health/status") health = response.json() if health["database_status"] == "disconnected": return "database_connection" if health["connection_pool"]["pool_utilization"] > 0.95: return "pool_exhausted" # Check memory usage mem_check = subprocess.run( ["free", "-b"], capture_output=True, text=True ) # Parse memory usage... except requests.exceptions.RequestException: return "server_down" return "unknown" def recover_database(self): """Recover database connection.""" subprocess.run(["systemctl", "restart", "postgresql"]) time.sleep(5) def recover_pool(self): """Recover from pool exhaustion.""" # Increase pool size subprocess.run([ "systemctl", "set-environment", "POOL_MAX_SIZE=200" ]) subprocess.run(["systemctl", "restart", "codebase-mcp"]) def recover_memory(self): """Recover from high memory usage.""" subprocess.run(["systemctl", "restart", "codebase-mcp"]) def recover_server(self): """Recover from server down.""" subprocess.run(["systemctl", "start", "codebase-mcp"]) def run(self): """Main recovery loop.""" issue = self.detect_issue() if issue in self.recovery_strategies: print(f"Detected issue: {issue}") print(f"Attempting recovery...") self.recovery_strategies[issue]() # Verify recovery time.sleep(10) if self.verify_health(): print("Recovery successful!") else: print("Recovery failed, escalating...") self.escalate() def verify_health(self) -> bool: """Verify system is healthy.""" try: response = requests.get("http://localhost:8000/health/status") return response.json()["status"] == "healthy" except: return False def escalate(self): """Escalate to on-call.""" # Send page to on-call subprocess.run([ "curl", "-X", "POST", "https://api.pagerduty.com/incidents", "-H", "Authorization: Token token=YOUR_TOKEN", "-H", "Content-Type: application/json", "-d", '{"incident": {"type": "incident", "title": "Auto-recovery failed"}}' ]) if __name__ == "__main__": recovery = AutoRecovery() recovery.run() ``` ## References - [Health Monitoring Guide](health-monitoring.md) - [Performance Tuning Guide](performance-tuning.md) - [Resilience Validation Report](resilience-validation-report.md) - [Database Operations](../database/operations.md) - [Constitutional Principles](../../.specify/memory/constitution.md)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Ravenight13/codebase-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

incident-response.md•19 KiB