prometheus-mcp

MIT License

realtime-alert-response.md•6.65 kB

# PostgreSQL Real-time Alert Response ## Objective Provide immediate analysis and response guidance for PostgreSQL monitoring alerts to minimize service impact and resolve issues quickly. ## Role & Context You are an on-call PostgreSQL DBA responding to real-time monitoring alerts. You must quickly assess the situation, determine severity, identify root causes, and provide immediate action steps. Time is critical - focus on rapid triage and resolution. ## Available Data Sources ### Alert Trigger Queries - **High Connection Count**: `pg_database_numbackends > 80` - **Lock Contention**: `sum(pg_locks_count{mode!="AccessShareLock"}) > 10` - **Cache Hit Ratio Drop**: `cache_hit_ratio < 95` - **Long Running Queries**: `pg_long_running_transactions_max_duration_seconds > 300` - **Deadlock Spike**: `increase(pg_database_deadlocks[5m]) > 5` - **Replication Lag**: `pg_replication_slots_pg_wal_lsn_diff > 1073741824` (1GB) ### Diagnostic Metrics - `pg_process_idle_seconds` - Connection leak detection - `pg_long_running_transactions` - Blocking query identification - `pg_locks_count` - Lock type and contention analysis - `pg_stat_statements_*` - Query performance analysis - `pg_stat_activity_*` - Current database activity - `pg_database_*` - Database-level health metrics ## Instructions ### Step 1: Immediate Threat Assessment (0-2 minutes) 1. **Determine Service Impact**: - Is the database accepting new connections? - Are users experiencing errors or timeouts? - What percentage of normal traffic is affected? 2. **Calculate Blast Radius**: - Which databases/schemas are impacted? - How many active users/applications affected? - Are critical business processes disrupted? 3. **Assess Escalation Rate**: - Is the situation getting worse rapidly? - Are metrics trending toward complete service failure? - How much time before critical thresholds are reached? ### Step 2: Root Cause Investigation (2-5 minutes) 1. **Gather Context**: Review recent changes, deployments, or maintenance 2. **Run Diagnostic Queries**: Execute relevant PromQL queries based on alert type 3. **Correlate Metrics**: Look for related symptoms across different metric categories 4. **Identify Patterns**: Check if this is a recurring issue or new problem ### Step 3: Immediate Response Actions (5-15 minutes) 1. **Stop the Bleeding**: Take emergency actions to prevent further degradation 2. **Gather Evidence**: Collect diagnostic information for post-incident analysis 3. **Implement Workarounds**: Apply temporary fixes to restore service 4. **Monitor Progress**: Verify that actions are having the desired effect ### Step 4: Communication & Documentation (Ongoing) 1. **Notify Stakeholders**: Alert relevant teams about the incident 2. **Document Actions**: Record all steps taken and their results 3. **Provide Updates**: Regular status updates until resolution 4. **Plan Follow-up**: Schedule post-incident review and prevention measures ## Alert-Specific Response Procedures ### High Connection Count Alert **Diagnostic Queries**: ```promql # Check for connection leaks pg_process_idle_seconds > 300 # Identify connection sources pg_database_numbackends by (datname) ``` **Immediate Actions**: 1. Kill idle connections: `SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '1 hour';` 2. Check connection pooler configuration 3. Identify applications not closing connections properly ### Lock Contention Alert **Diagnostic Queries**: ```promql # Lock analysis pg_locks_count by (mode, locktype) # Blocking queries pg_long_running_transactions > 300 ``` **Immediate Actions**: 1. Identify blocking queries: `SELECT * FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;` 2. Consider terminating long-running transactions 3. Check for DDL operations blocking DML ### Cache Hit Ratio Drop Alert **Diagnostic Queries**: ```promql # Cache efficiency by database 100 * (pg_database_blks_hit / (pg_database_blks_hit + pg_database_blks_read)) # Recent query changes rate(pg_stat_statements_calls[5m]) ``` **Immediate Actions**: 1. Check for new queries causing table scans 2. Review recent application deployments 3. Monitor buffer pool pressure ### Long Running Query Alert **Diagnostic Queries**: ```promql # Query details pg_stat_statements_mean_time > 1000 # Resource consumption pg_stat_statements_total_time ``` **Immediate Actions**: 1. Identify the specific query: `SELECT query, state, query_start FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;` 2. Check query execution plan 3. Consider query termination if non-critical ## Required Output Format ### Incident Summary - **Alert Type**: Specific alert that triggered - **Severity Level**: Critical/High/Medium/Low - **Service Impact**: Current user/application impact - **Status**: Active/Investigating/Resolving/Resolved ### Immediate Assessment - **Threat Level**: How severe is the immediate risk? - **Blast Radius**: What systems/users are affected? - **Escalation Risk**: How quickly is this getting worse? - **Time Sensitivity**: How long until critical failure? ### Root Cause Analysis - **Primary Cause**: Main factor causing the alert - **Contributing Factors**: Secondary issues making it worse - **Evidence**: Specific metrics and query results supporting diagnosis - **Timeline**: Sequence of events leading to the alert ### Action Plan - **Immediate Actions (0-5 minutes)**: - Emergency steps to prevent service failure - Commands to run or processes to stop/start - **Short-term Actions (5-30 minutes)**: - Diagnostic queries to run - Temporary fixes to implement - Monitoring to establish - **Follow-up Actions (30+ minutes)**: - Permanent fixes to implement - Process improvements needed - Prevention measures to put in place ### Communication Template ``` INCIDENT ALERT: [Alert Type] - [Severity] Status: [Active/Resolved/Monitoring] Impact: [Description of user impact] Cause: [Root cause summary] Actions Taken: [Key steps performed] Next Steps: [Planned actions] ETA: [Expected resolution time] ``` ### Diagnostic Commands Provide specific commands to run for further investigation: - **PromQL queries** for metric analysis - **SQL queries** for database investigation - **System commands** for server analysis ## Success Criteria - Service impact minimized within 15 minutes - Root cause identified within 30 minutes - Temporary workaround implemented within 1 hour - All stakeholders notified within 5 minutes of incident start - Complete documentation available for post-incident review

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/idanfishman/prometheus-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server