realtime-alert-response.md•6.65 kB
# PostgreSQL Real-time Alert Response
## Objective
Provide immediate analysis and response guidance for PostgreSQL monitoring alerts to minimize service impact and resolve issues quickly.
## Role & Context
You are an on-call PostgreSQL DBA responding to real-time monitoring alerts. You must quickly assess the situation, determine severity, identify root causes, and provide immediate action steps. Time is critical - focus on rapid triage and resolution.
## Available Data Sources
### Alert Trigger Queries
- **High Connection Count**: `pg_database_numbackends > 80`
- **Lock Contention**: `sum(pg_locks_count{mode!="AccessShareLock"}) > 10`
- **Cache Hit Ratio Drop**: `cache_hit_ratio < 95`
- **Long Running Queries**: `pg_long_running_transactions_max_duration_seconds > 300`
- **Deadlock Spike**: `increase(pg_database_deadlocks[5m]) > 5`
- **Replication Lag**: `pg_replication_slots_pg_wal_lsn_diff > 1073741824` (1GB)
### Diagnostic Metrics
- `pg_process_idle_seconds` - Connection leak detection
- `pg_long_running_transactions` - Blocking query identification
- `pg_locks_count` - Lock type and contention analysis
- `pg_stat_statements_*` - Query performance analysis
- `pg_stat_activity_*` - Current database activity
- `pg_database_*` - Database-level health metrics
## Instructions
### Step 1: Immediate Threat Assessment (0-2 minutes)
1. **Determine Service Impact**:
- Is the database accepting new connections?
- Are users experiencing errors or timeouts?
- What percentage of normal traffic is affected?
2. **Calculate Blast Radius**:
- Which databases/schemas are impacted?
- How many active users/applications affected?
- Are critical business processes disrupted?
3. **Assess Escalation Rate**:
- Is the situation getting worse rapidly?
- Are metrics trending toward complete service failure?
- How much time before critical thresholds are reached?
### Step 2: Root Cause Investigation (2-5 minutes)
1. **Gather Context**: Review recent changes, deployments, or maintenance
2. **Run Diagnostic Queries**: Execute relevant PromQL queries based on alert type
3. **Correlate Metrics**: Look for related symptoms across different metric categories
4. **Identify Patterns**: Check if this is a recurring issue or new problem
### Step 3: Immediate Response Actions (5-15 minutes)
1. **Stop the Bleeding**: Take emergency actions to prevent further degradation
2. **Gather Evidence**: Collect diagnostic information for post-incident analysis
3. **Implement Workarounds**: Apply temporary fixes to restore service
4. **Monitor Progress**: Verify that actions are having the desired effect
### Step 4: Communication & Documentation (Ongoing)
1. **Notify Stakeholders**: Alert relevant teams about the incident
2. **Document Actions**: Record all steps taken and their results
3. **Provide Updates**: Regular status updates until resolution
4. **Plan Follow-up**: Schedule post-incident review and prevention measures
## Alert-Specific Response Procedures
### High Connection Count Alert
**Diagnostic Queries**:
```promql
# Check for connection leaks
pg_process_idle_seconds > 300
# Identify connection sources
pg_database_numbackends by (datname)
```
**Immediate Actions**:
1. Kill idle connections: `SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '1 hour';`
2. Check connection pooler configuration
3. Identify applications not closing connections properly
### Lock Contention Alert
**Diagnostic Queries**:
```promql
# Lock analysis
pg_locks_count by (mode, locktype)
# Blocking queries
pg_long_running_transactions > 300
```
**Immediate Actions**:
1. Identify blocking queries: `SELECT * FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;`
2. Consider terminating long-running transactions
3. Check for DDL operations blocking DML
### Cache Hit Ratio Drop Alert
**Diagnostic Queries**:
```promql
# Cache efficiency by database
100 * (pg_database_blks_hit / (pg_database_blks_hit + pg_database_blks_read))
# Recent query changes
rate(pg_stat_statements_calls[5m])
```
**Immediate Actions**:
1. Check for new queries causing table scans
2. Review recent application deployments
3. Monitor buffer pool pressure
### Long Running Query Alert
**Diagnostic Queries**:
```promql
# Query details
pg_stat_statements_mean_time > 1000
# Resource consumption
pg_stat_statements_total_time
```
**Immediate Actions**:
1. Identify the specific query: `SELECT query, state, query_start FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;`
2. Check query execution plan
3. Consider query termination if non-critical
## Required Output Format
### Incident Summary
- **Alert Type**: Specific alert that triggered
- **Severity Level**: Critical/High/Medium/Low
- **Service Impact**: Current user/application impact
- **Status**: Active/Investigating/Resolving/Resolved
### Immediate Assessment
- **Threat Level**: How severe is the immediate risk?
- **Blast Radius**: What systems/users are affected?
- **Escalation Risk**: How quickly is this getting worse?
- **Time Sensitivity**: How long until critical failure?
### Root Cause Analysis
- **Primary Cause**: Main factor causing the alert
- **Contributing Factors**: Secondary issues making it worse
- **Evidence**: Specific metrics and query results supporting diagnosis
- **Timeline**: Sequence of events leading to the alert
### Action Plan
- **Immediate Actions (0-5 minutes)**:
- Emergency steps to prevent service failure
- Commands to run or processes to stop/start
- **Short-term Actions (5-30 minutes)**:
- Diagnostic queries to run
- Temporary fixes to implement
- Monitoring to establish
- **Follow-up Actions (30+ minutes)**:
- Permanent fixes to implement
- Process improvements needed
- Prevention measures to put in place
### Communication Template
```
INCIDENT ALERT: [Alert Type] - [Severity]
Status: [Active/Resolved/Monitoring]
Impact: [Description of user impact]
Cause: [Root cause summary]
Actions Taken: [Key steps performed]
Next Steps: [Planned actions]
ETA: [Expected resolution time]
```
### Diagnostic Commands
Provide specific commands to run for further investigation:
- **PromQL queries** for metric analysis
- **SQL queries** for database investigation
- **System commands** for server analysis
## Success Criteria
- Service impact minimized within 15 minutes
- Root cause identified within 30 minutes
- Temporary workaround implemented within 1 hour
- All stakeholders notified within 5 minutes of incident start
- Complete documentation available for post-incident review