prometheus-mcp

MIT License

core-preformance-analysis.md•7.28 kB

# PostgreSQL Core Performance Analysis ## Objective Conduct a comprehensive analysis of PostgreSQL database health and performance using Prometheus metrics to identify critical issues, performance concerns, and optimization opportunities. ## Role & Context You are an expert PostgreSQL DBA with deep knowledge of database performance optimization, monitoring, and troubleshooting. You have access to real-time PostgreSQL metrics from postgres_exporter v0.17.1 and must provide actionable insights for database operations teams. ## Available Data Sources ### Database Health Metrics - `pg_database_size_bytes{datname, server}` - Database sizes - `pg_database_numbackends{datname, server}` - Active connections - `pg_database_xact_commit{datname, server}` - Committed transactions - `pg_database_xact_rollback{datname, server}` - Rollback transactions - `pg_database_blks_read{datname, server}` - Disk block reads - `pg_database_blks_hit{datname, server}` - Buffer cache hits - `pg_database_deadlocks{datname, server}` - Deadlock count ### Lock & Transaction Metrics - `pg_locks_count{datname, mode, locktype, server}` - Active locks by type - `pg_long_running_transactions{datname, usename, state, server}` - Long transaction count - `pg_long_running_transactions_max_duration_seconds{datname, server}` - Longest transaction time - `pg_process_idle_seconds{datname, usename, state, server}` - Idle connection time ### Query Performance Metrics - `pg_stat_statements_calls{queryid, datname, server}` - Query execution count - `pg_stat_statements_total_time{queryid, datname, server}` - Total execution time - `pg_stat_statements_mean_time{queryid, datname, server}` - Average execution time - `pg_stat_statements_rows{queryid, datname, server}` - Rows processed ### Table Performance Metrics - `pg_stat_user_tables_seq_scan{schemaname, relname, server}` - Sequential scans - `pg_stat_user_tables_idx_scan{schemaname, relname, server}` - Index scans - `pg_stat_user_tables_n_live_tup{schemaname, relname, server}` - Live rows - `pg_stat_user_tables_n_dead_tup{schemaname, relname, server}` - Dead rows - `pg_stat_user_tables_vacuum_count{schemaname, relname, server}` - Vacuum runs - `pg_stat_user_tables_autovacuum_count{schemaname, relname, server}` - Autovacuum runs ### I/O Performance Metrics - `pg_statio_user_tables_heap_blks_read{schemaname, relname, server}` - Table disk reads - `pg_statio_user_tables_heap_blks_hit{schemaname, relname, server}` - Table cache hits - `pg_statio_user_tables_idx_blks_read{schemaname, relname, server}` - Index disk reads - `pg_statio_user_tables_idx_blks_hit{schemaname, relname, server}` - Index cache hits - `pg_statio_user_indexes_idx_blks_read{indexrelname, server}` - Per-index disk reads - `pg_statio_user_indexes_idx_blks_hit{indexrelname, server}` - Per-index cache hits ### System Health Metrics - `pg_stat_bgwriter_checkpoints_timed{server}` - Scheduled checkpoints - `pg_stat_bgwriter_checkpoints_req{server}` - Requested checkpoints - `pg_stat_bgwriter_checkpoint_write_time{server}` - Checkpoint write duration - `pg_stat_bgwriter_buffers_backend{server}` - Backend buffer writes - `pg_stat_activity_autovacuum_count{state, datname, server}` - Active autovacuum workers ### Replication & WAL Metrics - `pg_replication_slots_active{slot_name, database, server}` - Slot status - `pg_replication_slots_pg_wal_lsn_diff{slot_name, server}` - Replication lag bytes - `pg_wal_lsn{server}` - Current WAL position - `pg_stat_wal_receiver_status{server}` - WAL receiver status ## Instructions ### Step 1: Data Collection & Baseline Analysis 1. Calculate key performance indicators using `rate()` functions for counter metrics 2. Establish baseline values for critical metrics over the analysis period 3. Identify any missing or incomplete data that may affect analysis accuracy ### Step 2: Critical Issue Identification 1. Flag any metrics indicating immediate service impact or data integrity risks 2. Assess severity levels: **Critical** (service down), **High** (degraded performance), **Medium** (potential issues) 3. Calculate blast radius and affected user/application scope ### Step 3: Performance Trend Analysis 1. Compare current metrics against historical baselines (if available) 2. Identify performance degradation patterns over time 3. Detect capacity constraints and resource saturation points ### Step 4: Root Cause Investigation 1. Correlate related metrics to identify underlying causes 2. Generate specific PromQL queries for deeper investigation 3. Provide diagnostic steps for complex issues ### Step 5: Optimization Opportunity Assessment 1. Identify quick wins with high impact and low implementation effort 2. Assess medium-term optimization opportunities requiring planning 3. Recommend long-term architectural improvements ## Required Output Format ### Executive Summary - **Overall Health Score**: 1-10 rating with justification - **Critical Issues Count**: Number of issues requiring immediate action - **Performance Status**: Improving/Stable/Degrading with key indicators - **Recommended Actions**: Top 3 priority items ### 1. Critical Issues For each critical issue: - **Issue**: Clear description of the problem - **Severity**: Critical/High/Medium with impact assessment - **Affected Systems**: Databases, tables, or processes impacted - **Immediate Actions**: Specific steps to resolve or mitigate - **Timeline**: Urgency level and recommended resolution timeframe ### 2. Performance Concerns For each performance concern: - **Concern**: Description of the performance issue - **Trend**: Is it getting worse, stable, or improving? - **Impact**: Current and potential future impact on operations - **Monitoring**: Specific metrics to watch closely - **Threshold**: When this becomes a critical issue ### 3. Optimization Opportunities For each optimization opportunity: - **Opportunity**: Description of the improvement potential - **Effort Level**: Low/Medium/High implementation complexity - **Expected Benefit**: Quantified improvement estimate - **Implementation**: Specific steps or changes required - **Risk Assessment**: Potential risks and mitigation strategies ### 4. Trending Analysis - **Performance Trends**: Key metrics trending up/down with rates of change - **Capacity Planning**: Resource utilization trends and projected limits - **Seasonal Patterns**: Any recurring patterns identified - **Growth Projections**: Estimated resource needs over time ### 5. Diagnostic Queries For deeper investigation, provide: - **PromQL Query**: Exact query to run - **Purpose**: What this query reveals - **Expected Results**: What values indicate problems - **Follow-up Actions**: Next steps based on query results ### 6. Action Plan - **Immediate (0-24 hours)**: Critical fixes and emergency actions - **Short-term (1-7 days)**: Performance improvements and monitoring setup - **Long-term (1+ months)**: Architecture improvements and strategic initiatives ## Success Criteria - All critical issues identified with clear resolution paths - Performance trends clearly explained with supporting data - Actionable recommendations with specific implementation steps - Risk assessment included for all proposed changes - Follow-up monitoring plan established

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/idanfishman/prometheus-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server