prometheus-mcp

MIT License

README.md•13.3 kB

# PostgreSQL AI DBA Agent ## Overview This comprehensive test suite validates AI-powered PostgreSQL database administration capabilities by simulating realistic performance scenarios. Using Prometheus metrics from `postgres_exporter`, we test whether AI agents can accurately diagnose, analyze, and recommend solutions for database performance issues. - **Single Command Setup**: `./entrypoint.sh` - Everything runs automatically - **Realistic Scenarios**: 60-minute Black Friday simulation with 6 distinct performance phases - **Fully Containerized**: No local dependencies except Docker - **Complete Monitoring**: Prometheus + Grafana + postgres_exporter included - **AI-Ready**: Specialized prompts for different DBA analysis scenarios - **Automated Orchestration**: Load generator manages the entire journey ## Real-World Scenario: E-commerce Black Friday **Setting:** "ShopFast" - A high-traffic e-commerce platform during Black Friday **Database Stats:** - 2M active users - 500K products in catalog - Multiple microservices architecture - Peak traffic multiplier: 5x normal load **Business Impact:** Critical revenue period requiring 99.9% uptime and optimal performance ## Test Architecture ``` ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ PostgreSQL │ │ Prometheus │ │ AI DBA Agent │ │ Database │◄──►│ + postgres_ │◄──►│ Analysis & │ │ + pgbench │ │ exporter │ │ Recommendations │ └─────────────────┘ └──────────────────┘ └─────────────────┘ ``` ### Components - **PostgreSQL 14+** with realistic e-commerce schema - **Prometheus** with postgres_exporter v0.17.1 - **pgbench** for controlled load generation - **AI Agent** with specialized DBA prompts ## The 60-Minute Journey ### Phase 1: Normal Operations (0-10 minutes) **Scenario:** Regular shopping activity, typical user behavior **Simulation:** ```bash pgbench -c 10 -j 2 -T 600 -f normal_operations.sql ``` **Expected Metrics:** - Connections: 15-20 - Cache hit ratio: 95%+ - Query response time: <50ms **AI Validation:** System should report healthy baseline --- ### Phase 2: Traffic Surge (10-20 minutes) **Scenario:** Black Friday begins, 3x traffic increase **Simulation:** ```bash pgbench -c 30 -j 4 -T 600 -f heavy_browsing.sql & pgbench -c 15 -j 2 -T 600 -f product_searches.sql & ``` **Expected Metrics:** - Connections: 45+ - Slight performance degradation - Increased query volume **AI Validation:** Should detect traffic pattern change and recommend monitoring --- ### Phase 3: Lock Contention Storm (20-35 minutes) **Scenario:** Inventory updates during high purchase volume create deadlocks **Simulation:** ```bash pgbench -c 5 -j 1 -T 900 -f inventory_updates.sql & pgbench -c 20 -j 4 -T 900 -f purchase_attempts.sql & ``` **Target Metrics:** - `pg_locks_count{mode="RowExclusiveLock"}` spikes - `pg_long_running_transactions_max_duration_seconds` > 30s - `pg_database_deadlocks` increases **AI Validation:** - Should trigger **CRITICAL ALERT** - Identify blocking inventory queries - Recommend immediate intervention --- ### Phase 4: Query Performance Collapse (30-45 minutes) **Scenario:** Recommendation engine deploys inefficient queries **Simulation:** ```bash pgbench -c 15 -j 3 -T 900 -f expensive_analytics.sql & pgbench -c 10 -j 2 -T 900 -f unoptimized_searches.sql & ``` **Target Metrics:** - `pg_stat_statements_mean_time` dramatically increases - `pg_stat_user_tables_seq_scan` spikes - Cache hit ratio drops to <90% **AI Validation:** - Identify specific slow queries - Recommend index creation - Suggest query optimization --- ### Phase 5: Connection Pool Exhaustion (40-55 minutes) **Scenario:** Mobile app connection leak during peak traffic **Simulation:** ```bash pgbench -c 70 -j 10 -T 900 -f connection_leak.sql & ``` **Target Metrics:** - `pg_database_numbackends` approaches max_connections (100) - `pg_process_idle_seconds` shows many long-idle connections - New connections start failing **AI Validation:** - Should trigger **CRITICAL ALERT** - Distinguish between high load vs. connection leak - Recommend immediate connection cleanup --- ### Phase 6: Recovery & Stabilization (50-60 minutes) **Scenario:** DBA intervention successful, system recovers **Simulation:** ```bash # Gradual load reduction pgbench -c 15 -j 3 -T 600 -f recovery_load.sql ``` **Expected Metrics:** - All metrics return to normal ranges - Performance indicators improve - System stability restored **AI Validation:** - Detect recovery pattern - Confirm system stabilization - Provide post-incident analysis ## Database Schema & Data ### E-commerce Schema ```sql -- Products catalog (500K items) CREATE TABLE products ( id SERIAL PRIMARY KEY, name TEXT, category_id INT, price DECIMAL(10,2), stock_quantity INT, description TEXT, created_at TIMESTAMP DEFAULT NOW() ); -- Order management CREATE TABLE orders ( id SERIAL PRIMARY KEY, user_id INT, total_amount DECIMAL(10,2), status VARCHAR(20), created_at TIMESTAMP DEFAULT NOW() ); -- Order line items CREATE TABLE order_items ( id SERIAL PRIMARY KEY, order_id INT REFERENCES orders(id), product_id INT REFERENCES products(id), quantity INT, price DECIMAL(10,2) ); -- Indexes (some intentionally missing for testing) CREATE INDEX idx_products_category ON products(category_id); CREATE INDEX idx_orders_user ON orders(user_id); -- Missing: products(name), products(description) - for seq scan testing ``` ### Data Generation ```sql -- Generate 500K products INSERT INTO products SELECT generate_series(1,500000), 'Product ' || generate_series(1,500000), (random() * 100)::int + 1, (random() * 1000 + 10)::decimal(10,2), (random() * 1000)::int, 'Description for product ' || generate_series(1,500000), NOW() - (random() * interval '365 days'); -- Generate 1M orders INSERT INTO orders SELECT generate_series(1,1000000), (random() * 100000)::int + 1, (random() * 5000 + 10)::decimal(10,2), CASE WHEN random() < 0.8 THEN 'completed' ELSE 'pending' END, NOW() - (random() * interval '180 days'); ``` ## pgbench Workload Files ### normal_operations.sql ```sql -- Typical user browsing patterns \set product_id random(1, 500000) \set user_id random(1, 100000) SELECT * FROM products WHERE id = :product_id; SELECT COUNT(*) FROM orders WHERE user_id = :user_id; SELECT * FROM products WHERE category_id = (random() * 100)::int LIMIT 20; ``` ### inventory_updates.sql ```sql -- Inventory management causing locks \set product_id random(1, 1000) BEGIN; SELECT stock_quantity FROM products WHERE id = :product_id FOR UPDATE; \sleep 5000ms UPDATE products SET stock_quantity = stock_quantity - (random() * 5)::int WHERE id = :product_id; COMMIT; ``` ### expensive_analytics.sql ```sql -- Intentionally inefficient analytics queries SELECT p.name, p.price, COUNT(oi.id) as order_count FROM products p LEFT JOIN order_items oi ON p.id = oi.product_id WHERE p.description LIKE '%premium%' GROUP BY p.id, p.name, p.price ORDER BY order_count DESC LIMIT 100; -- Sequential scan generator SELECT COUNT(*) FROM products WHERE name LIKE '%' || (random() * 1000)::int || '%'; ``` ### connection_leak.sql ```sql -- Simulate connection leaks with long idle times SELECT pg_sleep(30 + random() * 60); SELECT 1; \sleep 10000ms ``` ## Prometheus Metrics Monitored ### Core Performance Indicators - `pg_database_numbackends` - Active connections - `pg_database_xact_commit` - Transaction throughput - `pg_database_blks_hit` / `pg_database_blks_read` - Cache efficiency ### Lock & Transaction Monitoring - `pg_locks_count{mode, locktype}` - Lock contention patterns - `pg_long_running_transactions_max_duration_seconds` - Blocking transactions - `pg_database_deadlocks` - Deadlock frequency ### Query Performance Tracking - `pg_stat_statements_calls` - Query execution frequency - `pg_stat_statements_mean_time` - Average query duration - `pg_stat_statements_total_time` - Total query time ### I/O & Table Statistics - `pg_stat_user_tables_seq_scan` - Sequential scan frequency - `pg_stat_user_tables_idx_scan` - Index scan efficiency - `pg_statio_user_tables_heap_blks_read` - Disk I/O patterns ## AI Agent Validation Strategy ### Post-Test Analysis (Primary Validation) **After 60-minute test completion:** #### 1. Core Performance Analysis Feed complete metrics dataset to AI agent using the Core Performance Analysis prompt: - **Input**: 1 hour of collected PostgreSQL metrics from Black Friday simulation - **Expected Output**: Comprehensive DBA report identifying all 6 performance phases - **Validation**: Should detect progression from normal ops → traffic surge → lock contention → query degradation → connection exhaustion → recovery #### 2. Simulated Real-time Alert Response Extract critical moments from recorded data and test alert response using Real-time Alert Response prompt: **Alert Moment 1 (Phase 3 - Minute 25):** ``` pg_locks_count{mode="RowExclusiveLock"} = 45 pg_long_running_transactions_max_duration_seconds = 180 pg_database_deadlocks rate = 0.5/min ``` **Alert Moment 2 (Phase 5 - Minute 45):** ``` pg_database_numbackends = 87 (approaching max_connections: 100) pg_process_idle_seconds avg = 450 New connection failures detected ``` ### Success Criteria #### Historical Analysis Validation - **Phase Detection**: Identifies 5/6 performance phases correctly - **Root Cause Accuracy**: Links symptoms to causes (lock contention → inventory updates) - **Trend Analysis**: Recognizes performance degradation patterns over time - **Recovery Recognition**: Detects system stabilization in final phase #### Alert Response Validation - **Alert Triggering**: Would have generated appropriate alerts for critical moments - **Severity Assessment**: Correctly prioritizes connection exhaustion over query slowdowns - **Immediate Actions**: Provides specific intervention steps - **Investigation Queries**: Suggests correct PromQL queries for deeper analysis #### Communication Quality - **Executive Summary**: Clear business impact assessment - **Technical Details**: Actionable recommendations for DBA team - **Priority Matrix**: Immediate vs. long-term fixes properly categorized ## Quick Start ### Prerequisites - **Docker & Docker Compose** - Complete containerized solution ```bash # Install Docker (if not already installed) # macOS: brew install docker docker-compose # Ubuntu: sudo apt-get install docker.io docker-compose # Windows: Download Docker Desktop # Verify installation docker --version docker-compose --version ``` ### One Command to Run Everything Start the complete 60-minute Black Friday simulation with one command: ```bash # 1. Navigate to the DBA example directory cd examples/dba # 2. Start the complete journey ./entrypoint.sh ``` **That's it!** This single command will: - Start PostgreSQL with realistic e-commerce data - Launch Prometheus + postgres_exporter for metrics collection - Start Grafana with pre-configured dashboards - Build and run the load generator container - Execute the complete 60-minute Black Friday simulation automatically ### What Happens During the Journey The load generator will automatically orchestrate 6 realistic performance scenarios: 1. **Phase 1 (0-10m)**: Normal Operations - Baseline performance 2. **Phase 2 (10-20m)**: Traffic Surge - Black Friday begins, 3x traffic 3. **Phase 3 (20-35m)**: Lock Contention Storm - Inventory deadlocks 4. **Phase 4 (35-50m)**: Query Performance Collapse - Slow analytics 5. **Phase 5 (50-65m)**: Connection Pool Exhaustion - Connection leaks 6. **Phase 6 (65-75m)**: Recovery & Stabilization - System recovers ### Access Monitoring Once tests are running, access your monitoring dashboards: - **Prometheus Metrics**: http://localhost:9090 - **Grafana Dashboards**: http://localhost:3000 (admin/admin123) - **Prometheus MCP Server**: http://localhost:8080 ## Expected Outcomes ### Successful AI Agent Should: 1. **Detect all 6 phases** of performance degradation 2. **Trigger alerts** for critical issues (Phases 3 & 5) 3. **Provide accurate root cause analysis** for each scenario 4. **Recommend appropriate fixes** with priority levels 5. **Generate executive summary** of overall test period ### Metrics Success Thresholds: - **Detection Rate:** >90% of injected issues identified - **Response Time:** <5 minutes for critical alerts - **False Positive Rate:** <10% - **Remediation Accuracy:** >85% of recommendations appropriate ## Business Value This test suite validates that AI agents can: - **Replace on-call DBA** for initial triage - **Provide 24/7 monitoring** with expert-level analysis - **Reduce MTTR** through accurate root cause identification **ROI Calculation:** - Manual DBA response time: 20-60 minutes - AI agent response time: 2-5 minutes - Cost savings: 70-85% reduction in incident response time

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/idanfishman/prometheus-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server