# Monitor System Health
## Goal
Establish comprehensive monitoring and observability for applications and infrastructure to detect issues proactively and maintain system reliability.
## Context Required
- Running application or system
- System architecture and components
## Skip When
- Comprehensive monitoring already in place and functioning
- System too simple to warrant detailed monitoring
- Monitoring setup is handled by infrastructure team
- Only temporary or development system
## Complexity Assessment
- **Task Complexity**: High - requires monitoring expertise and system architecture understanding
## Task Understanding Assessment
If task unclear - ask clarifying questions with multiple choice options
## Monitoring Categories
### Infrastructure Monitoring
- **CPU utilization** - track processor usage across all systems
- **Memory usage** - monitor RAM consumption and availability
- **Disk space** - track storage usage and I/O performance
- **Network** - monitor bandwidth, latency, and connectivity
- **Container metrics** - CPU, memory, and network for containerized apps
### Application Monitoring
- **Response times** - track API and page load performance
- **Throughput** - monitor requests per second and concurrent users
- **Error rates** - track 4xx and 5xx errors, exceptions, and failures
- **Business metrics** - monitor key business KPIs and conversion rates
- **User experience** - track real user monitoring and synthetic checks
### Security Monitoring
- **Authentication failures** - monitor failed login attempts
- **Access patterns** - track unusual access patterns or privilege escalation
- **Security events** - monitor firewall blocks, intrusion attempts
- **Compliance metrics** - track adherence to security policies
- **Vulnerability scanning** - regular security assessment results
### Database Monitoring
- **Query performance** - track slow queries and execution times
- **Connection pool** - monitor database connections and pool usage
- **Replication lag** - track data synchronization across replicas
- **Storage usage** - monitor database size and growth trends
- **Lock contention** - identify blocking queries and deadlocks
## Alert Configuration
### Alert Severity Levels
- **Critical** - immediate response required (system down, data loss)
- **Warning** - requires attention within defined timeframe
- **Info** - informational alerts for awareness and trending
### Alert Thresholds
- Set thresholds based on baseline performance data
- Use multiple thresholds to avoid alert fatigue
- Configure escalation for unacknowledged critical alerts
- Implement intelligent alerting to reduce false positives
### Notification Channels
- **Immediate** - SMS, phone calls for critical alerts
- **Standard** - email, Slack for warnings and info alerts
- **Escalation** - manager notification for unresolved critical issues
- **Status pages** - public communication for customer-facing issues
## Dashboard Design
### Executive Dashboard
- High-level system health overview
- Key business metrics and SLAs
- Current incidents and their impact
- System availability and performance trends
### Operations Dashboard
- Detailed system metrics and performance
- Recent alerts and their status
- Infrastructure utilization and capacity
- Deployment and change tracking
### Development Dashboard
- Application performance metrics
- Error rates and recent errors
- Performance across different environments
- Code deployment and rollback tracking
### Business Dashboard
- User engagement and conversion metrics
- Revenue and transaction monitoring
- Customer experience indicators
- Business process performance
## Key Tasks
1. **Assess monitoring requirements** - identify critical metrics, SLAs, and monitoring scope
2. **Set up infrastructure monitoring** - monitor servers, containers, networks, and resources
3. **Implement application monitoring** - track application performance, errors, and usage
4. **Configure log aggregation** - centralize and analyze application and system logs
5. **Create dashboards and visualizations** - build monitoring dashboards for different audiences
6. **Set up alerting and notifications** - configure alerts for critical issues and thresholds
7. **Test monitoring and alerts** - validate monitoring setup and alert delivery
8. **Document monitoring procedures** - create runbooks and monitoring documentation
## Monitoring Tools and Technologies
### Open Source Options
- **Prometheus + Grafana** - metrics collection and visualization
- **ELK Stack** - Elasticsearch, Logstash, Kibana for log analysis
- **Jaeger** - distributed tracing for microservices
- **Nagios** - traditional infrastructure monitoring
### Cloud Provider Solutions
- **AWS CloudWatch** - comprehensive AWS service monitoring
- **Azure Monitor** - Microsoft Azure monitoring and analytics
- **Google Cloud Monitoring** - GCP infrastructure and application monitoring
### SaaS Solutions
- **Datadog** - full-stack monitoring and analytics
- **New Relic** - application performance monitoring
- **Splunk** - log analysis and security monitoring
- **PagerDuty** - incident management and alerting
## Success Criteria
- All critical system components monitored with appropriate metrics
- Dashboards provide clear visibility into system health
- Alerts trigger appropriately for actual issues without false positives
- Monitoring data helps with troubleshooting and capacity planning
- Team receives notifications through preferred channels
- Historical data available for trend analysis
## Key Outputs
- Comprehensive monitoring setup for all system components
- Real-time dashboards showing system health and performance
- Automated alerting for critical issues and threshold breaches
- Centralized log aggregation and analysis
- Monitoring documentation and runbooks
- Performance baselines and trend analysis