# OVS Performance Monitoring and Analysis System
A comprehensive monitoring and analysis system for Open vSwitch (OVS) performance in OpenShift/Kubernetes environments.
## ποΈ System Architecture
```
βββββββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββββββ
β OpenShift/K8s β β Prometheus β β OVS Monitoring β
β βββββββββββββββ β β ββββββββββββββββ β β βββββββββββββββ β
β β OVS Nodes βββββΌβββββΌβββΆβ Metrics βββββΌβββββΌβββΆβ Collector β β
β β β β β β Storage β β β β β β
β βββββββββββββββ β β ββββββββββββββββ β β βββββββββββββββ β
β β β β β β β
β βββββββββββββββ β β ββββββββββββββββ β β βββββββΌββββββββββ β
β β OVSDB βββββΌβββββΌβββΆβ PromQL β β β β Analyzer β β
β β Servers β β β β Queries β β β β β β
β βββββββββββββββ β β ββββββββββββββββ β β βββββββββββββββββ β
βββββββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββββββ
```
## π Components
### 1. Core Modules
- **`ovnk_benchmark_prometheus_basequery.py`**: Base Prometheus query functionality
- **`ovnk_benchmark_auth.py`**: OpenShift/Kubernetes authentication and service discovery
- **`ovnk_benchmark_prometheus_ovnk_ovs.py`**: OVS metrics collector (main implementation)
- **`ovnk_benchmark_performance_analysis_ovs.py`**: Performance analysis and alerting
### 2. Configuration
- **`metrics.yaml`**: Metrics configuration with PromQL queries and thresholds
- **Main script**: Orchestrator for running various analysis modes
## π Features
### Metrics Collected
#### CPU Usage
- β
OVS vswitchd CPU usage per node
- β
OVSDB server CPU usage per node
- β
Top 10 highest CPU consumers
- β
Min/Max/Average statistics
#### Memory Usage
- β
OVS database process memory (with unit conversion)
- β
OVS vswitchd process memory (with unit conversion)
- β
Top 10 highest memory consumers
- β
Automatic unit conversion (bytes β KB/MB/GB)
#### Flow Metrics
- β
Datapath flows total (`ovs_vswitchd_dp_flows_total`)
- β
Bridge flows for br-int and br-ex
- β
Flow efficiency analysis
- β
Top 10 instances by flow count
#### Connection Health
- β
Stream connections open
- β
Connection overflows
- β
Discarded connections
- β
Connection efficiency metrics
### Analysis Capabilities
#### Performance Analysis
- π¨ **Alerting**: Critical/Warning/Info levels
- π **Insights**: High-confidence performance insights
- π **Trends**: Variance analysis and growth patterns
- π― **Recommendations**: Actionable optimization suggestions
#### Alert Categories
- **CPU Usage**: High utilization alerts with thresholds
- **Memory Usage**: Memory leak detection and growth alerts
- **Flow Table**: Flow table size optimization alerts
- **Connection Health**: Network connectivity issue detection
## π οΈ Installation and Setup
### Prerequisites
```bash
# Required Python packages
pip install aiohttp asyncio kubernetes pyyaml statistics
```
### File Structure
```
project/
βββ tools/
β βββ ovnk_benchmark_prometheus_basequery.py
β βββ ovnk_benchmark_auth.py
β βββ ovnk_benchmark_prometheus_ovnk_ovs.py
βββ analysis/
β βββ ovnk_benchmark_performance_analysis_ovs.py
βββ metrics.yaml
βββ main.py # Main orchestrator script
```
### Configuration
1. **Kubeconfig Setup**:
```bash
export KUBECONFIG=/path/to/your/kubeconfig
```
2. **Metrics Configuration**:
- Place `metrics.yaml` in the project root
- Customize thresholds and queries as needed
## π Usage
### Command Line Interface
#### Instant Analysis (Point-in-time)
```bash
python main.py --mode instant --output ovs_report.json
```
#### Duration Analysis (Time Range)
```bash
# 5-minute analysis
python main.py --mode duration --duration 5m --output ovs_5min_report.json
# 1-hour analysis
python main.py --mode duration --duration 1h --output ovs_1hour_report.json
# 1-day analysis
python main.py --mode duration --duration 1d --output ovs_daily_report.json
```
#### Individual Metrics
```bash
# CPU metrics only
python main.py --mode metric --metric cpu --duration 5m
# Memory metrics only
python main.py --mode metric --metric memory --duration 10m
# Flow metrics only
python main.py --mode metric --metric dp_flows
# Bridge flow metrics
python main.py --mode metric --metric bridge_flows
# Connection metrics
python main.py --mode metric --metric connections
```
#### Continuous Monitoring
```bash
# Monitor every 5 minutes with 2-minute duration windows
python main.py --mode monitor --interval 300 --duration 2m
# Limited monitoring (10 iterations)
python main.py --mode monitor --interval 300 --duration 2m --max-iterations 10
```
### Programmatic Usage
#### Basic Usage
```python
import asyncio
from ovnk_benchmark_prometheus_basequery import PrometheusBaseQuery
from ovnk_benchmark_auth import OpenShiftAuth
from ovnk_benchmark_prometheus_ovnk_ovs import OVSUsageCollector
from ovnk_benchmark_performance_analysis_ovs import OVSPerformanceAnalyzer
async def main():
# Initialize authentication
auth = OpenShiftAuth()
await auth.initialize()
# Create Prometheus client
async with PrometheusBaseQuery(auth.prometheus_url, auth.prometheus_token) as prometheus_client:
# Initialize OVS collector
ovs_collector = OVSUsageCollector(prometheus_client, auth)
# Collect metrics
cpu_metrics = await ovs_collector.query_ovs_cpu_usage(duration="5m")
memory_metrics = await ovs_collector.query_ovs_memory_usage(duration="5m")
# Analyze performance
analyzer = OVSPerformanceAnalyzer()
analysis = analyzer.analyze_comprehensive_ovs_metrics({
'cpu_usage': cpu_metrics,
'memory_usage': memory_metrics
})
print(f"Performance Status: {analysis['performance_summary']['overall_status']}")
asyncio.run(main())
```
#### Advanced Usage
```python
async def advanced_monitoring():
auth = OpenShiftAuth()
await auth.initialize()
async with PrometheusBaseQuery(auth.prometheus_url, auth.prometheus_token) as prometheus_client:
ovs_collector = OVSUsageCollector(prometheus_client, auth)
# Comprehensive metrics collection
all_metrics = await ovs_collector.collect_all_ovs_metrics(duration="10m")
# Detailed analysis
analyzer = OVSPerformanceAnalyzer()
detailed_analysis = analyzer.analyze_comprehensive_ovs_metrics(all_metrics)
# Process alerts
for alert in detailed_analysis['detailed_alerts']:
if alert['level'] == 'critical':
print(f"π¨ CRITICAL: {alert['message']}")
print(f"π‘ Recommendation: {alert['recommendation']}")
```
## π Output Examples
### Instant Analysis Output
```json
{
"report_type": "instant_analysis",
"timestamp": "2025-08-27T10:30:00.000Z",
"cluster_info": {
"cluster_info": {
"openshift_version": "4.12.0",
"node_count": 6,
"is_openshift": true
}
},
"performance_analysis": {
"overall_status": "good",
"summary_metrics": {
"total_alerts": 2,
"critical_alerts": 0,
"warning_alerts": 2,
"total_insights": 3
},
"top_issues": [
{
"level": "warning",
"message": "High CPU usage on node worker-1: 78.5%",
"recommendation": "Monitor closely and consider optimization if trend continues"
}
]
}
}
```
### CPU Usage Top 10 Output
```json
{
"ovs_vswitchd_cpu": [
{
"node_name": "worker-1",
"min": 45.2,
"avg": 67.8,
"max": 89.5,
"unit": "%"
},
{
"node_name": "worker-2",
"min": 12.1,
"avg": 25.3,
"max": 38.7,
"unit": "%"
}
],
"summary": {
"ovs_vswitchd_top10": [
{
"node_name": "worker-1",
"max": 89.5,
"unit": "%"
}
]
}
}
```
### Memory Usage with Unit Conversion
```json
{
"ovs_db_memory": [
{
"pod_name": "ovs-db-worker-1",
"min": 256.5,
"avg": 445.2,
"max": 678.9,
"unit": "MB"
}
],
"ovs_vswitchd_memory": [
{
"pod_name": "ovs-vswitchd-worker-1",
"min": 0.5,
"avg": 0.74,
"max": 1.02,
"unit": "GB"
}
]
}
```
## π§ Configuration Options
### metrics.yaml Configuration
```yaml
metrics:
- query: sum by(node) (irate(container_cpu_usage_seconds_total{id=~"/system.slice/ovs-vswitchd.service"}[5m])*100)
metricName: ovs-vswitchd-cpu-usage
unit: percent
threshold:
warning: 70
critical: 85
global:
scrape_interval: 15s
query_timeout: 30s
analysis:
cpu_thresholds:
warning_percent: 70
critical_percent: 85
memory_thresholds:
warning_mb: 500
critical_mb: 1000
```
### Performance Thresholds
#### CPU Thresholds
- **Warning**: 70% CPU usage
- **Critical**: 85% CPU usage
#### Memory Thresholds
- **Warning**: 500 MB memory usage
- **Critical**: 1 GB memory usage
#### Flow Thresholds
- **Datapath Warning**: 5,000 flows
- **Datapath Critical**: 20,000 flows
- **Bridge Warning**: 10,000 flows
- **Bridge Critical**: 50,000 flows
#### Connection Thresholds
- **Overflow Warning**: 100 events
- **Overflow Critical**: 1,000 events
- **Discarded Warning**: 50 connections
- **Discarded Critical**: 500 connections
## π¨ Alert Types and Recommendations
### CPU Usage Alerts
- **High vswitchd CPU**: Review flow rules, optimize table structure
- **High OVSDB CPU**: Check database size, client connections
- **CPU Variance**: Investigate traffic patterns, flow optimization
### Memory Usage Alerts
- **Memory Growth**: Check for leaks, review cleanup policies
- **High Memory**: Optimize flow tables, database maintenance
- **Memory Spikes**: Monitor traffic patterns, review caching
### Flow Table Alerts
- **High Flow Count**: Optimize rules, review segmentation
- **Flow Efficiency**: Check megaflow cache, rule conflicts
- **Bridge Imbalance**: Review network policies, traffic distribution
### Connection Alerts
- **Overflow Events**: Check controller connectivity, reduce load
- **Discarded Connections**: Review network stability, controller health
- **High Overflow Rate**: Tune connection buffers, check processing capacity
## π Performance Insights
### Insight Categories
#### CPU Variance Analysis
- Detects inconsistent CPU usage patterns
- Identifies potential traffic spikes or processing inefficiencies
- Confidence scoring based on variance magnitude
#### Memory Growth Tracking
- Monitors memory usage trends over time
- Detects potential memory leaks or capacity issues
- Tracks growth rates and patterns
#### Flow Efficiency Analysis
- Analyzes datapath to bridge flow ratios
- Identifies megaflow cache inefficiencies
- Detects flow rule optimization opportunities
#### Connection Health Monitoring
- Tracks connection stability metrics
- Identifies controller communication issues
- Monitors overflow and discard rates
## π Troubleshooting
### Common Issues
#### Authentication Problems
```bash
# Check kubeconfig
kubectl config current-context
# Verify cluster access
kubectl get nodes
# Test Prometheus connectivity
kubectl get pods -n openshift-monitoring | grep prometheus
```
#### Prometheus Connection Issues
```bash
# Check Prometheus service
kubectl get svc -n openshift-monitoring | grep prometheus
# Test route accessibility (OpenShift)
oc get route -n openshift-monitoring | grep prometheus
# Verify token permissions
kubectl auth can-i get --subresource=metrics '*' --all-namespaces
```
#### Missing Metrics
```bash
# Check OVS pods
kubectl get pods -A | grep ovs
# Verify OVS metrics exporters
kubectl get pods -n openshift-ovn-kubernetes | grep ovs
# Check metric endpoints
curl http://prometheus-url/api/v1/label/__name__/values | grep ovs
```
### Debug Mode
Enable debug logging:
```python
import logging
logging.basicConfig(level=logging.DEBUG)
# Run with verbose output
python main.py --mode instant --output debug_report.json
```
## π Integration Options
### Grafana Dashboard
- Import provided dashboard configuration from `metrics.yaml`
- Visualize real-time OVS performance metrics
- Set up alerting rules for critical thresholds
### Prometheus AlertManager
- Use alerting rules from `metrics.yaml`
- Configure notification channels (Slack, email, PagerDuty)
- Set up escalation policies for critical alerts
### CI/CD Integration
```yaml
# GitHub Actions example
- name: OVS Performance Check
run: |
python main.py --mode duration --duration 5m --output ovs_check.json
# Parse results and fail if critical alerts found
```
### Automation Scripts
```bash
#!/bin/bash
# Daily OVS health check
python main.py --mode duration --duration 1h --output "daily_$(date +%Y%m%d).json"
# Alert on critical issues
if grep -q '"level": "critical"' daily_*.json; then
echo "π¨ Critical OVS issues detected!"
# Send notifications
fi
```
## π API Reference
### OVSUsageCollector Methods
#### `query_ovs_cpu_usage(duration=None, time=None)`
- Collects CPU usage for OVS components
- Returns top 10 consumers with min/avg/max stats
- Supports both instant and range queries
#### `query_ovs_memory_usage(duration=None, time=None)`
- Collects memory usage with automatic unit conversion
- Returns pod-level memory statistics
- Provides top 10 memory consumers
#### `query_ovs_dp_flows_total(duration=None, time=None)`
- Queries datapath flow counts
- Returns per-instance flow statistics
- Includes top 10 flow consumers
#### `query_ovs_bridge_flows_total(duration=None, time=None)`
- Queries bridge-specific flow counts (br-int, br-ex)
- Provides separate top 10 lists per bridge
- Includes flow efficiency metrics
#### `query_ovs_connection_metrics(duration=None, time=None)`
- Collects connection health metrics
- Monitors overflows, discarded connections
- Tracks stream connection counts
### OVSPerformanceAnalyzer Methods
#### `analyze_comprehensive_ovs_metrics(metrics_data)`
- Performs complete analysis of all collected metrics
- Returns alerts, insights, and recommendations
- Provides overall performance status assessment
## π€ Contributing
1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
3. **Commit** your changes (`git commit -m 'Add amazing feature'`)
4. **Push** to the branch (`git push origin feature/amazing-feature`)
5. **Open** a Pull Request
### Development Guidelines
- Follow Python PEP 8 style guidelines
- Add unit tests for new functionality
- Update documentation for API changes
- Test with multiple OpenShift/Kubernetes versions
## π License
This project is licensed under the MIT License - see the LICENSE file for details.
## π Support
- **Issues**: Report bugs and feature requests via GitHub Issues
- **Discussions**: Join community discussions for questions and ideas
- **Documentation**: Check the wiki for additional examples and tutorials
## π·οΈ Version History
- **v1.0.0**: Initial release with comprehensive OVS monitoring
- **v1.1.0**: Added performance analysis and alerting
- **v1.2.0**: Enhanced metrics configuration and dashboard integration
---
**π Start monitoring your OVS performance today!**
```bash
python main.py --mode instant
```
CPU Usage Monitoring (query_ovs_cpu_usage)
β
Min/Max/Average for ovs-vswitchd-cpu-usage and ovsdb-server-cpu-usage
β
Per-node breakdown with top 10 ranking
β
Supports both instant and duration queries
Memory Usage Monitoring (query_ovs_memory_usage)
β
ovs_db_process_resident_memory_bytes and ovs_vswitchd_process_resident_memory_bytes
β
Automatic unit conversion (bytes β KB/MB/GB)
β
Pod-level tracking with top 10 consumers
Flow Metrics
β
ovs_vswitchd_dp_flows_total with min/max/avg stats
β
ovs_vswitchd_bridge_flows_total for br-int and br-ex bridges
β
Top 10 results for each metric separately
Connection Metrics (query_ovs_connection_metrics)
β
sum(ovs_vswitchd_stream_open), sum(ovs_vswitchd_rconn_overflow), sum(ovs_vswitchd_rconn_discarded)
β
Statistical analysis with thresholds
OVSUsageCollector Class
β
Supports both instant and duration scenarios
β
Uses PrometheusBaseQuery client instead of direct prometheus_url
β
Integrates with OpenShiftAuth for Kubernetes operations
π§ Advanced Analysis Features:
Performance Analysis Module (ovnk_benchmark_performance_analysis_ovs.py)
β
Comprehensive alert system (Critical/Warning/Info levels)
β
Performance insights with confidence scoring
β
Actionable recommendations for optimization
β
Overall performance status determination
Configuration Management
β
metrics.yaml support with fallback to default PromQL queries
β
Configurable thresholds and alerting rules
β
Flexible metric definitions
π System Capabilities:
Multi-mode Operation: Instant queries, duration-based analysis, individual metrics, continuous monitoring
Intelligent Analysis: Detects performance issues, memory leaks, flow inefficiencies, connection problems
Comprehensive Reporting: JSON output with detailed metrics, analysis, and recommendations
Top 10 Rankings: Separate rankings for each metric type as requested
Unit Conversion: Automatic memory unit conversion for readability
Error Handling: Robust error handling with graceful degradation