Skip to main content
Glama

OPNSense MCP Server

MONITORING-STRATEGY.md18.7 kB
# Monitoring Strategy - OPNSense MCP Server ## 🎯 Executive Summary A comprehensive monitoring strategy for the OPNSense MCP Server covering: - **Application Performance Monitoring (APM)** - **Infrastructure Monitoring** - **Security Monitoring** - **Business Metrics** - **Alerting & Incident Response** ## 📊 Monitoring Architecture ``` ┌─────────────────────────────────────────────┐ │ Application Layer │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Metrics │ │ Traces │ │ Logs │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ └───────┼────────────┼────────────┼──────────┘ │ │ │ └────────────┼────────────┘ │ ┌──────────▼──────────┐ │ OpenTelemetry │ │ Collector │ └──────────┬──────────┘ │ ┌───────────────┼───────────────┐ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │Prometheus│ │ Jaeger │ │ Loki │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ └───────────────┼───────────────┘ │ ┌───────▼───────┐ │ Grafana │ │ Dashboards │ └───────┬───────┘ │ ┌───────▼───────┐ │ Alerting │ │ Manager │ └───────────────┘ ``` ## 🔍 What to Monitor ### 1. Application Metrics #### Key Performance Indicators (KPIs) ```typescript // metrics.ts export const applicationMetrics = { // Request metrics httpRequestDuration: new Histogram({ name: 'http_request_duration_seconds', help: 'Duration of HTTP requests in seconds', labelNames: ['method', 'route', 'status'], buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10] }), // MCP tool metrics toolExecutionDuration: new Histogram({ name: 'mcp_tool_execution_duration_seconds', help: 'Duration of MCP tool execution', labelNames: ['tool', 'status'], buckets: [0.1, 0.5, 1, 2, 5, 10, 30] }), // API client metrics apiCallDuration: new Histogram({ name: 'opnsense_api_duration_seconds', help: 'Duration of OPNsense API calls', labelNames: ['endpoint', 'method', 'status'], buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5] }), // Cache metrics cacheHitRate: new Gauge({ name: 'cache_hit_rate', help: 'Cache hit rate percentage', labelNames: ['cache_type'] }), // Error metrics errorRate: new Counter({ name: 'errors_total', help: 'Total number of errors', labelNames: ['type', 'severity'] }) }; ``` #### Golden Signals | Signal | Metric | Target | Alert Threshold | |--------|--------|--------|-----------------| | **Latency** | P50, P95, P99 response times | P95 < 500ms | P95 > 1s | | **Traffic** | Requests per second | Baseline ±20% | >200% baseline | | **Errors** | Error rate | < 1% | > 5% | | **Saturation** | CPU, Memory, Connections | < 80% | > 90% | ### 2. Infrastructure Metrics #### System Resources ```yaml # Node Exporter metrics - node_cpu_seconds_total - node_memory_MemAvailable_bytes - node_filesystem_avail_bytes - node_network_receive_bytes_total - node_network_transmit_bytes_total - node_load1, node_load5, node_load15 ``` #### Container Metrics (if using Docker/K8s) ```yaml # cAdvisor/Kubernetes metrics - container_cpu_usage_seconds_total - container_memory_usage_bytes - container_network_receive_bytes_total - container_fs_usage_bytes - kube_pod_container_status_restarts_total ``` ### 3. Business Metrics ```typescript // business-metrics.ts export const businessMetrics = { // Resource operations vlanOperations: new Counter({ name: 'vlan_operations_total', help: 'Total VLAN operations', labelNames: ['operation', 'status'] }), firewallRuleChanges: new Counter({ name: 'firewall_rule_changes_total', help: 'Total firewall rule changes', labelNames: ['action', 'interface'] }), // IaC deployments deploymentSuccess: new Counter({ name: 'iac_deployments_successful_total', help: 'Successful IaC deployments' }), deploymentDuration: new Histogram({ name: 'iac_deployment_duration_seconds', help: 'IaC deployment duration', buckets: [10, 30, 60, 120, 300, 600] }), // Backup operations backupSize: new Gauge({ name: 'backup_size_bytes', help: 'Size of backups in bytes' }), lastBackupTimestamp: new Gauge({ name: 'last_backup_timestamp', help: 'Timestamp of last successful backup' }) }; ``` ## 📈 Monitoring Implementation ### 1. OpenTelemetry Setup ```typescript // telemetry.ts import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { PeriodicExportingMetricReader, ConsoleMetricExporter } from '@opentelemetry/sdk-metrics'; import { PrometheusExporter } from '@opentelemetry/exporter-prometheus'; import { JaegerExporter } from '@opentelemetry/exporter-jaeger'; const sdk = new NodeSDK({ traceExporter: new JaegerExporter({ endpoint: 'http://localhost:14268/api/traces', }), metricReader: new PeriodicExportingMetricReader({ exporter: new PrometheusExporter({ port: 9090, }), exportIntervalMillis: 15000, }), instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false, }, }), ], }); sdk.start(); ``` ### 2. Custom Metrics Collection ```typescript // monitoring.ts import { metrics } from './metrics'; export class MonitoringService { private metricsInterval: NodeJS.Timeout; start() { // Collect metrics every 10 seconds this.metricsInterval = setInterval(() => { this.collectSystemMetrics(); this.collectApplicationMetrics(); this.collectBusinessMetrics(); }, 10000); } private async collectSystemMetrics() { const usage = process.memoryUsage(); metrics.memoryUsage.set(usage.heapUsed / 1024 / 1024); const cpuUsage = process.cpuUsage(); metrics.cpuUsage.set(cpuUsage.user / 1000000); } private async collectApplicationMetrics() { // Cache metrics const cacheStats = await this.cacheManager.getStats(); metrics.cacheHitRate.set( { cache_type: 'memory' }, cacheStats.hitRate ); // Connection pool metrics const poolStats = this.apiClient.getPoolStats(); metrics.activeConnections.set(poolStats.active); metrics.idleConnections.set(poolStats.idle); } trackRequest(method: string, path: string, statusCode: number, duration: number) { metrics.httpRequestDuration .labels(method, path, statusCode.toString()) .observe(duration / 1000); metrics.httpRequestsTotal .labels(method, path, statusCode.toString()) .inc(); } } ``` ### 3. Distributed Tracing ```typescript // tracing.ts import { trace, context, SpanStatusCode } from '@opentelemetry/api'; const tracer = trace.getTracer('opnsense-mcp-server'); export async function tracedOperation<T>( name: string, operation: () => Promise<T> ): Promise<T> { const span = tracer.startSpan(name); try { const result = await context.with( trace.setSpan(context.active(), span), operation ); span.setStatus({ code: SpanStatusCode.OK }); return result; } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); span.recordException(error); throw error; } finally { span.end(); } } // Usage example async function createVlan(config: VlanConfig) { return tracedOperation('vlan.create', async () => { const span = trace.getActiveSpan(); span?.setAttributes({ 'vlan.tag': config.tag, 'vlan.interface': config.interface }); // Operation logic const result = await this.apiClient.createVlan(config); span?.addEvent('vlan_created', { 'vlan.uuid': result.uuid }); return result; }); } ``` ### 4. Structured Logging ```typescript // logging.ts import winston from 'winston'; import { LogtailTransport } from '@logtail/winston'; const logger = winston.createLogger({ level: process.env.LOG_LEVEL || 'info', format: winston.format.combine( winston.format.timestamp(), winston.format.errors({ stack: true }), winston.format.json() ), defaultMeta: { service: 'opnsense-mcp', version: process.env.npm_package_version, environment: process.env.NODE_ENV }, transports: [ // Console output new winston.transports.Console({ format: winston.format.simple() }), // File output new winston.transports.File({ filename: 'logs/error.log', level: 'error' }), // Centralized logging new LogtailTransport({ sourceToken: process.env.LOGTAIL_TOKEN }) ] }); // Correlation ID middleware export function correlationMiddleware(req, res, next) { const correlationId = req.headers['x-correlation-id'] || uuid(); req.correlationId = correlationId; logger.defaultMeta.correlationId = correlationId; next(); } ``` ## 📊 Dashboards ### 1. Main Operations Dashboard ```json { "dashboard": { "title": "OPNSense MCP Operations", "panels": [ { "title": "Request Rate", "query": "rate(http_requests_total[5m])", "type": "graph" }, { "title": "Error Rate", "query": "rate(errors_total[5m]) / rate(http_requests_total[5m])", "type": "singlestat", "thresholds": "0.01,0.05", "colors": ["green", "yellow", "red"] }, { "title": "P95 Latency", "query": "histogram_quantile(0.95, http_request_duration_seconds)", "type": "graph" }, { "title": "Active Connections", "query": "opnsense_active_connections", "type": "gauge" } ] } } ``` ### 2. Resource Management Dashboard ```yaml panels: - title: "VLAN Operations" query: "sum(rate(vlan_operations_total[5m])) by (operation)" visualization: bar_chart - title: "Firewall Rule Changes" query: "sum(increase(firewall_rule_changes_total[1h])) by (action)" visualization: pie_chart - title: "HAProxy Backend Health" query: "haproxy_backend_up" visualization: heatmap - title: "DNS Blocklist Size" query: "dns_blocklist_entries" visualization: singlestat ``` ### 3. Infrastructure Dashboard ```yaml panels: - title: "CPU Usage" query: "100 - (avg(irate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)" - title: "Memory Usage" query: "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100" - title: "Disk I/O" query: "rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])" - title: "Network Traffic" query: "rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])" ``` ## 🚨 Alerting Rules ### 1. Critical Alerts ```yaml # prometheus-alerts.yml groups: - name: critical interval: 30s rules: - alert: HighErrorRate expr: rate(errors_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}" - alert: APIDown expr: up{job="opnsense-api"} == 0 for: 1m labels: severity: critical annotations: summary: "OPNsense API is down" - alert: OutOfMemory expr: node_memory_MemAvailable_bytes < 100000000 for: 5m labels: severity: critical annotations: summary: "Memory critically low" ``` ### 2. Warning Alerts ```yaml - name: warnings rules: - alert: HighLatency expr: histogram_quantile(0.95, http_request_duration_seconds) > 1 for: 10m labels: severity: warning annotations: summary: "P95 latency above 1 second" - alert: CacheMissRate expr: cache_hit_rate < 0.5 for: 15m labels: severity: warning annotations: summary: "Cache hit rate below 50%" - alert: DiskSpaceLow expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1 for: 10m labels: severity: warning annotations: summary: "Less than 10% disk space remaining" ``` ### 3. Business Alerts ```yaml - name: business rules: - alert: DeploymentFailures expr: rate(iac_deployments_failed_total[1h]) > 0.1 for: 30m labels: severity: warning annotations: summary: "High deployment failure rate" - alert: BackupNotRunning expr: time() - last_backup_timestamp > 86400 labels: severity: warning annotations: summary: "No backup in last 24 hours" - alert: UnusualActivity expr: rate(firewall_rule_changes_total[5m]) > 10 labels: severity: info annotations: summary: "Unusual number of firewall changes" ``` ## 📱 Alert Routing ```yaml # alertmanager.yml global: resolve_timeout: 5m route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'default' routes: - match: severity: critical receiver: pagerduty continue: true - match: severity: warning receiver: slack - match: severity: info receiver: email receivers: - name: 'default' webhook_configs: - url: 'http://localhost:5001/webhook' - name: 'pagerduty' pagerduty_configs: - service_key: '<pagerduty_key>' - name: 'slack' slack_configs: - api_url: '<slack_webhook>' channel: '#ops-alerts' - name: 'email' email_configs: - to: 'ops-team@company.com' from: 'alerts@company.com' ``` ## 📈 SLO Monitoring ### Service Level Objectives ```yaml slos: - name: "API Availability" objective: 99.9% indicator: "sum(rate(http_requests_total{status!~'5..'}[5m])) / sum(rate(http_requests_total[5m]))" - name: "Request Latency" objective: 95% of requests < 500ms indicator: "histogram_quantile(0.95, http_request_duration_seconds) < 0.5" - name: "Error Budget" objective: < 0.1% errors indicator: "rate(errors_total[1h]) / rate(http_requests_total[1h]) < 0.001" ``` ### Error Budget Tracking ```typescript // error-budget.ts export class ErrorBudgetTracker { private readonly SLO = 0.999; // 99.9% availability calculateErrorBudget(timeWindow: number): ErrorBudget { const totalRequests = this.getTotalRequests(timeWindow); const failedRequests = this.getFailedRequests(timeWindow); const availability = (totalRequests - failedRequests) / totalRequests; const errorBudgetUsed = (1 - availability) / (1 - this.SLO); const errorBudgetRemaining = Math.max(0, 1 - errorBudgetUsed); return { slo: this.SLO, availability, errorBudgetUsed, errorBudgetRemaining, burnRate: errorBudgetUsed / (timeWindow / (30 * 24 * 60 * 60)) }; } } ``` ## 🔄 Continuous Improvement ### 1. Regular Reviews - **Daily**: Check dashboards, review overnight alerts - **Weekly**: Analyze trends, update thresholds - **Monthly**: SLO review, capacity planning - **Quarterly**: Architecture review, tool evaluation ### 2. Runbook Automation ```typescript // runbook.ts export class AutomatedRunbook { async handleAlert(alert: Alert) { switch (alert.name) { case 'HighMemoryUsage': await this.clearCache(); await this.restartIfNeeded(); break; case 'APIDown': await this.checkAPIHealth(); await this.attemptReconnection(); await this.notifyOnCall(); break; case 'DiskSpaceLow': await this.cleanupLogs(); await this.archiveOldBackups(); break; } } } ``` ### 3. Post-Incident Analysis ```markdown ## Incident Report Template **Incident ID**: INC-2024-001 **Date**: 2024-01-15 **Duration**: 45 minutes **Severity**: High ### Timeline - 14:30 - Alert triggered - 14:35 - Engineer acknowledged - 14:45 - Root cause identified - 15:00 - Fix deployed - 15:15 - Incident resolved ### Root Cause Memory leak in cache manager ### Action Items - [ ] Fix memory leak - [ ] Add memory profiling - [ ] Update monitoring thresholds - [ ] Create automated remediation ``` ## 📋 Implementation Checklist ### Phase 1: Basic Monitoring (Week 1) - [ ] Install Prometheus - [ ] Configure Node Exporter - [ ] Set up Grafana - [ ] Create basic dashboards - [ ] Configure critical alerts ### Phase 2: Advanced Monitoring (Week 2) - [ ] Implement OpenTelemetry - [ ] Add distributed tracing - [ ] Set up centralized logging - [ ] Create business metrics - [ ] Configure alert routing ### Phase 3: Optimization (Week 3) - [ ] Define SLOs - [ ] Implement error budgets - [ ] Create runbooks - [ ] Automate remediation - [ ] Performance baseline ### Phase 4: Maturity (Ongoing) - [ ] Chaos engineering - [ ] Predictive analytics - [ ] Cost optimization - [ ] Compliance reporting - [ ] Continuous improvement --- *This monitoring strategy ensures comprehensive observability of the OPNSense MCP Server, enabling proactive issue detection and rapid incident response.*

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/vespo92/OPNSenseMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server