Skip to main content
Glama

OPNSense MCP Server

RUNBOOK.md12.8 kB
# Operational Runbook - OPNSense MCP Server ## 🚨 Emergency Contacts | Role | Name | Contact | When to Call | |------|------|---------|--------------| | On-Call Engineer | Rotation | PagerDuty | First contact for all incidents | | Tech Lead | John Smith | +1-555-0100 | Architecture decisions | | OPNsense Admin | Jane Doe | +1-555-0101 | Firewall access issues | | DevOps Lead | Bob Wilson | +1-555-0102 | Infrastructure problems | | Security Team | SOC | security@company | Security incidents | ## 🎯 Quick Actions ### Service is Down ```bash # 1. Check if process is running systemctl status opnsense-mcp # 2. Restart service systemctl restart opnsense-mcp # 3. Check logs journalctl -u opnsense-mcp -n 100 # 4. If still down, check dependencies curl -k https://opnsense.local/api/core/system/status redis-cli ping ``` ### High Memory Usage ```bash # 1. Check current usage ps aux | grep opnsense-mcp # 2. Clear cache curl -X POST http://localhost:3000/admin/cache/clear # 3. Restart with memory limit systemctl stop opnsense-mcp node --max-old-space-size=512 dist/index.js # 4. If persists, check for memory leaks node --inspect dist/index.js # Open chrome://inspect and take heap snapshot ``` ### API Connection Failed ```bash # 1. Test OPNsense connectivity ping opnsense.local curl -k https://opnsense.local # 2. Verify credentials curl -u "$OPNSENSE_API_KEY:$OPNSENSE_API_SECRET" \ https://opnsense.local/api/core/system/status # 3. Check firewall rules # SSH to OPNsense and verify API access is allowed # 4. Restart with debug logging DEBUG=opnsense:* node dist/index.js ``` ## 📊 Standard Operating Procedures ### Daily Operations #### Morning Health Check (9:00 AM) ```bash #!/bin/bash # morning-check.sh echo "=== OPNSense MCP Daily Health Check ===" echo "Date: $(date)" # 1. Service Status echo -e "\n1. Service Status:" systemctl status opnsense-mcp --no-pager | head -10 # 2. Memory Usage echo -e "\n2. Memory Usage:" ps aux | grep opnsense-mcp | grep -v grep # 3. Error Count (last 24h) echo -e "\n3. Errors in last 24 hours:" grep ERROR logs/error.log | grep "$(date -d yesterday '+%Y-%m-%d')" | wc -l # 4. Cache Stats echo -e "\n4. Cache Statistics:" curl -s http://localhost:3000/debug/cache/stats | jq # 5. API Response Time echo -e "\n5. API Response Time:" time curl -s http://localhost:3000/health > /dev/null # 6. Disk Usage echo -e "\n6. Disk Usage:" df -h | grep -E "/$|/var" echo -e "\n=== Check Complete ===" ``` #### Backup Procedure (Daily 2:00 AM) ```bash #!/bin/bash # backup.sh BACKUP_DIR="/backups/opnsense-mcp/$(date +%Y%m%d)" mkdir -p $BACKUP_DIR # 1. Backup state cp -r /opt/opnsense-mcp/data $BACKUP_DIR/ # 2. Backup configuration cp /opt/opnsense-mcp/.env $BACKUP_DIR/ # 3. Export Redis cache redis-cli --rdb $BACKUP_DIR/redis.rdb # 4. Backup OPNsense config curl -u "$OPNSENSE_API_KEY:$OPNSENSE_API_SECRET" \ https://opnsense.local/api/core/backup/download \ -o $BACKUP_DIR/opnsense-config.xml # 5. Compress backup tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR rm -rf $BACKUP_DIR # 6. Upload to S3 aws s3 cp $BACKUP_DIR.tar.gz s3://backups/opnsense-mcp/ # 7. Clean old backups (keep 30 days) find /backups/opnsense-mcp -name "*.tar.gz" -mtime +30 -delete ``` ### Incident Response #### Severity Levels | Level | Description | Response Time | Examples | |-------|-------------|---------------|----------| | **P1** | Service Down | < 15 min | Complete outage, data loss | | **P2** | Degraded Service | < 1 hour | Slow response, partial failure | | **P3** | Minor Issue | < 4 hours | Single feature broken | | **P4** | Low Priority | < 24 hours | Cosmetic issues | #### P1 - Service Outage Response ```bash # INCIDENT COMMANDER CHECKLIST # [ ] Acknowledge incident in PagerDuty # [ ] Join incident bridge: https://zoom.us/j/incident # [ ] Start incident timeline # [ ] Assign roles (Commander, Communicator, Investigator) # IMMEDIATE ACTIONS (0-5 minutes) # 1. Verify the outage curl -m 5 http://localhost:3000/health || echo "SERVICE DOWN" # 2. Check all components systemctl status opnsense-mcp redis-cli ping curl -k https://opnsense.local/api/core/system/status # 3. Attempt quick recovery systemctl restart opnsense-mcp sleep 10 curl http://localhost:3000/health # INVESTIGATION (5-15 minutes) # 4. Check recent changes git log --oneline -n 10 kubectl rollout history deployment/opnsense-mcp # 5. Review error logs tail -n 1000 /var/log/opnsense-mcp/error.log | grep ERROR # 6. Check system resources top -bn1 | head -20 df -h netstat -tuln | grep -E "3000|6379" # RECOVERY ACTIONS (15+ minutes) # 7. Rollback if needed kubectl rollout undo deployment/opnsense-mcp # OR git checkout v0.7.0 && npm run build && pm2 restart opnsense-mcp # 8. Escalate if not resolved # Page: Tech Lead, DevOps Lead # Open vendor ticket if OPNsense issue ``` #### P2 - Performance Degradation ```bash # 1. Identify slow operations grep "duration" logs/app.log | awk '{if ($NF > 1000) print}' # 2. Check cache effectiveness curl http://localhost:3000/debug/cache/stats # 3. Monitor API latency for i in {1..10}; do time curl -s http://localhost:3000/api/vlans > /dev/null sleep 1 done # 4. Clear cache if needed curl -X POST http://localhost:3000/admin/cache/clear # 5. Increase resources if needed pm2 scale opnsense-mcp 4 # Scale to 4 instances ``` ### Maintenance Procedures #### Weekly Maintenance (Sunday 3:00 AM) ```bash #!/bin/bash # weekly-maintenance.sh echo "Starting weekly maintenance..." # 1. Cleanup logs find /var/log/opnsense-mcp -name "*.log" -mtime +7 -delete journalctl --vacuum-time=7d # 2. Optimize Redis redis-cli BGREWRITEAOF sleep 60 redis-cli MEMORY PURGE # 3. Database maintenance psql -U opnsense -c "VACUUM ANALYZE;" psql -U opnsense -c "REINDEX DATABASE opnsense_mcp;" # 4. Update dependencies (staging only) if [ "$ENVIRONMENT" = "staging" ]; then npm audit fix npm update npm run build fi # 5. Test backup restore /opt/scripts/test-backup-restore.sh echo "Maintenance complete" ``` #### Monthly Tasks 1. **Security Patching** ```bash # Review and apply security updates npm audit apt update && apt list --upgradable # Apply patches during maintenance window ``` 2. **Capacity Review** ```bash # Generate capacity report ./scripts/capacity-report.sh > reports/capacity-$(date +%Y%m).txt ``` 3. **Certificate Renewal** ```bash # Check certificate expiry echo | openssl s_client -connect opnsense.local:443 2>/dev/null | \ openssl x509 -noout -dates ``` ### Monitoring Alerts #### Alert: High Error Rate ```bash # Triggered when: error_rate > 5% for 5 minutes # 1. Identify error types grep ERROR logs/error.log | tail -100 | \ awk '{print $5}' | sort | uniq -c | sort -rn # 2. Check specific endpoints grep "status=5" logs/access.log | \ awk '{print $7}' | sort | uniq -c | sort -rn # 3. Temporary mitigation # Enable circuit breaker curl -X POST http://localhost:3000/admin/circuit-breaker/enable # 4. Fix root cause # Deploy fix or rollback ``` #### Alert: Cache Miss Rate High ```bash # Triggered when: cache_hit_rate < 50% for 10 minutes # 1. Check cache stats redis-cli INFO stats # 2. Identify missing keys redis-cli MONITOR # Watch for 5 seconds, Ctrl+C # Look for frequent GETs with nil responses # 3. Warm cache curl -X POST http://localhost:3000/admin/cache/warm # 4. Adjust TTL if needed # Edit config to increase cache TTL ``` #### Alert: Memory Usage High ```bash # Triggered when: memory > 80% for 10 minutes # 1. Get memory breakdown cat /proc/$(pgrep -f opnsense-mcp)/status | grep VmRSS # 2. Check for memory leaks node --expose-gc --trace-gc dist/index.js # 3. Force garbage collection curl -X POST http://localhost:3000/admin/gc # 4. Restart if necessary pm2 restart opnsense-mcp --max-memory-restart 400M ``` ### Deployment Procedures #### Standard Deployment ```bash #!/bin/bash # deploy.sh # PRE-DEPLOYMENT CHECKS echo "[ ] Code reviewed and approved" echo "[ ] Tests passing in CI/CD" echo "[ ] Change ticket approved" echo "[ ] Maintenance window scheduled" echo "[ ] Rollback plan documented" read -p "Continue? (y/n) " -n 1 -r echo if [[ ! $REPLY =~ ^[Yy]$ ]]; then exit 1; fi # DEPLOYMENT STEPS # 1. Backup current version ./scripts/backup-current.sh # 2. Pull new code git fetch git checkout v0.8.0 # 3. Install dependencies npm ci # 4. Run migrations npm run db:migrate # 5. Build npm run build # 6. Run smoke tests npm run test:smoke # 7. Deploy (rolling update) pm2 reload opnsense-mcp # 8. Verify deployment sleep 10 curl http://localhost:3000/version curl http://localhost:3000/health # 9. Monitor for 5 minutes watch -n 10 'grep ERROR logs/error.log | tail -5' ``` #### Emergency Hotfix ```bash # For critical production issues only # 1. Create hotfix branch git checkout -b hotfix/critical-fix # 2. Apply fix # ... make changes ... # 3. Test locally npm test # 4. Deploy to one instance pm2 stop opnsense-mcp-1 node dist/index.js & HOTFIX_PID=$! # 5. Test fix curl http://localhost:3000/test-endpoint # 6. If successful, deploy to all pm2 reload opnsense-mcp # 7. If failed, rollback kill $HOTFIX_PID pm2 start opnsense-mcp-1 ``` ### Disaster Recovery #### Complete System Failure ```bash # When: Total system failure, need to rebuild from scratch # 1. Provision new infrastructure terraform apply -auto-approve # 2. Install dependencies ansible-playbook -i inventory setup.yml # 3. Restore from backup aws s3 cp s3://backups/opnsense-mcp/latest.tar.gz . tar -xzf latest.tar.gz # 4. Restore state cp backup/data/state.json /opt/opnsense-mcp/data/ # 5. Restore Redis redis-cli --rdb backup/redis.rdb # 6. Start services systemctl start opnsense-mcp systemctl start redis # 7. Verify recovery ./scripts/health-check.sh # 8. Update DNS/Load balancer # Point traffic to new instance ``` #### Data Corruption Recovery ```bash # When: State file or cache corrupted # 1. Stop service systemctl stop opnsense-mcp # 2. Backup corrupted data mv data/state.json data/state.corrupted.$(date +%s) # 3. Restore from last known good cp /backups/opnsense-mcp/$(date +%Y%m%d)/data/state.json data/ # 4. Clear cache redis-cli FLUSHDB # 5. Restart and rebuild cache systemctl start opnsense-mcp curl -X POST http://localhost:3000/admin/cache/warm # 6. Verify data integrity npm run test:integrity ``` ## 📈 Performance Tuning ### Quick Optimizations ```bash # 1. Increase Node.js memory export NODE_OPTIONS="--max-old-space-size=2048" # 2. Enable clustering pm2 start opnsense-mcp -i max # 3. Optimize Redis redis-cli CONFIG SET maxmemory 1gb redis-cli CONFIG SET maxmemory-policy allkeys-lru # 4. Enable compression export ENABLE_COMPRESSION=true export COMPRESSION_LEVEL=9 # 5. Increase cache TTL export CACHE_DEFAULT_TTL=600 ``` ### Load Testing ```bash # Run load test to find limits npm run test:load # Or use k6 k6 run load-test.js --vus 100 --duration 5m # Monitor during test watch -n 1 'ps aux | grep opnsense; echo; netstat -an | grep -c ESTABLISHED' ``` ## 🔐 Security Procedures ### Security Incident Response ```bash # When: Suspected security breach # 1. ISOLATE iptables -A INPUT -j DROP # Block all incoming iptables -A OUTPUT -j DROP # Block all outgoing iptables -A INPUT -s 10.0.0.0/8 -j ACCEPT # Allow internal only # 2. PRESERVE EVIDENCE tar -czf /evidence/incident-$(date +%s).tar.gz /var/log /opt/opnsense-mcp # 3. INVESTIGATE grep -r "401\|403" logs/ last -100 netstat -an | grep ESTABLISHED # 4. ROTATE CREDENTIALS # Generate new API keys in OPNsense # Update all secrets in vault # Restart services with new credentials # 5. REPORT # Email: security@company.com # Include: timeline, affected systems, actions taken ``` ### Regular Security Tasks ```bash # Weekly security scan npm audit nmap -sV localhost nikto -h http://localhost:3000 # Monthly penetration test (staging) ./scripts/pentest.sh # Quarterly security review # Review access logs # Update security documentation # Rotate credentials ``` ## 📋 Checklists ### New Team Member Onboarding - [ ] Add to PagerDuty rotation - [ ] Grant access to monitoring dashboards - [ ] Provide runbook training - [ ] Shadow on-call engineer - [ ] Complete incident response training ### Pre-Deployment Checklist - [ ] All tests passing - [ ] Security scan completed - [ ] Performance benchmarks met - [ ] Documentation updated - [ ] Rollback plan prepared - [ ] Stakeholders notified ### Post-Incident Checklist - [ ] Incident resolved and verified - [ ] Root cause identified - [ ] Post-mortem scheduled - [ ] Action items documented - [ ] Monitoring improved - [ ] Runbook updated --- *This runbook is a living document. Update it after each incident with lessons learned.* **Last Updated**: 2025-01-07 **Version**: 1.0 **Owner**: Operations Team

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/vespo92/OPNSenseMCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server