Skip to main content
Glama

Sumanshu Arora

monitoring.md24.6 kB
# Monitoring & Management **Production monitoring, alerting, and management strategies for MCP Template Platform deployments.** ## Overview Effective monitoring is crucial for production MCP deployments. This guide covers comprehensive monitoring strategies, alerting setup, performance optimization, and operational best practices. ### Key Monitoring Areas - **Deployment Health**: Container status, uptime, restart counts - **Resource Usage**: CPU, memory, disk, and network utilization - **MCP Protocol**: Tool availability, response times, error rates - **Application Metrics**: Tool usage, success rates, custom metrics - **Infrastructure**: Docker daemon, system resources, network connectivity ## Built-in Monitoring Features ### CLI-Based Monitoring ```bash # Real-time status monitoring mcpt status --watch --refresh 5 # Health-only monitoring mcpt status --health-only # Deployment overview mcpt list --status # Resource monitoring mcpt status deployment-name --detailed ``` ### Status Monitoring Dashboard ```bash # Interactive status dashboard mcpt dashboard # Example output: ╭─────────────────── MCP Platform Dashboard ───────────────────╮ │ │ │ 🚀 Active Deployments: 12 │ │ ✅ Healthy: 10 ⚠️ Warning: 2 ❌ Failed: 0 │ │ 📊 Total Memory: 2.4GB 💾 Total Disk: 15GB │ │ │ ╰───────────────────────────────────────────────────────────────╯ ┌──────────────────┬─────────────┬────────────┬──────────────────┐ │ Deployment │ Template │ Status │ Resource Usage │ ├──────────────────┼─────────────┼────────────┼──────────────────┤ │ filesystem-prod │ filesystem │ ✅ Healthy │ 🟢 15% CPU, 180MB│ │ github-api │ github │ ✅ Healthy │ 🟢 8% CPU, 95MB │ │ database-conn │ database │ ⚠️ Warning │ 🟡 65% CPU, 340MB│ │ slack-bot │ slack │ ✅ Healthy │ 🟢 12% CPU, 120MB│ └──────────────────┴─────────────┴────────────┴──────────────────┘ 🔄 Auto-refresh: 5s ⏰ Last Update: 2025-01-27 16:47:30 UTC 📊 Press 'd' for details, 'q' to quit, 'r' to refresh now ``` ### Automated Health Checks ```bash # Set up automated health monitoring mcpt monitor --config health-check.json # Example health-check.json: { "interval": 30, "deployments": ["critical-app", "filesystem-prod"], "actions": { "on_failure": "restart", "on_warning": "alert", "max_restarts": 3 }, "notifications": { "email": "admin@company.com", "webhook": "https://hooks.slack.com/services/..." } } ``` ## Logging & Log Management ### Centralized Logging ```bash # Stream logs from all deployments mcpt logs --all --follow # Filter logs by severity mcpt logs --all --filter "ERROR|WARN" # Export logs for analysis mcpt logs deployment --since 24h --format json > logs.json ``` ### Log Aggregation Setup ```bash # Forward logs to external systems mcpt logs deployment --format json --follow | \ curl -X POST -H "Content-Type: application/json" \ --data-binary @- \ https://logs.company.com/mcp-platform ``` ### Log Analysis Scripts ```python #!/usr/bin/env python3 """Analyze MCP deployment logs for patterns and issues.""" import json import subprocess from collections import defaultdict from datetime import datetime, timedelta def analyze_deployment_logs(deployment_name, hours=24): """Analyze logs for a specific deployment.""" # Get logs result = subprocess.run([ 'python', '-m', 'mcp_template', 'logs', deployment_name, '--since', f'{hours}h', '--format', 'json' ], capture_output=True, text=True) if result.returncode != 0: print(f"Error getting logs: {result.stderr}") return # Parse and analyze logs = [] for line in result.stdout.strip().split('\n'): if line: try: logs.append(json.loads(line)) except json.JSONDecodeError: continue # Analysis error_count = len([log for log in logs if log.get('level') == 'ERROR']) warning_count = len([log for log in logs if log.get('level') == 'WARN']) # Tool usage analysis tool_usage = defaultdict(int) for log in logs: if 'tool_call' in log.get('message', ''): tool_name = extract_tool_name(log['message']) if tool_name: tool_usage[tool_name] += 1 # Report print(f"📊 Log Analysis for {deployment_name} (last {hours}h)") print(f"📝 Total entries: {len(logs)}") print(f"❌ Errors: {error_count}") print(f"⚠️ Warnings: {warning_count}") if tool_usage: print("\n🛠️ Tool Usage:") for tool, count in sorted(tool_usage.items(), key=lambda x: x[1], reverse=True): print(f" {tool}: {count} calls") def extract_tool_name(message): """Extract tool name from log message.""" # Implementation depends on log format if "tool_call:" in message: return message.split("tool_call:")[1].split()[0] return None if __name__ == "__main__": import sys deployment = sys.argv[1] if len(sys.argv) > 1 else "all" hours = int(sys.argv[2]) if len(sys.argv) > 2 else 24 if deployment == "all": # Analyze all deployments result = subprocess.run([ 'python', '-m', 'mcp_template', 'list', '--format', 'json' ], capture_output=True, text=True) deployments = json.loads(result.stdout) for dep in deployments: analyze_deployment_logs(dep['id'], hours) print() else: analyze_deployment_logs(deployment, hours) ``` ## Metrics Collection ### System Metrics ```bash # Collect system metrics mcpt metrics --output prometheus # Example Prometheus metrics: # mcp_deployment_status{deployment="filesystem"} 1 # mcp_deployment_uptime_seconds{deployment="filesystem"} 86400 # mcp_deployment_memory_bytes{deployment="filesystem"} 185073664 # mcp_deployment_cpu_percent{deployment="filesystem"} 5.2 # mcp_tool_calls_total{deployment="filesystem",tool="read_file"} 157 # mcp_tool_errors_total{deployment="filesystem",tool="read_file"} 2 ``` ### Custom Metrics ```python #!/usr/bin/env python3 """Custom metrics collection for MCP deployments.""" import json import subprocess import time from prometheus_client import start_http_server, Gauge, Counter, Histogram # Prometheus metrics deployment_status = Gauge('mcp_deployment_status', 'Deployment status (1=healthy, 0.5=warning, 0=error)', ['deployment', 'template']) deployment_uptime = Gauge('mcp_deployment_uptime_seconds', 'Deployment uptime in seconds', ['deployment']) deployment_memory = Gauge('mcp_deployment_memory_bytes', 'Memory usage in bytes', ['deployment']) deployment_cpu = Gauge('mcp_deployment_cpu_percent', 'CPU usage percentage', ['deployment']) tool_calls = Counter('mcp_tool_calls_total', 'Total tool calls', ['deployment', 'tool']) tool_errors = Counter('mcp_tool_errors_total', 'Total tool errors', ['deployment', 'tool']) tool_duration = Histogram('mcp_tool_duration_seconds', 'Tool execution duration', ['deployment', 'tool']) def collect_metrics(): """Collect metrics from all deployments.""" try: # Get deployment status result = subprocess.run([ 'python', '-m', 'mcp_template', 'status', '--format', 'json' ], capture_output=True, text=True) if result.returncode == 0: status_data = json.loads(result.stdout) for deployment in status_data.get('deployments', []): deployment_id = deployment['deployment_id'] template_name = deployment['template']['name'] # Status metrics health_value = { 'healthy': 1.0, 'warning': 0.5, 'critical': 0.0, 'unknown': -1.0 }.get(deployment['status']['health'], -1.0) deployment_status.labels( deployment=deployment_id, template=template_name ).set(health_value) # Resource metrics deployment_uptime.labels(deployment=deployment_id).set( deployment['status'].get('uptime_seconds', 0) ) container = deployment.get('container', {}) deployment_memory.labels(deployment=deployment_id).set( container.get('memory_usage', 0) ) deployment_cpu.labels(deployment=deployment_id).set( container.get('cpu_percent', 0) ) # Collect tool metrics from logs collect_tool_metrics() except Exception as e: print(f"Error collecting metrics: {e}") def collect_tool_metrics(): """Collect tool usage metrics from recent logs.""" # Get recent logs and parse for tool usage result = subprocess.run([ 'python', '-m', 'mcp_template', 'logs', '--all', '--since', '5m', '--format', 'json' ], capture_output=True, text=True) if result.returncode == 0: for line in result.stdout.strip().split('\n'): if line: try: log_entry = json.loads(line) parse_tool_log_entry(log_entry) except json.JSONDecodeError: continue def parse_tool_log_entry(log_entry): """Parse log entry for tool metrics.""" message = log_entry.get('message', '') deployment = log_entry.get('deployment', 'unknown') # Tool call tracking if 'tool_call:' in message: tool_name = message.split('tool_call:')[1].split()[0] tool_calls.labels(deployment=deployment, tool=tool_name).inc() # Tool error tracking if 'tool_error:' in message: tool_name = message.split('tool_error:')[1].split()[0] tool_errors.labels(deployment=deployment, tool=tool_name).inc() # Tool duration tracking if 'tool_duration:' in message: parts = message.split('tool_duration:')[1].split() tool_name = parts[0] duration = float(parts[1].replace('s', '')) tool_duration.labels(deployment=deployment, tool=tool_name).observe(duration) if __name__ == '__main__': # Start Prometheus metrics server start_http_server(8000) print("Metrics server started on port 8000") # Collect metrics every 30 seconds while True: collect_metrics() time.sleep(30) ``` ## Alerting & Notifications ### Alert Configuration ```yaml # alerts.yaml alerts: - name: deployment_down condition: mcp_deployment_status == 0 for: 5m labels: severity: critical annotations: summary: "MCP deployment {{ $labels.deployment }} is down" description: "Deployment {{ $labels.deployment }} has been down for more than 5 minutes" - name: high_memory_usage condition: mcp_deployment_memory_bytes / (1024*1024*1024) > 1 for: 10m labels: severity: warning annotations: summary: "High memory usage for {{ $labels.deployment }}" description: "Memory usage is {{ $value }}GB for deployment {{ $labels.deployment }}" - name: high_error_rate condition: rate(mcp_tool_errors_total[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "High error rate for {{ $labels.deployment }}" description: "Error rate is {{ $value }} per second for deployment {{ $labels.deployment }}" ``` ### Notification Integration ```python #!/usr/bin/env python3 """Alert notification system for MCP deployments.""" import json import subprocess import requests import smtplib from email.mime.text import MIMEText from datetime import datetime class AlertManager: def __init__(self, config_file='alert-config.json'): with open(config_file) as f: self.config = json.load(f) def check_deployments(self): """Check all deployments and send alerts if needed.""" result = subprocess.run([ 'python', '-m', 'mcp_template', 'status', '--format', 'json' ], capture_output=True, text=True) if result.returncode == 0: status_data = json.loads(result.stdout) for deployment in status_data.get('deployments', []): self.check_deployment_health(deployment) def check_deployment_health(self, deployment): """Check individual deployment health and alert if needed.""" deployment_id = deployment['deployment_id'] health = deployment['status']['health'] # Check for critical issues if health == 'critical' or deployment['status']['state'] == 'failed': self.send_critical_alert(deployment) # Check for warnings elif health == 'warning': self.send_warning_alert(deployment) # Check resource usage container = deployment.get('container', {}) memory_usage = container.get('memory_usage', 0) memory_limit = container.get('memory_limit', 1024*1024*1024) # 1GB default if memory_usage / memory_limit > 0.9: self.send_resource_alert(deployment, 'memory', memory_usage / memory_limit * 100) def send_critical_alert(self, deployment): """Send critical alert for deployment failure.""" message = { 'severity': 'critical', 'deployment': deployment['deployment_id'], 'template': deployment['template']['name'], 'status': deployment['status']['state'], 'health': deployment['status']['health'], 'timestamp': datetime.utcnow().isoformat() } self.send_notification('Critical: MCP Deployment Down', message) def send_warning_alert(self, deployment): """Send warning alert for deployment issues.""" message = { 'severity': 'warning', 'deployment': deployment['deployment_id'], 'template': deployment['template']['name'], 'health': deployment['status']['health'], 'timestamp': datetime.utcnow().isoformat() } self.send_notification('Warning: MCP Deployment Issue', message) def send_resource_alert(self, deployment, resource_type, usage_percent): """Send alert for high resource usage.""" message = { 'severity': 'warning', 'deployment': deployment['deployment_id'], 'resource_type': resource_type, 'usage_percent': usage_percent, 'timestamp': datetime.utcnow().isoformat() } self.send_notification(f'High {resource_type.title()} Usage', message) def send_notification(self, subject, message): """Send notification via configured channels.""" # Slack webhook if 'slack_webhook' in self.config: self.send_slack_notification(subject, message) # Email if 'email' in self.config: self.send_email_notification(subject, message) # PagerDuty if 'pagerduty_key' in self.config and message['severity'] == 'critical': self.send_pagerduty_alert(subject, message) def send_slack_notification(self, subject, message): """Send Slack notification.""" webhook_url = self.config['slack_webhook'] color = 'danger' if message['severity'] == 'critical' else 'warning' payload = { 'attachments': [{ 'color': color, 'title': subject, 'fields': [ {'title': 'Deployment', 'value': message['deployment'], 'short': True}, {'title': 'Severity', 'value': message['severity'].upper(), 'short': True}, {'title': 'Time', 'value': message['timestamp'], 'short': True} ], 'text': json.dumps(message, indent=2) }] } requests.post(webhook_url, json=payload) def send_email_notification(self, subject, message): """Send email notification.""" email_config = self.config['email'] msg = MIMEText(json.dumps(message, indent=2)) msg['Subject'] = f"[MCP Platform] {subject}" msg['From'] = email_config['from'] msg['To'] = email_config['to'] server = smtplib.SMTP(email_config['smtp_server'], email_config['smtp_port']) if email_config.get('username'): server.starttls() server.login(email_config['username'], email_config['password']) server.send_message(msg) server.quit() def send_pagerduty_alert(self, subject, message): """Send PagerDuty alert for critical issues.""" routing_key = self.config['pagerduty_key'] payload = { 'routing_key': routing_key, 'event_action': 'trigger', 'payload': { 'summary': subject, 'source': 'mcp-platform', 'severity': 'critical', 'custom_details': message } } requests.post('https://events.pagerduty.com/v2/enqueue', json=payload) # Example usage if __name__ == '__main__': alert_manager = AlertManager() alert_manager.check_deployments() ``` ## Performance Monitoring ### Resource Optimization ```bash # Monitor resource usage trends mcpt metrics --format csv --duration 24h > usage-trends.csv # Analyze performance bottlenecks mcpt analyze-performance deployment-name # Optimize resource allocation mcpt deploy template --memory 512m --cpu 0.5 --optimize ``` ### Performance Benchmarking ```python #!/usr/bin/env python3 """Performance benchmarking for MCP deployments.""" import time import json import subprocess import statistics from concurrent.futures import ThreadPoolExecutor import requests def benchmark_deployment(deployment_name, test_duration=60): """Benchmark a deployment's performance.""" print(f"🚀 Starting benchmark for {deployment_name}") # Get deployment info result = subprocess.run([ 'python', '-m', 'mcp_template', 'status', deployment_name, '--format', 'json' ], capture_output=True, text=True) if result.returncode != 0: print(f"❌ Failed to get deployment status: {result.stderr}") return deployment_info = json.loads(result.stdout) # Get available tools result = subprocess.run([ 'python', '-m', 'mcp_template', 'tools', deployment_name, '--format', 'json' ], capture_output=True, text=True) if result.returncode != 0: print(f"❌ Failed to get tools: {result.stderr}") return tools_info = json.loads(result.stdout) tools = [tool['name'] for tool in tools_info.get('tools', [])] if not tools: print("❌ No tools found for benchmarking") return print(f"📊 Benchmarking {len(tools)} tools for {test_duration}s") # Run benchmark results = {} start_time = time.time() with ThreadPoolExecutor(max_workers=5) as executor: while time.time() - start_time < test_duration: futures = [] for tool in tools[:3]: # Test top 3 tools future = executor.submit(call_tool, deployment_name, tool) futures.append((tool, future)) for tool, future in futures: try: duration = future.result(timeout=10) if tool not in results: results[tool] = [] results[tool].append(duration) except Exception as e: print(f"⚠️ Tool {tool} failed: {e}") time.sleep(1) # Rate limiting # Analyze results print("\n📈 Benchmark Results:") for tool, durations in results.items(): if durations: avg_duration = statistics.mean(durations) median_duration = statistics.median(durations) p95_duration = sorted(durations)[int(len(durations) * 0.95)] print(f"🛠️ {tool}:") print(f" Calls: {len(durations)}") print(f" Avg: {avg_duration:.3f}s") print(f" Median: {median_duration:.3f}s") print(f" P95: {p95_duration:.3f}s") # Resource usage during benchmark final_status = subprocess.run([ 'python', '-m', 'mcp_template', 'status', deployment_name, '--format', 'json' ], capture_output=True, text=True) if final_status.returncode == 0: final_info = json.loads(final_status.stdout) container = final_info.get('container', {}) print(f"\n💾 Resource Usage:") print(f" Memory: {container.get('memory_usage', 0) / 1024 / 1024:.1f} MB") print(f" CPU: {container.get('cpu_percent', 0):.1f}%") def call_tool(deployment_name, tool_name): """Call a specific tool and measure response time.""" start_time = time.time() # This would need to be implemented based on your tool calling mechanism # For now, we'll simulate a tool call time.sleep(0.1) # Simulate tool execution return time.time() - start_time if __name__ == "__main__": import sys deployment = sys.argv[1] if len(sys.argv) > 1 else "demo" duration = int(sys.argv[2]) if len(sys.argv) > 2 else 60 benchmark_deployment(deployment, duration) ``` ## Production Best Practices ### Deployment Strategies ```bash # Blue-green deployment mcpt deploy template --name template-blue mcpt deploy template --name template-green # Rolling updates mcpt deploy template --strategy rolling --instances 3 # Canary deployment mcpt deploy template --canary 10% ``` ### Backup & Recovery ```bash # Backup deployment configurations mcpt backup --output backup-$(date +%Y%m%d).tar.gz # Backup specific deployment mcpt backup deployment-name --include-data # Restore from backup mcpt restore backup-20250127.tar.gz # Disaster recovery mcpt restore --disaster-recovery --cluster-config ``` ### High Availability Setup ```yaml # ha-config.yaml high_availability: load_balancer: enabled: true algorithm: round_robin health_check_interval: 30s replication: min_replicas: 2 max_replicas: 5 scale_trigger: cpu_usage > 70% failover: enabled: true timeout: 30s max_failures: 3 ``` ```bash # Deploy with high availability mcpt deploy template --ha-config ha-config.yaml ``` ## Integration with External Systems ### Grafana Dashboard ```json { "dashboard": { "title": "MCP Platform Monitoring", "panels": [ { "title": "Deployment Status", "type": "stat", "targets": [{ "expr": "mcp_deployment_status", "legendFormat": "{{deployment}}" }] }, { "title": "Memory Usage", "type": "graph", "targets": [{ "expr": "mcp_deployment_memory_bytes / 1024 / 1024", "legendFormat": "{{deployment}} Memory (MB)" }] }, { "title": "Tool Call Rate", "type": "graph", "targets": [{ "expr": "rate(mcp_tool_calls_total[5m])", "legendFormat": "{{deployment}}/{{tool}}" }] } ] } } ``` ### ELK Stack Integration ```yaml # filebeat.yml filebeat.inputs: - type: docker containers.ids: - "*" containers.path: "/var/lib/docker/containers" containers.stream: "stdout" json.keys_under_root: true json.add_error_key: true processors: - add_docker_metadata: ~ output.elasticsearch: hosts: ["elasticsearch:9200"] index: "mcp-platform-%{+yyyy.MM.dd}" logging.level: info ``` By implementing comprehensive monitoring and management practices, you can ensure reliable, performant, and observable MCP Platform deployments in production environments.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Data-Everything/mcp-server-templates'

If you have feedback or need assistance with the MCP directory API, please join our Discord server