Solid Multi-Tenant DevOps MCP Server

ERROR_MANAGEMENT_SYSTEM.md•7.83 kB

# AI-First DevOps Error Management System ## Overview This system captures, logs, and enables AI agents (Claude, ADA, Morgan) to automatically diagnose and fix infrastructure issues. --- ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Services (MCP Server, Backend, Agents, etc.) │ │ - Capture errors automatically │ │ - Log to central error database │ │ - Trigger alerts for critical issues │ └────────────────────┬────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ Error Logging System (MCP Server) │ │ - POST /api/errors/log - Log new errors │ │ - GET /api/errors - List all errors │ │ - GET /api/errors/:id - Get error details │ │ - POST /api/errors/:id/fix - Attempt auto-fix │ └────────────────────┬────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ AI Agents (via MCP Tools) │ │ - Morgan (DevOps) - Monitors and fixes infrastructure │ │ - ADA (Orchestrator) - Delegates to appropriate agent │ │ - Claude - Can query errors and suggest fixes │ └─────────────────────────────────────────────────────────────┘ ``` --- ## Error Logging ### 1. Error Structure ```json { "id": "uuid", "timestamp": "2025-10-13T13:45:00Z", "service": "mcp-server", "severity": "critical|error|warning|info", "error_type": "connection_failed|health_check_failed|api_error", "message": "PostgreSQL health check failed", "details": { "service_name": "PostgreSQL", "port": 5432, "attempted_protocol": "http", "expected_protocol": "postgresql" }, "stack_trace": "...", "company_id": 1, "resolved": false, "resolution_notes": null, "auto_fix_attempted": false, "auto_fix_successful": false } ``` ### 2. Automatic Error Capture All services will automatically log errors to the MCP server: ```javascript // Example: Automatic error logging try { await checkHealth(service); } catch (error) { await logError({ service: 'mcp-server', severity: 'error', error_type: 'health_check_failed', message: error.message, details: { service_name: service.name } }); } ``` --- ## MCP Endpoints for Error Management ### 1. Log Error ```bash POST http://localhost:8092/api/errors/log Content-Type: application/json { "service": "mcp-server", "severity": "error", "error_type": "health_check_failed", "message": "PostgreSQL connection failed", "details": {...} } ``` ### 2. List Errors ```bash GET http://localhost:8092/api/errors?severity=critical&resolved=false&limit=50 ``` Response: ```json { "total": 10, "errors": [ { "id": "uuid", "timestamp": "2025-10-13T13:45:00Z", "service": "mcp-server", "severity": "error", "message": "PostgreSQL health check failed", "resolved": false } ] } ``` ### 3. Get Error Details ```bash GET http://localhost:8092/api/errors/:id ``` ### 4. Attempt Auto-Fix ```bash POST http://localhost:8092/api/errors/:id/fix ``` The system will: 1. Analyze the error type 2. Apply known fixes (restart service, clear cache, etc.) 3. Re-test the service 4. Update error status ### 5. Mark as Resolved ```bash POST http://localhost:8092/api/errors/:id/resolve Content-Type: application/json { "resolution_notes": "Fixed by restarting PostgreSQL container" } ``` --- ## Auto-Healing System ### Common Fixes 1. **Service Not Responding** - Restart Docker container - Check resource limits - Clear connection pool 2. **Health Check Failed** - Verify correct protocol - Check firewall rules - Test connectivity 3. **Database Connection Issues** - Restart database - Clear connection pool - Check credentials 4. **Out of Memory** - Restart service - Increase memory limits - Clear caches ### Auto-Fix Flow ``` 1. Error Detected ↓ 2. Log to Error Database ↓ 3. Check if auto-fixable ↓ 4. Apply fix ↓ 5. Re-test service ↓ 6. Update error status ↓ 7. Notify Morgan (DevOps agent) if fix failed ``` --- ## Integration with AI Agents ### Morgan (DevOps Agent) - MCP Tool ```json { "name": "devops.errors.list", "description": "List infrastructure errors for Morgan to investigate", "inputSchema": { "type": "object", "properties": { "severity": { "type": "string", "enum": ["critical", "error", "warning"] }, "resolved": { "type": "boolean" }, "limit": { "type": "number" } } } } ``` ```json { "name": "devops.errors.fix", "description": "Attempt to auto-fix an infrastructure error", "inputSchema": { "type": "object", "properties": { "error_id": { "type": "string" } }, "required": ["error_id"] } } ``` ### ADA (Orchestrator) - Error Delegation When a critical error occurs: 1. ADA receives notification 2. Analyzes error type 3. Delegates to appropriate agent: - Morgan (DevOps) → Infrastructure issues - Alex (Finance) → Billing issues - Devon (Strategy) → Business logic issues --- ## Usage Examples ### Example 1: Claude Investigates Errors ``` User: "Why is PostgreSQL showing as disconnected?" Claude (via MCP): 1. Calls devops.errors.list with severity="error" 2. Sees PostgreSQL health check is failing 3. Reads error details 4. Identifies issue: Using HTTP on non-HTTP service 5. Suggests fix: Update health check to use pg_isready ``` ### Example 2: Morgan Auto-Fixes Issue ``` 1. Morgan monitors /api/errors 2. Detects: Redis connection pool exhausted 3. Calls devops.errors.fix with error_id 4. System automatically: - Clears Redis connection pool - Restarts Redis if needed - Re-tests connection 5. Morgan logs resolution ``` ### Example 3: ADA Orchestrates Multi-Agent Fix ``` 1. Critical error: Payment processing failing 2. ADA analyzes error 3. Delegates: - Morgan → Check API health - Alex → Verify billing service - Devon → Check business rules 4. Each agent reports back 5. ADA coordinates fix ``` --- ## Dashboard Integration The Super Admin dashboard at `http://localhost:8080/admin` will show: 1. **Error Feed** - Real-time errors across all services 2. **Auto-Fix Log** - History of automatic fixes 3. **Agent Activity** - Which agents are investigating/fixing issues 4. **Health Trends** - Error rates over time --- ## Next Steps 1. ✅ Create error logging endpoints in MCP server 2. ✅ Fix PostgreSQL/Redis health checks 3. ✅ Add error capture middleware 4. ✅ Create auto-healing system 5. ✅ Integrate with Morgan (DevOps agent) 6. ✅ Add dashboard UI for error management --- **Status**: 🚧 In Development **Target**: AI-first DevOps - errors are automatically logged and fixed by AI agents

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Adam-Camp-King/solid-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server