ERROR_MANAGEMENT_SYSTEM.md•7.83 kB
# AI-First DevOps Error Management System
## Overview
This system captures, logs, and enables AI agents (Claude, ADA, Morgan) to automatically diagnose and fix infrastructure issues.
---
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Services (MCP Server, Backend, Agents, etc.) │
│ - Capture errors automatically │
│ - Log to central error database │
│ - Trigger alerts for critical issues │
└────────────────────┬────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Error Logging System (MCP Server) │
│ - POST /api/errors/log - Log new errors │
│ - GET /api/errors - List all errors │
│ - GET /api/errors/:id - Get error details │
│ - POST /api/errors/:id/fix - Attempt auto-fix │
└────────────────────┬────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ AI Agents (via MCP Tools) │
│ - Morgan (DevOps) - Monitors and fixes infrastructure │
│ - ADA (Orchestrator) - Delegates to appropriate agent │
│ - Claude - Can query errors and suggest fixes │
└─────────────────────────────────────────────────────────────┘
```
---
## Error Logging
### 1. Error Structure
```json
{
"id": "uuid",
"timestamp": "2025-10-13T13:45:00Z",
"service": "mcp-server",
"severity": "critical|error|warning|info",
"error_type": "connection_failed|health_check_failed|api_error",
"message": "PostgreSQL health check failed",
"details": {
"service_name": "PostgreSQL",
"port": 5432,
"attempted_protocol": "http",
"expected_protocol": "postgresql"
},
"stack_trace": "...",
"company_id": 1,
"resolved": false,
"resolution_notes": null,
"auto_fix_attempted": false,
"auto_fix_successful": false
}
```
### 2. Automatic Error Capture
All services will automatically log errors to the MCP server:
```javascript
// Example: Automatic error logging
try {
await checkHealth(service);
} catch (error) {
await logError({
service: 'mcp-server',
severity: 'error',
error_type: 'health_check_failed',
message: error.message,
details: { service_name: service.name }
});
}
```
---
## MCP Endpoints for Error Management
### 1. Log Error
```bash
POST http://localhost:8092/api/errors/log
Content-Type: application/json
{
"service": "mcp-server",
"severity": "error",
"error_type": "health_check_failed",
"message": "PostgreSQL connection failed",
"details": {...}
}
```
### 2. List Errors
```bash
GET http://localhost:8092/api/errors?severity=critical&resolved=false&limit=50
```
Response:
```json
{
"total": 10,
"errors": [
{
"id": "uuid",
"timestamp": "2025-10-13T13:45:00Z",
"service": "mcp-server",
"severity": "error",
"message": "PostgreSQL health check failed",
"resolved": false
}
]
}
```
### 3. Get Error Details
```bash
GET http://localhost:8092/api/errors/:id
```
### 4. Attempt Auto-Fix
```bash
POST http://localhost:8092/api/errors/:id/fix
```
The system will:
1. Analyze the error type
2. Apply known fixes (restart service, clear cache, etc.)
3. Re-test the service
4. Update error status
### 5. Mark as Resolved
```bash
POST http://localhost:8092/api/errors/:id/resolve
Content-Type: application/json
{
"resolution_notes": "Fixed by restarting PostgreSQL container"
}
```
---
## Auto-Healing System
### Common Fixes
1. **Service Not Responding**
- Restart Docker container
- Check resource limits
- Clear connection pool
2. **Health Check Failed**
- Verify correct protocol
- Check firewall rules
- Test connectivity
3. **Database Connection Issues**
- Restart database
- Clear connection pool
- Check credentials
4. **Out of Memory**
- Restart service
- Increase memory limits
- Clear caches
### Auto-Fix Flow
```
1. Error Detected
↓
2. Log to Error Database
↓
3. Check if auto-fixable
↓
4. Apply fix
↓
5. Re-test service
↓
6. Update error status
↓
7. Notify Morgan (DevOps agent) if fix failed
```
---
## Integration with AI Agents
### Morgan (DevOps Agent) - MCP Tool
```json
{
"name": "devops.errors.list",
"description": "List infrastructure errors for Morgan to investigate",
"inputSchema": {
"type": "object",
"properties": {
"severity": { "type": "string", "enum": ["critical", "error", "warning"] },
"resolved": { "type": "boolean" },
"limit": { "type": "number" }
}
}
}
```
```json
{
"name": "devops.errors.fix",
"description": "Attempt to auto-fix an infrastructure error",
"inputSchema": {
"type": "object",
"properties": {
"error_id": { "type": "string" }
},
"required": ["error_id"]
}
}
```
### ADA (Orchestrator) - Error Delegation
When a critical error occurs:
1. ADA receives notification
2. Analyzes error type
3. Delegates to appropriate agent:
- Morgan (DevOps) → Infrastructure issues
- Alex (Finance) → Billing issues
- Devon (Strategy) → Business logic issues
---
## Usage Examples
### Example 1: Claude Investigates Errors
```
User: "Why is PostgreSQL showing as disconnected?"
Claude (via MCP):
1. Calls devops.errors.list with severity="error"
2. Sees PostgreSQL health check is failing
3. Reads error details
4. Identifies issue: Using HTTP on non-HTTP service
5. Suggests fix: Update health check to use pg_isready
```
### Example 2: Morgan Auto-Fixes Issue
```
1. Morgan monitors /api/errors
2. Detects: Redis connection pool exhausted
3. Calls devops.errors.fix with error_id
4. System automatically:
- Clears Redis connection pool
- Restarts Redis if needed
- Re-tests connection
5. Morgan logs resolution
```
### Example 3: ADA Orchestrates Multi-Agent Fix
```
1. Critical error: Payment processing failing
2. ADA analyzes error
3. Delegates:
- Morgan → Check API health
- Alex → Verify billing service
- Devon → Check business rules
4. Each agent reports back
5. ADA coordinates fix
```
---
## Dashboard Integration
The Super Admin dashboard at `http://localhost:8080/admin` will show:
1. **Error Feed** - Real-time errors across all services
2. **Auto-Fix Log** - History of automatic fixes
3. **Agent Activity** - Which agents are investigating/fixing issues
4. **Health Trends** - Error rates over time
---
## Next Steps
1. ✅ Create error logging endpoints in MCP server
2. ✅ Fix PostgreSQL/Redis health checks
3. ✅ Add error capture middleware
4. ✅ Create auto-healing system
5. ✅ Integrate with Morgan (DevOps agent)
6. ✅ Add dashboard UI for error management
---
**Status**: 🚧 In Development
**Target**: AI-first DevOps - errors are automatically logged and fixed by AI agents