recovery-tools.md•23.3 kB
# Recovery & Disaster Management Tools Reference
This document provides detailed specifications for GEPA MCP tools focused on system resilience, disaster recovery, backup management, and component health monitoring.
## Table of Contents
- [gepa_create_backup](#gepa_create_backup)
- [gepa_restore_backup](#gepa_restore_backup)
- [gepa_list_backups](#gepa_list_backups)
- [gepa_recovery_status](#gepa_recovery_status)
- [gepa_recover_component](#gepa_recover_component)
- [gepa_integrity_check](#gepa_integrity_check)
---
## gepa_create_backup
Creates comprehensive system backups including evolution state, trajectories, Pareto frontier data, and component configurations.
### Purpose
Provides robust backup capabilities to protect against data loss, system corruption, and component failures. Essential for maintaining system resilience and enabling point-in-time recovery.
### Backup Types
| Type | Description | Contents | Use Case |
|------|-------------|----------|----------|
| **Full** | Complete system state | All components, trajectories, configurations | Regular snapshots |
| **Incremental** | Changes since last backup | Modified data only | Frequent updates |
| **Component** | Specific component data | Single component state | Targeted backup |
| **Archive** | Compressed historical data | Older trajectories and results | Long-term storage |
### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `label` | `string` | ❌ | Optional descriptive label for the backup |
| `includeTrajectories` | `boolean` | ❌ | Include trajectory data in backup (default: true) |
### Request Example
```typescript
const response = await mcpClient.callTool('gepa_create_backup', {
label: 'pre-major-update-backup',
includeTrajectories: true
});
```
### Response Example
```markdown
# System Backup Created
## Backup Details
- **ID**: backup_1733140800_abc123def
- **Label**: pre-major-update-backup
- **Timestamp**: 2024-12-02T14:30:00.000Z
- **Type**: full
- **Size**: 15.47 MB
- **Components**: 7
- **Compressed**: Yes
## Components Backed Up
- **evolution_engine** (system): 2.45 KB
- **pareto_frontier** (data): 1.23 MB
- **trajectory_store** (data): 12.15 MB
- **llm_adapter** (config): 1.87 KB
- **prompt_mutator** (config): 3.21 KB
- **reflection_engine** (system): 5.67 KB
- **disaster_recovery** (system): 890 bytes
## Metadata
- Generation: 15
- Population Size: 47
- Pareto Frontier Size: 23
- Total Trajectories: 1,247
The backup is ready for restoration if needed.
```
### Backup Components
Each backup includes multiple component types:
| Component | Type | Description | Critical Level |
|-----------|------|-------------|----------------|
| **Evolution Engine** | System | Evolution state and configuration | High |
| **Pareto Frontier** | Data | Optimal candidate archive | High |
| **Trajectory Store** | Data | Complete execution history | Medium |
| **LLM Adapter** | Config | Model configurations and settings | Medium |
| **Prompt Mutator** | Config | Mutation strategies and parameters | Medium |
| **Reflection Engine** | System | Analysis patterns and learned insights | High |
| **Memory Cache** | Data | Performance optimization cache | Low |
### Automated Backup Strategies
```typescript
// Schedule regular backups
const scheduleBackups = {
daily: {
label: `daily-backup-${new Date().toISOString().split('T')[0]}`,
includeTrajectories: true
},
weekly: {
label: `weekly-backup-week-${getWeekNumber()}`,
includeTrajectories: true
},
preUpdate: {
label: 'pre-system-update',
includeTrajectories: true
}
};
```
### Error Cases
| Error | Cause | Solution |
|-------|-------|----------|
| `Insufficient disk space` | Storage limit exceeded | Free up space or configure backup location |
| `Component lock failure` | System busy during backup | Retry after current operations complete |
| `Compression failed` | Corrupted data or memory issues | Run integrity check before backup |
| `Permission denied` | File system access restricted | Check backup directory permissions |
---
## gepa_restore_backup
Restores system state from a previously created backup with optional integrity validation and pre-restore backup creation.
### Purpose
Enables recovery from backup snapshots to restore system functionality after failures, corruption, or experimental changes that need to be reverted.
### Restoration Options
| Option | Description | Impact | Recommendation |
|--------|-------------|--------|----------------|
| **Full Restore** | Complete system replacement | All current data lost | Use for disaster recovery |
| **Selective Restore** | Specific component restoration | Targeted data replacement | Use for component failures |
| **Merge Restore** | Combine backup with current state | Partial data preservation | Use for partial recovery |
### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `backupId` | `string` | ✅ | ID of the backup to restore from |
| `validateIntegrity` | `boolean` | ❌ | Perform integrity validation before restore (default: true) |
| `createPreRestoreBackup` | `boolean` | ❌ | Create backup before restoration (default: true) |
### Request Example
```typescript
const response = await mcpClient.callTool('gepa_restore_backup', {
backupId: 'backup_1733140800_abc123def',
validateIntegrity: true,
createPreRestoreBackup: true
});
```
### Response Example
```markdown
# System Restore Completed
## Restore Details
- **Backup ID**: backup_1733140800_abc123def
- **Success**: Yes
- **Restore Time**: 2,347 ms
- **Pre-restore Backup**: backup_1733141000_prerestore_xyz789
## Components Restored
✅ evolution_engine
✅ pareto_frontier
✅ trajectory_store
✅ llm_adapter
✅ prompt_mutator
✅ reflection_engine
✅ disaster_recovery
## Integrity Checks
- evolution_engine: ✅ Valid
- pareto_frontier: ✅ Valid
- trajectory_store: ✅ Valid
- llm_adapter: ✅ Valid
- prompt_mutator: ✅ Valid
- reflection_engine: ✅ Valid
- disaster_recovery: ✅ Valid
System has been successfully restored from backup.
```
### Restoration Process
The restoration follows these steps:
1. **Pre-Validation**: Verify backup integrity and compatibility
2. **Pre-Restore Backup**: Create safety backup of current state
3. **Component Shutdown**: Gracefully stop affected components
4. **Data Restoration**: Replace component data with backup versions
5. **Integrity Verification**: Validate restored data consistency
6. **Component Restart**: Initialize components with restored data
7. **Health Check**: Verify system functionality post-restore
### Error Cases
| Error | Cause | Solution |
|-------|-------|----------|
| `Backup not found` | Invalid backup ID | Verify backup ID exists with list_backups |
| `Integrity validation failed` | Corrupted backup data | Use different backup or disable validation |
| `Incompatible version` | Backup from different system version | Check compatibility or use migration tools |
| `Restoration interrupted` | System failure during restore | Use pre-restore backup to recover |
---
## gepa_list_backups
Lists all available system backups with filtering and sorting options for backup management.
### Purpose
Provides visibility into backup history, enables backup selection for restoration, and supports backup lifecycle management.
### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `limit` | `number` | ❌ | Maximum number of backups to return (default: 20) |
| `filterLabel` | `string` | ❌ | Filter backups by label pattern (optional) |
### Request Example
```typescript
const response = await mcpClient.callTool('gepa_list_backups', {
limit: 10,
filterLabel: 'daily'
});
```
### Response Example
```markdown
# Available System Backups
Found 8 backup(s):
## daily-backup-2024-12-02 (backup_1733140800_abc123def)
- **Created**: 2024-12-02T14:30:00.000Z
- **Type**: full
- **Size**: 15.47 MB
- **Components**: 7
## daily-backup-2024-12-01 (backup_1733054400_def456ghi)
- **Created**: 2024-12-01T14:30:00.000Z
- **Type**: full
- **Size**: 14.92 MB
- **Components**: 7
## pre-evolution-experiment (backup_1733050800_ghi789jkl)
- **Created**: 2024-12-01T13:30:00.000Z
- **Type**: full
- **Size**: 14.85 MB
- **Components**: 7
## weekly-backup-week-48 (backup_1732968000_jkl012mno)
- **Created**: 2024-11-30T14:00:00.000Z
- **Type**: full
- **Size**: 13.67 MB
- **Components**: 7
## emergency-backup (backup_1732881600_mno345pqr)
- **Created**: 2024-11-29T14:00:00.000Z
- **Type**: full
- **Size**: 12.98 MB
- **Components**: 6
Use `gepa_restore_backup` with a backup ID to restore the system.
```
### Backup Management Utilities
```typescript
// Find recent backups
const recentBackups = await mcpClient.callTool('gepa_list_backups', {
limit: 5
});
// Find backups by label pattern
const experimentBackups = await mcpClient.callTool('gepa_list_backups', {
filterLabel: 'experiment',
limit: 20
});
// Get comprehensive backup list
const allBackups = await mcpClient.callTool('gepa_list_backups', {
limit: 100
});
```
### Error Cases
| Error | Cause | Solution |
|-------|-------|----------|
| `Backup directory not found` | Backup storage not configured | Initialize backup system |
| `Permission denied` | Insufficient file access rights | Check directory permissions |
| `Corrupted backup index` | Backup metadata damaged | Rebuild backup index |
---
## gepa_recovery_status
Provides comprehensive disaster recovery status and health information for all system components.
### Purpose
Offers real-time visibility into system health, recovery capabilities, and potential issues before they become critical failures.
### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `includeMetrics` | `boolean` | ❌ | Include detailed metrics in response (default: true) |
### Request Example
```typescript
const response = await mcpClient.callTool('gepa_recovery_status', {
includeMetrics: true
});
```
### Response Example
```markdown
# Disaster Recovery Status
## Overall System Health: HEALTHY
### System Components
- **evolution_engine**: HEALTHY (Last check: 2024-12-02T14:35:00.000Z)
- **pareto_frontier**: HEALTHY (Last check: 2024-12-02T14:35:00.000Z)
- **trajectory_store**: HEALTHY (Last check: 2024-12-02T14:35:00.000Z)
- **llm_adapter**: HEALTHY (Last check: 2024-12-02T14:35:00.000Z)
- **prompt_mutator**: HEALTHY (Last check: 2024-12-02T14:35:00.000Z)
- **reflection_engine**: HEALTHY (Last check: 2024-12-02T14:35:00.000Z)
### Recovery Dashboard
- **System Status**: HEALTHY
- **Active Recoveries**: 0
- **Recent Failures**: 1 (24h)
- **Total Executions**: 1,247
- **Success Rate**: 97.3%
### Metrics
- **Backups Available**: 15
- **Quarantined Items**: 0
- **Critical Components**: 6
- **Last Backup Age**: 23 minutes
### Detailed Metrics
#### Evolution Engine
- uptime: 72.5 hours
- memory_usage: 156 MB
- active_processes: 3
- error_rate: 0.002
#### Pareto Frontier
- frontier_size: 23
- total_candidates: 47
- update_frequency: 15.7 per hour
- optimization_efficiency: 0.89
#### Trajectory Store
- total_trajectories: 1,247
- storage_used: 245 MB
- query_performance: 12.3ms avg
- index_health: optimal
#### LLM Adapter
- active_connections: 2
- avg_response_time: 1,850ms
- token_efficiency: 0.76
- rate_limit_status: normal
#### Prompt Mutator
- mutations_per_hour: 48.2
- success_rate: 94.7%
- diversity_score: 0.73
- cache_hit_rate: 67.8%
#### Reflection Engine
- analyses_completed: 89
- pattern_recognition_accuracy: 91.2%
- improvement_suggestions: 267
- confidence_score: 0.84
```
### Health Status Levels
| Status | Description | Action Required |
|--------|-------------|-----------------|
| **HEALTHY** | All systems operating normally | None |
| **WARNING** | Minor issues detected | Monitor closely |
| **DEGRADED** | Reduced functionality | Investigate and repair |
| **CRITICAL** | Severe issues affecting operations | Immediate attention |
| **FAILED** | Component non-functional | Emergency recovery |
### Monitoring Thresholds
```typescript
const healthThresholds = {
memory_usage: { warning: 500, critical: 1000 }, // MB
error_rate: { warning: 0.05, critical: 0.1 }, // percentage
response_time: { warning: 5000, critical: 10000 }, // ms
success_rate: { warning: 0.9, critical: 0.8 }, // percentage
disk_usage: { warning: 0.8, critical: 0.95 } // percentage
};
```
### Error Cases
| Error | Cause | Solution |
|-------|-------|----------|
| `Health check timeout` | Component unresponsive | Restart component or run recovery |
| `Metrics collection failed` | Monitoring system issue | Check monitoring configuration |
| `Status unavailable` | Recovery system not initialized | Initialize disaster recovery system |
---
## gepa_recover_component
Recovers a specific GEPA component using configurable recovery strategies when component failures are detected.
### Purpose
Provides targeted recovery for individual components without full system restoration, enabling surgical repairs and minimizing disruption.
### Component Types
| Component | Description | Recovery Strategies |
|-----------|-------------|-------------------|
| `evolution_engine` | Core genetic algorithm engine | restart, restore_from_backup, rebuild |
| `pareto_frontier` | Multi-objective optimization frontier | restart, restore_from_backup, reset_to_defaults |
| `llm_adapter` | Language model interface | restart, reset_to_defaults |
| `trajectory_store` | Execution data storage | restart, restore_from_backup, rebuild |
| `memory_cache` | Performance optimization cache | restart, reset_to_defaults |
### Recovery Strategies
| Strategy | Description | Data Impact | Recovery Time |
|----------|-------------|-------------|---------------|
| `restart` | Graceful component restart | None | Fast (1-5s) |
| `restore_from_backup` | Load from recent backup | Partial loss | Medium (10-30s) |
| `rebuild` | Reconstruct from available data | Minimal loss | Slow (30-120s) |
| `reset_to_defaults` | Factory reset configuration | Settings lost | Fast (1-10s) |
### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `componentType` | `string` | ✅ | Type of component to recover |
| `strategy` | `string` | ❌ | Recovery strategy to use (default: 'restart') |
### Request Example
```typescript
const response = await mcpClient.callTool('gepa_recover_component', {
componentType: 'trajectory_store',
strategy: 'restore_from_backup'
});
```
### Response Example
```markdown
# Component Recovery Completed
## Recovery Details
- **Component**: trajectory_store
- **Strategy**: restore_from_backup
- **Success**: Yes
- **Duration**: 15,347 ms
- **Start Time**: 2024-12-02T14:45:00.000Z
- **End Time**: 2024-12-02T14:45:15.347Z
## Recovery Logs
[2024-12-02T14:45:00.120Z] Starting trajectory_store recovery
[2024-12-02T14:45:00.245Z] Identifying latest valid backup
[2024-12-02T14:45:01.156Z] Found backup: backup_1733140800_abc123def
[2024-12-02T14:45:01.289Z] Validating backup integrity
[2024-12-02T14:45:02.445Z] Backup validation passed
[2024-12-02T14:45:02.567Z] Stopping trajectory_store component
[2024-12-02T14:45:03.123Z] Backing up current state
[2024-12-02T14:45:05.789Z] Restoring data from backup
[2024-12-02T14:45:12.234Z] Data restoration completed
[2024-12-02T14:45:12.456Z] Rebuilding indexes
[2024-12-02T14:45:14.567Z] Restarting trajectory_store component
[2024-12-02T14:45:15.123Z] Component health check passed
[2024-12-02T14:45:15.347Z] Recovery completed successfully
## Pre-Recovery State
- Status: FAILED
- Last Action: query_trajectories
- Error: Database connection timeout
## Post-Recovery State
- Status: HEALTHY
- Last Action: component_startup
- Performance: optimal
Component trajectory_store has been successfully recovered using restore_from_backup strategy.
```
### Recovery Decision Matrix
| Issue Type | Recommended Strategy | Alternative |
|------------|---------------------|-------------|
| **Memory Leak** | restart | reset_to_defaults |
| **Data Corruption** | restore_from_backup | rebuild |
| **Configuration Error** | reset_to_defaults | restart |
| **Process Crash** | restart | restore_from_backup |
| **Performance Degradation** | restart | rebuild |
### Automated Recovery Triggers
```typescript
// Set up automatic recovery based on health metrics
const recoveryTriggers = {
evolution_engine: {
memory_usage_threshold: 800, // MB
error_rate_threshold: 0.1,
strategy: 'restart'
},
trajectory_store: {
response_time_threshold: 5000, // ms
error_rate_threshold: 0.05,
strategy: 'restore_from_backup'
}
};
```
### Error Cases
| Error | Cause | Solution |
|-------|-------|----------|
| `Component not found` | Invalid component type | Use valid component name |
| `Recovery strategy failed` | Strategy inappropriate for issue | Try alternative strategy |
| `No backup available` | Backup required but missing | Create backup or use 'restart' strategy |
| `Component in use` | Component locked by active process | Wait for completion or force recovery |
---
## gepa_integrity_check
Performs comprehensive data integrity validation with optional automatic repair for corruption detection and prevention.
### Purpose
Validates data consistency, detects corruption early, and provides automatic repair capabilities to maintain system reliability and data quality.
### Check Scopes
| Scope | Description | Components Checked |
|-------|-------------|-------------------|
| `all` | Complete system validation | All components and data |
| `evolution_state` | Evolution process integrity | Evolution engine, populations |
| `trajectories` | Execution data validation | Trajectory store, indexes |
| `configuration` | Settings and parameters | All component configurations |
| `cache` | Performance cache validation | Memory cache, optimization data |
### Validation Types
| Validation | Description | Detection |
|------------|-------------|-----------|
| **Checksum** | File integrity verification | Data corruption |
| **Size Match** | Expected vs actual file sizes | Truncation, incomplete writes |
| **Dependencies** | Component relationship validation | Missing dependencies |
| **Schema** | Data structure validation | Format inconsistencies |
| **Referential** | Cross-component data consistency | Orphaned references |
### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `component` | `string` | ❌ | Component to check (default: 'all') |
| `autoRepair` | `boolean` | ❌ | Attempt automatic repair (default: false) |
### Request Example
```typescript
const response = await mcpClient.callTool('gepa_integrity_check', {
component: 'all',
autoRepair: true
});
```
### Response Example
```markdown
# Data Integrity Check Results
## Overall Status: ✅ PASSED
### Summary
- **Components Checked**: 6
- **Valid Components**: 6
- **Corrupted Components**: 0
- **Auto-Repair**: Enabled
### Detailed Results
#### evolution_engine
- **Overall Valid**: ✅ Yes
- **Checksum Valid**: ✅
- **Size Match**: ✅
- **Dependencies Valid**: ✅
#### pareto_frontier
- **Overall Valid**: ✅ Yes
- **Checksum Valid**: ✅
- **Size Match**: ✅
- **Dependencies Valid**: ✅
#### trajectory_store
- **Overall Valid**: ✅ Yes
- **Checksum Valid**: ✅
- **Size Match**: ✅
- **Dependencies Valid**: ✅
#### llm_adapter
- **Overall Valid**: ✅ Yes
- **Checksum Valid**: ✅
- **Size Match**: ✅
- **Dependencies Valid**: ✅
#### prompt_mutator
- **Overall Valid**: ✅ Yes
- **Checksum Valid**: ✅
- **Size Match**: ✅
- **Dependencies Valid**: ✅
#### reflection_engine
- **Overall Valid**: ✅ Yes
- **Checksum Valid**: ✅
- **Size Match**: ✅
- **Dependencies Valid**: ✅
All checked components passed integrity validation.
```
### Integrity Check Example with Issues
```markdown
# Data Integrity Check Results
## Overall Status: ❌ ISSUES DETECTED
### Summary
- **Components Checked**: 6
- **Valid Components**: 4
- **Corrupted Components**: 2
- **Auto-Repair**: Enabled
### Detailed Results
#### trajectory_store
- **Overall Valid**: ❌ No
- **Checksum Valid**: ❌
- **Size Match**: ✅
- **Dependencies Valid**: ✅
- **Errors**: Checksum mismatch in trajectory_1733140850_def456.json
#### pareto_frontier
- **Overall Valid**: ❌ No
- **Checksum Valid**: ✅
- **Size Match**: ❌
- **Dependencies Valid**: ❌
- **Errors**: Missing reference to candidate evolution_1733140800_candidate_15
### Recommendations
- **trajectory_store**: Consider running with autoRepair enabled or manual restoration
- **pareto_frontier**: Consider running with autoRepair enabled or manual restoration
2 component(s) failed integrity checks. Auto-repair was attempted.
```
### Automated Integrity Monitoring
```typescript
// Schedule regular integrity checks
const integritySchedule = {
realtime: {
interval: 300000, // 5 minutes
scope: 'cache',
autoRepair: true
},
hourly: {
interval: 3600000, // 1 hour
scope: 'configuration',
autoRepair: true
},
daily: {
interval: 86400000, // 24 hours
scope: 'all',
autoRepair: false
}
};
```
### Error Cases
| Error | Cause | Solution |
|-------|-------|----------|
| `Component not accessible` | File system or permission issue | Check file permissions and disk space |
| `Validation timeout` | Large dataset or slow storage | Increase timeout or check specific components |
| `Auto-repair failed` | Corruption too severe | Manual intervention required |
| `Checksum calculation failed` | Missing or corrupted metadata | Rebuild component indexes |
---
## Best Practices
### Backup Strategy
- **Regular Schedules**: Daily for critical systems, weekly for development
- **Retention Policies**: Keep 7 daily, 4 weekly, 12 monthly backups
- **Verification**: Test backup integrity regularly
- **Labeling**: Use descriptive labels for easy identification
### Recovery Planning
- **Risk Assessment**: Identify critical components and failure modes
- **Recovery Priorities**: Establish component recovery order
- **Testing**: Regular disaster recovery drills
- **Documentation**: Maintain recovery procedures and contact information
### Monitoring and Alerts
- **Health Checks**: Continuous component monitoring
- **Threshold Tuning**: Adjust thresholds based on system behavior
- **Alert Fatigue**: Balance sensitivity vs. noise
- **Escalation**: Define alert escalation procedures
### Component Recovery
- **Progressive Strategies**: Start with least disruptive recovery methods
- **Impact Assessment**: Understand data loss implications
- **Rollback Plans**: Prepare rollback procedures for failed recoveries
- **Post-Recovery Validation**: Verify system functionality after recovery
### Integrity Management
- **Proactive Monitoring**: Regular integrity checks before issues occur
- **Auto-Repair Guidelines**: Enable for minor issues, manual for critical
- **Correlation Analysis**: Identify patterns in integrity failures
- **Preventive Measures**: Address root causes of recurring issues