# MCP Server Architecture
## Overview
MCP Server is a production-ready orchestration platform for AI-driven remediation testing. It follows a service-oriented architecture with clear separation of concerns.
## Core Components
### 1. Configuration Layer (`config.py`)
- **Pydantic Settings**: Type-safe configuration management
- **Environment Overrides**: `MCP_*` environment variables
- **YAML Support**: Default `config.yaml` loading
- **Validation**: Automatic path creation and validation
### 2. Logging Layer (`logging_config.py`)
- **Dual Output**: Console (INFO+) and File (DEBUG+)
- **Rotation**: 10MB files, 5 backups
- **Artifact Management**: Per-run directory structure
- **Structured Logging**: Contextual information with file/line numbers
### 3. Data Models (`models/`)
#### Scenario Model (`scenario.py`)
- **Meta**: Scenario identification
- **Defaults**: Default configurations
- **Bindings**: Variable substitution
- **Prechecks**: Pre-execution validation
- **Fault**: Fault injection definition
- **Stabilize**: Wait conditions
- **Assistant Steps**: RCA and Remedy interactions
- **Execute Remedy**: Command execution
- **Verify**: Post-execution validation
- **Cleanup**: Resource cleanup
- **Report**: Result reporting
All models are Pydantic v2 with:
- Type validation
- Default values
- JSON schema generation
- Serialization/deserialization
### 4. Orchestration Engine (`orchestration/`)
#### FSM (`fsm.py`)
State machine with 13 states:
```
INIT → PRECHECK → FAULT_INJECT → STABILIZE →
ASSISTANT_RCA → EVAL_RCA → ASSISTANT_REMEDY →
EVAL_REMEDY → EXECUTE_REMEDY → VERIFY →
PASS/FAIL → CLEANUP
```
**ScenarioContext**: Execution state container
- Run metadata
- Current state
- Step results
- Thread/interrupt tracking
- Response storage
**StepResult**: Individual step outcome
- State identifier
- Success/failure
- Message
- Score (optional)
- Artifacts
- Metadata
#### Engine (`engine.py`)
Async generator-based orchestration:
- Yields `StepResult` for each step
- Error handling and recovery
- Artifact generation
- Variable substitution
### 5. Services (`services/`)
#### FaultService (`fault_service.py`)
Stub for chaos engineering integration:
- `inject()`: Create fault
- `cleanup()`: Remove fault
- Tracking of active faults
- Integration points for Chaos Mesh, Litmus, Gremlin
#### ExecutorService (`executor_service.py`)
Secure command execution:
- `asyncio.subprocess` for local execution
- Deny pattern enforcement
- Output capture (stdout/stderr)
- Artifact storage
- Exit code handling
#### EvalService (`eval_service.py`)
AI response evaluation:
- **Regex Guards**: Pattern validation
- **JSON Schema**: Structure validation
- **Token Jaccard**: Semantic similarity
- Reference/metric matching
- Threshold-based pass/fail
### 6. Clients (`clients/`)
#### RemediationClient (`remediation_client.py`)
HTTP client for workflow API:
- **httpx**: Async HTTP client
- **initiate_remediation()**: Start workflow
- **resume_remediation()**: Resume with input
- **JSON Pointer Resolution**: Navigate graph structure
- **State Management**: Thread/interrupt tracking
API Methods:
- `InitiateEnsemble`: Create new remediation workflow
- `ResumeEnsemble`: Continue workflow with input
### 7. Server (`server.py`)
#### ScenarioServiceImpl
Main service implementation:
- Scenario registry
- Run tracking
- Async execution
- Result aggregation
#### MCPServer
Server lifecycle management:
- Service initialization
- Scenario loading
- Graceful shutdown
- Client cleanup
## Data Flow
```
1. Load Scenario (YAML → Pydantic Model)
↓
2. Initialize Context (ScenarioContext)
↓
3. Orchestration Engine (FSM-based)
├─→ FaultService.inject()
├─→ RemediationClient.initiate()
├─→ EvalService.score()
├─→ ExecutorService.run()
└─→ FaultService.cleanup()
↓
4. Generate Artifacts
├─→ scenario.yaml
├─→ transcript.json
├─→ report.json
└─→ cmd_*.txt
↓
5. Return ScenarioResult
```
## Error Handling
### Service Level
- Try/catch with logging
- Graceful degradation
- Detailed error messages
### Orchestration Level
- State-specific error handling
- Automatic cleanup on failure
- Final state preservation
### Client Level
- HTTP error handling
- Timeout management
- Retry logic (future)
## Security
### Command Execution
- Deny pattern matching
- Namespace isolation
- Service account enforcement
- Output sanitization
### API Communication
- HTTPS support
- Token-based auth (configurable)
- Request validation
## Extensibility
### Adding New Fault Types
1. Update `FaultService.inject()`
2. Add integration with chaos tool
3. Update cleanup logic
### Adding New Evaluations
1. Extend `EvalService.score()`
2. Add new guard types
3. Update `AssistantExpectation` model
### Adding New States
1. Update `State` enum
2. Add handler in `engine.py`
3. Update FSM transitions
## Performance Considerations
### Async/Await
- All I/O operations are async
- Non-blocking execution
- Parallel service calls where possible
### Resource Management
- Connection pooling (httpx)
- File handle management
- Memory-efficient streaming
### Logging
- Rotating file handler
- Size-based rotation
- Async-safe logging
## Testing Strategy
### Unit Tests
- Service mocking
- Pydantic validation
- FSM state transitions
### Integration Tests
- Mock remediation API
- Local command execution
- End-to-end scenarios
### System Tests
- Full scenario execution
- Artifact validation
- Performance benchmarks
## Deployment Patterns
### Standalone
```bash
python -m mcp_server.server
```
### Docker
```bash
docker build -t mcp-server .
docker run -p 50051:50051 mcp-server
```
### Kubernetes
- StatefulSet for persistence
- ConfigMap for scenarios
- PVC for logs
### Cloud Functions
- Serverless execution
- Event-driven triggers
- Managed storage
## Monitoring
### Metrics (Future)
- Scenario execution time
- Pass/fail rates
- Service latency
- Resource usage
### Tracing (Future)
- OpenTelemetry integration
- Distributed tracing
- Service dependencies
### Alerting
- Failed scenarios
- Service errors
- Resource exhaustion
## Future Enhancements
1. **gRPC Streaming**: Real-time event streaming
2. **WebSocket**: Live scenario updates
3. **Metrics Export**: Prometheus/StatsD
4. **Distributed Tracing**: OpenTelemetry
5. **Multi-tenancy**: Namespace isolation
6. **Scenario Library**: Pre-built templates
7. **UI Dashboard**: Web-based monitoring
8. **CI/CD Integration**: GitHub Actions, Jenkins