AI-Driven Remediation Testing

ARCHITECTURE.md•6.41 KiB

# MCP Server Architecture ## Overview MCP Server is a production-ready orchestration platform for AI-driven remediation testing. It follows a service-oriented architecture with clear separation of concerns. ## Core Components ### 1. Configuration Layer (`config.py`) - **Pydantic Settings**: Type-safe configuration management - **Environment Overrides**: `MCP_*` environment variables - **YAML Support**: Default `config.yaml` loading - **Validation**: Automatic path creation and validation ### 2. Logging Layer (`logging_config.py`) - **Dual Output**: Console (INFO+) and File (DEBUG+) - **Rotation**: 10MB files, 5 backups - **Artifact Management**: Per-run directory structure - **Structured Logging**: Contextual information with file/line numbers ### 3. Data Models (`models/`) #### Scenario Model (`scenario.py`) - **Meta**: Scenario identification - **Defaults**: Default configurations - **Bindings**: Variable substitution - **Prechecks**: Pre-execution validation - **Fault**: Fault injection definition - **Stabilize**: Wait conditions - **Assistant Steps**: RCA and Remedy interactions - **Execute Remedy**: Command execution - **Verify**: Post-execution validation - **Cleanup**: Resource cleanup - **Report**: Result reporting All models are Pydantic v2 with: - Type validation - Default values - JSON schema generation - Serialization/deserialization ### 4. Orchestration Engine (`orchestration/`) #### FSM (`fsm.py`) State machine with 13 states: ``` INIT → PRECHECK → FAULT_INJECT → STABILIZE → ASSISTANT_RCA → EVAL_RCA → ASSISTANT_REMEDY → EVAL_REMEDY → EXECUTE_REMEDY → VERIFY → PASS/FAIL → CLEANUP ``` **ScenarioContext**: Execution state container - Run metadata - Current state - Step results - Thread/interrupt tracking - Response storage **StepResult**: Individual step outcome - State identifier - Success/failure - Message - Score (optional) - Artifacts - Metadata #### Engine (`engine.py`) Async generator-based orchestration: - Yields `StepResult` for each step - Error handling and recovery - Artifact generation - Variable substitution ### 5. Services (`services/`) #### FaultService (`fault_service.py`) Stub for chaos engineering integration: - `inject()`: Create fault - `cleanup()`: Remove fault - Tracking of active faults - Integration points for Chaos Mesh, Litmus, Gremlin #### ExecutorService (`executor_service.py`) Secure command execution: - `asyncio.subprocess` for local execution - Deny pattern enforcement - Output capture (stdout/stderr) - Artifact storage - Exit code handling #### EvalService (`eval_service.py`) AI response evaluation: - **Regex Guards**: Pattern validation - **JSON Schema**: Structure validation - **Token Jaccard**: Semantic similarity - Reference/metric matching - Threshold-based pass/fail ### 6. Clients (`clients/`) #### RemediationClient (`remediation_client.py`) HTTP client for workflow API: - **httpx**: Async HTTP client - **initiate_remediation()**: Start workflow - **resume_remediation()**: Resume with input - **JSON Pointer Resolution**: Navigate graph structure - **State Management**: Thread/interrupt tracking API Methods: - `InitiateEnsemble`: Create new remediation workflow - `ResumeEnsemble`: Continue workflow with input ### 7. Server (`server.py`) #### ScenarioServiceImpl Main service implementation: - Scenario registry - Run tracking - Async execution - Result aggregation #### MCPServer Server lifecycle management: - Service initialization - Scenario loading - Graceful shutdown - Client cleanup ## Data Flow ``` 1. Load Scenario (YAML → Pydantic Model) ↓ 2. Initialize Context (ScenarioContext) ↓ 3. Orchestration Engine (FSM-based) ├─→ FaultService.inject() ├─→ RemediationClient.initiate() ├─→ EvalService.score() ├─→ ExecutorService.run() └─→ FaultService.cleanup() ↓ 4. Generate Artifacts ├─→ scenario.yaml ├─→ transcript.json ├─→ report.json └─→ cmd_*.txt ↓ 5. Return ScenarioResult ``` ## Error Handling ### Service Level - Try/catch with logging - Graceful degradation - Detailed error messages ### Orchestration Level - State-specific error handling - Automatic cleanup on failure - Final state preservation ### Client Level - HTTP error handling - Timeout management - Retry logic (future) ## Security ### Command Execution - Deny pattern matching - Namespace isolation - Service account enforcement - Output sanitization ### API Communication - HTTPS support - Token-based auth (configurable) - Request validation ## Extensibility ### Adding New Fault Types 1. Update `FaultService.inject()` 2. Add integration with chaos tool 3. Update cleanup logic ### Adding New Evaluations 1. Extend `EvalService.score()` 2. Add new guard types 3. Update `AssistantExpectation` model ### Adding New States 1. Update `State` enum 2. Add handler in `engine.py` 3. Update FSM transitions ## Performance Considerations ### Async/Await - All I/O operations are async - Non-blocking execution - Parallel service calls where possible ### Resource Management - Connection pooling (httpx) - File handle management - Memory-efficient streaming ### Logging - Rotating file handler - Size-based rotation - Async-safe logging ## Testing Strategy ### Unit Tests - Service mocking - Pydantic validation - FSM state transitions ### Integration Tests - Mock remediation API - Local command execution - End-to-end scenarios ### System Tests - Full scenario execution - Artifact validation - Performance benchmarks ## Deployment Patterns ### Standalone ```bash python -m mcp_server.server ``` ### Docker ```bash docker build -t mcp-server . docker run -p 50051:50051 mcp-server ``` ### Kubernetes - StatefulSet for persistence - ConfigMap for scenarios - PVC for logs ### Cloud Functions - Serverless execution - Event-driven triggers - Managed storage ## Monitoring ### Metrics (Future) - Scenario execution time - Pass/fail rates - Service latency - Resource usage ### Tracing (Future) - OpenTelemetry integration - Distributed tracing - Service dependencies ### Alerting - Failed scenarios - Service errors - Resource exhaustion ## Future Enhancements 1. **gRPC Streaming**: Real-time event streaming 2. **WebSocket**: Live scenario updates 3. **Metrics Export**: Prometheus/StatsD 4. **Distributed Tracing**: OpenTelemetry 5. **Multi-tenancy**: Namespace isolation 6. **Scenario Library**: Pre-built templates 7. **UI Dashboard**: Web-based monitoring 8. **CI/CD Integration**: GitHub Actions, Jenkins

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Purv123/Remidiation-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

ARCHITECTURE.md•6.41 KiB