AI-Driven Remediation Testing

README.md•11.1 KiB

# MCP Server - AI-Driven Remediation Testing A production-ready Model Context Protocol (MCP) server for orchestrating AI-driven remediation test scenarios with gRPC, WebSocket, and HTTP integrations. ## Overview MCP Server provides end-to-end orchestration for testing AI-powered incident remediation workflows. It reads declarative YAML scenarios, injects faults, interacts with remediation APIs, evaluates AI responses, executes remediation commands, and produces comprehensive test reports. ## Features - **Declarative Scenarios**: Define test scenarios in YAML with variable substitution - **FSM-Based Orchestration**: 13-state finite state machine for reliable execution - **Fault Injection**: Integrate with chaos engineering tools (Chaos Mesh, Litmus, etc.) - **AI Evaluation**: Score AI responses using regex, JSON Schema, and semantic similarity - **Secure Execution**: Sandboxed command execution with deny patterns - **Remediation API Integration**: Full HTTP/WebSocket client for workflow APIs - **Comprehensive Logging**: DEBUG+ file logs, INFO+ console, artifact management - **Production-Ready**: Type-safe Python 3.11+ with pydantic validation ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ MCP Server (gRPC) │ ├─────────────────────────────────────────────────────────────┤ │ ScenarioService │ FaultService │ ExecutorService │ EvalService│ └─────────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐ │ Orchestration │ │ Fault │ │ Command │ │ Engine (FSM) │ │ Injection │ │ Executor │ └──────────────────┘ └──────────────┘ └──────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Remediation Workflow API Client │ │ (HTTP + WebSocket, InitiateEnsemble, Resume) │ └─────────────────────────────────────────────────────────────┘ ``` ## Installation ```bash # Install dependencies pip install -r requirements.txt # Generate gRPC code (optional, using simplified implementation for MVP) # python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/*.proto ``` ## Configuration Configuration can be provided via `config.yaml` or environment variables: ```yaml # config.yaml log_dir: "./log" session_timeout_sec: 300 ws: ping_interval: 300 ping_timeout: 300 grpc: host: "localhost" port: 50051 timeout: 300 http: base_url: "http://localhost:8901" ws_url: "ws://localhost:8765/chatsocket" token_url: "https://app.lab0.signalfx.com/v2/jwt/token" ``` Environment variables (override config.yaml): ```bash export CONFIG_PATH=./config.yaml export MCP_LOG_DIR=./log export MCP_GRPC__HOST=localhost export MCP_GRPC__PORT=50051 export MCP_HTTP__BASE_URL=http://localhost:8901 ``` ## Scenario Definition Scenarios are defined in YAML with the following structure: ```yaml meta: id: scenario-001 title: "Test Scenario" owner: "team-name" defaults: model: "gpt-4" timeout: 300 bindings: namespace: "production" service: "api-gateway" fault: type: "pod_kill" params: namespace: "${namespace}" stabilize: wait_for: timeout: 120 assistant_rca: system: "You are an SRE expert." user: "Analyze the incident." expect: references: ["pod", "crash"] metrics: ["cpu", "memory"] guards: - type: "regex" pattern: "(?i)root cause" assistant_remedy: system: "Provide remediation." user: "What commands should we run?" expect: references: ["kubectl"] execute_remedy: sandbox: service_account: "sre-bot" namespace: "${namespace}" policies: deny_patterns: - ".*rm -rf.*" commands: - name: "Restart pods" cmd: "kubectl" args: ["rollout", "restart", "deployment/${service}"] verify: signalflow: - program: "data('cpu.utilization').mean().publish()" assert_rules: ["value < 70"] cleanup: always: - name: "Reset state" cmd: "kubectl" args: ["delete", "pod", "-l", "app=${service}"] report: formats: ["json"] ``` ## FSM States The orchestration engine follows this state machine: 1. **INIT**: Initialize scenario, resolve bindings 2. **PRECHECK**: Run pre-execution checks (SignalFlow) 3. **FAULT_INJECT**: Inject fault using FaultService 4. **STABILIZE**: Wait for system stabilization 5. **ASSISTANT_RCA**: Get RCA from remediation API 6. **EVAL_RCA**: Evaluate RCA response 7. **ASSISTANT_REMEDY**: Get remediation commands 8. **EVAL_REMEDY**: Evaluate remedy response 9. **EXECUTE_REMEDY**: Execute commands 10. **VERIFY**: Verify system state 11. **PASS**: Scenario passed 12. **FAIL**: Scenario failed 13. **CLEANUP**: Clean up resources ## Usage ### Start Server ```bash python -m mcp_server.server ``` ### Run Scenario (Programmatic) ```python import asyncio from mcp_server.server import MCPServer from mcp_server.config import get_settings async def main(): settings = get_settings() server = MCPServer(settings) # Run scenario result = await server.scenario_service.run_scenario( scenario_yaml=open("scenarios/example_scenario.yaml").read(), bindings={"namespace": "staging"} ) print(f"Run ID: {result['run_id']}") print(f"Status: {result['status']}") asyncio.run(main()) ``` ### Check Results Results are stored in `log/runs/{run_id}/`: - `scenario.yaml`: Original scenario - `transcript.json`: RCA/remedy responses - `report.json`: Final test report - `cmd_*.txt`: Command outputs ## Services ### FaultService Injects and cleans up faults. Stub implementation provided; integrate with: - Chaos Mesh (Kubernetes) - Litmus (Kubernetes) - Gremlin (Cloud) ### ExecutorService Executes commands with sandboxing: - Local execution via `asyncio.subprocess` - Deny pattern enforcement - Output capture and artifact storage ### EvalService Evaluates AI responses: - **Regex guards**: Pattern matching - **JSON Schema**: Structure validation - **Semantic similarity**: Token-based Jaccard ### RemediationClient HTTP client for remediation workflow API: - `initiate_remediation()`: Start new workflow - `resume_remediation()`: Resume with input - JSON pointer resolution for graph navigation ## API Reference ### ScenarioService ```protobuf service ScenarioService { rpc RunScenario(RunScenarioRequest) returns (RunScenarioResponse); rpc ListScenarios(Empty) returns (ListScenariosResponse); rpc GetScenario(GetScenarioRequest) returns (GetScenarioResponse); rpc StreamEvents(StreamEventsRequest) returns (stream ScenarioEvent); } ``` ### Remediation API **InitiateEnsemble**: ```json { "apiMethod": "InitiateEnsemble", "apiVersion": "1", "ensembleName": "REMEDIATION", "payload": { "incidentId": "inc-123", "rcaAnalysis": { "title": "Pod Crash", "summary": "API gateway pod crashed", "nextSteps": "Awaiting analysis" } } } ``` **ResumeEnsemble**: ```json { "apiMethod": "ResumeEnsemble", "apiVersion": "1", "payload": { "messageType": "node_input", "stateIdentifier": { "threadId": "thread-123", "interruptId": "int-456" }, "nodeId": "node-789", "inputProperties": { "input": "User input text" } } } ``` ## Logging - **Console**: INFO+ (concise) - **File**: DEBUG+ at `log/mcp_server.log` (rotating, 10MB, 5 backups) - **Artifacts**: Per-run in `log/runs/{run_id}/` ## Development ### Project Structure ``` Remidiation-MCP/ ├── config.yaml # Configuration ├── requirements.txt # Dependencies ├── proto/ # gRPC definitions │ ├── common.proto │ ├── scenario_service.proto │ ├── fault_service.proto │ ├── executor_service.proto │ └── eval_service.proto ├── mcp_server/ │ ├── __init__.py │ ├── config.py # Settings │ ├── logging_config.py # Logging │ ├── server.py # gRPC server │ ├── models/ # Pydantic models │ │ └── scenario.py │ ├── services/ # Service implementations │ │ ├── fault_service.py │ │ ├── executor_service.py │ │ └── eval_service.py │ ├── clients/ # API clients │ │ └── remediation_client.py │ ├── orchestration/ # Orchestration engine │ │ ├── fsm.py │ │ └── engine.py │ └── utils/ # Utilities │ ├── variables.py │ └── artifacts.py ├── scenarios/ # Test scenarios │ └── example_scenario.yaml └── log/ # Logs and artifacts ``` ### Testing ```bash # Run example scenario python -m mcp_server.server # In another terminal, verify logs tail -f log/mcp_server.log # Check results ls -la log/runs/ cat log/runs/run-*/report.json ``` ## Production Deployment ### Docker ```dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["python", "-m", "mcp_server.server"] ``` ### Kubernetes ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: mcp-server spec: replicas: 1 template: spec: containers: - name: mcp-server image: mcp-server:latest ports: - containerPort: 50051 env: - name: MCP_GRPC__HOST value: "0.0.0.0" - name: MCP_HTTP__BASE_URL value: "http://remediation-api:8901" ``` ## Contributing 1. Follow PEP 8 style guidelines 2. Add type hints to all functions 3. Write docstrings for public APIs 4. Update tests for new features ## License MIT License - See LICENSE file for details ## Support For issues and questions: - GitHub Issues: https://github.com/your-org/mcp-server/issues - Documentation: https://docs.your-org.com/mcp-server ## Roadmap - [ ] Full gRPC code generation from .proto files - [ ] WebSocket streaming for real-time events - [ ] Chaos Mesh integration - [ ] Prometheus metrics export - [ ] OpenTelemetry tracing - [ ] Multi-scenario parallel execution - [ ] Scenario templates and library

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Purv123/Remidiation-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•11.1 KiB