Skip to main content
Glama
README.md11.3 kB
# MCP Server - AI-Driven Remediation Testing A production-ready Model Context Protocol (MCP) server for orchestrating AI-driven remediation test scenarios with gRPC, WebSocket, and HTTP integrations. ## Overview MCP Server provides end-to-end orchestration for testing AI-powered incident remediation workflows. It reads declarative YAML scenarios, injects faults, interacts with remediation APIs, evaluates AI responses, executes remediation commands, and produces comprehensive test reports. ## Features - **Declarative Scenarios**: Define test scenarios in YAML with variable substitution - **FSM-Based Orchestration**: 13-state finite state machine for reliable execution - **Fault Injection**: Integrate with chaos engineering tools (Chaos Mesh, Litmus, etc.) - **AI Evaluation**: Score AI responses using regex, JSON Schema, and semantic similarity - **Secure Execution**: Sandboxed command execution with deny patterns - **Remediation API Integration**: Full HTTP/WebSocket client for workflow APIs - **Comprehensive Logging**: DEBUG+ file logs, INFO+ console, artifact management - **Production-Ready**: Type-safe Python 3.11+ with pydantic validation ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ MCP Server (gRPC) │ ├─────────────────────────────────────────────────────────────┤ │ ScenarioService │ FaultService │ ExecutorService │ EvalService│ └─────────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐ │ Orchestration │ │ Fault │ │ Command │ │ Engine (FSM) │ │ Injection │ │ Executor │ └──────────────────┘ └──────────────┘ └──────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Remediation Workflow API Client │ │ (HTTP + WebSocket, InitiateEnsemble, Resume) │ └─────────────────────────────────────────────────────────────┘ ``` ## Installation ```bash # Install dependencies pip install -r requirements.txt # Generate gRPC code (optional, using simplified implementation for MVP) # python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/*.proto ``` ## Configuration Configuration can be provided via `config.yaml` or environment variables: ```yaml # config.yaml log_dir: "./log" session_timeout_sec: 300 ws: ping_interval: 300 ping_timeout: 300 grpc: host: "localhost" port: 50051 timeout: 300 http: base_url: "http://localhost:8901" ws_url: "ws://localhost:8765/chatsocket" token_url: "https://app.lab0.signalfx.com/v2/jwt/token" ``` Environment variables (override config.yaml): ```bash export CONFIG_PATH=./config.yaml export MCP_LOG_DIR=./log export MCP_GRPC__HOST=localhost export MCP_GRPC__PORT=50051 export MCP_HTTP__BASE_URL=http://localhost:8901 ``` ## Scenario Definition Scenarios are defined in YAML with the following structure: ```yaml meta: id: scenario-001 title: "Test Scenario" owner: "team-name" defaults: model: "gpt-4" timeout: 300 bindings: namespace: "production" service: "api-gateway" fault: type: "pod_kill" params: namespace: "${namespace}" stabilize: wait_for: timeout: 120 assistant_rca: system: "You are an SRE expert." user: "Analyze the incident." expect: references: ["pod", "crash"] metrics: ["cpu", "memory"] guards: - type: "regex" pattern: "(?i)root cause" assistant_remedy: system: "Provide remediation." user: "What commands should we run?" expect: references: ["kubectl"] execute_remedy: sandbox: service_account: "sre-bot" namespace: "${namespace}" policies: deny_patterns: - ".*rm -rf.*" commands: - name: "Restart pods" cmd: "kubectl" args: ["rollout", "restart", "deployment/${service}"] verify: signalflow: - program: "data('cpu.utilization').mean().publish()" assert_rules: ["value < 70"] cleanup: always: - name: "Reset state" cmd: "kubectl" args: ["delete", "pod", "-l", "app=${service}"] report: formats: ["json"] ``` ## FSM States The orchestration engine follows this state machine: 1. **INIT**: Initialize scenario, resolve bindings 2. **PRECHECK**: Run pre-execution checks (SignalFlow) 3. **FAULT_INJECT**: Inject fault using FaultService 4. **STABILIZE**: Wait for system stabilization 5. **ASSISTANT_RCA**: Get RCA from remediation API 6. **EVAL_RCA**: Evaluate RCA response 7. **ASSISTANT_REMEDY**: Get remediation commands 8. **EVAL_REMEDY**: Evaluate remedy response 9. **EXECUTE_REMEDY**: Execute commands 10. **VERIFY**: Verify system state 11. **PASS**: Scenario passed 12. **FAIL**: Scenario failed 13. **CLEANUP**: Clean up resources ## Usage ### Start Server ```bash python -m mcp_server.server ``` ### Run Scenario (Programmatic) ```python import asyncio from mcp_server.server import MCPServer from mcp_server.config import get_settings async def main(): settings = get_settings() server = MCPServer(settings) # Run scenario result = await server.scenario_service.run_scenario( scenario_yaml=open("scenarios/example_scenario.yaml").read(), bindings={"namespace": "staging"} ) print(f"Run ID: {result['run_id']}") print(f"Status: {result['status']}") asyncio.run(main()) ``` ### Check Results Results are stored in `log/runs/{run_id}/`: - `scenario.yaml`: Original scenario - `transcript.json`: RCA/remedy responses - `report.json`: Final test report - `cmd_*.txt`: Command outputs ## Services ### FaultService Injects and cleans up faults. Stub implementation provided; integrate with: - Chaos Mesh (Kubernetes) - Litmus (Kubernetes) - Gremlin (Cloud) ### ExecutorService Executes commands with sandboxing: - Local execution via `asyncio.subprocess` - Deny pattern enforcement - Output capture and artifact storage ### EvalService Evaluates AI responses: - **Regex guards**: Pattern matching - **JSON Schema**: Structure validation - **Semantic similarity**: Token-based Jaccard ### RemediationClient HTTP client for remediation workflow API: - `initiate_remediation()`: Start new workflow - `resume_remediation()`: Resume with input - JSON pointer resolution for graph navigation ## API Reference ### ScenarioService ```protobuf service ScenarioService { rpc RunScenario(RunScenarioRequest) returns (RunScenarioResponse); rpc ListScenarios(Empty) returns (ListScenariosResponse); rpc GetScenario(GetScenarioRequest) returns (GetScenarioResponse); rpc StreamEvents(StreamEventsRequest) returns (stream ScenarioEvent); } ``` ### Remediation API **InitiateEnsemble**: ```json { "apiMethod": "InitiateEnsemble", "apiVersion": "1", "ensembleName": "REMEDIATION", "payload": { "incidentId": "inc-123", "rcaAnalysis": { "title": "Pod Crash", "summary": "API gateway pod crashed", "nextSteps": "Awaiting analysis" } } } ``` **ResumeEnsemble**: ```json { "apiMethod": "ResumeEnsemble", "apiVersion": "1", "payload": { "messageType": "node_input", "stateIdentifier": { "threadId": "thread-123", "interruptId": "int-456" }, "nodeId": "node-789", "inputProperties": { "input": "User input text" } } } ``` ## Logging - **Console**: INFO+ (concise) - **File**: DEBUG+ at `log/mcp_server.log` (rotating, 10MB, 5 backups) - **Artifacts**: Per-run in `log/runs/{run_id}/` ## Development ### Project Structure ``` Remidiation-MCP/ ├── config.yaml # Configuration ├── requirements.txt # Dependencies ├── proto/ # gRPC definitions │ ├── common.proto │ ├── scenario_service.proto │ ├── fault_service.proto │ ├── executor_service.proto │ └── eval_service.proto ├── mcp_server/ │ ├── __init__.py │ ├── config.py # Settings │ ├── logging_config.py # Logging │ ├── server.py # gRPC server │ ├── models/ # Pydantic models │ │ └── scenario.py │ ├── services/ # Service implementations │ │ ├── fault_service.py │ │ ├── executor_service.py │ │ └── eval_service.py │ ├── clients/ # API clients │ │ └── remediation_client.py │ ├── orchestration/ # Orchestration engine │ │ ├── fsm.py │ │ └── engine.py │ └── utils/ # Utilities │ ├── variables.py │ └── artifacts.py ├── scenarios/ # Test scenarios │ └── example_scenario.yaml └── log/ # Logs and artifacts ``` ### Testing ```bash # Run example scenario python -m mcp_server.server # In another terminal, verify logs tail -f log/mcp_server.log # Check results ls -la log/runs/ cat log/runs/run-*/report.json ``` ## Production Deployment ### Docker ```dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["python", "-m", "mcp_server.server"] ``` ### Kubernetes ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: mcp-server spec: replicas: 1 template: spec: containers: - name: mcp-server image: mcp-server:latest ports: - containerPort: 50051 env: - name: MCP_GRPC__HOST value: "0.0.0.0" - name: MCP_HTTP__BASE_URL value: "http://remediation-api:8901" ``` ## Contributing 1. Follow PEP 8 style guidelines 2. Add type hints to all functions 3. Write docstrings for public APIs 4. Update tests for new features ## License MIT License - See LICENSE file for details ## Support For issues and questions: - GitHub Issues: https://github.com/your-org/mcp-server/issues - Documentation: https://docs.your-org.com/mcp-server ## Roadmap - [ ] Full gRPC code generation from .proto files - [ ] WebSocket streaming for real-time events - [ ] Chaos Mesh integration - [ ] Prometheus metrics export - [ ] OpenTelemetry tracing - [ ] Multi-scenario parallel execution - [ ] Scenario templates and library

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Purv123/Remidiation-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server