Enables AI-driven remediation testing for Kubernetes clusters, including pod management, deployment operations, and fault injection through chaos engineering tools like Chaos Mesh and Litmus.
Planned integration for distributed tracing of remediation workflows and test scenario execution.
Supports metrics verification through SignalFlow queries for validating system state and performance during remediation testing scenarios.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@AI-Driven Remediation Testingrun scenario scenario-001 with namespace=production"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
MCP Server - AI-Driven Remediation Testing
A production-ready Model Context Protocol (MCP) server for orchestrating AI-driven remediation test scenarios with gRPC, WebSocket, and HTTP integrations.
Overview
MCP Server provides end-to-end orchestration for testing AI-powered incident remediation workflows. It reads declarative YAML scenarios, injects faults, interacts with remediation APIs, evaluates AI responses, executes remediation commands, and produces comprehensive test reports.
Features
Declarative Scenarios: Define test scenarios in YAML with variable substitution
FSM-Based Orchestration: 13-state finite state machine for reliable execution
Fault Injection: Integrate with chaos engineering tools (Chaos Mesh, Litmus, etc.)
AI Evaluation: Score AI responses using regex, JSON Schema, and semantic similarity
Secure Execution: Sandboxed command execution with deny patterns
Remediation API Integration: Full HTTP/WebSocket client for workflow APIs
Comprehensive Logging: DEBUG+ file logs, INFO+ console, artifact management
Production-Ready: Type-safe Python 3.11+ with pydantic validation
Architecture
┌─────────────────────────────────────────────────────────────┐
│ MCP Server (gRPC) │
├─────────────────────────────────────────────────────────────┤
│ ScenarioService │ FaultService │ ExecutorService │ EvalService│
└─────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Orchestration │ │ Fault │ │ Command │
│ Engine (FSM) │ │ Injection │ │ Executor │
└──────────────────┘ └──────────────┘ └──────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Remediation Workflow API Client │
│ (HTTP + WebSocket, InitiateEnsemble, Resume) │
└─────────────────────────────────────────────────────────────┘Installation
# Install dependencies
pip install -r requirements.txt
# Generate gRPC code (optional, using simplified implementation for MVP)
# python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/*.protoConfiguration
Configuration can be provided via config.yaml or environment variables:
# config.yaml
log_dir: "./log"
session_timeout_sec: 300
ws:
ping_interval: 300
ping_timeout: 300
grpc:
host: "localhost"
port: 50051
timeout: 300
http:
base_url: "http://localhost:8901"
ws_url: "ws://localhost:8765/chatsocket"
token_url: "https://app.lab0.signalfx.com/v2/jwt/token"Environment variables (override config.yaml):
export CONFIG_PATH=./config.yaml
export MCP_LOG_DIR=./log
export MCP_GRPC__HOST=localhost
export MCP_GRPC__PORT=50051
export MCP_HTTP__BASE_URL=http://localhost:8901Scenario Definition
Scenarios are defined in YAML with the following structure:
meta:
id: scenario-001
title: "Test Scenario"
owner: "team-name"
defaults:
model: "gpt-4"
timeout: 300
bindings:
namespace: "production"
service: "api-gateway"
fault:
type: "pod_kill"
params:
namespace: "${namespace}"
stabilize:
wait_for:
timeout: 120
assistant_rca:
system: "You are an SRE expert."
user: "Analyze the incident."
expect:
references: ["pod", "crash"]
metrics: ["cpu", "memory"]
guards:
- type: "regex"
pattern: "(?i)root cause"
assistant_remedy:
system: "Provide remediation."
user: "What commands should we run?"
expect:
references: ["kubectl"]
execute_remedy:
sandbox:
service_account: "sre-bot"
namespace: "${namespace}"
policies:
deny_patterns:
- ".*rm -rf.*"
commands:
- name: "Restart pods"
cmd: "kubectl"
args: ["rollout", "restart", "deployment/${service}"]
verify:
signalflow:
- program: "data('cpu.utilization').mean().publish()"
assert_rules: ["value < 70"]
cleanup:
always:
- name: "Reset state"
cmd: "kubectl"
args: ["delete", "pod", "-l", "app=${service}"]
report:
formats: ["json"]FSM States
The orchestration engine follows this state machine:
INIT: Initialize scenario, resolve bindings
PRECHECK: Run pre-execution checks (SignalFlow)
FAULT_INJECT: Inject fault using FaultService
STABILIZE: Wait for system stabilization
ASSISTANT_RCA: Get RCA from remediation API
EVAL_RCA: Evaluate RCA response
ASSISTANT_REMEDY: Get remediation commands
EVAL_REMEDY: Evaluate remedy response
EXECUTE_REMEDY: Execute commands
VERIFY: Verify system state
PASS: Scenario passed
FAIL: Scenario failed
CLEANUP: Clean up resources
Usage
Start Server
python -m mcp_server.serverRun Scenario (Programmatic)
import asyncio
from mcp_server.server import MCPServer
from mcp_server.config import get_settings
async def main():
settings = get_settings()
server = MCPServer(settings)
# Run scenario
result = await server.scenario_service.run_scenario(
scenario_yaml=open("scenarios/example_scenario.yaml").read(),
bindings={"namespace": "staging"}
)
print(f"Run ID: {result['run_id']}")
print(f"Status: {result['status']}")
asyncio.run(main())Check Results
Results are stored in log/runs/{run_id}/:
scenario.yaml: Original scenariotranscript.json: RCA/remedy responsesreport.json: Final test reportcmd_*.txt: Command outputs
Services
FaultService
Injects and cleans up faults. Stub implementation provided; integrate with:
Chaos Mesh (Kubernetes)
Litmus (Kubernetes)
Gremlin (Cloud)
ExecutorService
Executes commands with sandboxing:
Local execution via
asyncio.subprocessDeny pattern enforcement
Output capture and artifact storage
EvalService
Evaluates AI responses:
Regex guards: Pattern matching
JSON Schema: Structure validation
Semantic similarity: Token-based Jaccard
RemediationClient
HTTP client for remediation workflow API:
initiate_remediation(): Start new workflowresume_remediation(): Resume with inputJSON pointer resolution for graph navigation
API Reference
ScenarioService
service ScenarioService {
rpc RunScenario(RunScenarioRequest) returns (RunScenarioResponse);
rpc ListScenarios(Empty) returns (ListScenariosResponse);
rpc GetScenario(GetScenarioRequest) returns (GetScenarioResponse);
rpc StreamEvents(StreamEventsRequest) returns (stream ScenarioEvent);
}Remediation API
InitiateEnsemble:
{
"apiMethod": "InitiateEnsemble",
"apiVersion": "1",
"ensembleName": "REMEDIATION",
"payload": {
"incidentId": "inc-123",
"rcaAnalysis": {
"title": "Pod Crash",
"summary": "API gateway pod crashed",
"nextSteps": "Awaiting analysis"
}
}
}ResumeEnsemble:
{
"apiMethod": "ResumeEnsemble",
"apiVersion": "1",
"payload": {
"messageType": "node_input",
"stateIdentifier": {
"threadId": "thread-123",
"interruptId": "int-456"
},
"nodeId": "node-789",
"inputProperties": {
"input": "User input text"
}
}
}Logging
Console: INFO+ (concise)
File: DEBUG+ at
log/mcp_server.log(rotating, 10MB, 5 backups)Artifacts: Per-run in
log/runs/{run_id}/
Development
Project Structure
Remidiation-MCP/
├── config.yaml # Configuration
├── requirements.txt # Dependencies
├── proto/ # gRPC definitions
│ ├── common.proto
│ ├── scenario_service.proto
│ ├── fault_service.proto
│ ├── executor_service.proto
│ └── eval_service.proto
├── mcp_server/
│ ├── __init__.py
│ ├── config.py # Settings
│ ├── logging_config.py # Logging
│ ├── server.py # gRPC server
│ ├── models/ # Pydantic models
│ │ └── scenario.py
│ ├── services/ # Service implementations
│ │ ├── fault_service.py
│ │ ├── executor_service.py
│ │ └── eval_service.py
│ ├── clients/ # API clients
│ │ └── remediation_client.py
│ ├── orchestration/ # Orchestration engine
│ │ ├── fsm.py
│ │ └── engine.py
│ └── utils/ # Utilities
│ ├── variables.py
│ └── artifacts.py
├── scenarios/ # Test scenarios
│ └── example_scenario.yaml
└── log/ # Logs and artifactsTesting
# Run example scenario
python -m mcp_server.server
# In another terminal, verify logs
tail -f log/mcp_server.log
# Check results
ls -la log/runs/
cat log/runs/run-*/report.jsonProduction Deployment
Docker
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "-m", "mcp_server.server"]Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
spec:
replicas: 1
template:
spec:
containers:
- name: mcp-server
image: mcp-server:latest
ports:
- containerPort: 50051
env:
- name: MCP_GRPC__HOST
value: "0.0.0.0"
- name: MCP_HTTP__BASE_URL
value: "http://remediation-api:8901"Contributing
Follow PEP 8 style guidelines
Add type hints to all functions
Write docstrings for public APIs
Update tests for new features
License
MIT License - See LICENSE file for details
Support
For issues and questions:
GitHub Issues: https://github.com/your-org/mcp-server/issues
Documentation: https://docs.your-org.com/mcp-server
Roadmap
Full gRPC code generation from .proto files
WebSocket streaming for real-time events
Chaos Mesh integration
Prometheus metrics export
OpenTelemetry tracing
Multi-scenario parallel execution
Scenario templates and library