MCP Server - AI-Driven Remediation Testing
A production-ready Model Context Protocol (MCP) server for orchestrating AI-driven remediation test scenarios with gRPC, WebSocket, and HTTP integrations.
Overview
MCP Server provides end-to-end orchestration for testing AI-powered incident remediation workflows. It reads declarative YAML scenarios, injects faults, interacts with remediation APIs, evaluates AI responses, executes remediation commands, and produces comprehensive test reports.
Features
Declarative Scenarios: Define test scenarios in YAML with variable substitution
FSM-Based Orchestration: 13-state finite state machine for reliable execution
Fault Injection: Integrate with chaos engineering tools (Chaos Mesh, Litmus, etc.)
AI Evaluation: Score AI responses using regex, JSON Schema, and semantic similarity
Secure Execution: Sandboxed command execution with deny patterns
Remediation API Integration: Full HTTP/WebSocket client for workflow APIs
Comprehensive Logging: DEBUG+ file logs, INFO+ console, artifact management
Production-Ready: Type-safe Python 3.11+ with pydantic validation
Architecture
┌─────────────────────────────────────────────────────────────┐
│ MCP Server (gRPC) │
├─────────────────────────────────────────────────────────────┤
│ ScenarioService │ FaultService │ ExecutorService │ EvalService│
└─────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Orchestration │ │ Fault │ │ Command │
│ Engine (FSM) │ │ Injection │ │ Executor │
└──────────────────┘ └──────────────┘ └──────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Remediation Workflow API Client │
│ (HTTP + WebSocket, InitiateEnsemble, Resume) │
└─────────────────────────────────────────────────────────────┘
Installation
# Install dependencies
pip install -r requirements.txt
# Generate gRPC code (optional, using simplified implementation for MVP)
# python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/*.proto
Configuration
Configuration can be provided via config.yaml or environment variables:
# config.yaml
log_dir: "./log"
session_timeout_sec: 300
ws:
ping_interval: 300
ping_timeout: 300
grpc:
host: "localhost"
port: 50051
timeout: 300
http:
base_url: "http://localhost:8901"
ws_url: "ws://localhost:8765/chatsocket"
token_url: "https://app.lab0.signalfx.com/v2/jwt/token"
Environment variables (override config.yaml):
export CONFIG_PATH=./config.yaml
export MCP_LOG_DIR=./log
export MCP_GRPC__HOST=localhost
export MCP_GRPC__PORT=50051
export MCP_HTTP__BASE_URL=http://localhost:8901
Scenario Definition
Scenarios are defined in YAML with the following structure:
meta:
id: scenario-001
title: "Test Scenario"
owner: "team-name"
defaults:
model: "gpt-4"
timeout: 300
bindings:
namespace: "production"
service: "api-gateway"
fault:
type: "pod_kill"
params:
namespace: "${namespace}"
stabilize:
wait_for:
timeout: 120
assistant_rca:
system: "You are an SRE expert."
user: "Analyze the incident."
expect:
references: ["pod", "crash"]
metrics: ["cpu", "memory"]
guards:
- type: "regex"
pattern: "(?i)root cause"
assistant_remedy:
system: "Provide remediation."
user: "What commands should we run?"
expect:
references: ["kubectl"]
execute_remedy:
sandbox:
service_account: "sre-bot"
namespace: "${namespace}"
policies:
deny_patterns:
- ".*rm -rf.*"
commands:
- name: "Restart pods"
cmd: "kubectl"
args: ["rollout", "restart", "deployment/${service}"]
verify:
signalflow:
- program: "data('cpu.utilization').mean().publish()"
assert_rules: ["value < 70"]
cleanup:
always:
- name: "Reset state"
cmd: "kubectl"
args: ["delete", "pod", "-l", "app=${service}"]
report:
formats: ["json"]
FSM States
The orchestration engine follows this state machine:
INIT: Initialize scenario, resolve bindings
PRECHECK: Run pre-execution checks (SignalFlow)
FAULT_INJECT: Inject fault using FaultService
STABILIZE: Wait for system stabilization
ASSISTANT_RCA: Get RCA from remediation API
EVAL_RCA: Evaluate RCA response
ASSISTANT_REMEDY: Get remediation commands
EVAL_REMEDY: Evaluate remedy response
EXECUTE_REMEDY: Execute commands
VERIFY: Verify system state
PASS: Scenario passed
FAIL: Scenario failed
CLEANUP: Clean up resources
Usage
Start Server
python -m mcp_server.server
Run Scenario (Programmatic)
import asyncio
from mcp_server.server import MCPServer
from mcp_server.config import get_settings
async def main():
settings = get_settings()
server = MCPServer(settings)
# Run scenario
result = await server.scenario_service.run_scenario(
scenario_yaml=open("scenarios/example_scenario.yaml").read(),
bindings={"namespace": "staging"}
)
print(f"Run ID: {result['run_id']}")
print(f"Status: {result['status']}")
asyncio.run(main())
Check Results
Results are stored in log/runs/{run_id}/:
scenario.yaml: Original scenario
transcript.json: RCA/remedy responses
report.json: Final test report
cmd_*.txt: Command outputs
Services
FaultService
Injects and cleans up faults. Stub implementation provided; integrate with:
Chaos Mesh (Kubernetes)
Litmus (Kubernetes)
Gremlin (Cloud)
ExecutorService
Executes commands with sandboxing:
EvalService
Evaluates AI responses:
Regex guards: Pattern matching
JSON Schema: Structure validation
Semantic similarity: Token-based Jaccard
RemediationClient
HTTP client for remediation workflow API:
initiate_remediation(): Start new workflow
resume_remediation(): Resume with input
JSON pointer resolution for graph navigation
API Reference
ScenarioService
service ScenarioService {
rpc RunScenario(RunScenarioRequest) returns (RunScenarioResponse);
rpc ListScenarios(Empty) returns (ListScenariosResponse);
rpc GetScenario(GetScenarioRequest) returns (GetScenarioResponse);
rpc StreamEvents(StreamEventsRequest) returns (stream ScenarioEvent);
}
Remediation API
InitiateEnsemble:
{
"apiMethod": "InitiateEnsemble",
"apiVersion": "1",
"ensembleName": "REMEDIATION",
"payload": {
"incidentId": "inc-123",
"rcaAnalysis": {
"title": "Pod Crash",
"summary": "API gateway pod crashed",
"nextSteps": "Awaiting analysis"
}
}
}
ResumeEnsemble:
{
"apiMethod": "ResumeEnsemble",
"apiVersion": "1",
"payload": {
"messageType": "node_input",
"stateIdentifier": {
"threadId": "thread-123",
"interruptId": "int-456"
},
"nodeId": "node-789",
"inputProperties": {
"input": "User input text"
}
}
}
Logging
Console: INFO+ (concise)
File: DEBUG+ at log/mcp_server.log (rotating, 10MB, 5 backups)
Artifacts: Per-run in log/runs/{run_id}/
Development
Project Structure
Remidiation-MCP/
├── config.yaml # Configuration
├── requirements.txt # Dependencies
├── proto/ # gRPC definitions
│ ├── common.proto
│ ├── scenario_service.proto
│ ├── fault_service.proto
│ ├── executor_service.proto
│ └── eval_service.proto
├── mcp_server/
│ ├── __init__.py
│ ├── config.py # Settings
│ ├── logging_config.py # Logging
│ ├── server.py # gRPC server
│ ├── models/ # Pydantic models
│ │ └── scenario.py
│ ├── services/ # Service implementations
│ │ ├── fault_service.py
│ │ ├── executor_service.py
│ │ └── eval_service.py
│ ├── clients/ # API clients
│ │ └── remediation_client.py
│ ├── orchestration/ # Orchestration engine
│ │ ├── fsm.py
│ │ └── engine.py
│ └── utils/ # Utilities
│ ├── variables.py
│ └── artifacts.py
├── scenarios/ # Test scenarios
│ └── example_scenario.yaml
└── log/ # Logs and artifacts
Testing
# Run example scenario
python -m mcp_server.server
# In another terminal, verify logs
tail -f log/mcp_server.log
# Check results
ls -la log/runs/
cat log/runs/run-*/report.json
Production Deployment
Docker
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "-m", "mcp_server.server"]
Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
spec:
replicas: 1
template:
spec:
containers:
- name: mcp-server
image: mcp-server:latest
ports:
- containerPort: 50051
env:
- name: MCP_GRPC__HOST
value: "0.0.0.0"
- name: MCP_HTTP__BASE_URL
value: "http://remediation-api:8901"
Contributing
Follow PEP 8 style guidelines
Add type hints to all functions
Write docstrings for public APIs
Update tests for new features
License
MIT License - See LICENSE file for details
Support
For issues and questions:
Roadmap
Full gRPC code generation from .proto files
WebSocket streaming for real-time events
Chaos Mesh integration
Prometheus metrics export
OpenTelemetry tracing
Multi-scenario parallel execution
Scenario templates and library