# MCP Server - AI-Driven Remediation Testing
A production-ready Model Context Protocol (MCP) server for orchestrating AI-driven remediation test scenarios with gRPC, WebSocket, and HTTP integrations.
## Overview
MCP Server provides end-to-end orchestration for testing AI-powered incident remediation workflows. It reads declarative YAML scenarios, injects faults, interacts with remediation APIs, evaluates AI responses, executes remediation commands, and produces comprehensive test reports.
## Features
- **Declarative Scenarios**: Define test scenarios in YAML with variable substitution
- **FSM-Based Orchestration**: 13-state finite state machine for reliable execution
- **Fault Injection**: Integrate with chaos engineering tools (Chaos Mesh, Litmus, etc.)
- **AI Evaluation**: Score AI responses using regex, JSON Schema, and semantic similarity
- **Secure Execution**: Sandboxed command execution with deny patterns
- **Remediation API Integration**: Full HTTP/WebSocket client for workflow APIs
- **Comprehensive Logging**: DEBUG+ file logs, INFO+ console, artifact management
- **Production-Ready**: Type-safe Python 3.11+ with pydantic validation
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ MCP Server (gRPC) │
├─────────────────────────────────────────────────────────────┤
│ ScenarioService │ FaultService │ ExecutorService │ EvalService│
└─────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Orchestration │ │ Fault │ │ Command │
│ Engine (FSM) │ │ Injection │ │ Executor │
└──────────────────┘ └──────────────┘ └──────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Remediation Workflow API Client │
│ (HTTP + WebSocket, InitiateEnsemble, Resume) │
└─────────────────────────────────────────────────────────────┘
```
## Installation
```bash
# Install dependencies
pip install -r requirements.txt
# Generate gRPC code (optional, using simplified implementation for MVP)
# python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/*.proto
```
## Configuration
Configuration can be provided via `config.yaml` or environment variables:
```yaml
# config.yaml
log_dir: "./log"
session_timeout_sec: 300
ws:
ping_interval: 300
ping_timeout: 300
grpc:
host: "localhost"
port: 50051
timeout: 300
http:
base_url: "http://localhost:8901"
ws_url: "ws://localhost:8765/chatsocket"
token_url: "https://app.lab0.signalfx.com/v2/jwt/token"
```
Environment variables (override config.yaml):
```bash
export CONFIG_PATH=./config.yaml
export MCP_LOG_DIR=./log
export MCP_GRPC__HOST=localhost
export MCP_GRPC__PORT=50051
export MCP_HTTP__BASE_URL=http://localhost:8901
```
## Scenario Definition
Scenarios are defined in YAML with the following structure:
```yaml
meta:
id: scenario-001
title: "Test Scenario"
owner: "team-name"
defaults:
model: "gpt-4"
timeout: 300
bindings:
namespace: "production"
service: "api-gateway"
fault:
type: "pod_kill"
params:
namespace: "${namespace}"
stabilize:
wait_for:
timeout: 120
assistant_rca:
system: "You are an SRE expert."
user: "Analyze the incident."
expect:
references: ["pod", "crash"]
metrics: ["cpu", "memory"]
guards:
- type: "regex"
pattern: "(?i)root cause"
assistant_remedy:
system: "Provide remediation."
user: "What commands should we run?"
expect:
references: ["kubectl"]
execute_remedy:
sandbox:
service_account: "sre-bot"
namespace: "${namespace}"
policies:
deny_patterns:
- ".*rm -rf.*"
commands:
- name: "Restart pods"
cmd: "kubectl"
args: ["rollout", "restart", "deployment/${service}"]
verify:
signalflow:
- program: "data('cpu.utilization').mean().publish()"
assert_rules: ["value < 70"]
cleanup:
always:
- name: "Reset state"
cmd: "kubectl"
args: ["delete", "pod", "-l", "app=${service}"]
report:
formats: ["json"]
```
## FSM States
The orchestration engine follows this state machine:
1. **INIT**: Initialize scenario, resolve bindings
2. **PRECHECK**: Run pre-execution checks (SignalFlow)
3. **FAULT_INJECT**: Inject fault using FaultService
4. **STABILIZE**: Wait for system stabilization
5. **ASSISTANT_RCA**: Get RCA from remediation API
6. **EVAL_RCA**: Evaluate RCA response
7. **ASSISTANT_REMEDY**: Get remediation commands
8. **EVAL_REMEDY**: Evaluate remedy response
9. **EXECUTE_REMEDY**: Execute commands
10. **VERIFY**: Verify system state
11. **PASS**: Scenario passed
12. **FAIL**: Scenario failed
13. **CLEANUP**: Clean up resources
## Usage
### Start Server
```bash
python -m mcp_server.server
```
### Run Scenario (Programmatic)
```python
import asyncio
from mcp_server.server import MCPServer
from mcp_server.config import get_settings
async def main():
settings = get_settings()
server = MCPServer(settings)
# Run scenario
result = await server.scenario_service.run_scenario(
scenario_yaml=open("scenarios/example_scenario.yaml").read(),
bindings={"namespace": "staging"}
)
print(f"Run ID: {result['run_id']}")
print(f"Status: {result['status']}")
asyncio.run(main())
```
### Check Results
Results are stored in `log/runs/{run_id}/`:
- `scenario.yaml`: Original scenario
- `transcript.json`: RCA/remedy responses
- `report.json`: Final test report
- `cmd_*.txt`: Command outputs
## Services
### FaultService
Injects and cleans up faults. Stub implementation provided; integrate with:
- Chaos Mesh (Kubernetes)
- Litmus (Kubernetes)
- Gremlin (Cloud)
### ExecutorService
Executes commands with sandboxing:
- Local execution via `asyncio.subprocess`
- Deny pattern enforcement
- Output capture and artifact storage
### EvalService
Evaluates AI responses:
- **Regex guards**: Pattern matching
- **JSON Schema**: Structure validation
- **Semantic similarity**: Token-based Jaccard
### RemediationClient
HTTP client for remediation workflow API:
- `initiate_remediation()`: Start new workflow
- `resume_remediation()`: Resume with input
- JSON pointer resolution for graph navigation
## API Reference
### ScenarioService
```protobuf
service ScenarioService {
rpc RunScenario(RunScenarioRequest) returns (RunScenarioResponse);
rpc ListScenarios(Empty) returns (ListScenariosResponse);
rpc GetScenario(GetScenarioRequest) returns (GetScenarioResponse);
rpc StreamEvents(StreamEventsRequest) returns (stream ScenarioEvent);
}
```
### Remediation API
**InitiateEnsemble**:
```json
{
"apiMethod": "InitiateEnsemble",
"apiVersion": "1",
"ensembleName": "REMEDIATION",
"payload": {
"incidentId": "inc-123",
"rcaAnalysis": {
"title": "Pod Crash",
"summary": "API gateway pod crashed",
"nextSteps": "Awaiting analysis"
}
}
}
```
**ResumeEnsemble**:
```json
{
"apiMethod": "ResumeEnsemble",
"apiVersion": "1",
"payload": {
"messageType": "node_input",
"stateIdentifier": {
"threadId": "thread-123",
"interruptId": "int-456"
},
"nodeId": "node-789",
"inputProperties": {
"input": "User input text"
}
}
}
```
## Logging
- **Console**: INFO+ (concise)
- **File**: DEBUG+ at `log/mcp_server.log` (rotating, 10MB, 5 backups)
- **Artifacts**: Per-run in `log/runs/{run_id}/`
## Development
### Project Structure
```
Remidiation-MCP/
├── config.yaml # Configuration
├── requirements.txt # Dependencies
├── proto/ # gRPC definitions
│ ├── common.proto
│ ├── scenario_service.proto
│ ├── fault_service.proto
│ ├── executor_service.proto
│ └── eval_service.proto
├── mcp_server/
│ ├── __init__.py
│ ├── config.py # Settings
│ ├── logging_config.py # Logging
│ ├── server.py # gRPC server
│ ├── models/ # Pydantic models
│ │ └── scenario.py
│ ├── services/ # Service implementations
│ │ ├── fault_service.py
│ │ ├── executor_service.py
│ │ └── eval_service.py
│ ├── clients/ # API clients
│ │ └── remediation_client.py
│ ├── orchestration/ # Orchestration engine
│ │ ├── fsm.py
│ │ └── engine.py
│ └── utils/ # Utilities
│ ├── variables.py
│ └── artifacts.py
├── scenarios/ # Test scenarios
│ └── example_scenario.yaml
└── log/ # Logs and artifacts
```
### Testing
```bash
# Run example scenario
python -m mcp_server.server
# In another terminal, verify logs
tail -f log/mcp_server.log
# Check results
ls -la log/runs/
cat log/runs/run-*/report.json
```
## Production Deployment
### Docker
```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "-m", "mcp_server.server"]
```
### Kubernetes
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
spec:
replicas: 1
template:
spec:
containers:
- name: mcp-server
image: mcp-server:latest
ports:
- containerPort: 50051
env:
- name: MCP_GRPC__HOST
value: "0.0.0.0"
- name: MCP_HTTP__BASE_URL
value: "http://remediation-api:8901"
```
## Contributing
1. Follow PEP 8 style guidelines
2. Add type hints to all functions
3. Write docstrings for public APIs
4. Update tests for new features
## License
MIT License - See LICENSE file for details
## Support
For issues and questions:
- GitHub Issues: https://github.com/your-org/mcp-server/issues
- Documentation: https://docs.your-org.com/mcp-server
## Roadmap
- [ ] Full gRPC code generation from .proto files
- [ ] WebSocket streaming for real-time events
- [ ] Chaos Mesh integration
- [ ] Prometheus metrics export
- [ ] OpenTelemetry tracing
- [ ] Multi-scenario parallel execution
- [ ] Scenario templates and library