Skip to main content
Glama

MCP Server - AI-Driven Remediation Testing

A production-ready Model Context Protocol (MCP) server for orchestrating AI-driven remediation test scenarios with gRPC, WebSocket, and HTTP integrations.

Overview

MCP Server provides end-to-end orchestration for testing AI-powered incident remediation workflows. It reads declarative YAML scenarios, injects faults, interacts with remediation APIs, evaluates AI responses, executes remediation commands, and produces comprehensive test reports.

Features

  • Declarative Scenarios: Define test scenarios in YAML with variable substitution

  • FSM-Based Orchestration: 13-state finite state machine for reliable execution

  • Fault Injection: Integrate with chaos engineering tools (Chaos Mesh, Litmus, etc.)

  • AI Evaluation: Score AI responses using regex, JSON Schema, and semantic similarity

  • Secure Execution: Sandboxed command execution with deny patterns

  • Remediation API Integration: Full HTTP/WebSocket client for workflow APIs

  • Comprehensive Logging: DEBUG+ file logs, INFO+ console, artifact management

  • Production-Ready: Type-safe Python 3.11+ with pydantic validation

Architecture

┌─────────────────────────────────────────────────────────────┐ │ MCP Server (gRPC) │ ├─────────────────────────────────────────────────────────────┤ │ ScenarioService │ FaultService │ ExecutorService │ EvalService│ └─────────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐ │ Orchestration │ │ Fault │ │ Command │ │ Engine (FSM) │ │ Injection │ │ Executor │ └──────────────────┘ └──────────────┘ └──────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Remediation Workflow API Client │ │ (HTTP + WebSocket, InitiateEnsemble, Resume) │ └─────────────────────────────────────────────────────────────┘

Installation

# Install dependencies pip install -r requirements.txt # Generate gRPC code (optional, using simplified implementation for MVP) # python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/*.proto

Configuration

Configuration can be provided via config.yaml or environment variables:

# config.yaml log_dir: "./log" session_timeout_sec: 300 ws: ping_interval: 300 ping_timeout: 300 grpc: host: "localhost" port: 50051 timeout: 300 http: base_url: "http://localhost:8901" ws_url: "ws://localhost:8765/chatsocket" token_url: "https://app.lab0.signalfx.com/v2/jwt/token"

Environment variables (override config.yaml):

export CONFIG_PATH=./config.yaml export MCP_LOG_DIR=./log export MCP_GRPC__HOST=localhost export MCP_GRPC__PORT=50051 export MCP_HTTP__BASE_URL=http://localhost:8901

Scenario Definition

Scenarios are defined in YAML with the following structure:

meta: id: scenario-001 title: "Test Scenario" owner: "team-name" defaults: model: "gpt-4" timeout: 300 bindings: namespace: "production" service: "api-gateway" fault: type: "pod_kill" params: namespace: "${namespace}" stabilize: wait_for: timeout: 120 assistant_rca: system: "You are an SRE expert." user: "Analyze the incident." expect: references: ["pod", "crash"] metrics: ["cpu", "memory"] guards: - type: "regex" pattern: "(?i)root cause" assistant_remedy: system: "Provide remediation." user: "What commands should we run?" expect: references: ["kubectl"] execute_remedy: sandbox: service_account: "sre-bot" namespace: "${namespace}" policies: deny_patterns: - ".*rm -rf.*" commands: - name: "Restart pods" cmd: "kubectl" args: ["rollout", "restart", "deployment/${service}"] verify: signalflow: - program: "data('cpu.utilization').mean().publish()" assert_rules: ["value < 70"] cleanup: always: - name: "Reset state" cmd: "kubectl" args: ["delete", "pod", "-l", "app=${service}"] report: formats: ["json"]

FSM States

The orchestration engine follows this state machine:

  1. INIT: Initialize scenario, resolve bindings

  2. PRECHECK: Run pre-execution checks (SignalFlow)

  3. FAULT_INJECT: Inject fault using FaultService

  4. STABILIZE: Wait for system stabilization

  5. ASSISTANT_RCA: Get RCA from remediation API

  6. EVAL_RCA: Evaluate RCA response

  7. ASSISTANT_REMEDY: Get remediation commands

  8. EVAL_REMEDY: Evaluate remedy response

  9. EXECUTE_REMEDY: Execute commands

  10. VERIFY: Verify system state

  11. PASS: Scenario passed

  12. FAIL: Scenario failed

  13. CLEANUP: Clean up resources

Usage

Start Server

python -m mcp_server.server

Run Scenario (Programmatic)

import asyncio from mcp_server.server import MCPServer from mcp_server.config import get_settings async def main(): settings = get_settings() server = MCPServer(settings) # Run scenario result = await server.scenario_service.run_scenario( scenario_yaml=open("scenarios/example_scenario.yaml").read(), bindings={"namespace": "staging"} ) print(f"Run ID: {result['run_id']}") print(f"Status: {result['status']}") asyncio.run(main())

Check Results

Results are stored in log/runs/{run_id}/:

  • scenario.yaml: Original scenario

  • transcript.json: RCA/remedy responses

  • report.json: Final test report

  • cmd_*.txt: Command outputs

Services

FaultService

Injects and cleans up faults. Stub implementation provided; integrate with:

  • Chaos Mesh (Kubernetes)

  • Litmus (Kubernetes)

  • Gremlin (Cloud)

ExecutorService

Executes commands with sandboxing:

  • Local execution via asyncio.subprocess

  • Deny pattern enforcement

  • Output capture and artifact storage

EvalService

Evaluates AI responses:

  • Regex guards: Pattern matching

  • JSON Schema: Structure validation

  • Semantic similarity: Token-based Jaccard

RemediationClient

HTTP client for remediation workflow API:

  • initiate_remediation(): Start new workflow

  • resume_remediation(): Resume with input

  • JSON pointer resolution for graph navigation

API Reference

ScenarioService

service ScenarioService { rpc RunScenario(RunScenarioRequest) returns (RunScenarioResponse); rpc ListScenarios(Empty) returns (ListScenariosResponse); rpc GetScenario(GetScenarioRequest) returns (GetScenarioResponse); rpc StreamEvents(StreamEventsRequest) returns (stream ScenarioEvent); }

Remediation API

InitiateEnsemble:

{ "apiMethod": "InitiateEnsemble", "apiVersion": "1", "ensembleName": "REMEDIATION", "payload": { "incidentId": "inc-123", "rcaAnalysis": { "title": "Pod Crash", "summary": "API gateway pod crashed", "nextSteps": "Awaiting analysis" } } }

ResumeEnsemble:

{ "apiMethod": "ResumeEnsemble", "apiVersion": "1", "payload": { "messageType": "node_input", "stateIdentifier": { "threadId": "thread-123", "interruptId": "int-456" }, "nodeId": "node-789", "inputProperties": { "input": "User input text" } } }

Logging

  • Console: INFO+ (concise)

  • File: DEBUG+ at log/mcp_server.log (rotating, 10MB, 5 backups)

  • Artifacts: Per-run in log/runs/{run_id}/

Development

Project Structure

Remidiation-MCP/ ├── config.yaml # Configuration ├── requirements.txt # Dependencies ├── proto/ # gRPC definitions │ ├── common.proto │ ├── scenario_service.proto │ ├── fault_service.proto │ ├── executor_service.proto │ └── eval_service.proto ├── mcp_server/ │ ├── __init__.py │ ├── config.py # Settings │ ├── logging_config.py # Logging │ ├── server.py # gRPC server │ ├── models/ # Pydantic models │ │ └── scenario.py │ ├── services/ # Service implementations │ │ ├── fault_service.py │ │ ├── executor_service.py │ │ └── eval_service.py │ ├── clients/ # API clients │ │ └── remediation_client.py │ ├── orchestration/ # Orchestration engine │ │ ├── fsm.py │ │ └── engine.py │ └── utils/ # Utilities │ ├── variables.py │ └── artifacts.py ├── scenarios/ # Test scenarios │ └── example_scenario.yaml └── log/ # Logs and artifacts

Testing

# Run example scenario python -m mcp_server.server # In another terminal, verify logs tail -f log/mcp_server.log # Check results ls -la log/runs/ cat log/runs/run-*/report.json

Production Deployment

Docker

FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["python", "-m", "mcp_server.server"]

Kubernetes

apiVersion: apps/v1 kind: Deployment metadata: name: mcp-server spec: replicas: 1 template: spec: containers: - name: mcp-server image: mcp-server:latest ports: - containerPort: 50051 env: - name: MCP_GRPC__HOST value: "0.0.0.0" - name: MCP_HTTP__BASE_URL value: "http://remediation-api:8901"

Contributing

  1. Follow PEP 8 style guidelines

  2. Add type hints to all functions

  3. Write docstrings for public APIs

  4. Update tests for new features

License

MIT License - See LICENSE file for details

Support

For issues and questions:

Roadmap

  • Full gRPC code generation from .proto files

  • WebSocket streaming for real-time events

  • Chaos Mesh integration

  • Prometheus metrics export

  • OpenTelemetry tracing

  • Multi-scenario parallel execution

  • Scenario templates and library

-
security - not tested
F
license - not found
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Purv123/Remidiation-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server