Skip to main content
Glama
pimisael

FIS Recommender MCP Server

by pimisael

FIS Recommender MCP Server

An MCP (Model Context Protocol) server that automatically recommends AWS Fault Injection Simulator (FIS) experiments based on DevOps Agent findings. Helps teams quickly design chaos engineering experiments to validate system resilience.

Features

  • πŸ” Analyzes DevOps findings and suggests relevant FIS experiments

  • 🎯 Maps issues to appropriate fault injection actions

  • πŸ“‹ Generates complete FIS experiment templates

  • ⚑ Integrates seamlessly with Kiro CLI and other MCP clients

Installation

Clone the Repository

git clone https://github.com/pimisael/fis-recommender-mcp.git cd fis-recommender-mcp chmod +x server.py

Configure MCP Client

For Kiro CLI

Add to ~/.kiro/mcp-servers.json:

{ "mcpServers": { "fis-recommender": { "command": "python3", "args": ["/absolute/path/to/fis-recommender-mcp/server.py"], "env": { "AWS_REGION": "us-east-1" } } } }

For Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{ "mcpServers": { "fis-recommender": { "command": "python3", "args": ["/absolute/path/to/fis-recommender-mcp/server.py"], "env": { "AWS_REGION": "us-east-1" } } } }

Usage Examples

Example 1: Network Latency Issue

Prompt:

I have a DevOps finding about network latency causing timeouts in my application. Can you recommend FIS experiments to test this? Finding details: - ID: finding-001 - Summary: "High network latency between services causing request timeouts" - Type: NETWORK_ISSUE

Response: The MCP server will recommend:

  • Action: aws:network:disrupt-connectivity

  • Duration: 10 minutes

  • Target: Network interfaces

  • Stop condition: CloudWatch alarm on error rate

Example 2: Database Availability

Prompt:

Recommend FIS experiments for this finding: { "id": "finding-db-001", "summary": "Database connection failures during peak load", "type": "DATABASE_ISSUE" }

Response:

  • Action: aws:rds:reboot-db-instances

  • Duration: 2 minutes

  • Target: RDS instances

  • Tests application's database failover handling

Example 3: CPU Stress Testing

Prompt:

We had a CPU spike incident. Generate a FIS template to test our auto-scaling. Finding: "CPU utilization reached 95% causing service degradation"

Response: Complete FIS experiment template with:

  • EC2 instance stop action

  • 3-minute duration

  • CloudWatch alarm stop condition

  • Target selection by tags

Example 4: Memory Pressure

Prompt:

Create FIS experiments to validate our memory monitoring: - Finding ID: mem-leak-001 - Issue: Memory leak caused OOM errors - Need to test alerting and recovery

Response:

  • Action: aws:ssm:send-command (memory stress)

  • Duration: 5 minutes

  • SSM document for memory consumption

  • Tests monitoring and auto-recovery

Standalone Testing

Run the example script to test without an MCP client:

python3 example.py

This will analyze sample findings and display recommendations.

Supported Finding Types

Network & Connectivity

Finding Keyword

FIS Action

Duration

Use Case

network

aws:network:disrupt-connectivity

5 min

Test network partition handling

latency

aws:network:disrupt-connectivity

10 min

Validate timeout configurations

packet loss

aws:ecs:task-network-packet-loss

5 min

Simulate packet loss scenarios

vpc endpoint

aws:network:disrupt-vpc-endpoint

5 min

Test VPC endpoint failures

cross-region

aws:network:route-table-disrupt-cross-region-connectivity

10 min

Test multi-region connectivity

transit gateway

aws:network:transit-gateway-disrupt-cross-region-connectivity

10 min

Test transit gateway issues

direct connect

aws:directconnect:virtual-interface-disconnect

5 min

Test Direct Connect failures

Database & Storage

Finding Keyword

FIS Action

Duration

Use Case

database

aws:rds:reboot-db-instances

2 min

Test database failover

rds

aws:rds:failover-db-cluster

3 min

Test RDS cluster failover

dynamodb

aws:dynamodb:global-table-pause-replication

5 min

Test DynamoDB replication pause

aurora dsql

aws:dsql:cluster-connection-failure

5 min

Test Aurora DSQL failures

disk

aws:ebs:pause-volume-io

3 min

Test disk I/O failures

ebs

aws:ebs:volume-io-latency

5 min

Inject EBS I/O latency

s3 replication

aws:s3:bucket-pause-replication

10 min

Test S3 replication pause

Compute & Instances

Finding Keyword

FIS Action

Duration

Use Case

cpu

aws:ec2:stop-instances

3 min

Validate auto-scaling policies

memory

aws:ssm:send-command

5 min

Test OOM handling

instance

aws:ec2:reboot-instances

2 min

Test instance reboot resilience

spot

aws:ec2:send-spot-instance-interruptions

2 min

Test spot interruption handling

capacity

aws:ec2:api-insufficient-instance-capacity-error

5 min

Test capacity error handling

auto scaling

aws:ec2:asg-insufficient-instance-capacity-error

5 min

Test ASG capacity errors

ECS & Containers

Finding Keyword

FIS Action

Duration

Use Case

ecs

aws:ecs:stop-task

2 min

Test ECS task failure recovery

container cpu

aws:ecs:task-cpu-stress

5 min

Inject CPU stress on tasks

container memory

aws:ecs:task-io-stress

5 min

Inject I/O stress on tasks

container network

aws:ecs:task-network-latency

5 min

Inject network latency on tasks

drain

aws:ecs:drain-container-instances

5 min

Test container draining

EKS & Kubernetes

Finding Keyword

FIS Action

Duration

Use Case

eks

aws:eks:pod-delete

2 min

Test pod deletion recovery

pod cpu

aws:eks:pod-cpu-stress

5 min

Inject CPU stress on pods

pod memory

aws:eks:pod-memory-stress

5 min

Inject memory stress on pods

pod network

aws:eks:pod-network-latency

5 min

Inject network latency on pods

nodegroup

aws:eks:terminate-nodegroup-instances

3 min

Test node termination

kubernetes

aws:eks:inject-kubernetes-custom-resource

5 min

Inject custom K8s faults

Lambda & Serverless

Finding Keyword

FIS Action

Duration

Use Case

lambda

aws:lambda:invocation-error

5 min

Inject Lambda errors

lambda latency

aws:lambda:invocation-add-delay

5 min

Add Lambda invocation delay

lambda http

aws:lambda:invocation-http-integration-response

5 min

Test Lambda HTTP failures

Lambda Chaos Engineering Best Practices

Testing Cold Starts and Timeouts:

  • Use aws:lambda:invocation-add-delay to simulate cold start scenarios

  • Set startupDelayMilliseconds higher than function timeout to test timeout handling

  • Validates retry logic, dead letter queues, and error handling

Error Handling Validation:

  • Use aws:lambda:invocation-error with preventExecution: true to test without running code

  • Set invocationPercentage to gradually increase fault injection (start at 10-20%)

  • Verify CloudWatch alarms fire and monitoring captures errors

Integration Testing:

  • Use aws:lambda:invocation-http-integration-response for ALB, API Gateway, VPC Lattice

  • Test upstream/downstream service behavior with custom HTTP status codes

  • Validate circuit breakers and fallback mechanisms

Continuous Testing in CI/CD:

  • Automate Lambda FIS experiments in AWS CodePipeline post-deployment

  • Use CloudWatch Synthetics to monitor user experience during experiments

  • Set stop conditions based on error rate thresholds (e.g., >5% errors)

Experiment Safety:

  • Start experiments in non-production with synthetic traffic

  • Use invocationPercentage parameter to limit blast radius

  • Configure CloudWatch alarms as stop conditions

  • Run during off-peak hours initially

Key Metrics to Monitor:

  • Invocation errors and throttles

  • Duration and billed duration

  • Concurrent executions

  • Dead letter queue messages

  • Downstream service health

Caching & Streaming

Finding Keyword

FIS Action

Duration

Use Case

elasticache

aws:elasticache:replicationgroup-interrupt-az-power

5 min

Test ElastiCache AZ failure

memorydb

aws:memorydb:multi-region-cluster-pause-replication

5 min

Test MemoryDB replication

kinesis

aws:kinesis:stream-provisioned-throughput-exception

5 min

Test Kinesis throughput

kinesis iterator

aws:kinesis:stream-expired-iterator-exception

3 min

Test expired iterator handling

API & Throttling

Finding Keyword

FIS Action

Duration

Use Case

api throttle

aws:fis:inject-api-throttle-error

5 min

Inject API throttling

api error

aws:fis:inject-api-internal-error

5 min

Inject API internal errors

api unavailable

aws:fis:inject-api-unavailable-error

5 min

Inject API unavailable errors

Availability & Recovery

Finding Keyword

FIS Action

Duration

Use Case

availability

aws:ec2:stop-instances

5 min

Test high availability setup

zonal

aws:arc:start-zonal-autoshift

10 min

Test zonal autoshift

alarm

aws:cloudwatch:assert-alarm-state

1 min

Validate alarm states

Available Tools

1. recommend_fis_experiments

Analyzes DevOps Agent findings and returns FIS experiment recommendations.

Input:

{ "finding": { "id": "finding-123", "summary": "Network latency caused timeouts", "type": "AVAILABILITY_ISSUE" } }

Output:

{ "recommendations": [ { "action": "aws:network:disrupt-connectivity", "duration": "PT10M", "description": "Simulates network disruption to test timeout handling", "targets": ["NetworkInterface"], "stopConditions": ["CloudWatch alarm on error rate > 5%"] } ], "finding_id": "finding-123", "count": 1 }

2. create_fis_template

Generates a complete, ready-to-deploy FIS experiment template.

Input:

{ "recommendation": { "action": "aws:ec2:stop-instances", "duration": "PT3M", "description": "Test instance failure recovery" }, "target_config": { "resourceType": "aws:ec2:instance", "selectionMode": "COUNT(1)", "tags": { "Environment": "staging", "Team": "platform" }, "roleArn": "arn:aws:iam::123456789012:role/FISRole" } }

Output: Complete CloudFormation-compatible FIS experiment template ready for deployment.

Customization

Adding New Finding Mappings

Edit server.py and add to the finding_mappings dictionary:

finding_mappings = { "disk": { "action": "aws:ebs:pause-volume-io", "duration": "PT5M", "description": "Simulates disk I/O issues" }, # Add your custom mappings here }

Adjusting Durations

Modify duration values in ISO 8601 format:

  • PT2M = 2 minutes

  • PT5M = 5 minutes

  • PT10M = 10 minutes

  • PT1H = 1 hour

Requirements

  • Python 3.7+

  • AWS credentials configured (for actual FIS deployment)

  • MCP-compatible client (Kiro CLI, Claude Desktop, etc.)

Chaos Engineering Best Practices

The Chaos Engineering Flywheel

Follow the scientific method for each experiment:

  1. Define Steady State - Establish measurable baseline metrics (TPS, latency, error rate)

  2. Form Hypothesis - Predict how the system will respond to the fault

  3. Run Experiment - Inject the fault in a controlled manner

  4. Verify Results - Compare actual behavior against hypothesis

  5. Improve - Address gaps and re-run experiments

Experiment Safety Guidelines

Start Small, Scale Gradually:

  • Begin in non-production environments

  • Use synthetic traffic before real customer traffic

  • Start with low percentages (10-20%) and increase gradually

  • Run during off-peak hours initially

Implement Guardrails:

  • Set CloudWatch alarms as stop conditions

  • Define clear rollback procedures

  • Monitor blast radius with real-time dashboards

  • Communicate with operations teams before experiments

Scope and Impact:

  • Clearly define experiment boundaries

  • Use tags to target specific resources

  • Limit concurrent experiments

  • Document expected vs. actual impact

Continuous Chaos Testing

Automate in CI/CD:

  • Integrate FIS experiments into AWS CodePipeline

  • Run experiments post-deployment automatically

  • Use results to gate production releases

  • Track experiment results over time

Game Days:

  • Schedule regular chaos engineering sessions

  • Simulate realistic failure scenarios

  • Test incident response procedures

  • Validate runbooks and documentation

Key Metrics to Track

System Health:

  • Request success rate (target: >99.9%)

  • Latency percentiles (p50, p95, p99)

  • Error rates (4xx, 5xx)

  • Resource utilization (CPU, memory, connections)

Resilience Indicators:

  • Time to detect failures

  • Time to recovery

  • Blast radius of failures

  • Cascading failure prevention

Common Failure Scenarios

Network Failures:

  • Partition tolerance between services

  • Cross-region connectivity loss

  • DNS resolution failures

  • Increased latency and packet loss

Resource Exhaustion:

  • CPU and memory pressure

  • Connection pool exhaustion

  • Disk I/O saturation

  • API throttling and rate limits

Dependency Failures:

  • Database failover and replication lag

  • Cache invalidation and cold starts

  • Third-party API unavailability

  • Message queue backlogs

References

License

MIT

Contributing

Issues and pull requests welcome at https://github.com/pimisael/fis-recommender-mcp

-
security - not tested
A
license - permissive license
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pimisael/fis-recommender-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server