Analyzes DevOps findings to recommend and generate AWS Fault Injection Simulator (FIS) experiments across various AWS services like EC2, RDS, and ECS to validate system resilience.
Suggests chaos engineering experiments for Kubernetes environments and Amazon EKS clusters, including pod deletion, resource stress testing, and network latency injection.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@FIS Recommender MCP ServerRecommend FIS experiments for finding-001 regarding network latency timeouts."
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
FIS Recommender MCP Server
An MCP (Model Context Protocol) server that automatically recommends AWS Fault Injection Simulator (FIS) experiments based on DevOps Agent findings. Helps teams quickly design chaos engineering experiments to validate system resilience.
Features
π Analyzes DevOps findings and suggests relevant FIS experiments
π― Maps issues to appropriate fault injection actions
π Generates complete FIS experiment templates
β‘ Integrates seamlessly with Kiro CLI and other MCP clients
Installation
Clone the Repository
Configure MCP Client
For Kiro CLI
Add to ~/.kiro/mcp-servers.json:
For Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
Usage Examples
Example 1: Network Latency Issue
Prompt:
Response: The MCP server will recommend:
Action:
aws:network:disrupt-connectivityDuration: 10 minutes
Target: Network interfaces
Stop condition: CloudWatch alarm on error rate
Example 2: Database Availability
Prompt:
Response:
Action:
aws:rds:reboot-db-instancesDuration: 2 minutes
Target: RDS instances
Tests application's database failover handling
Example 3: CPU Stress Testing
Prompt:
Response: Complete FIS experiment template with:
EC2 instance stop action
3-minute duration
CloudWatch alarm stop condition
Target selection by tags
Example 4: Memory Pressure
Prompt:
Response:
Action:
aws:ssm:send-command(memory stress)Duration: 5 minutes
SSM document for memory consumption
Tests monitoring and auto-recovery
Standalone Testing
Run the example script to test without an MCP client:
This will analyze sample findings and display recommendations.
Supported Finding Types
Network & Connectivity
Finding Keyword | FIS Action | Duration | Use Case |
network | aws:network:disrupt-connectivity | 5 min | Test network partition handling |
latency | aws:network:disrupt-connectivity | 10 min | Validate timeout configurations |
packet loss | aws:ecs:task-network-packet-loss | 5 min | Simulate packet loss scenarios |
vpc endpoint | aws:network:disrupt-vpc-endpoint | 5 min | Test VPC endpoint failures |
cross-region | aws:network:route-table-disrupt-cross-region-connectivity | 10 min | Test multi-region connectivity |
transit gateway | aws:network:transit-gateway-disrupt-cross-region-connectivity | 10 min | Test transit gateway issues |
direct connect | aws:directconnect:virtual-interface-disconnect | 5 min | Test Direct Connect failures |
Database & Storage
Finding Keyword | FIS Action | Duration | Use Case |
database | aws:rds:reboot-db-instances | 2 min | Test database failover |
rds | aws:rds:failover-db-cluster | 3 min | Test RDS cluster failover |
dynamodb | aws:dynamodb:global-table-pause-replication | 5 min | Test DynamoDB replication pause |
aurora dsql | aws:dsql:cluster-connection-failure | 5 min | Test Aurora DSQL failures |
disk | aws:ebs:pause-volume-io | 3 min | Test disk I/O failures |
ebs | aws:ebs:volume-io-latency | 5 min | Inject EBS I/O latency |
s3 replication | aws:s3:bucket-pause-replication | 10 min | Test S3 replication pause |
Compute & Instances
Finding Keyword | FIS Action | Duration | Use Case |
cpu | aws:ec2:stop-instances | 3 min | Validate auto-scaling policies |
memory | aws:ssm:send-command | 5 min | Test OOM handling |
instance | aws:ec2:reboot-instances | 2 min | Test instance reboot resilience |
spot | aws:ec2:send-spot-instance-interruptions | 2 min | Test spot interruption handling |
capacity | aws:ec2:api-insufficient-instance-capacity-error | 5 min | Test capacity error handling |
auto scaling | aws:ec2:asg-insufficient-instance-capacity-error | 5 min | Test ASG capacity errors |
ECS & Containers
Finding Keyword | FIS Action | Duration | Use Case |
ecs | aws:ecs:stop-task | 2 min | Test ECS task failure recovery |
container cpu | aws:ecs:task-cpu-stress | 5 min | Inject CPU stress on tasks |
container memory | aws:ecs:task-io-stress | 5 min | Inject I/O stress on tasks |
container network | aws:ecs:task-network-latency | 5 min | Inject network latency on tasks |
drain | aws:ecs:drain-container-instances | 5 min | Test container draining |
EKS & Kubernetes
Finding Keyword | FIS Action | Duration | Use Case |
eks | aws:eks:pod-delete | 2 min | Test pod deletion recovery |
pod cpu | aws:eks:pod-cpu-stress | 5 min | Inject CPU stress on pods |
pod memory | aws:eks:pod-memory-stress | 5 min | Inject memory stress on pods |
pod network | aws:eks:pod-network-latency | 5 min | Inject network latency on pods |
nodegroup | aws:eks:terminate-nodegroup-instances | 3 min | Test node termination |
kubernetes | aws:eks:inject-kubernetes-custom-resource | 5 min | Inject custom K8s faults |
Lambda & Serverless
Finding Keyword | FIS Action | Duration | Use Case |
lambda | aws:lambda:invocation-error | 5 min | Inject Lambda errors |
lambda latency | aws:lambda:invocation-add-delay | 5 min | Add Lambda invocation delay |
lambda http | aws:lambda:invocation-http-integration-response | 5 min | Test Lambda HTTP failures |
Lambda Chaos Engineering Best Practices
Testing Cold Starts and Timeouts:
Use
aws:lambda:invocation-add-delayto simulate cold start scenariosSet
startupDelayMillisecondshigher than function timeout to test timeout handlingValidates retry logic, dead letter queues, and error handling
Error Handling Validation:
Use
aws:lambda:invocation-errorwithpreventExecution: trueto test without running codeSet
invocationPercentageto gradually increase fault injection (start at 10-20%)Verify CloudWatch alarms fire and monitoring captures errors
Integration Testing:
Use
aws:lambda:invocation-http-integration-responsefor ALB, API Gateway, VPC LatticeTest upstream/downstream service behavior with custom HTTP status codes
Validate circuit breakers and fallback mechanisms
Continuous Testing in CI/CD:
Automate Lambda FIS experiments in AWS CodePipeline post-deployment
Use CloudWatch Synthetics to monitor user experience during experiments
Set stop conditions based on error rate thresholds (e.g., >5% errors)
Experiment Safety:
Start experiments in non-production with synthetic traffic
Use
invocationPercentageparameter to limit blast radiusConfigure CloudWatch alarms as stop conditions
Run during off-peak hours initially
Key Metrics to Monitor:
Invocation errors and throttles
Duration and billed duration
Concurrent executions
Dead letter queue messages
Downstream service health
Caching & Streaming
Finding Keyword | FIS Action | Duration | Use Case |
elasticache | aws:elasticache:replicationgroup-interrupt-az-power | 5 min | Test ElastiCache AZ failure |
memorydb | aws:memorydb:multi-region-cluster-pause-replication | 5 min | Test MemoryDB replication |
kinesis | aws:kinesis:stream-provisioned-throughput-exception | 5 min | Test Kinesis throughput |
kinesis iterator | aws:kinesis:stream-expired-iterator-exception | 3 min | Test expired iterator handling |
API & Throttling
Finding Keyword | FIS Action | Duration | Use Case |
api throttle | aws:fis:inject-api-throttle-error | 5 min | Inject API throttling |
api error | aws:fis:inject-api-internal-error | 5 min | Inject API internal errors |
api unavailable | aws:fis:inject-api-unavailable-error | 5 min | Inject API unavailable errors |
Availability & Recovery
Finding Keyword | FIS Action | Duration | Use Case |
availability | aws:ec2:stop-instances | 5 min | Test high availability setup |
zonal | aws:arc:start-zonal-autoshift | 10 min | Test zonal autoshift |
alarm | aws:cloudwatch:assert-alarm-state | 1 min | Validate alarm states |
Available Tools
1. recommend_fis_experiments
Analyzes DevOps Agent findings and returns FIS experiment recommendations.
Input:
Output:
2. create_fis_template
Generates a complete, ready-to-deploy FIS experiment template.
Input:
Output: Complete CloudFormation-compatible FIS experiment template ready for deployment.
Customization
Adding New Finding Mappings
Edit server.py and add to the finding_mappings dictionary:
Adjusting Durations
Modify duration values in ISO 8601 format:
PT2M= 2 minutesPT5M= 5 minutesPT10M= 10 minutesPT1H= 1 hour
Requirements
Python 3.7+
AWS credentials configured (for actual FIS deployment)
MCP-compatible client (Kiro CLI, Claude Desktop, etc.)
Chaos Engineering Best Practices
The Chaos Engineering Flywheel
Follow the scientific method for each experiment:
Define Steady State - Establish measurable baseline metrics (TPS, latency, error rate)
Form Hypothesis - Predict how the system will respond to the fault
Run Experiment - Inject the fault in a controlled manner
Verify Results - Compare actual behavior against hypothesis
Improve - Address gaps and re-run experiments
Experiment Safety Guidelines
Start Small, Scale Gradually:
Begin in non-production environments
Use synthetic traffic before real customer traffic
Start with low percentages (10-20%) and increase gradually
Run during off-peak hours initially
Implement Guardrails:
Set CloudWatch alarms as stop conditions
Define clear rollback procedures
Monitor blast radius with real-time dashboards
Communicate with operations teams before experiments
Scope and Impact:
Clearly define experiment boundaries
Use tags to target specific resources
Limit concurrent experiments
Document expected vs. actual impact
Continuous Chaos Testing
Automate in CI/CD:
Integrate FIS experiments into AWS CodePipeline
Run experiments post-deployment automatically
Use results to gate production releases
Track experiment results over time
Game Days:
Schedule regular chaos engineering sessions
Simulate realistic failure scenarios
Test incident response procedures
Validate runbooks and documentation
Key Metrics to Track
System Health:
Request success rate (target: >99.9%)
Latency percentiles (p50, p95, p99)
Error rates (4xx, 5xx)
Resource utilization (CPU, memory, connections)
Resilience Indicators:
Time to detect failures
Time to recovery
Blast radius of failures
Cascading failure prevention
Common Failure Scenarios
Network Failures:
Partition tolerance between services
Cross-region connectivity loss
DNS resolution failures
Increased latency and packet loss
Resource Exhaustion:
CPU and memory pressure
Connection pool exhaustion
Disk I/O saturation
API throttling and rate limits
Dependency Failures:
Database failover and replication lag
Cache invalidation and cold starts
Third-party API unavailability
Message queue backlogs
References
License
MIT
Contributing
Issues and pull requests welcome at https://github.com/pimisael/fis-recommender-mcp