FIS Recommender MCP Server

README.md•14.3 KiB

# FIS Recommender MCP Server An MCP (Model Context Protocol) server that automatically recommends AWS Fault Injection Simulator (FIS) experiments based on DevOps Agent findings. Helps teams quickly design chaos engineering experiments to validate system resilience. ## Features - 🔍 Analyzes DevOps findings and suggests relevant FIS experiments - 🎯 Maps issues to appropriate fault injection actions - 📋 Generates complete FIS experiment templates - ⚡ Integrates seamlessly with Kiro CLI and other MCP clients ## Installation ### Clone the Repository ```bash git clone https://github.com/pimisael/fis-recommender-mcp.git cd fis-recommender-mcp chmod +x server.py ``` ### Configure MCP Client #### For Kiro CLI Add to `~/.kiro/mcp-servers.json`: ```json { "mcpServers": { "fis-recommender": { "command": "python3", "args": ["/absolute/path/to/fis-recommender-mcp/server.py"], "env": { "AWS_REGION": "us-east-1" } } } } ``` #### For Claude Desktop Add to `~/Library/Application Support/Claude/claude_desktop_config.json`: ```json { "mcpServers": { "fis-recommender": { "command": "python3", "args": ["/absolute/path/to/fis-recommender-mcp/server.py"], "env": { "AWS_REGION": "us-east-1" } } } } ``` ## Usage Examples ### Example 1: Network Latency Issue **Prompt:** ``` I have a DevOps finding about network latency causing timeouts in my application. Can you recommend FIS experiments to test this? Finding details: - ID: finding-001 - Summary: "High network latency between services causing request timeouts" - Type: NETWORK_ISSUE ``` **Response:** The MCP server will recommend: - Action: `aws:network:disrupt-connectivity` - Duration: 10 minutes - Target: Network interfaces - Stop condition: CloudWatch alarm on error rate ### Example 2: Database Availability **Prompt:** ``` Recommend FIS experiments for this finding: { "id": "finding-db-001", "summary": "Database connection failures during peak load", "type": "DATABASE_ISSUE" } ``` **Response:** - Action: `aws:rds:reboot-db-instances` - Duration: 2 minutes - Target: RDS instances - Tests application's database failover handling ### Example 3: CPU Stress Testing **Prompt:** ``` We had a CPU spike incident. Generate a FIS template to test our auto-scaling. Finding: "CPU utilization reached 95% causing service degradation" ``` **Response:** Complete FIS experiment template with: - EC2 instance stop action - 3-minute duration - CloudWatch alarm stop condition - Target selection by tags ### Example 4: Memory Pressure **Prompt:** ``` Create FIS experiments to validate our memory monitoring: - Finding ID: mem-leak-001 - Issue: Memory leak caused OOM errors - Need to test alerting and recovery ``` **Response:** - Action: `aws:ssm:send-command` (memory stress) - Duration: 5 minutes - SSM document for memory consumption - Tests monitoring and auto-recovery ## Standalone Testing Run the example script to test without an MCP client: ```bash python3 example.py ``` This will analyze sample findings and display recommendations. ## Supported Finding Types ### Network & Connectivity | Finding Keyword | FIS Action | Duration | Use Case | |----------------|------------|----------|----------| | network | aws:network:disrupt-connectivity | 5 min | Test network partition handling | | latency | aws:network:disrupt-connectivity | 10 min | Validate timeout configurations | | packet loss | aws:ecs:task-network-packet-loss | 5 min | Simulate packet loss scenarios | | vpc endpoint | aws:network:disrupt-vpc-endpoint | 5 min | Test VPC endpoint failures | | cross-region | aws:network:route-table-disrupt-cross-region-connectivity | 10 min | Test multi-region connectivity | | transit gateway | aws:network:transit-gateway-disrupt-cross-region-connectivity | 10 min | Test transit gateway issues | | direct connect | aws:directconnect:virtual-interface-disconnect | 5 min | Test Direct Connect failures | ### Database & Storage | Finding Keyword | FIS Action | Duration | Use Case | |----------------|------------|----------|----------| | database | aws:rds:reboot-db-instances | 2 min | Test database failover | | rds | aws:rds:failover-db-cluster | 3 min | Test RDS cluster failover | | dynamodb | aws:dynamodb:global-table-pause-replication | 5 min | Test DynamoDB replication pause | | aurora dsql | aws:dsql:cluster-connection-failure | 5 min | Test Aurora DSQL failures | | disk | aws:ebs:pause-volume-io | 3 min | Test disk I/O failures | | ebs | aws:ebs:volume-io-latency | 5 min | Inject EBS I/O latency | | s3 replication | aws:s3:bucket-pause-replication | 10 min | Test S3 replication pause | ### Compute & Instances | Finding Keyword | FIS Action | Duration | Use Case | |----------------|------------|----------|----------| | cpu | aws:ec2:stop-instances | 3 min | Validate auto-scaling policies | | memory | aws:ssm:send-command | 5 min | Test OOM handling | | instance | aws:ec2:reboot-instances | 2 min | Test instance reboot resilience | | spot | aws:ec2:send-spot-instance-interruptions | 2 min | Test spot interruption handling | | capacity | aws:ec2:api-insufficient-instance-capacity-error | 5 min | Test capacity error handling | | auto scaling | aws:ec2:asg-insufficient-instance-capacity-error | 5 min | Test ASG capacity errors | ### ECS & Containers | Finding Keyword | FIS Action | Duration | Use Case | |----------------|------------|----------|----------| | ecs | aws:ecs:stop-task | 2 min | Test ECS task failure recovery | | container cpu | aws:ecs:task-cpu-stress | 5 min | Inject CPU stress on tasks | | container memory | aws:ecs:task-io-stress | 5 min | Inject I/O stress on tasks | | container network | aws:ecs:task-network-latency | 5 min | Inject network latency on tasks | | drain | aws:ecs:drain-container-instances | 5 min | Test container draining | ### EKS & Kubernetes | Finding Keyword | FIS Action | Duration | Use Case | |----------------|------------|----------|----------| | eks | aws:eks:pod-delete | 2 min | Test pod deletion recovery | | pod cpu | aws:eks:pod-cpu-stress | 5 min | Inject CPU stress on pods | | pod memory | aws:eks:pod-memory-stress | 5 min | Inject memory stress on pods | | pod network | aws:eks:pod-network-latency | 5 min | Inject network latency on pods | | nodegroup | aws:eks:terminate-nodegroup-instances | 3 min | Test node termination | | kubernetes | aws:eks:inject-kubernetes-custom-resource | 5 min | Inject custom K8s faults | ### Lambda & Serverless | Finding Keyword | FIS Action | Duration | Use Case | |----------------|------------|----------|----------| | lambda | aws:lambda:invocation-error | 5 min | Inject Lambda errors | | lambda latency | aws:lambda:invocation-add-delay | 5 min | Add Lambda invocation delay | | lambda http | aws:lambda:invocation-http-integration-response | 5 min | Test Lambda HTTP failures | #### Lambda Chaos Engineering Best Practices **Testing Cold Starts and Timeouts:** - Use `aws:lambda:invocation-add-delay` to simulate cold start scenarios - Set `startupDelayMilliseconds` higher than function timeout to test timeout handling - Validates retry logic, dead letter queues, and error handling **Error Handling Validation:** - Use `aws:lambda:invocation-error` with `preventExecution: true` to test without running code - Set `invocationPercentage` to gradually increase fault injection (start at 10-20%) - Verify CloudWatch alarms fire and monitoring captures errors **Integration Testing:** - Use `aws:lambda:invocation-http-integration-response` for ALB, API Gateway, VPC Lattice - Test upstream/downstream service behavior with custom HTTP status codes - Validate circuit breakers and fallback mechanisms **Continuous Testing in CI/CD:** - Automate Lambda FIS experiments in AWS CodePipeline post-deployment - Use CloudWatch Synthetics to monitor user experience during experiments - Set stop conditions based on error rate thresholds (e.g., >5% errors) **Experiment Safety:** - Start experiments in non-production with synthetic traffic - Use `invocationPercentage` parameter to limit blast radius - Configure CloudWatch alarms as stop conditions - Run during off-peak hours initially **Key Metrics to Monitor:** - Invocation errors and throttles - Duration and billed duration - Concurrent executions - Dead letter queue messages - Downstream service health ### Caching & Streaming | Finding Keyword | FIS Action | Duration | Use Case | |----------------|------------|----------|----------| | elasticache | aws:elasticache:replicationgroup-interrupt-az-power | 5 min | Test ElastiCache AZ failure | | memorydb | aws:memorydb:multi-region-cluster-pause-replication | 5 min | Test MemoryDB replication | | kinesis | aws:kinesis:stream-provisioned-throughput-exception | 5 min | Test Kinesis throughput | | kinesis iterator | aws:kinesis:stream-expired-iterator-exception | 3 min | Test expired iterator handling | ### API & Throttling | Finding Keyword | FIS Action | Duration | Use Case | |----------------|------------|----------|----------| | api throttle | aws:fis:inject-api-throttle-error | 5 min | Inject API throttling | | api error | aws:fis:inject-api-internal-error | 5 min | Inject API internal errors | | api unavailable | aws:fis:inject-api-unavailable-error | 5 min | Inject API unavailable errors | ### Availability & Recovery | Finding Keyword | FIS Action | Duration | Use Case | |----------------|------------|----------|----------| | availability | aws:ec2:stop-instances | 5 min | Test high availability setup | | zonal | aws:arc:start-zonal-autoshift | 10 min | Test zonal autoshift | | alarm | aws:cloudwatch:assert-alarm-state | 1 min | Validate alarm states | ## Available Tools ### 1. recommend_fis_experiments Analyzes DevOps Agent findings and returns FIS experiment recommendations. **Input:** ```json { "finding": { "id": "finding-123", "summary": "Network latency caused timeouts", "type": "AVAILABILITY_ISSUE" } } ``` **Output:** ```json { "recommendations": [ { "action": "aws:network:disrupt-connectivity", "duration": "PT10M", "description": "Simulates network disruption to test timeout handling", "targets": ["NetworkInterface"], "stopConditions": ["CloudWatch alarm on error rate > 5%"] } ], "finding_id": "finding-123", "count": 1 } ``` ### 2. create_fis_template Generates a complete, ready-to-deploy FIS experiment template. **Input:** ```json { "recommendation": { "action": "aws:ec2:stop-instances", "duration": "PT3M", "description": "Test instance failure recovery" }, "target_config": { "resourceType": "aws:ec2:instance", "selectionMode": "COUNT(1)", "tags": { "Environment": "staging", "Team": "platform" }, "roleArn": "arn:aws:iam::123456789012:role/FISRole" } } ``` **Output:** Complete CloudFormation-compatible FIS experiment template ready for deployment. ## Customization ### Adding New Finding Mappings Edit `server.py` and add to the `finding_mappings` dictionary: ```python finding_mappings = { "disk": { "action": "aws:ebs:pause-volume-io", "duration": "PT5M", "description": "Simulates disk I/O issues" }, # Add your custom mappings here } ``` ### Adjusting Durations Modify duration values in ISO 8601 format: - `PT2M` = 2 minutes - `PT5M` = 5 minutes - `PT10M` = 10 minutes - `PT1H` = 1 hour ## Requirements - Python 3.7+ - AWS credentials configured (for actual FIS deployment) - MCP-compatible client (Kiro CLI, Claude Desktop, etc.) ## Chaos Engineering Best Practices ### The Chaos Engineering Flywheel Follow the scientific method for each experiment: 1. **Define Steady State** - Establish measurable baseline metrics (TPS, latency, error rate) 2. **Form Hypothesis** - Predict how the system will respond to the fault 3. **Run Experiment** - Inject the fault in a controlled manner 4. **Verify Results** - Compare actual behavior against hypothesis 5. **Improve** - Address gaps and re-run experiments ### Experiment Safety Guidelines **Start Small, Scale Gradually:** - Begin in non-production environments - Use synthetic traffic before real customer traffic - Start with low percentages (10-20%) and increase gradually - Run during off-peak hours initially **Implement Guardrails:** - Set CloudWatch alarms as stop conditions - Define clear rollback procedures - Monitor blast radius with real-time dashboards - Communicate with operations teams before experiments **Scope and Impact:** - Clearly define experiment boundaries - Use tags to target specific resources - Limit concurrent experiments - Document expected vs. actual impact ### Continuous Chaos Testing **Automate in CI/CD:** - Integrate FIS experiments into AWS CodePipeline - Run experiments post-deployment automatically - Use results to gate production releases - Track experiment results over time **Game Days:** - Schedule regular chaos engineering sessions - Simulate realistic failure scenarios - Test incident response procedures - Validate runbooks and documentation ### Key Metrics to Track **System Health:** - Request success rate (target: >99.9%) - Latency percentiles (p50, p95, p99) - Error rates (4xx, 5xx) - Resource utilization (CPU, memory, connections) **Resilience Indicators:** - Time to detect failures - Time to recovery - Blast radius of failures - Cascading failure prevention ### Common Failure Scenarios **Network Failures:** - Partition tolerance between services - Cross-region connectivity loss - DNS resolution failures - Increased latency and packet loss **Resource Exhaustion:** - CPU and memory pressure - Connection pool exhaustion - Disk I/O saturation - API throttling and rate limits **Dependency Failures:** - Database failover and replication lag - Cache invalidation and cold starts - Third-party API unavailability - Message queue backlogs ### References - [AWS Well-Architected Reliability Pillar](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/) - [Principles of Chaos Engineering](https://principlesofchaos.org/) - [AWS Fault Injection Simulator](https://docs.aws.amazon.com/fis/latest/userguide/) - [Chaos Testing with AWS FIS and CodePipeline](https://aws.amazon.com/blogs/architecture/chaos-testing-with-aws-fault-injection-simulator-and-aws-codepipeline/) - [Verify Resilience Using Chaos Engineering](https://aws.amazon.com/blogs/architecture/verify-the-resilience-of-your-workloads-using-chaos-engineering/) ## License MIT ## Contributing Issues and pull requests welcome at https://github.com/pimisael/fis-recommender-mcp

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pimisael/fis-recommender-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•14.3 KiB