loading…
Search for a command to run...
loading…
Analyzes DevOps findings to automatically recommend and generate AWS Fault Injection Simulator (FIS) experiment templates. It helps teams validate system resili
Analyzes DevOps findings to automatically recommend and generate AWS Fault Injection Simulator (FIS) experiment templates. It helps teams validate system resilience by mapping reported issues to specific chaos engineering actions across AWS services.
An MCP (Model Context Protocol) server that automatically recommends AWS Fault Injection Simulator (FIS) experiments based on DevOps Agent findings. Helps teams quickly design chaos engineering experiments to validate system resilience.
# Install toolkit
pip install bedrock-agentcore-starter-toolkit
# Configure and deploy
agentcore configure -e server.py --protocol MCP
agentcore launch
Windows Note: If you see a platform mismatch warning (
linux/amd64vslinux/arm64), useagentcore launch(notagentcore deploy) which does a remote cross-platform build via CodeBuild.
python setup_cognito_fis.py
Save the output values:
agentcore configurehttps://bedrock-agentcore.{REGION}.amazonaws.com/runtimes/{ENCODED_ARN}/invocationshttps://{COGNITO_DOMAIN}.auth.{REGION}.amazoncognito.com/oauth2/tokenhttps://{COGNITO_DOMAIN}.auth.{REGION}.amazoncognito.com/oauth2/authorizeopenidAsk DevOps Agent:
"Recommend FIS experiments for network latency issues"
See Setup Guide for complete deployment instructions.
python deploy_lambda.py
# Test
aws lambda invoke --function-name fis-recommender-mcp-client --region {REGION} \
--payload '{"tool":"recommend_fis_experiments","arguments":{"finding":{"summary":"network latency"}}}' \
response.json && cat response.json
See Lambda Deploy Guide for detailed instructions.
git clone https://github.com/pimisael/fis-recommender-mcp.git
cd fis-recommender-mcp
chmod +x server.py
Add to ~/.kiro/mcp-servers.json:
{
"mcpServers": {
"fis-recommender": {
"command": "python3",
"args": ["/absolute/path/to/fis-recommender-mcp/server.py"],
"env": {
"AWS_REGION": "us-east-1"
}
}
}
}
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"fis-recommender": {
"command": "python3",
"args": ["/absolute/path/to/fis-recommender-mcp/server.py"],
"env": {
"AWS_REGION": "us-east-1"
}
}
}
}
Prompt:
I have a DevOps finding about network latency causing timeouts in my application.
Can you recommend FIS experiments to test this?
Finding details:
- ID: finding-001
- Summary: "High network latency between services causing request timeouts"
- Type: NETWORK_ISSUE
Response: The MCP server will recommend:
aws:network:disrupt-connectivityPrompt:
Recommend FIS experiments for this finding:
{
"id": "finding-db-001",
"summary": "Database connection failures during peak load",
"type": "DATABASE_ISSUE"
}
Response:
aws:rds:reboot-db-instancesPrompt:
We had a CPU spike incident. Generate a FIS template to test our auto-scaling.
Finding: "CPU utilization reached 95% causing service degradation"
Response: Complete FIS experiment template with:
Prompt:
Create FIS experiments to validate our memory monitoring:
- Finding ID: mem-leak-001
- Issue: Memory leak caused OOM errors
- Need to test alerting and recovery
Response:
aws:ssm:send-command (memory stress)Run the example script to test without an MCP client:
python3 example.py
This will analyze sample findings and display recommendations.
| Finding Keyword | FIS Action | Duration | Use Case |
|---|---|---|---|
| network | aws:network:disrupt-connectivity | 5 min | Test network partition handling |
| latency | aws:network:disrupt-connectivity | 10 min | Validate timeout configurations |
| packet loss | aws:ecs:task-network-packet-loss | 5 min | Simulate packet loss scenarios |
| vpc endpoint | aws:network:disrupt-vpc-endpoint | 5 min | Test VPC endpoint failures |
| cross-region | aws:network:route-table-disrupt-cross-region-connectivity | 10 min | Test multi-region connectivity |
| transit gateway | aws:network:transit-gateway-disrupt-cross-region-connectivity | 10 min | Test transit gateway issues |
| direct connect | aws:directconnect:virtual-interface-disconnect | 5 min | Test Direct Connect failures |
| Finding Keyword | FIS Action | Duration | Use Case |
|---|---|---|---|
| database | aws:rds:reboot-db-instances | 2 min | Test database failover |
| rds | aws:rds:failover-db-cluster | 3 min | Test RDS cluster failover |
| dynamodb | aws:dynamodb:global-table-pause-replication | 5 min | Test DynamoDB replication pause |
| aurora dsql | aws:dsql:cluster-connection-failure | 5 min | Test Aurora DSQL failures |
| disk | aws:ebs:pause-volume-io | 3 min | Test disk I/O failures |
| ebs | aws:ebs:volume-io-latency | 5 min | Inject EBS I/O latency |
| s3 replication | aws:s3:bucket-pause-replication | 10 min | Test S3 replication pause |
| Finding Keyword | FIS Action | Duration | Use Case |
|---|---|---|---|
| cpu | aws:ec2:stop-instances | 3 min | Validate auto-scaling policies |
| memory | aws:ssm:send-command | 5 min | Test OOM handling |
| instance | aws:ec2:reboot-instances | 2 min | Test instance reboot resilience |
| spot | aws:ec2:send-spot-instance-interruptions | 2 min | Test spot interruption handling |
| capacity | aws:ec2:api-insufficient-instance-capacity-error | 5 min | Test capacity error handling |
| auto scaling | aws:ec2:asg-insufficient-instance-capacity-error | 5 min | Test ASG capacity errors |
| Finding Keyword | FIS Action | Duration | Use Case |
|---|---|---|---|
| ecs | aws:ecs:stop-task | 2 min | Test ECS task failure recovery |
| container cpu | aws:ecs:task-cpu-stress | 5 min | Inject CPU stress on tasks |
| container memory | aws:ecs:task-io-stress | 5 min | Inject I/O stress on tasks |
| container network | aws:ecs:task-network-latency | 5 min | Inject network latency on tasks |
| drain | aws:ecs:drain-container-instances | 5 min | Test container draining |
| Finding Keyword | FIS Action | Duration | Use Case |
|---|---|---|---|
| eks | aws:eks:pod-delete | 2 min | Test pod deletion recovery |
| pod cpu | aws:eks:pod-cpu-stress | 5 min | Inject CPU stress on pods |
| pod memory | aws:eks:pod-memory-stress | 5 min | Inject memory stress on pods |
| pod network | aws:eks:pod-network-latency | 5 min | Inject network latency on pods |
| nodegroup | aws:eks:terminate-nodegroup-instances | 3 min | Test node termination |
| kubernetes | aws:eks:inject-kubernetes-custom-resource | 5 min | Inject custom K8s faults |
| Finding Keyword | FIS Action | Duration | Use Case |
|---|---|---|---|
| lambda | aws:lambda:invocation-error | 5 min | Inject Lambda errors |
| lambda latency | aws:lambda:invocation-add-delay | 5 min | Add Lambda invocation delay |
| lambda http | aws:lambda:invocation-http-integration-response | 5 min | Test Lambda HTTP failures |
Testing Cold Starts and Timeouts:
aws:lambda:invocation-add-delay to simulate cold start scenariosstartupDelayMilliseconds higher than function timeout to test timeout handlingError Handling Validation:
aws:lambda:invocation-error with preventExecution: true to test without running codeinvocationPercentage to gradually increase fault injection (start at 10-20%)Integration Testing:
aws:lambda:invocation-http-integration-response for ALB, API Gateway, VPC LatticeContinuous Testing in CI/CD:
Experiment Safety:
invocationPercentage parameter to limit blast radiusKey Metrics to Monitor:
| Finding Keyword | FIS Action | Duration | Use Case |
|---|---|---|---|
| elasticache | aws:elasticache:replicationgroup-interrupt-az-power | 5 min | Test ElastiCache AZ failure |
| memorydb | aws:memorydb:multi-region-cluster-pause-replication | 5 min | Test MemoryDB replication |
| kinesis | aws:kinesis:stream-provisioned-throughput-exception | 5 min | Test Kinesis throughput |
| kinesis iterator | aws:kinesis:stream-expired-iterator-exception | 3 min | Test expired iterator handling |
| Finding Keyword | FIS Action | Duration | Use Case |
|---|---|---|---|
| api throttle | aws:fis:inject-api-throttle-error | 5 min | Inject API throttling |
| api error | aws:fis:inject-api-internal-error | 5 min | Inject API internal errors |
| api unavailable | aws:fis:inject-api-unavailable-error | 5 min | Inject API unavailable errors |
| Finding Keyword | FIS Action | Duration | Use Case |
|---|---|---|---|
| availability | aws:ec2:stop-instances | 5 min | Test high availability setup |
| zonal | aws:arc:start-zonal-autoshift | 10 min | Test zonal autoshift |
| alarm | aws:cloudwatch:assert-alarm-state | 1 min | Validate alarm states |
Analyzes DevOps Agent findings and returns FIS experiment recommendations.
Input:
{
"finding": {
"id": "finding-123",
"summary": "Network latency caused timeouts",
"type": "AVAILABILITY_ISSUE"
}
}
Output:
{
"recommendations": [
{
"action": "aws:network:disrupt-connectivity",
"duration": "PT10M",
"description": "Simulates network disruption to test timeout handling",
"targets": ["NetworkInterface"],
"stopConditions": ["CloudWatch alarm on error rate > 5%"]
}
],
"finding_id": "finding-123",
"count": 1
}
Generates a complete, ready-to-deploy FIS experiment template.
Input:
{
"recommendation": {
"action": "aws:ec2:stop-instances",
"duration": "PT3M",
"description": "Test instance failure recovery"
},
"target_config": {
"resourceType": "aws:ec2:instance",
"selectionMode": "COUNT(1)",
"tags": {
"Environment": "staging",
"Team": "platform"
},
"roleArn": "arn:aws:iam::123456789012:role/FISRole"
}
}
Output: Complete CloudFormation-compatible FIS experiment template ready for deployment.
Edit server.py and add to the finding_mappings dictionary:
finding_mappings = {
"disk": {
"action": "aws:ebs:pause-volume-io",
"duration": "PT5M",
"description": "Simulates disk I/O issues"
},
# Add your custom mappings here
}
Modify duration values in ISO 8601 format:
PT2M = 2 minutesPT5M = 5 minutesPT10M = 10 minutesPT1H = 1 houragentcore configure falls back to Container deployment ("zip utility not found"), install zip via choco install zip or scoop install zip, then re-run. Container deployment also works but is slower.Follow the scientific method for each experiment:
Start Small, Scale Gradually:
Implement Guardrails:
Scope and Impact:
Automate in CI/CD:
Game Days:
System Health:
Resilience Indicators:
Network Failures:
Resource Exhaustion:
Dependency Failures:
MIT
Issues and pull requests welcome at https://github.com/pimisael/fis-recommender-mcp
Добавь это в claude_desktop_config.json и перезапусти Claude Desktop.
{
"mcpServers": {
"fis-recommender-mcp-server": {
"command": "npx",
"args": []
}
}
}