bedrock-agent-core-operations-hub-mcp
Provides tools for investigating and managing incidents, integrating with Jira for ticketing and tracking.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@bedrock-agent-core-operations-hub-mcpCheck and fix the out of stock issue for product SKU-4521"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
π‘οΈ Bedrock AgentCore: Self-Healing Operations Hub
Autonomous AI Operations Infrastructure for Enterprise E-Commerce.
Validated against 9 scenario types using an LLM-as-Judge Consensus framework and a decentralized MCP mesh. Achieved 100% Pass Rate with a 96% average Consensus Score across two independent models (Claude 4.5 Sonnet & Amazon Nova Pro).
π The Story
It's 3:00 AM on Black Friday. A critical product suddenly shows "Out of Stock" on your website despite 500 units in the warehouse. The culprit? Surge traffic triggered DynamoDB write-throttling, leaving a sync message stranded in the Dead Letter Queue.
Usually, this means an exhausted engineer gets paged, spends an hour digging through logs, and manually triggers a sync β while the company loses thousands in sales.
The Bedrock Operations Hub changes that story.
An on-call operator types a single natural-language message. From that moment, the AI takes over as a senior expert would. It checks inventory levels, scans Dead Letter Queues for blockages, and remembers if this exact product has failed before. Within seconds, it diagnoses the root cause, clears the blockage, triggers a self-healing sync, and confirms the product is live β no developer pager required.
This isn't just an AI. It's Self-Healing Infrastructure β turning 3 AM incident bridges into solved tickets.
π Architectural Journey
This project progressed through 3 distinct evolutionary phases, where each iteration exposed specific limitations in the previous approach and drove the next major architectural decision:
v1
bedrock-full-reconciliation: Initial prototype using direct Bedrock inference. Hit immediate ceilings with lack of persistent state or structured tool orchestration.v2
agent-core-operations-hub: Consolidated all logic into a Single Lambda Monolith usingBedrockAgentCore. While functional, the "Fat Lambda" approach created deployment bottlenecks and violated the principle of least privilege.v3 (Current): Transitioned to a Distributed MCP Mesh. Decomposed the monolith into 11 independent MCP services. This achieved true service isolation, independent scalability, and set the stage for A2A (Agent-to-Agent) encapsulation.
ποΈ Technical Pillars
π Decentralized MCP Mesh
Unlike monolithic agents, this system utilizes a Distributed Model Context Protocol (MCP) mesh. Built on Decentralized Tools: 11 independent AWS Lambda functions acting as MCP Servers. The orchestrator dynamically routes intent across the infrastructure. This decoupling allows for independent service scaling and ensures the orchestrator remains infrastructure-agnostic.
β‘ Cost-Optimized Triage Router (Few-Shot Cascading)
To reduce the high baseline cost of ReAct-style agent exploration, this system employs a Triage Router Pattern. A lightweight, high-speed Claude Haiku classifier intercepts incoming requests, using a curated few-shot prompt to generate a pre-diagnosis "Hint". This hint identifies the most likely tools and is injected into the primary Claude Sonnet orchestration context.
Result: Significantly reduces exploratory tool calls, lowering token consumption and latency by an additional ~13% in ambiguous scenarios while maintaining high accuracy under deterministic system constraints.
π§ Episodic Memory Bridge
The system leverages a stateful Episodic Memory bridge to bypass redundant diagnostic cycles. By correlating current SKU states with historical resolution data, the agent can skip L1 triage and move directly to remediation, drastically reducing token latency and operational costs.
π‘οΈ Stealth Resilience
Implemented a hook-layer retry mechanism that intercepts transient 5xx errors and performs silent recoveries. This ensures that minor network blips do not derail the agent's reasoning chain, allowing for optimized task completion rates in unstable production environments.
π‘οΈ Two-Stage AI Safety Gate (Bedrock Guardrails)
To ensure enterprise-grade safety, the system implements a native Bedrock Guardrail policy (configured in serverless.yml). This provides a deterministic safety perimeter around the LLM:
Inbound Gate: Blocks off-topic queries (denying non-e-commerce requests), detects and blocks prompt injection attacks, and automatically anonymizes PII (Email, Phone, CC).
Outbound Check (Grounding): Validates the agent's final resolution against raw tool outputs. If the agent hallucinates a fix not supported by the data, the response is flagged with a Contextual Grounding Warning.
π΅οΈ Agent-to-Agent (A2A) Encapsulation
To maintain strict security boundaries and lean context windows, we implemented A2A Handoff. When systemic infrastructure issues are detected, the primary orchestrator encapsulates the problem and hands it off to a specialized L2 Detective sub-agent. This specialist possesses its own secure tool registry (CloudWatch, Jira), keeping investigative "noise" out of the primary triage loop.
π Operational Guardrails (Hook Layer)
Hardcoded business rules enforced at the @strands-agents/sdk hook layer, providing a second layer of defense:
Change Freeze Window: Automated syncs are blocked Friday 4PM β Monday morning. Any attempt returns
OPERATIONAL_POLICY_ERROR.Gift Item Guard: Recognizes that
$0.00is the valid business state for promotional items (GFT-orSAMPLE-). This prevents the agent from misidentifying these items as pricing errors.
π οΈ The Stack
Language: TypeScript & Node.js 22.x (Enterprise-grade type safety).
Orchestration:
@strands-agents/sdk+ Amazon Bedrock.Protocol: Official MCP logic over HTTPS Lambda Function URLs.
Memory: Amazon Bedrock AgentCore (Vector-based episodic retrieval).
Schema: Model-aware Zod-to-JSON-Schema transformation (Claude, Nova, Llama).
Production Hygiene: Built-in
__healthprobes on every service and a CORS-enabledstatusHub.Deployment: Stage-aware Serverless Framework v4 using clean YAML anchors for URL management.
Traceability: Logical Correlation ID tracing across distributed log groups.
Check out ARCHITECTURE.md for a deep dive into the Stealth Retry Lifecycle and A2A Encapsulation.
π¦ Getting Started
Prerequisites
Node.js 22.x
AWS CLI configured with Bedrock access.
Installation
npm installLocal Evaluation (100% Simulation)
Run the full diagnostic suite locally without any AWS costs:
npm run evalDeployment
sls deploy --stage devπ§ͺ Evaluation
The Bedrock Operations Hub is validated against 9 distinct scenario types using a sophisticated LLM-as-Judge Consensus framework. Two independent modelsβClaude 4.5 Sonnet and Amazon Nova Proβact as judges, scoring each agent run on semantic accuracy (0β100). The final score is a mean average of both judges, minus any deterministic tool-use penalties.
Current Performance Baseline:
Pass Rate: 100% (9/9 scenarios)
Average Consensus Score: 96/100
Deterministic Tool Penalty: -10 pts per missed expected tool invocation
π [Scenario 1: Generic Availability Complaint]
β
PASS | π Consensus: 100/100 (Claude: 100, Nova: 100, Pen: -0)
π§ββοΈ Claude : Identified root cause and used correct tools for inventory/price sync.
π§ββοΈ Nova : Accurate root cause identification and successful verification.
π [Scenario 2: Specific Price Complaint]
β
PASS | π Consensus: 100/100 (Claude: 100, Nova: 100, Pen: -0)
π§ββοΈ Claude : Correctly identified price disparity and triggered price sync.
π§ββοΈ Nova : Agent correctly remediated price discrepancy and verified success.
π [Scenario 3: Episodic Memory Fast-Path]
β
PASS | π Consensus: 98/100 (Claude: 100, Nova: 95, Pen: -0)
π§ββοΈ Claude : Correctly identified episodic memory indicator for previous fix.
π§ββοΈ Nova : Accurate identification of root cause and used correct tool.
π [Scenario 4: PIM Metadata Complaint]
β
PASS | π Consensus: 98/100 (Claude: 100, Nova: 95, Pen: -0)
π§ββοΈ Claude : Identified PIM metadata root cause and triggered syncs across systems.
π§ββοΈ Nova : Identified root cause and successfully verified resolution.
π [Scenario 5: Full Reconciliation β All Systems]
β
PASS | π Consensus: 98/100 (Claude: 100, Nova: 95, Pen: -0)
π§ββοΈ Claude : Correctly identified all three system failures as root causes.
π§ββοΈ Nova : Accurate identification of causes and successful sync tool usage.
π [Scenario 6: DLQ Recovery β Guide Consultation]
β
PASS | π Consensus: 95/100 (Claude: 95, Nova: 95, Pen: -0)
π§ββοΈ Claude : Applied troubleshooting guide resolution and triggered sync.
π§ββοΈ Nova : Identified root cause, applied guide resolution and verified remediation.
π [Scenario 7: L2 Detective β Handoff Escalation]
β
PASS | π Consensus: 95/100 (Claude: 100, Nova: 90, Pen: -0)
π§ββοΈ Claude : Properly diagnosed DynamoDB throttling and escalated as instructed.
π§ββοΈ Nova : Accurately identified root cause and provided appropriate escalation.
π [Scenario 8: Gift Item Validation β Expected Zero Price]
β
PASS | π Consensus: 100/100 (Claude: 100, Nova: 100, Pen: -0)
π§ββοΈ Claude : Correctly identified promotional $0.00 as valid business state.
π§ββοΈ Nova : Perfectly aligns with ground truth for GFT- SKU logic.
π [Scenario 9: Transient Error & Silent Recovery]
β
PASS | π Consensus: 83/100 (Claude: 85, Nova: 80, Pen: -0)
π§ββοΈ Claude : Correctly remediated 503 error via silent retry but missed summary mention.
π§ββοΈ Nova : Correctly identified the issue but did not mention the silent recovery.
============================================
π FINAL RESULTS
Pass Rate : 100% (9/9 scenarios)
Avg Score : 96/100
============================================π€ Engineering Highlights
Decentralized MCP Mesh: Transitioned from a monolithic API to a mesh of 13 independent AWS Lambdas using direct Function URLs to eliminate API Gateway latency and cold-start overhead.
Cost-Optimization via Cascading: Engineered a dual-model LLM cascade. Using Haiku for instant triage and Sonnet for complex remediation slashes operating costs over a standard Single-Model ReAct loop.
Synthetic Distillation: Hand-crafted a synthetic data pipeline (
seed-diagnostic-data.ts) utilizing Sonnet to harvest 200 "Gold Standard" examples that power the Haiku intent classification, mimicking the benefits of model distillation without the massive provisioned throughput costs.Hook-Layer Guardrails: Implemented deterministic safety logic (Holiday Freeze, Gift Item Guards) using orchestration hooks rather than fragile prompt-layer instructions, ensuring 100% policy compliance.
A2A Context Optimization: Implemented the L2 Detective sub-agent handoff to minimize context-window bloat, delegating deep-trace analytical tasks to a specialized agentic domain only when needed.
Created by Palamkunnel Sujith for the Bedrock Agent Portfolio.
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/sujithpvarghese/bedrock-agent-core-operations-hub-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server