loading…
Search for a command to run...
loading…
An autonomous AI operations hub for enterprise e-commerce that uses a decentralized MCP mesh and LLM-as-Judge consensus to detect and self-heal infrastructure i
An autonomous AI operations hub for enterprise e-commerce that uses a decentralized MCP mesh and LLM-as-Judge consensus to detect and self-heal infrastructure issues.
Autonomous AI Operations Infrastructure for Enterprise E-Commerce.
Validated against 9 scenario types using an LLM-as-Judge Consensus framework and a decentralized MCP mesh. Achieved 100% Pass Rate with a 96% average Consensus Score across two independent models (Claude 4.5 Sonnet & Amazon Nova Pro).
TypeScript Node.js 22 Strands SDK AWS Bedrock MCP Protocol Bedrock Guardrails Serverless v4
It's 3:00 AM on Black Friday. A critical product suddenly shows "Out of Stock" on your website despite 500 units in the warehouse. The culprit? Surge traffic triggered DynamoDB write-throttling, leaving a sync message stranded in the Dead Letter Queue.
Usually, this means an exhausted engineer gets paged, spends an hour digging through logs, and manually triggers a sync — while the company loses thousands in sales.
The Bedrock Operations Hub changes that story.
An on-call operator types a single natural-language message. From that moment, the AI takes over as a senior expert would. It checks inventory levels, scans Dead Letter Queues for blockages, and remembers if this exact product has failed before. Within seconds, it diagnoses the root cause, clears the blockage, triggers a self-healing sync, and confirms the product is live — no developer pager required.
This isn't just an AI. It's Self-Healing Infrastructure — turning 3 AM incident bridges into solved tickets.
This project progressed through 3 distinct evolutionary phases, where each iteration exposed specific limitations in the previous approach and drove the next major architectural decision:
BedrockAgentCore. While functional, the "Fat Lambda" approach created deployment bottlenecks and violated the principle of least privilege.Unlike monolithic agents, this system utilizes a Distributed Model Context Protocol (MCP) mesh. Built on Decentralized Tools: 11 independent AWS Lambda functions acting as MCP Servers. The orchestrator dynamically routes intent across the infrastructure. This decoupling allows for independent service scaling and ensures the orchestrator remains infrastructure-agnostic.
To reduce the high baseline cost of ReAct-style agent exploration, this system employs a Triage Router Pattern. A lightweight, high-speed Claude Haiku classifier intercepts incoming requests, using a curated few-shot prompt to generate a pre-diagnosis "Hint". This hint identifies the most likely tools and is injected into the primary Claude Sonnet orchestration context.
Result: Significantly reduces exploratory tool calls, lowering token consumption and latency by an additional ~13% in ambiguous scenarios while maintaining high accuracy under deterministic system constraints.
The system leverages a stateful Episodic Memory bridge to bypass redundant diagnostic cycles. By correlating current SKU states with historical resolution data, the agent can skip L1 triage and move directly to remediation, drastically reducing token latency and operational costs.
Implemented a hook-layer retry mechanism that intercepts transient 5xx errors and performs silent recoveries. This ensures that minor network blips do not derail the agent's reasoning chain, allowing for optimized task completion rates in unstable production environments.
To ensure enterprise-grade safety, the system implements a native Bedrock Guardrail policy (configured in serverless.yml). This provides a deterministic safety perimeter around the LLM:
To maintain strict security boundaries and lean context windows, we implemented A2A Handoff. When systemic infrastructure issues are detected, the primary orchestrator encapsulates the problem and hands it off to a specialized L2 Detective sub-agent. This specialist possesses its own secure tool registry (CloudWatch, Jira), keeping investigative "noise" out of the primary triage loop.
Hardcoded business rules enforced at the @strands-agents/sdk hook layer, providing a second layer of defense:
OPERATIONAL_POLICY_ERROR.$0.00 is the valid business state for promotional items (GFT- or SAMPLE-). This prevents the agent from misidentifying these items as pricing errors.@strands-agents/sdk + Amazon Bedrock.__health probes on every service and a CORS-enabled statusHub.[!TIP] Check out ARCHITECTURE.md for a deep dive into the Stealth Retry Lifecycle and A2A Encapsulation.
npm install
Run the full diagnostic suite locally without any AWS costs:
npm run eval
sls deploy --stage dev
The Bedrock Operations Hub is validated against 9 distinct scenario types using a sophisticated LLM-as-Judge Consensus framework. Two independent models—Claude 4.5 Sonnet and Amazon Nova Pro—act as judges, scoring each agent run on semantic accuracy (0–100). The final score is a mean average of both judges, minus any deterministic tool-use penalties.
Current Performance Baseline:
📝 [Scenario 1: Generic Availability Complaint]
✅ PASS | 📊 Consensus: 100/100 (Claude: 100, Nova: 100, Pen: -0)
🧑⚖️ Claude : Identified root cause and used correct tools for inventory/price sync.
🧑⚖️ Nova : Accurate root cause identification and successful verification.
📝 [Scenario 2: Specific Price Complaint]
✅ PASS | 📊 Consensus: 100/100 (Claude: 100, Nova: 100, Pen: -0)
🧑⚖️ Claude : Correctly identified price disparity and triggered price sync.
🧑⚖️ Nova : Agent correctly remediated price discrepancy and verified success.
📝 [Scenario 3: Episodic Memory Fast-Path]
✅ PASS | 📊 Consensus: 98/100 (Claude: 100, Nova: 95, Pen: -0)
🧑⚖️ Claude : Correctly identified episodic memory indicator for previous fix.
🧑⚖️ Nova : Accurate identification of root cause and used correct tool.
📝 [Scenario 4: PIM Metadata Complaint]
✅ PASS | 📊 Consensus: 98/100 (Claude: 100, Nova: 95, Pen: -0)
🧑⚖️ Claude : Identified PIM metadata root cause and triggered syncs across systems.
🧑⚖️ Nova : Identified root cause and successfully verified resolution.
📝 [Scenario 5: Full Reconciliation — All Systems]
✅ PASS | 📊 Consensus: 98/100 (Claude: 100, Nova: 95, Pen: -0)
🧑⚖️ Claude : Correctly identified all three system failures as root causes.
🧑⚖️ Nova : Accurate identification of causes and successful sync tool usage.
📝 [Scenario 6: DLQ Recovery — Guide Consultation]
✅ PASS | 📊 Consensus: 95/100 (Claude: 95, Nova: 95, Pen: -0)
🧑⚖️ Claude : Applied troubleshooting guide resolution and triggered sync.
🧑⚖️ Nova : Identified root cause, applied guide resolution and verified remediation.
📝 [Scenario 7: L2 Detective — Handoff Escalation]
✅ PASS | 📊 Consensus: 95/100 (Claude: 100, Nova: 90, Pen: -0)
🧑⚖️ Claude : Properly diagnosed DynamoDB throttling and escalated as instructed.
🧑⚖️ Nova : Accurately identified root cause and provided appropriate escalation.
📝 [Scenario 8: Gift Item Validation — Expected Zero Price]
✅ PASS | 📊 Consensus: 100/100 (Claude: 100, Nova: 100, Pen: -0)
🧑⚖️ Claude : Correctly identified promotional $0.00 as valid business state.
🧑⚖️ Nova : Perfectly aligns with ground truth for GFT- SKU logic.
📝 [Scenario 9: Transient Error & Silent Recovery]
✅ PASS | 📊 Consensus: 83/100 (Claude: 85, Nova: 80, Pen: -0)
🧑⚖️ Claude : Correctly remediated 503 error via silent retry but missed summary mention.
🧑⚖️ Nova : Correctly identified the issue but did not mention the silent recovery.
============================================
🏆 FINAL RESULTS
Pass Rate : 100% (9/9 scenarios)
Avg Score : 96/100
============================================
seed-diagnostic-data.ts) utilizing Sonnet to harvest 200 "Gold Standard" examples that power the Haiku intent classification, mimicking the benefits of model distillation without the massive provisioned throughput costs.Created by Palamkunnel Sujith for the Bedrock Agent Portfolio.
from github.com/sujithpvarghese/bedrock-agent-core-operations-hub-mcp
Выполни в терминале:
claude mcp add bedrock-agent-core-operations-hub-mcp -- npx Web content fetching and conversion for efficient LLM usage.
Retrieval from AWS Knowledge Base using Bedrock Agent Runtime.
автор: modelcontextprotocolProvides auto-configuration for setting up an MCP server in Spring Boot applications.
A very streamlined mcp client that supports calling and monitoring stdio/sse/streamableHttp, and can also view request responses through the /logs page. It also
автор: xuzexin-hzНе уверен что выбрать?
Найди свой стек за 60 секунд
Автор?
Embed-бейдж для README
Похожее
Все в категории ai