loading…
Search for a command to run...
loading…
A multi-agent MCP server that turns LLMs into an autonomous incident-response copilot, enabling rapid investigation, correlation, and remediation of production
A multi-agent MCP server that turns LLMs into an autonomous incident-response copilot, enabling rapid investigation, correlation, and remediation of production incidents.
Production incidents in 10 seconds, not 60 minutes. A drop-in MCP server + dashboard that turns any LLM — Claude, Claude Code, ChatGPT, Cursor, Continue — into an autonomous incident-response copilot.
MCP Compatible Claude Code Claude Desktop ChatGPT Cursor License: MIT Python
Every production incident starts the same way: an engineer opens five tabs at 2 a.m. — CloudWatch, Grafana, GitLab, Confluence, the customer DB — and spends 40-60 minutes gathering context before they can even begin fixing the problem. That hour costs $1,000-$10,000/minute in lost revenue for a P1.
We built AIOps MCP for engineers who are tired of being the human glue between observability tools. It treats incident investigation the way Slack treats messaging or k8s treats containers — as something the platform should handle, not a thing humans should do by hand. Inspired by the way Resolve.ai and pager-replacement tooling are reshaping on-call, but built MCP-native so it speaks the same protocol every modern LLM client already speaks.
Under the hood: six specialized agents, an LLM-driven supervisor, an opinionated synthesis prompt, and a topology engine that knows what depends on what.
| Capability | Description |
|---|---|
| 🤖 6 specialized agents | Log, Infra, Change, Docs, Impact, Audit — run in parallel, not sequence |
| 🧠 MCP-native | Plug into Claude Desktop, Claude Code, Cursor, Continue, or any MCP client over stdio or HTTP |
| 🔌 Multi-LLM | Claude, GPT, Gemini, local models via OpenRouter — pick your brain, we coordinate |
| 📊 MCP Dashboard | Chat + live agent traces + topology + log viewer in one tab — like Claude.ai for incidents |
| 🕸️ App topology | Interactive service graph with blast-radius propagation for connected-impact analysis |
| 📎 Manual + auto logs | Paste, upload, or auto-pull from CloudWatch / Datadog / Splunk / Loki / Grafana |
| 🧾 Full audit trail | Every agent step, LLM prompt, and one-click action logged — compliance-ready |
| 🎫 Auto-Jira | Incident, RCA, evidence, action log — created and updated by the Audit Agent |
| 🚀 One-click actions | Rollback / restart / scale / flag-flip — vetted, parameterized, reversible |
| ⚙️ 8 env vars total | Production deployment with mocks-by-default — no creds, no problem |
| 🐳 Docker-ready | docker compose up and you have the full stack |
| 🔐 Zero-trust by default | Per-agent secrets, PII scrubbing on LLM prompts, immutable audit log |
| MCP Plugin (recommended for LLM users) | Self-hosted CLI (for SREs/platform teams) | |
|---|---|---|
| Best for | Solo engineers wiring it into Claude Code / Claude Desktop / Cursor | Teams running AIOps MCP as shared infrastructure |
| Install | claude mcp add aiops -- aiops mcp-stdio |
pip install -e . then aiops serve |
| Transport | stdio | HTTP + MCP-over-HTTP + dashboard at :7878 |
| Config | Single .env next to aiops binary |
.env + configs/topology.yaml + Docker |
| Dashboard | Optional (aiops dashboard) |
Always on at http://host:7878 |
| Multi-user | Single user | RBAC via Cognito / Okta / OAuth2 |
Pick based on the team you're solving for. Both paths use the same agent engine.
git clone https://github.com/<you>/aiops-mcp.git
cd aiops-mcp
cp .env.example .env # leave it empty for full mock mode
pip install -e .
aiops serve # MCP + HTTP + dashboard on :7878
Open http://localhost:7878 and ask: "Why is checkout slow?"
docker compose up
Grouped by what they actually do in an incident:
| Agent | Sources | What it answers |
|---|---|---|
| 🪵 Log Agent | CloudWatch, Datadog, Splunk, ELK, Loki | "What errors fired in the last 30 min?" |
| 📊 Infra Agent | Grafana, Prometheus, Datadog Metrics, CloudWatch | "Is the DB at 98% connections? Is upstream healthy?" |
| 🚢 Change Agent | GitHub, GitLab, ArgoCD, Jenkins | "Who deployed what, when?" |
| Agent | Sources | What it answers |
|---|---|---|
| 📚 Docs Agent | Bedrock KB / pgvector / Pinecone over runbooks, postmortems, ADRs | "Have we seen this before? What's the runbook?" |
| 💸 Impact Agent | DynamoDB, Snowflake, BigQuery, Mixpanel | "Who's affected? How much revenue is at risk?" |
| Agent | Sources | What it answers |
|---|---|---|
| 🧾 Audit Agent | Jira, ServiceNow, Linear | "Create the ticket, attach the RCA, link past incidents." |
| Tool | Purpose |
|---|---|
investigate_incident |
Full multi-agent investigation — returns RCA + suggested actions |
query_logs |
Search logs in CloudWatch / Datadog / Splunk / Loki / ELK |
query_metrics |
PromQL / Grafana / Datadog Metrics query |
attach_log |
Manually attach a log blob (paste or upload) to an active investigation |
get_topology |
Return service dependency graph + health |
correlate_impact |
Given a service, list downstream impact + affected customers |
recent_deploys |
List deploys / merges in a window |
find_runbook |
RAG search over runbooks and past postmortems |
create_jira_ticket |
Create / update Jira with full RCA |
execute_action |
One-click remediation (rollback / restart / scale / flag-flip) |
Every tool is callable directly from your LLM client — no UI required.
A single-tab web UI inspired by Resolve.ai and Claude.ai for incident response:
| Surface | What it does |
|---|---|
| 💬 Chat panel | Natural-language conversation with the orchestrator |
| 🧩 Agent trace | Live cards showing each agent's progress, findings, and citations |
| 🕸️ Topology graph | Interactive node graph; click a service to see blast radius |
| 📎 Log dropzone | Paste / upload / fetch logs with timestamp alignment |
| ⏱️ Incident timeline | Every step with timestamps, audit-ready |
| 🎯 Action panel | One-click rollback / scale / flag-flip with explicit confirmation |
Live demo (self-host): http://localhost:7878 after aiops serve.
┌──────────────────────────────────────────────────────┐
│ LLM CLIENT (Claude Code / Desktop / ChatGPT / ...) │
└────────────────────────┬─────────────────────────────┘
│ MCP (stdio or HTTP)
▼
┌──────────────────────────────────────────────────────┐
│ AIOps MCP SERVER (:7878) │
│ ┌──────────────────────────────────────────────┐ │
│ │ SUPERVISOR ORCHESTRATOR │ │
│ │ plans → fans out → synthesizes → audits │ │
│ └──┬─────────┬─────────┬────────┬────────┬─────┘ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ LOG │ │INFRA │ │CHANGE│ │ DOCS │ │IMPACT│ │
│ └──┬──┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ ADAPTERS (mock-by-default, swappable) │ │
│ └──────────────────────────────────────────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ CloudWatch Grafana GitHub Vector Snowflake │
│ Datadog Promet. GitLab pgvector BigQuery │
│ Splunk Datadog ArgoCD RunbookKB DynamoDB │
│ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ SYNTHESIS ENGINE │ │
│ │ (Claude Opus 4.7) │ │
│ └────────────┬────────────┘ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ AUDIT AGENT → Jira │ │
│ └─────────────────────────┘ │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ MCP DASHBOARD (web UI) │
│ Chat · Trace · Topology · Logs │
└──────────────────────────────────┘
You pick the model; AIOps MCP handles coordination.
All config is via environment variables. Defaults work with mock data so you can run it instantly.
| Variable | Required | Purpose |
|---|---|---|
ANTHROPIC_API_KEY |
for real LLM | Supervisor + Synthesis (Claude Opus 4.7) |
AIOPS_PORT |
no | HTTP / MCP port — default 7878 |
AIOPS_DATA_DIR |
no | SQLite, uploads, topology cache — default ./data |
AIOPS_MOCK_MODE |
no | Auto-on when no integrations set |
DATADOG_API_KEY or SPLUNK_TOKEN+SPLUNK_HOST or AWS creds |
optional | Pick the log source you have |
GRAFANA_URL + GRAFANA_TOKEN |
optional | Metrics |
GITHUB_TOKEN or GITLAB_TOKEN |
optional | Deploys |
JIRA_HOST + JIRA_EMAIL + JIRA_TOKEN |
optional | Audit ticketing |
That's it. See .env.example for the full annotated list.
| Client | Setup | Config file |
|---|---|---|
| Claude Desktop | Merge mcpServers block into claude_desktop_config.json |
configs/claude-desktop.json |
| Claude Code | claude mcp add aiops -- aiops mcp-stdio |
configs/claude-code.json |
| ChatGPT (custom GPT) | Point at http://your-host:7878/openapi.json |
configs/chatgpt-openapi-stub.json |
| Cursor | Add to ~/.cursor/mcp.json (same format as Claude Desktop) |
configs/claude-desktop.json |
| Continue.dev | Add to ~/.continue/config.json MCP section |
configs/claude-desktop.json |
| Custom / any HTTP client | POST to :7878/mcp (JSON-RPC 2.0) |
n/a |
Every tool the dashboard uses is also callable from the LLM client. The dashboard is just another MCP consumer.
| Capability | Without | With AIOps MCP |
|---|---|---|
| Time to RCA | 40–60 min, 5 tabs | ~10 sec, one prompt |
| Investigation cost | 1 engineer-hour per P1 | 1 LLM call |
| Documentation | Manual Jira write-up after the fact | Auto-generated mid-incident |
| Knowledge retention | Lost when the senior leaves | Permanent in RAG corpus |
| On-call escalation reason | "I don't know who deployed what" | Change agent already answered |
| Impact estimation | Slack the BI team | Impact agent in 2 seconds |
| Action execution | SSH, kubectl, prayer | One-click, audited, reversible |
| Connected-impact view | Mental model in someone's head | Live topology graph |
aiops-mcp/
├── README.md # this file
├── .env.example # annotated env var template
├── pyproject.toml
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
├── server/
│ ├── main.py # CLI entry: aiops serve | mcp-stdio | dashboard
│ ├── mcp_server.py # MCP protocol (stdio + HTTP)
│ ├── api.py # FastAPI HTTP API + dashboard host
│ ├── orchestrator.py # Supervisor: plans + fans out
│ ├── synthesis.py # Final LLM correlation call
│ ├── topology.py # Service graph + impact propagation
│ ├── config.py # Env loading + mock fallback
│ └── agents/
│ ├── base.py
│ ├── log_agent.py
│ ├── infra_agent.py
│ ├── change_agent.py
│ ├── docs_agent.py
│ ├── impact_agent.py
│ └── audit_agent.py
├── dashboard/
│ └── index.html # single-page UI (vanilla JS + vis-network)
├── configs/
│ ├── claude-desktop.json
│ ├── claude-code.json
│ ├── chatgpt-openapi-stub.json
│ └── topology.example.yaml
├── docs/
│ ├── INSTALLATION.md
│ ├── INTEGRATIONS.md
│ └── MCP-USAGE.md
└── tests/
└── test_basic.py
| When to read | Doc |
|---|---|
| First-time install on a new host | docs/INSTALLATION.md |
| Wiring into Claude / ChatGPT / Cursor / Continue / custom | docs/INTEGRATIONS.md |
| Building your own MCP client against this server | docs/MCP-USAGE.md |
| Architecture deep-dive (v1 + v2 roadmap) | docs/aiops-architecture.md |
MIT — see LICENSE. Use it, fork it, run it, ship it.
enterprise labelBuilt by people who've carried the pager.
Run in your terminal:
claude mcp add aiops-mcp -- npx