loading…
Search for a command to run...
loading…
Enables AI agents to investigate backend incidents by executing runbooks that gather evidence from observability and storage systems.
Enables AI agents to investigate backend incidents by executing runbooks that gather evidence from observability and storage systems.
Runbook-driven backend incident investigation for AI agents.
Status: early open-source MVP.
This repository was inspired by a real internal AI troubleshooting and self-healing workflow. The original production DAG, permissions, and observability plumbing are private and are not reproduced here. This repo focuses on the reusable layer: runbooks, evidence normalization, decision logic, and an MCP entrypoint.
Read this in Chinese (Simplified Chinese)
Many online incidents are not hard because they are unique. They are hard because engineers keep replaying the same investigation sequence by hand:
Example:
trace_id, expected result, actual result.agent-debugger exists for that pattern. It turns repeated troubleshooting habits into executable runbooks so an agent can gather evidence in order instead of guessing freely.
The zero-config path is the fastest way to understand the project. It uses replayable fixtures and does not require Langfuse, Postgres, or Redis credentials.
Requirements:
>= 18.17pnpmRun:
pnpm install
pnpm demo
pnpm benchmark
pnpm check
What you get:
Important:
pnpm demo and pnpm benchmark validate replayable investigation cases.The default demo replays this kind of incident:
The output shows:
After the zero-config demo, you can connect the MCP server to your own observability and storage systems.
Build the server:
pnpm build
Create a config file:
cp agent-debugger.config.example.yaml agent-debugger.config.yaml
Example:
adapters:
langfuse:
base_url: https://cloud.langfuse.com
secret_key: ${LANGFUSE_SECRET_KEY}
public_key: ${LANGFUSE_PUBLIC_KEY}
db:
type: postgres
connection_string: ${DATABASE_URL}
allowed_tables: [orders, tasks]
redis:
url: ${REDIS_URL}
key_prefix_allowlist: ["idempotency:", "task:idempotent:", "order:view:", "task:view:"]
runbooks:
- ./runbooks/request_not_effective.yaml
Add the MCP server to your AI client:
{
"mcpServers": {
"agent-debugger": {
"command": "node",
"args": ["/path/to/agent-debugger/dist/mcp/server.js"],
"env": {
"LANGFUSE_SECRET_KEY": "sk-...",
"LANGFUSE_PUBLIC_KEY": "pk-...",
"DATABASE_URL": "postgresql://...",
"REDIS_URL": "redis://..."
}
}
}
}
Then provide a concrete incident:
Investigate
order_id=order_123. Actual: order was created but no task was generated. Expected: a task row should exist.
| Runbook | Scenario |
|---|---|
request_not_effective |
A request succeeded but the expected side effect did not happen |
cache_stale |
Cached state appears inconsistent with persistence |
state_abnormal |
Persisted business state itself looks incorrect |
Current built-in context coverage is intentionally narrow:
request_not_effective: request_id, order_idcache_stale: order_id, task_idstate_abnormal: order_id, task_idIf you want broader locator support such as trace_id or user_id, add a custom runbook through runbooks: in the config file.
Custom runbooks are supported through runbooks: entries in the config file. Each custom runbook should include sibling .selector.json, .execution.json, and .decision.json metadata files.
Incident Input (context_id + symptom + expected)
↓
[Runbook Selector] Matches signal weights via *.selector.json
↓
[Executor] Calls adapters in order defined by the runbook
↓
[Adapter Layer] Langfuse / PostgreSQL / Redis -> Evidence[]
↓
[Decision Engine] Maps evidence to a conclusion and next actions
↓
[Reporter] Structured IncidentReport
See CONTRIBUTING.md
MIT
Run in your terminal:
claude mcp add agent-debugger -- npx Security
Low riskAutomated heuristic from public metadata — not a security guarantee.