loading…
Search for a command to run...
loading…
Subscription-aware LLM router for Claude Code. Routes tasks to 20+ providers (OpenAI, Gemini, Groq, Ollama, Codex) based on complexity classification, Claude su
Subscription-aware LLM router for Claude Code. Routes tasks to 20+ providers (OpenAI, Gemini, Groq, Ollama, Codex) based on complexity classification, Claude subscription pressure, and cost. Free tasks stay on Claude subscription; expensive tasks fall back to the cheapest capable model. Includes 30 MCP tools, 6 auto-routing hooks, semantic dedup cache, prompt caching, daily spend cap, and a live web dashboard.

Route every AI call to the cheapest model that can do the job well. 48 tools · 20+ providers · personal routing memory · budget caps, dashboards, traces.
PyPI Tests Downloads Python MCP License Stars
Average savings: 60–80% vs running everything on Claude Opus.
Real numbers from a 14-day sprint: 51 releases, 22.6M tokens, $6.95 spent.
Free-first routing eliminated budget pressure over 14 days—allowing sustainable feature velocity.



Free-first routing achieved:
Compare: Claude Opus baseline for same work = $1,200–1,500/year
pipx install claude-code-llm-router && llm-router install
| Host | Command |
|---|---|
| Claude Code | llm-router install |
| VS Code | llm-router install --host vscode |
| Cursor | llm-router install --host cursor |
| Codex CLI | llm-router install --host codex |
| Gemini CLI | llm-router install --host gemini-cli |
llm-router works as an MCP server inside any tool that supports MCP, providing unified routing across your entire development environment.
| Tool | Status | What You Get |
|---|---|---|
| Claude Code | ✅ Full | Auto-routing hooks + session tracking + quota display |
| Gemini CLI | ✅ Full | Auto-routing hooks + session tracking + quota display |
| Codex CLI | ✅ Full | Auto-routing hooks + savings tracking |
| VS Code + Copilot | ✅ MCP | llm-router tools available (routing is model-voluntary) |
| Cursor | ✅ MCP | llm-router tools available (routing is model-voluntary) |
| OpenCode | ✅ MCP | llm-router tools available (routing is model-voluntary) |
| Windsurf | ✅ MCP | llm-router tools available (routing is model-voluntary) |
| Any MCP-compatible tool | ⚡ Manual | Add llm-router to your tool's MCP config |
Full support = auto-routing hooks fire before the model answers, enforcing your routing policy. MCP support = tools are available, but the model chooses whether to use them.
pipx install claude-code-llm-router
llm-router install
Then in Claude Code, llm_route and friends appear as built-in tools. Your settings control the profile (budget/balanced/premium).
pipx install claude-code-llm-router
llm-router install --host gemini-cli
Gemini CLI users get full routing experience: auto-routing suggestions, quota display, and free-first chaining (Ollama → Codex → Gemini CLI → paid).
pipx install claude-code-llm-router
llm-router install --host codex
Codex integrates deep into the routing chain as a free fallback when your OpenAI subscription is available.
pipx install claude-code-llm-router
llm-router install --host vscode # or --host cursor
The MCP server loads automatically. Tools appear in your IDE's model UI.
Intercepts prompts and routes them to the cheapest model that can handle the task. Most AI sessions are full of low-value work: file lookups, small edits, quick questions. Those burn through expensive models unnecessarily.
llm-router keeps cheap work on cheap/free models, escalates to premium models only when needed. No micromanagement required.
Think of llm-router as a smart task dispatcher. When you ask a question:
The dispatcher learns over time: if a model starts performing poorly (judge scores drop), it gets demoted in future decisions. If you're running low on quota (budget pressure), it automatically uses cheaper models. You don't manage any of this—it just happens behind the scenes.
Example: "Explain this error message" → Simple task → Route to Haiku (fast, cheap) → Done. vs. "Refactor this complex architecture" → Complex task → Route to Opus (expensive but thorough) → Done.
The savings come from not using Opus for every question.
Major release with optimized routing chains and automatic Ollama management.
Ollama Auto-Startup — Session-start hook automatically launches Ollama and loads budget models (gemma4, qwen3.5) if not running
Free-First MCP Chain for All Complexity Levels
BALANCED Tier Chain Reordering — Gemini Pro prioritized after Codex injection
Routing Decision Logging & Analytics
See CHANGELOG.md for full version history and v6.x features.
Smart content generation detection with automatic routing suggestions.
Automatic Content Generation Detection — Hook detects "write", "draft", "add card", "create spec" patterns
llm_generate routingllm_generateDecomposition Patterns — Multi-step content+file tasks now route intelligently
llm_generate → Donellm_generate content → Edit file integrationSoft Nudges via Hook Suggestion (not blocking)
llm_generate first, then integrate locally"Fast-Path for Content Tasks — Content generation routed instantly without waiting for classifier
See CLAUDE.md § Content Generation Routing for detailed decision tree.
User Prompt
↓
[Complexity Classifier] — Haiku/Sonnet/Opus?
↓
[Free-First Router] — Ollama → Codex → Gemini Flash → OpenAI → Claude
↓
[Budget Pressure Check] — Downshift if over 85% budget
↓
[Quality Guard] — Demote if judge score < 0.6
↓
Selected Model → Execute
Zero-config by default if you use Claude Code Pro/Max (subscription mode).
Optional env vars:
OPENAI_API_KEY=sk-... # GPT-4o, o3
GEMINI_API_KEY=AIza... # Gemini Flash (free tier)
OLLAMA_BASE_URL=http://localhost:11434 # Local Ollama (free)
LLM_ROUTER_PROFILE=balanced # budget|balanced|premium
LLM_ROUTER_COMPRESS_RESPONSE=true # Enable response compression
For full setup guide, see docs/SETUP.md.
Routing violations occur when Claude bypasses a routing directive by using Bash, Read, Edit, or Write instead of calling the routed MCP tool first. This burns expensive tokens with zero cost savings.
When llm-router issues a ⚡ MANDATORY ROUTE hint, it writes a pending state file. If Claude uses Bash (or Read/Edit/Write for Q&A tasks) before calling the expected tool, enforce-route.py logs it as a violation.
Example violation sequence:
⚡ MANDATORY ROUTE: query/simple → call llm_query
→ Bash: "I'll answer this directly" ❌ VIOLATION (should have called llm_query)
Cost impact: $0.10+ spent on full Claude model instead of $0.0001 via llm_query routing.
Use the provided analysis script to see which sessions violate most and why:
python3 scripts/analyze-violations.py
Output:
~/.llm-router/retrospectives/violation-report-<date>.md| Pattern | Cause | Fix |
|---|---|---|
| Bash used for Q&A | Claude answers directly | Route via llm_query / llm_research instead |
| Read after route hint | Claude investigates before routing | Call llm_analyze first, pass file content |
| Edit without generation | Claude codes directly | Route via llm_code first for simple tasks |
| Loop: same tool 3+ times | Investigation stuck in debugging | Call the routed tool to break the deadlock |
Control violation behavior via LLM_ROUTER_ENFORCE:
export LLM_ROUTER_ENFORCE=smart # (default) Hard for Q&A, soft for code
export LLM_ROUTER_ENFORCE=hard # Block all violations (strictest)
export LLM_ROUTER_ENFORCE=soft # Log violations, allow calls (permissive)
export LLM_ROUTER_ENFORCE=off # Disable enforcement entirely
enforcement.log but never blocked. Good for testing.After 3+ violations in a session, enforce-route.py prints a warning to stderr:
[llm-router] ⚠️ ESCALATION: 5 routing violations this session.
Next prompt expecting llm_query:
→ Call the MCP tool FIRST before any Bash/Read/Edit/Write.
→ See ~/.llm-router/enforcement.log for full history.
→ Set LLM_ROUTER_ENFORCE=hard to block violations automatically.
This reminds the model to route first.
The log file ~/.llm-router/enforcement.log contains all violations:
[2026-04-26 10:30:45] VIOLATION session=abc12345678 expected=llm_query got=Bash
[2026-04-26 10:31:02] VIOLATION session=abc12345678 expected=llm_query got=Read
Use analyze-violations.py to summarize and find patterns across sessions.
v7.5.0+ uses a box-drawing format that's harder to miss:
╔══════════════════════════════════════════════════╗
║ ⚡ MANDATORY ROUTE — DO NOT SKIP ║
║ task : query ║
║ action: call llm_query ║
║ via : heuristic ║
║ saves : $0.001 ║
╚══════════════════════════════════════════════════╝
⚠️ IMPORTANT: Call the tool above as your FIRST action.
• Do NOT use Bash, Read, Edit, or Write to self-answer
• Do NOT spawn Agent subagents — they cost $0.10+
• Do NOT use WebSearch or WebFetch — route via llm_research
• Violations are logged per-session and count toward escalation
This format is visible even in long context windows.
Test artifacts from development can inflate error counts. Clean them up:
# Preview what will be removed
python3 scripts/cleanup-hook-health.py --dry-run
# Apply cleanup
python3 scripts/cleanup-hook-health.py
# Force-remove specific hooks
python3 scripts/cleanup-hook-health.py --remove hook-a hook-b test-hook
This removes hooks that only have errors from test sessions (session_id: "abc123").
Routing:
llm_route — Route task to optimal modelllm_classify — Classify task complexityllm_quality_guard — Monitor model healthText:
llm_query, llm_research, llm_generate, llm_analyze, llm_codeMedia:
llm_image, llm_video, llm_audioAdmin:
llm_usage, llm_savings, llm_budget, llm_health, llm_providersAdvanced:
llm_orchestrate — Multi-step pipelinesllm_setup — Configure provider keysllm_policy — Routing policy managementFull tool reference — Complete documentation for all 48 tools
See CLAUDE.md for:
See docs/ARCHITECTURE.md for:
uv run pytest tests/ -q # Run tests
uv run ruff check src/ tests/ # Lint
uv run llm-router --version # Check version
MIT — See LICENSE
Добавь это в claude_desktop_config.json и перезапусти Claude Desktop.
{
"mcpServers": {
"ypollak2-llm-router": {
"command": "npx",
"args": []
}
}
}PRs, issues, code search, CI status
Database, auth and storage
Reference / test server with prompts, resources, and tools.
Secure file operations with configurable access controls.