loading…
Search for a command to run...
loading…
Model Context Protocol server for Langfuse observability. Query traces, analyze accuracy, detect failures, track costs, debug latency, manage prompts and datase
Model Context Protocol server for Langfuse observability. Query traces, analyze accuracy, detect failures, track costs, debug latency, manage prompts and datasets.
Model Context Protocol server for Langfuse observability. Query traces, analyze accuracy, detect failures, track costs, debug latency, manage prompts and datasets.
56 tools across data access and analytics. Multi-project support so one instance can serve several Langfuse projects. Works with Claude Code, Codex CLI, Cursor, and any MCP-compatible client.
Comparison with official Langfuse MCP (as of March 2026):
| Capability | This server | Official Langfuse MCP |
|---|---|---|
| Traces & Observations | Yes | No |
| Sessions & Users | Yes | No |
| Exception Tracking | Yes | No |
| Prompt Management | Yes | Yes |
| Dataset Management | Yes | No |
| Annotation Queues | Yes | No |
| Scores v2 API | Yes | No |
| Score Write-back | Yes | No |
| Multi-project support | Yes | No |
| Accuracy Metrics | Yes | No |
| Failure Detection | Yes | No |
| Token Percentiles | Yes | No |
| Cost Breakdown | Yes | No |
| Latency Analysis | Yes | No |
| Session Analytics | Yes | No |
| Context Breach Scanning | Yes | No |
| User Group Aggregation | Yes | No |
The official MCP focuses on prompt management. This server provides a full observability and analytics toolkit — traces, observations, sessions, scores, exceptions, prompts, datasets, annotation queues, plus 9 built-in analytics tools that compute insights server-side and return LLM-sized summaries. Multi-project routing lets a single instance serve several Langfuse projects behind one connector URL.
LANGFUSE_HOST to your instance URL (e.g., https://langfuse.yourcompany.com)claude mcp add \
-e LANGFUSE_PUBLIC_KEY=pk-lf-... \
-e LANGFUSE_SECRET_KEY=sk-lf-... \
-e LANGFUSE_HOST=https://cloud.langfuse.com \
--scope project \
langfuse-mcp -- uvx langfuse-mcp-server
codex mcp add langfuse-mcp \
--env LANGFUSE_PUBLIC_KEY=pk-lf-... \
--env LANGFUSE_SECRET_KEY=sk-lf-... \
--env LANGFUSE_HOST=https://cloud.langfuse.com \
-- uvx langfuse-mcp-server
Add to .cursor/mcp.json:
{
"mcpServers": {
"langfuse-mcp": {
"command": "uvx",
"args": ["langfuse-mcp-server"],
"env": {
"LANGFUSE_PUBLIC_KEY": "pk-lf-...",
"LANGFUSE_SECRET_KEY": "sk-lf-...",
"LANGFUSE_HOST": "https://cloud.langfuse.com"
}
}
}
}
Restart your CLI, then test with /mcp (Claude Code) or codex mcp list (Codex).
pip install langfuse-mcp-server
langfuse-mcp-server
Run as a long-lived HTTP service so multiple users connect to a single instance — required for Claude.ai custom Connectors, and useful for team-wide access without distributing Langfuse API keys per user.
Enabled via env vars; no code changes.
MCP_TRANSPORT=streamable-http
MCP_BASE_URL=https://mcp.yourcompany.com
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://your-langfuse-instance.example
Without OAuth env vars, the endpoint is unauthenticated — suitable only for local testing. See Google OAuth setup below for production.
A production-ready Dockerfile is checked into the repo (non-root user, pinned base, .dockerignore to prevent secret leakage). Each tagged release auto-publishes a multi-arch image to GitHub Container Registry via .github/workflows/docker-publish.yml.
Pull the published image:
docker pull ghcr.io/drishtantkaushal/langfusemcp:latest
Or build from source:
docker build -t langfuse-mcp .
Run (all secrets injected via -e, never baked into the image):
docker run -d \
--name langfuse-mcp \
--restart unless-stopped \
-p 8000:8000 \
-e MCP_TRANSPORT=streamable-http \
-e MCP_BASE_URL=https://mcp.yourcompany.com \
-e LANGFUSE_PUBLIC_KEY=pk-lf-... \
-e LANGFUSE_SECRET_KEY=sk-lf-... \
-e LANGFUSE_HOST=https://cloud.langfuse.com \
-e GOOGLE_CLIENT_ID=... \
-e GOOGLE_CLIENT_SECRET=... \
-e ALLOWED_EMAIL_DOMAINS=yourcompany.com \
ghcr.io/drishtantkaushal/langfusemcp:latest
Terminate TLS in front (nginx, Caddy, Cloudflare). MCP endpoint is at /mcp/ (trailing slash). Because responses stream, the proxy must:
proxy_buffering off;In your Google Cloud project:
openid, https://www.googleapis.com/auth/userinfo.emailhttps://{your-base-url}/auth/callbackSet:
GOOGLE_CLIENT_ID=....apps.googleusercontent.com
GOOGLE_CLIENT_SECRET=GOCSPX-...
OAuth activates when GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, and MCP_BASE_URL are all set. With an Internal consent screen, Google rejects non-Workspace sign-ins at the identity layer — the server never sees those attempts.
For narrower control than "anyone in the Workspace":
# either, or both
ALLOWED_EMAIL_DOMAINS=yourcompany.com
[email protected],[email protected]
When set, every tool call verifies the caller's email_verified claim and checks membership before proceeding. When unset, the server trusts whatever the OAuth provider returns.
Once hosted at https://mcp.yourcompany.com:
https://mcp.yourcompany.com/mcp/Auth enabled, expect 401:
curl -i -X POST https://mcp.yourcompany.com/mcp/ \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"curl","version":"0"}}}'
OAuth metadata endpoint returns JSON (used by Claude.ai to auto-register):
curl https://mcp.yourcompany.com/.well-known/oauth-authorization-server
Liveness/readiness probe — unauthenticated GET /health returns HTTP 200 with {"status": "ok"}, suitable for Kubernetes probes:
curl -i https://mcp.yourcompany.com/health
A single server instance can route to multiple Langfuse projects. Every tool accepts an optional project argument; when omitted, the server-configured default is used. Call list_projects to discover what's available.
Declare each project via indexed env vars. Project names are data, not part of variable names — use whatever scheme you like.
LANGFUSE_PROJECT_1_NAME=production
LANGFUSE_PROJECT_1_PUBLIC_KEY=pk-lf-...
LANGFUSE_PROJECT_1_SECRET_KEY=sk-lf-...
LANGFUSE_PROJECT_1_HOST=https://cloud.langfuse.com
LANGFUSE_PROJECT_2_NAME=staging
LANGFUSE_PROJECT_2_PUBLIC_KEY=pk-lf-...
LANGFUSE_PROJECT_2_SECRET_KEY=sk-lf-...
LANGFUSE_PROJECT_2_HOST=https://cloud.langfuse.com
LANGFUSE_DEFAULT_PROJECT=production
Claude: "Show me failing traces in production today."
→ fetch_traces(project="production", ...) routed to project 1's credentials.
Claude: "Compare that with staging."
→ fetch_traces(project="staging", ...) routed to project 2's credentials.
Each project has its own cache, rate limiter, and connection pool. Claude.ai sees one connector; users authenticate once via OAuth and can query any configured project within the session.
If LANGFUSE_PROJECT_1_NAME is not set, the server falls back to the legacy LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_HOST vars and registers them as a project called default. Existing deployments keep working without changes.
| Env Variable | Default | Description |
|---|---|---|
LANGFUSE_PUBLIC_KEY |
(required) | Langfuse public API key |
LANGFUSE_SECRET_KEY |
(required) | Langfuse secret API key |
LANGFUSE_HOST |
https://cloud.langfuse.com |
Langfuse instance URL (cloud or self-hosted) |
LANGFUSE_INTERNAL_DOMAINS |
"" |
Comma-separated internal domains to exclude from analytics (e.g., mycompany.com,test.com). Applies when using group_by='domain'. |
LANGFUSE_MCP_READ_ONLY |
false |
Disable write operations (score_traces, create_dataset, etc.) |
LANGFUSE_PAGE_LIMIT |
100 |
Traces per API page |
LANGFUSE_PROJECT_{N}_NAME |
(unset) | Multi-project: name for project N (e.g. production). See Multi-project. |
LANGFUSE_PROJECT_{N}_PUBLIC_KEY |
(unset) | Public key for project N. |
LANGFUSE_PROJECT_{N}_SECRET_KEY |
(unset) | Secret key for project N. |
LANGFUSE_PROJECT_{N}_HOST |
https://cloud.langfuse.com |
Host URL for project N. |
LANGFUSE_DEFAULT_PROJECT |
first configured | Default project name used when a tool call omits project. |
MCP_TRANSPORT |
stdio |
stdio or streamable-http. HTTP mode listens on a port instead of stdin/stdout. See Hosting. |
MCP_HOST |
0.0.0.0 |
Bind address when MCP_TRANSPORT=streamable-http. |
MCP_PORT |
8000 |
Port when MCP_TRANSPORT=streamable-http. |
MCP_BASE_URL |
(unset) | Public base URL of the hosted server. Required for Google OAuth. |
GOOGLE_CLIENT_ID |
(unset) | Google OAuth client ID. OAuth activates when all three Google vars are set. |
GOOGLE_CLIENT_SECRET |
(unset) | Google OAuth client secret. |
ALLOWED_EMAILS |
(unset) | Comma-separated emails allowed to call tools. Requires OAuth. |
ALLOWED_EMAIL_DOMAINS |
(unset) | Comma-separated email domains allowed to call tools. Requires OAuth. |
Tools that compute insights server-side and return compact summaries. These go beyond raw data access — they aggregate, detect patterns, and compute statistics so the LLM can reason over results without hitting context window limits.
| Tool | Description | Key Parameters |
|---|---|---|
aggregate_by_group |
Aggregate trace metrics by user group. Returns per-group: trace count, unique sessions, unique users, accuracy rate, average latency, total cost. | group_by (name/userId/domain/tag), time_range, top_n, exclude_internal |
compute_accuracy |
Compute accuracy from feedback scores. Accuracy = correct / (correct + incorrect). Supports grouping and time bucketing for trend analysis. | group_by, bucket_by (week/day), score_name, time_range |
detect_failures |
Detect LLM output quality failures using pattern matching ("unable to", "I can't", etc.) and negative feedback scores. NOT Python exceptions — use find_exceptions for those. |
group_by, include_examples, max_examples, time_range |
compute_token_percentiles |
Compute token usage percentiles (TP50/TP90/TP95/TP99) at trace level. Fetches generation observations for accurate per-trace token counts. | group_by, percentiles, time_range |
detect_context_breaches |
Scan for traces where any single generation exceeds a token threshold. Catches context window overflow causing degraded LLM performance or silent truncation. | threshold (default 256000), check_per_generation, time_range |
analyze_sessions |
Analyze multi-turn session behavior. Returns session count, depth distribution (single vs multi-turn), engagement metrics, and session-level cost/latency. | group_by, time_range |
estimate_costs |
Compute cost breakdown using Langfuse's built-in totalCost field (model-aware, computed by Langfuse). Groups by user, agent, or time bucket. |
group_by, bucket_by (week/day), time_range |
analyze_latency |
Analyze latency distribution at trace level and optionally per LLM generation. Identifies which model is the bottleneck. | group_by, percentiles, include_per_generation, time_range |
score_traces |
Write scores back to Langfuse. Use after analysis to annotate traces with findings — tag failures for review, mark high-quality traces for dataset creation. | trace_ids, score_name, score_value, comment |
Full Langfuse API coverage for querying and managing your observability data.
| Tool | Description |
|---|---|
fetch_traces |
List traces with filters — user ID, name, tags, time range, ordering. Returns paginated results. |
fetch_trace |
Get a single trace by ID with full details including all observations (spans, generations, events). |
diff_traces |
Compare two traces side-by-side (name, user, latency, cost, tags, release, version). |
| Tool | Description |
|---|---|
fetch_observations |
List observations with filters — trace ID, type (GENERATION/SPAN/EVENT), name, time range. |
fetch_observation |
Get a single observation by ID. Returns input/output, token usage, model, latency, and cost. |
| Tool | Description |
|---|---|
fetch_sessions |
List sessions with optional time filters. |
get_session_details |
Get full details of a session including all its traces. |
get_user_sessions |
Get sessions for a specific user. Fetches user's traces and extracts unique sessions. |
| Tool | Description |
|---|---|
find_exceptions |
Find observations with error status. For LLM output quality issues, use detect_failures instead. |
get_exception_details |
Get full error details for a trace — returns all observations with error status highlighted. |
get_error_count |
Get total error count within a time period. |
| Tool | Description |
|---|---|
fetch_scores |
List scores/evaluations with filters — trace ID, score name, time range. |
list_scores_v2 |
v2 Scores API with richer filters (session ID, dataset run ID, queue ID, config ID, operator/value, etc.). |
get_score_v2 |
Get a single score by ID via the v2 Scores API. |
| Tool | Description |
|---|---|
list_prompts |
List all prompts in the project with optional name filter. |
get_prompt |
Fetch a specific prompt by name, version, or label. |
get_prompt_unresolved |
Fetch a prompt with placeholders/dependencies intact (debugging prompt composition). |
create_text_prompt |
Create a new text prompt version with optional labels and model config. |
create_chat_prompt |
Create a new chat prompt version with message array and optional config. |
update_prompt_labels |
Update labels for a specific prompt version (e.g., promote to "production"). |
| Tool | Description |
|---|---|
list_datasets |
List all datasets in the project. |
get_dataset |
Get metadata for a specific dataset. |
list_dataset_items |
List items in a dataset with pagination. |
get_dataset_item |
Get a single dataset item by ID. |
create_dataset |
Create a new dataset with optional description and metadata. |
create_dataset_item |
Create or upsert a dataset item. Supports linking to source traces. |
delete_dataset_item |
Delete a dataset item by ID. |
| Tool | Description |
|---|---|
list_annotation_queues |
List all annotation queues in the project. |
create_annotation_queue |
Create a new annotation queue with attached score configs. |
get_annotation_queue |
Get a queue by ID. |
list_annotation_queue_items |
List items in a queue (optionally filtered by status). |
get_annotation_queue_item |
Get a queue item by ID. |
create_annotation_queue_item |
Add a trace or observation to a queue for review. |
update_annotation_queue_item |
Change a queue item's status (PENDING / COMPLETED). |
delete_annotation_queue_item |
Remove an item from a queue. |
create_annotation_queue_assignment |
Assign a reviewer to a queue. |
delete_annotation_queue_assignment |
Remove a reviewer from a queue. |
| Tool | Description |
|---|---|
get_daily_metrics |
Langfuse's pre-aggregated daily rollup (trace count, cost, tokens per day). Faster than per-trace aggregation for long windows. |
| Tool | Description |
|---|---|
list_users |
Top users by trace count over a time window (defaults to last 30 days). Wraps the Langfuse /metrics query API. |
| Tool | Description |
|---|---|
list_comments |
List comments attached to traces/observations/sessions/prompts, with filters. |
get_comment |
Get a single comment by ID. |
create_comment |
Create a markdown comment on a trace/observation/session/prompt. |
| Tool | Description |
|---|---|
list_models |
List model definitions in Langfuse's models registry (pricing + tokenizer config). |
get_model |
Get a single model definition by ID. |
| Tool | Description |
|---|---|
list_projects |
Discovery: returns the list of configured Langfuse projects and the default project. |
| Tool | Description |
|---|---|
get_data_schema |
Get the data schema for the Langfuse project — available fields and types for traces, observations, scores, sessions. |
Once connected, ask your AI assistant questions like these:
The group_by parameter controls how traces are segmented in analytics tools:
| Value | What it groups by | When to use |
|---|---|---|
name |
Trace/agent name (default) | Compare performance across different agents or pipelines |
userId |
Per-user breakdown | Identify users with issues or high usage |
domain |
Email domain extracted from userId | Multi-tenant apps where users have email-based IDs (e.g., [email protected] → acme.com) |
tag |
Trace tags | Compare across tagged environments, versions, or experiments |
Load only the tool groups you need to reduce token overhead:
# Only load traces and analytics tools
LANGFUSE_TOOLS=traces,analytics langfuse-mcp-server
# Only load prompts and datasets
LANGFUSE_TOOLS=prompts,datasets langfuse-mcp-server
# In Claude Code
claude mcp add \
-e LANGFUSE_PUBLIC_KEY=pk-lf-... \
-e LANGFUSE_SECRET_KEY=sk-lf-... \
-e LANGFUSE_TOOLS=traces,observations,analytics \
langfuse-mcp -- uvx langfuse-mcp-server
Available groups:
| Group | Tools | Count |
|---|---|---|
traces |
fetch_traces, fetch_trace |
2 |
observations |
fetch_observations, fetch_observation |
2 |
sessions |
fetch_sessions, get_session_details, get_user_sessions |
3 |
errors |
find_exceptions, get_exception_details, get_error_count |
3 |
scores |
fetch_scores |
1 |
prompts |
list_prompts, get_prompt, create_text_prompt, create_chat_prompt, update_prompt_labels |
5 |
datasets |
list_datasets, get_dataset, list_dataset_items, get_dataset_item, create_dataset, create_dataset_item, delete_dataset_item |
7 |
annotation_queues |
All 10 annotation queue tools | 10 |
metrics |
get_daily_metrics |
1 |
users |
list_users |
1 |
comments |
list_comments, get_comment, create_comment |
3 |
models |
list_models, get_model |
2 |
projects |
list_projects |
1 |
schema |
get_data_schema |
1 |
analytics |
All 9 analytics tools | 9 |
If LANGFUSE_TOOLS is not set, all 56 tools are loaded.
Disable write operations (score_traces, create_dataset, create_dataset_item, delete_dataset_item, create_text_prompt, create_chat_prompt):
LANGFUSE_MCP_READ_ONLY=true
| Capability | This server | Official Langfuse MCP |
|---|---|---|
| Traces & Observations | Yes | No |
| Sessions & Users | Yes | No |
| Exception Tracking | Yes | No |
| Prompt Management | Yes | Yes |
| Dataset Management | Yes | No |
| Score Write-back | Yes | No |
| Selective Tool Loading | Yes | No |
| Accuracy Metrics | Yes | No |
| Failure Detection | Yes | No |
| Token Percentiles | Yes | No |
| Cost Breakdown | Yes | No |
| Latency Analysis | Yes | No |
| Session Analytics | Yes | No |
| Context Breach Scanning | Yes | No |
| User Group Aggregation | Yes | No |
The official Langfuse MCP (5 tools) focuses on prompt management. This server provides full observability coverage plus 9 analytics tools.
| Capability | This server | Others |
|---|---|---|
| Data access (traces, observations, sessions) | Yes | Yes |
| Prompt & dataset management | Yes | Yes |
| Exception tracking | Yes | Yes |
| Annotation queues | Yes | Partial |
| Selective tool loading | Yes | Yes |
| Multi-project support | Yes | No |
| Accuracy metrics | Yes | No |
| LLM failure detection | Yes | No |
| Token percentiles (TP50/P90/P95/P99) | Yes | No |
| Cost breakdown by group/time | Yes | No |
| Latency analysis with per-model breakdown | Yes | No |
| Multi-turn session analytics | Yes | No |
| Context window breach scanning | Yes | No |
| User/tenant group aggregation | Yes | No |
| Score write-back | Yes | No |
Other implementations provide data access (fetching raw traces, observations, sessions) using synchronous HTTP clients. This server adds a compute layer — analytics tools that aggregate, detect patterns, and compute statistics server-side — plus an async architecture that's fundamentally faster.
| Architecture | This server | Others |
|---|---|---|
| Async HTTP client | Yes (httpx.AsyncClient) | No (sync requests/httpx) |
| Concurrent observation fetching | Yes (asyncio.gather) | No (sequential per-trace) |
| TTL caching | Yes (live 5min, historical 1hr) | No |
| Adaptive rate limiting | Yes (token bucket, 429 backoff) | No (fixed sleep) |
| Batch observation queries | Yes (with auto-fallback) | No (N+1 per-trace) |
| Claude Code sub-agent | Yes (.claude/agents/) | No |
| Capability | This server | Platform AI assistants |
|---|---|---|
| Open source | Yes | No |
| Works with any MCP client | Yes | Platform-locked |
| Self-hosted Langfuse support | Yes | N/A |
| Real-time conversational | Yes | Varies (some batch-only) |
| Custom grouping/segmentation | Yes | Limited |
| Write-back to Langfuse | Yes | Platform-specific |
| Free | Yes | Paid tiers |
The Langfuse Python SDK is excellent for writing traces (it batches and sends asynchronously in the background). But for reading traces at scale — which is what an analytics MCP server does — the SDK has a limitation: its read API is synchronous, built on the requests library.
This server uses httpx.AsyncClient instead, which enables:
asyncio.gather, not one-by-oneasyncio.Semaphore + token bucket controls throughput without time.sleep() blockingMeasured impact: analyze_latency with per-generation breakdown dropped from 110s to 20s (5.4x faster) on a self-hosted instance with 2.4M daily observations.
Two-tier in-memory TTL cache using cachetools.TTLCache:
| Data age | TTL | Rationale |
|---|---|---|
| Today's data | 5 minutes | Still changing, short cache |
| Historical data (before today) | 1 hour | Won't change, cache aggressively |
The cache operates at the API page level. If you call aggregate_by_group then compute_accuracy for the same time range, the second call hits cache for all trace pages — only scores are fetched fresh.
Configure via LANGFUSE_CACHE_TTL and LANGFUSE_CACHE_TTL_HISTORICAL (seconds).
A global token bucket rate limiter respects Langfuse API limits:
| Instance type | Default RPM | Behavior |
|---|---|---|
| Self-hosted | Unlimited (0) | No artificial throttling. Full speed, limited only by your server. |
| Langfuse Cloud (Hobby) | 30 req/min | Conservative default for Hobby tier |
| Langfuse Cloud (Pro/Team) | Set LANGFUSE_RATE_LIMIT_RPM=1000 |
Higher throughput for paid plans |
On HTTP 429 responses, the limiter automatically halves the RPM and reads the Retry-After header. This means the server adapts to any rate limit — cloud or self-hosted — without manual configuration.
Analytics tools that need per-generation data (token percentiles, context breaches, latency breakdown) face the N+1 problem: one API call per trace to fetch its observations.
This server uses a two-step strategy:
asyncio.gather with semaphore-controlled concurrencyThis means the server handles both small projects (batch is faster) and large-scale deployments (concurrent targeted fetching avoids downloading millions of irrelevant observations).
The server ships with a Claude Code custom agent at .claude/agents/langfuse-analyst.md. When a user asks a Langfuse-related question, Claude Code can delegate to this agent, which:
This prevents 33 tool schemas (~5000 tokens) from polluting every conversation.
See CONTRIBUTING.md for development setup, code style guidelines, and areas for contribution.
See SECURITY.md for the security policy, vulnerability reporting, and API key handling.
See CODE_OF_CONDUCT.md.
MIT — see LICENSE.
Run in your terminal:
claude mcp add langfuse-mcp-server -- npx Security
Low riskAutomated heuristic from public metadata — not a security guarantee.