loading…
Search for a command to run...
loading…
Exposes queryable GPU inference benchmark data (quantization, throughput, VRAM, concurrent users) as tools for LLM clients.
Exposes queryable GPU inference benchmark data (quantization, throughput, VRAM, concurrent users) as tools for LLM clients.
An MCP server that exposes Poor Paul's Benchmark GPU inference data — quantization × throughput × VRAM × concurrent users — as queryable tools to any LLM client.
Hosted instance: https://mcp.poorpaul.dev/ (streamable-http transport, no auth)
Connect any MCP-aware client (Claude Desktop, Cline, Continue, etc.) to ask questions like:
It exposes thirteen tools backed by 39,000+ real benchmark rows:
| Tool | What it does |
|---|---|
list_tested_configs |
Lists every tested GPU, model, and quantization (call this first) |
query_ppb_results |
Filters raw benchmark rows by GPU / VRAM / model / quant / users / backend / date |
recommend_quantization |
Three-tier empirical-first recommendation engine (high / medium / low confidence) |
recommend_hardware |
Budget-aware GPU recommendation ranked by speed, efficiency, or value-for-money |
explain_result |
Contextual explanation of a result: VRAM pressure, PCIe context, percentile rank |
get_gpu_headroom |
Sanity-checks a (gpu, model, quant, users) configuration for VRAM headroom |
compare_quants_quantitative |
Side-by-side throughput comparison across quantizations for a model + GPU |
get_combined_scores |
Quantitative + qualitative metrics in one call for a (gpu, model, quant) config |
rank_by_priority |
Rank quantizations by speed, efficiency (tok/W), or a balanced composite score |
| Tool | What it does |
|---|---|
get_qualitative_summary |
All available qualitative scores (context-rot, tool accuracy, quality, MT-Bench) |
query_qualitative_results |
Filter qualitative rows by phase, model, quant, GPU, or minimum score thresholds |
get_context_rot_breakdown |
Long-context recall scores by length, depth, and needle type |
get_tool_accuracy_breakdown |
Tool-call accuracy: selection, parameters, hallucination rate, parse success |
compare_quants_qualitative |
Side-by-side qualitative comparison across quantizations with deterministic insight |
Benchmark rows are mirrored into a local SQLite cache (./ppb_cache.db by
default; override with PPB_DB_PATH). On startup the server loads from
SQLite and only contacts HuggingFace when the dataset's git commit SHA
has changed — making subsequent restarts fast and offline-friendly.
Add to your MCP client config (Claude Desktop example, ~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"ppb": {
"transport": { "type": "http", "url": "https://mcp.poorpaul.dev/mcp" }
}
}
}
pip install and run locally (stdio)pip install ppb-mcp
MCP_TRANSPORT=stdio ppb-mcp
Claude Desktop config:
{
"mcpServers": {
"ppb": {
"command": "ppb-mcp",
"env": { "MCP_TRANSPORT": "stdio" }
}
}
}
docker run --rm -p 9933:9933 \
-e MCP_TRANSPORT=streamable-http \
-v ppb-hf-cache:/data/huggingface \
ghcr.io/paulplee/ppb-mcp:latest
git clone https://github.com/paulplee/ppb-mcp
cd ppb-mcp
pip install -e ".[dev]"
ppb-mcp # streamable-http on :9933
All clients use the same hosted endpoint: https://mcp.poorpaul.dev/mcp
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS)
or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"ppb": {
"transport": { "type": "http", "url": "https://mcp.poorpaul.dev/mcp" }
}
}
}
Restart Claude Desktop after saving.
Edit ~/.cursor/mcp.json (create if it doesn't exist):
{
"mcpServers": {
"ppb": {
"url": "https://mcp.poorpaul.dev/mcp",
"type": "http"
}
}
}
Or via UI: Settings → Tools & Integrations → MCP → Add Server.
Edit ~/.codeium/windsurf/mcp_config.json:
{
"mcpServers": {
"ppb": {
"serverUrl": "https://mcp.poorpaul.dev/mcp",
"transport": "http"
}
}
}
Add to your .vscode/mcp.json (workspace) or User settings.json:
{
"mcp": {
"servers": {
"ppb": {
"type": "http",
"url": "https://mcp.poorpaul.dev/mcp"
}
}
}
}
Add to ~/.config/zed/settings.json under "context_servers":
{
"context_servers": {
"ppb": {
"command": {
"path": "env",
"args": ["MCP_TRANSPORT=stdio", "uvx", "ppb-mcp"]
}
}
}
}
Open the Cline panel → MCP Servers tab → Add Server → select SSE/HTTP → paste https://mcp.poorpaul.dev/mcp.
Add to ~/.continue/config.yaml:
mcpServers:
- name: ppb
transport:
type: http
url: https://mcp.poorpaul.dev/mcp
Add to ~/.config/opencode/config.json:
{
"mcp": {
"ppb": {
"type": "remote",
"url": "https://mcp.poorpaul.dev/mcp"
}
}
}
goose mcp add ppb --transport http --url https://mcp.poorpaul.dev/mcp
# Zero-install (requires uv):
env MCP_TRANSPORT=stdio uvx ppb-mcp
# After pip install:
env MCP_TRANSPORT=stdio ppb-mcp
Note on transport key names: MCP clients are not yet fully standardised on JSON key names for the HTTP transport. If your client doesn't connect with
"type": "http", try"transport": "http","type": "sse", or"transport": "streamable-http". The endpoint URL is the same regardless.
> list_tested_configs
{ "gpus": ["Apple M4 Pro", "NVIDIA GB10", "NVIDIA GeForce RTX 5090"],
"models": ["Qwen3.5-9B", ...], "quantizations": ["Q4_K_M", ...] }
> recommend_quantization(gpu_vram_gb=32, concurrent_users=8, model="Qwen3.5-9B", priority="balance")
{ "recommended_quantization": "Q5_K_M",
"estimated_vram_usage_gb": 27.8,
"estimated_tokens_per_second": 142.0,
"headroom_gb": 4.2,
"confidence": "high",
"reasoning": "Q5_K_M is recommended for your NVIDIA GeForce RTX 5090 (32 GB) ...",
"alternatives": ["Q4_K_M", "Q8_0"] }
> recommend_hardware(target_model="Qwen3.5-27B", target_quantization="Q4_K_M",
concurrent_users=4, budget_usd=1200, priority="value")
{ "top_picks": [
{ "gpu": "NVIDIA GeForce RTX 5090", "msrp_usd": 1999, "throughput_tok_s": 94.3,
"efficiency_tok_per_dollar": 0.047, "rank_reason": "best measured tok/$ in budget" },
...
],
"budget_usd": 1200 }
> explain_result(gpu_name="NVIDIA GeForce RTX 5090", model="Qwen3.5-9B",
quantization="Q4_K_M", concurrent_users=8, n_ctx=32768)
{ "throughput_tok_s": 142.0,
"vram_pressure": "medium",
"pcie_context": "PCIe Gen 5 x16 — full bandwidth, no bottleneck expected",
"percentile_rank": 0.91,
"insight": "Top 9% throughput among all Qwen3.5-9B Q4_K_M configurations measured." }
| Env var | Default | Notes |
|---|---|---|
HF_DATASET |
paulplee/ppb-results |
HuggingFace dataset ID |
REFRESH_INTERVAL_HOURS |
1 |
Background refresh cadence |
MCP_TRANSPORT |
streamable-http |
stdio or streamable-http |
HOST |
0.0.0.0 |
HTTP bind host |
PORT |
9933 |
HTTP bind port |
LOG_LEVEL |
INFO |
Python logging level |
git clone https://github.com/paulplee/ppb-mcp /tmp/ppb-mcp
cd /tmp/ppb-mcp
DOMAIN=mcp.example.com [email protected] ./deploy/deploy.sh
This installs Docker, builds the image, registers a systemd unit, configures nginx, and runs certbot.
deploy.sh (systemd manages the container)cd /opt/ppb-mcp
sudo git pull --ff-only
sudo systemctl restart ppb-mcp
The systemd unit runs docker compose pull before starting, so the new image is fetched automatically. Check that it came up cleanly with:
sudo systemctl status ppb-mcp
# or follow live logs:
journalctl -u ppb-mcp -f
cd ~/ppb-mcp # or wherever you cloned the repo
git pull --ff-only
docker compose build
docker compose up -d
# verify:
docker compose logs -f ppb-mcp
pip install -e ".[dev]"
ruff check src tests
pytest -v
Integration tests against the live HuggingFace dataset are gated behind PPB_RUN_INTEGRATION=1 to keep CI offline-clean.
(model, quant) benchmarked on a different GPU at the same concurrency; throughput borrowed, VRAM scaled to your card.vram_per_user ≈ (params_B × bits_per_weight / 8) × 1.15; viable iff total ≤ 90 % of your VRAM.MIT — see LICENSE.
Выполни в терминале:
claude mcp add ppb-mcp -- npx CSA PROJECT - FZCO © 2026 IFZA Business Park, DDP, Premises Number 31174 - 001
Безопасность
Низкий рискАвтоматическая эвристика по публичным данным — не гарантия безопасности.