loading…
Search for a command to run...
loading…
MCP server that gives AI coding agents direct access to evaluation tools.
Docs · Website · PyPI · multivon-eval (engine)
MCP server that gives AI coding agents direct access to evaluation tools. Drop into Claude Desktop, Claude Code, Cursor, Cline, or any Model Context Protocol–compatible agent.
When the agent is helping you build an LLM product, it can:
No copy-paste, no python -c "...", no asking the agent to figure out the SDK calls.
pip install multivon-mcp
Bare install pulls multivon-eval, pdfhell, and the MCP SDK. The provider SDKs (anthropic, openai, google-genai) come along too — bring your own API key in env.
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"multivon": {
"command": "multivon-mcp",
"env": {
"ANTHROPIC_API_KEY": "sk-ant-...",
"OPENAI_API_KEY": "sk-proj-...",
"GOOGLE_API_KEY": "AIza..."
}
}
}
}
Restart Claude. The 22 tools become available; ask Claude "use multivon to evaluate this RAG output" and it figures out which tool to call.
cursor.json or via Settings → MCP:
{ "mcpServers": { "multivon": { "command": "multivon-mcp" } } }
Same shape — point at the multivon-mcp console script.
mcp dev multivon_mcp.server
Opens the MCP Inspector UI in your browser. You can call any tool by name, see the JSON schemas, and watch the requests/responses.
| Tool | What it does | API key |
|---|---|---|
eval_discover |
Full machine-readable capability catalog (evaluators, traps, suites, calibration data, versions). Call first. | No |
pdfhell_make |
Generate one adversarial PDF + its answer key. | No |
pdfhell_run |
Run the pdfhell adversarial-PDF benchmark against a vision model. Returns pass rate, per-trap CIs, suite hash. | Yes (vision) |
eval_audit_pack |
Build a hash-chained, procurement-ready ZIP from a pdfhell run. | No |
| Tool | What it does | API key |
|---|---|---|
eval_faithfulness |
QAG-graded faithfulness — is a RAG output grounded in the retrieved context? | Yes |
eval_hallucination |
QAG-graded hallucination — does the output contain content NOT in context? | Yes |
eval_relevance |
QAG-graded answer-vs-question relevance. | Yes |
eval_answer_accuracy |
QAG-graded semantic equivalence vs ground truth. | Yes |
eval_context_precision |
RAG retrieval quality — are the retrieved chunks on-topic? | Yes |
eval_context_recall |
RAG retrieval completeness — does context contain enough info to answer? | Yes |
| Tool | What it does | API key |
|---|---|---|
eval_toxicity |
QAG-graded toxicity / harmful-content detection. | Yes |
eval_bias |
QAG-graded bias across gender, race, politics, age, socioeconomic axes. | Yes |
eval_pii_detection |
Local-only regex scan for PII (GDPR / CCPA / PIPEDA / HIPAA packs). | No |
eval_schema_compliance |
Validate an LLM output against a JSON Schema. | No |
| Tool | What it does | API key |
|---|---|---|
eval_tool_call_accuracy |
Deterministic agent tool-call correctness. No LLM. | No |
eval_vqa_faithfulness |
Image-grounded visual-QA faithfulness. | Yes (vision) |
eval_document_grounding |
Multi-page document-grounded faithfulness for document-AI agents. | Yes (vision) |
| Tool | What it does | API key |
|---|---|---|
eval_g_eval |
G-Eval holistic 0.0-1.0 scoring against a plain-English criterion. | Yes |
eval_custom_rubric |
Score against your own list of yes/no quality checks. | Yes |
| Tool | What it does | API key |
|---|---|---|
eval_compare_runs |
Diff two eval report JSONs — pass-rate delta, per-case regressions/improvements, McNemar p-value. Use after every fix to confirm it actually helped. | No |
eval_generate_cases |
Generate N eval cases (input / expected_output / context) from a chunk of source text. Eliminates the cold-start when building a new suite. | Yes (judge) |
eval_ingest_trace |
Convert a JSON agent trace (LangGraph / OpenAI Agents / manual) into an EvalCase payload. Use to score trajectories your agent just executed. | No |
User: I just shipped a RAG endpoint. Can you check it for hallucinations?
Claude: I'll use multivon to evaluate it.
[calls eval_discover to see what's available]
[calls eval_faithfulness with your input/context/output]
→ score: 0.667 (passed: False), threshold: 0.9
reason: 2/3 claims grounded
✓ "annual renewal" — supported by context
✓ "30-day notice" — supported by context
✗ "automatic upgrade" — NOT in context
Claude: Your RAG hallucinated the "automatic upgrade" detail. The context
doesn't mention upgrades. I'd add a Hallucination evaluator to your CI
gate, threshold ≥0.85, and re-prompt with explicit "only use facts
from context" instructions.
eval_discover returns the full 44-evaluator catalog, so the agent can always introspect everything. The 22 tools we expose directly are the ones agents actually call mid-edit:
The three new 0.3.0 tools matter because evals are most useful as a loop, not a single call: generate a starting suite from your own docs (eval_generate_cases), run your agent over it, score the trace (eval_ingest_trace → eval_*), make a fix, then verify the fix improved things vs. the baseline (eval_compare_runs). Agents need that whole loop callable from within a conversation — otherwise they fall back to ad-hoc judgment.
Exposing all 44 evaluators as MCP tools would bloat the agent's context window and overwhelm tool-selection. If you need an evaluator that's not directly exposed, the agent can still use multivon-eval as a library — eval_discover returns the import paths.
mcp[cli] >= 1.0 — official MCP Python SDK + the mcp dev inspectormultivon-eval >= 0.7.3 — the evaluator surface this wrapspdfhell >= 0.1.0 — the adversarial-PDF benchmark this wrapsAll Apache 2.0.
Five public + one early-access package, all built on a shared evaluation engine:
| Repo | What it is |
|---|---|
| multivon-eval | Python SDK — 44 evaluators + bootstrap CLI + multivon_eval.auto. The engine multivon-mcp wraps. |
| pdfhell | Adversarial PDFs that break AI document readers — exposed here as pdfhell_run + pdfhell_make tools |
| multivon-mcp (you are here) | MCP server — 22 tools from multivon-eval + pdfhell |
| eval-action | GitHub Action — runs the same evals on every PR |
| eval-framework-benchmark | Reproducible head-to-head benchmark vs DeepEval + RAGAS |
| multivon-guard (early access) | Local proxy that catches LLM coding agents leaking secrets / PII |
Apache 2.0.
@software{multivon_mcp,
title = {multivon-mcp: MCP server exposing multivon-eval + pdfhell as agent-callable tools},
author = {Multivon},
year = {2026},
url = {https://github.com/multivon-ai/multivon-mcp},
}
Run in your terminal:
claude mcp add multivon-mcp -- npx