loading…
Search for a command to run...
loading…
Human-evaluation infrastructure for AI quality. 25,000+ blind human reviews by 200+ verified reviewers across 58 AI models — query the data via five MCP tools (
Human-evaluation infrastructure for AI quality. 25,000+ blind human reviews by 200+ verified reviewers across 58 AI models — query the data via five MCP tools (get_model_scores, compare_models, get_flags, check_content, get_latest).
Real human evaluations of AI models. 25,000+ blind reviews by 200+ verified reviewers across 58 models (GPT-5, Claude Opus 4.7, Gemini 3.1, Grok 4.3, DeepSeek V4, Mistral, Kimi K2.6 and more) and 44 benchmarks. Free. Python SDK + MCP server + ChatGPT GPT + REST.
Get human feedback on your AI in 3 lines of Python:
from grandjury import GrandJury
gj = GrandJury() # reads GRANDJURY_API_KEY from env
gj.trace(name="chat", input=prompt, output=response, model="gpt-4o")
Then open your Jupyter notebook:
df = gj.results() # traces with human votes — as a DataFrame
print(f"Pass rate: {df['pass_rate'].mean():.1%}")
Patent Pending.
Most AI evaluation pipelines use LLMs to judge LLMs. That inherits the same biases, conventions, and blind spots as the models being evaluated — and tends to produce eval pipelines with ~0% disagreement, which is the diagnostic for "not measuring quality, just confirming assumptions" (essay).
HumanJudge uses real human reviewers who blind-evaluate AI outputs across structured benchmarks (marketing, healthcare, end-of-life conversations, cultural fluency, code review, and more) and write their reasoning. Reviewers earn XP, get credentialing letters, and stay anonymous to the reader by default.
The data is queryable via this SDK, the MCP server, a ChatGPT GPT action, and a REST API.
| Surface | Install | Docs |
|---|---|---|
| Python SDK | pip install grandjury |
docs/pulse/python-sdk |
| Claude Desktop MCP | Add https://api.humanjudge.com/mcp as a custom connector |
docs/pulse/claude-desktop |
| Claude Code MCP | Add to .mcp.json (remote, no install) |
docs/pulse/claude-code |
| ChatGPT GPT | Search "HumanJudge" in the GPT Store | docs/pulse/chatgpt |
| REST API | n/a | humanjudge.com/docs |
HumanJudge connects your AI to a community of human reviewers who evaluate your model's outputs. GrandJury is the Python SDK — it sends traces and retrieves human evaluation results.
Write path: Log AI calls from your app → traces appear in your developer dashboard. Read path: Fetch evaluation results (votes, pass rates, reviewer feedback) into DataFrames for analysis.
pip install grandjury
Optional performance dependencies:
pip install grandjury[performance] # msgspec, pyarrow, polars
Go to humanjudge.com/projects/new, register your AI, and copy the secret key.
export GRANDJURY_API_KEY=gj_sk_live_...
from grandjury import GrandJury
gj = GrandJury() # zero-config — reads from env
# Option A: Direct call
gj.trace(name="chat", input="What is ML?", output="Machine learning is...", model="gpt-4o")
# Option B: Decorator — auto-captures input/output/latency
@gj.observe(name="chat", model="gpt-4o")
def call_llm(prompt: str) -> str:
return openai.chat(prompt)
# Option C: Context manager
with gj.span("chat", input=prompt) as s:
response = call_llm(prompt)
s.set_output(response)
Once reviewers vote on your traces:
# Trace-level summary
df = gj.results()
# trace_id | input | output | model | pass_count | flag_count | total_votes | pass_rate
# Individual votes with reviewer identity
df_votes = gj.results(detail='votes')
# trace_id | voter_id | voter_name | verdict | flag_category | feedback | created_at
# Filter by benchmark
df_benchmark = gj.results(evaluation='marketing-benchmark')
# Export
df.to_parquet('evaluation_results.parquet')
Works on both live platform data and offline datasets:
# Auto-fetch from platform
gj.analytics.vote_histogram()
gj.analytics.population_confidence(voter_list=[...])
# Or pass your own data
import pandas as pd
df = pd.read_csv("my_votes.csv")
gj.analytics.vote_histogram(df)
gj.analytics.votes_distribution(df)
List and enroll your model in open benchmarks programmatically:
# Browse available benchmarks
benchmarks = gj.benchmarks.list()
# Enroll with endpoint config
gj.benchmarks.enroll(
benchmark_id="...",
model_id="...",
endpoint_config={
"endpoint": "https://api.myapp.com/v1/chat/completions",
"apiKey": "sk-...",
"request_template": '{"model":"gpt-4o","messages":[{"role":"user","content":"{{prompt}}"}]}',
"response_path": "choices[0].message.content"
}
)
All analytics methods work on both platform data (gj.results(detail='votes')) and offline data (pandas/polars/CSV/parquet):
| Method | Description |
|---|---|
gj.analytics.evaluate_model() |
Decay-adjusted scoring |
gj.analytics.vote_histogram() |
Vote time distribution |
gj.analytics.vote_completeness() |
Completeness per voter |
gj.analytics.population_confidence() |
Confidence metrics |
gj.analytics.majority_good_votes() |
Threshold analysis |
gj.analytics.votes_distribution() |
Votes per inference |
gj.results() only returns traces with at least 1 human vote (privacy gate)gj = GrandJury(
api_key=None, # reads GRANDJURY_API_KEY from env if not provided
base_url="https://grandjury-server.onrender.com",
timeout=5.0,
)
# Write
gj.trace(name, input, output, model, latency_ms, metadata, gj_inference_id)
await gj.atrace(...) # async version (requires httpx)
gj.observe(name, model, metadata) # decorator
gj.span(name, input, model, metadata) # context manager
# Read
gj.results(detail=None, evaluation=None) # returns DataFrame or list[dict]
# Browse
gj.models.list()
gj.models.get(model_id)
gj.benchmarks.list()
gj.benchmarks.enroll(benchmark_id, model_id, endpoint_config)
# Analytics
gj.analytics.evaluate_model(...)
gj.analytics.vote_histogram(data=None, ...)
gj.analytics.vote_completeness(data=None, voter_list=None, ...)
gj.analytics.population_confidence(data=None, voter_list=None, ...)
gj.analytics.majority_good_votes(data=None, ...)
gj.analytics.votes_distribution(data=None, ...)
See CONTRIBUTING.md for development setup, testing, and PR guidelines.
See LICENSE.
Выполни в терминале:
claude mcp add humanjudge -- npx Web content fetching and conversion for efficient LLM usage.
Retrieval from AWS Knowledge Base using Bedrock Agent Runtime.
автор: modelcontextprotocolProvides auto-configuration for setting up an MCP server in Spring Boot applications.
A very streamlined mcp client that supports calling and monitoring stdio/sse/streamableHttp, and can also view request responses through the /logs page. It also
автор: xuzexin-hzНе уверен что выбрать?
Найди свой стек за 60 секунд
Автор?
Embed-бейдж для README
Похожее
Все в категории ai