loading…
Search for a command to run...
loading…
AI evaluation toolkit that measures inter-rater agreement (Fleiss' κ, Kendall's W) across multiple LLM providers. Evaluate prompt reliability, detect contested
AI evaluation toolkit that measures inter-rater agreement (Fleiss' κ, Kendall's W) across multiple LLM providers. Evaluate prompt reliability, detect contested outputs, and track consensus trends over time.
One command. Find out if your AI agrees with itself.
ConKurrence is a statistically validated consensus measurement toolkit for AI evaluation pipelines. It uses multiple AI models as independent raters, measures inter-rater reliability with Fleiss' kappa and bootstrap confidence intervals, and routes contested items to human experts.
npm install -g conkurrence
Use ConKurrence as an MCP server in Claude Desktop or any MCP-compatible client:
npx conkurrence mcp
Add to your claude_desktop_config.json:
{
"mcpServers": {
"conkurrence": {
"command": "npx",
"args": ["-y", "conkurrence", "mcp"]
}
}
}
/plugin marketplace add AlligatorC0der/conkurrence
| Tool | Description |
|---|---|
conkurrence_run |
Execute an evaluation across multiple AI raters |
conkurrence_report |
Generate a detailed markdown report |
conkurrence_compare |
Side-by-side comparison of two runs |
conkurrence_trend |
Track agreement over multiple runs |
conkurrence_suggest |
AI-powered schema suggestion from your data |
conkurrence_validate_schema |
Validate a schema before running |
conkurrence_estimate |
Estimate cost and token usage |
BUSL-1.1 — Business Source License 1.1
Добавь это в claude_desktop_config.json и перезапусти Claude Desktop.
{
"mcpServers": {
"conkurrence": {
"command": "npx",
"args": []
}
}
}