Search for a command to run...
loading…
1 MCPs · 0 installs total
AI evaluation toolkit that measures inter-rater agreement (Fleiss' κ, Kendall's W) across multiple LLM providers. Evaluate prompt reliability, detect contested