LLM-as-judge calibration
Test execution is always the primary, objective gate (PRIMARY_GATE = "test_execution").
The LLM-as-judge is a secondary, qualitative score (style / clarity / idiomaticity) for cases with no test
oracle — never a gate.
Why a judge at all, and why κ
Raw agreement overstates a judge that mostly says one label. We report Cohen's κ (chance-adjusted agreement) against a hand-labeled gold set, on a single-answer discrete 1–5 rubric. Good rubrics land around κ ≈ 0.6–0.75. The judge is recalibrated at launch and periodically; its κ is published here so the secondary score is trustworthy before it is ever shown next to a run.
MVP status: the judge harness and Cohen's-κ computation ship with a partial seed gold set
(golden/judge_gold.jsonl). The full ~200-example calibration set is a documented, in-progress item;
this page will display the live κ once the set is complete.
Guardrails
- Discrete single-answer rubric (mitigates verbosity / position bias).
- Never used to pass or fail a run — the deterministic test transition decides resolution.
- Public κ so the qualitative score is held to the same evidence bar as the objective one.