LLM-as-judge calibration

Test execution is always the primary, objective gate (PRIMARY_GATE = "test_execution"). The LLM-as-judge is a secondary, qualitative score (style / clarity / idiomaticity) for cases with no test oracle — never a gate.

Cohen's κ vs the human gold set

Chance-adjusted agreement between the LLM judge and human raters on a discrete 1–5 rubric, binarized to a good/bad verdict (score ≥ 4 is “good”). Raw agreement overstates a judge that mostly says one label; κ corrects for chance.

Loading…

Why a judge at all, and why κ

Raw agreement overstates a judge that mostly says one label. We report Cohen's κ (chance-adjusted agreement) against a hand-labeled gold set, on a single-answer discrete 1–5 rubric. Good rubrics land around κ ≈ 0.6–0.75. The judge is recalibrated at launch and periodically; its κ is published here so the secondary score is trustworthy before it is ever shown next to a run.

MVP status: the judge harness and Cohen's-κ computation ship with a partial seed gold set (golden/judge_gold.jsonl). The full ~200-example calibration set is a documented, in-progress item; the figure above is recomputed by python -m forgejudge.eval.calibrate whenever the set grows.

Guardrails

Discrete single-answer rubric (mitigates verbosity / position bias).
Never used to pass or fail a run — the deterministic test transition decides resolution.
Public κ so the qualitative score is held to the same evidence bar as the objective one.