ForgeJudge

LLM-as-judge calibration

Test execution is always the primary, objective gate (PRIMARY_GATE = "test_execution"). The LLM-as-judge is a secondary, qualitative score (style / clarity / idiomaticity) for cases with no test oracle — never a gate.

Why a judge at all, and why κ

Raw agreement overstates a judge that mostly says one label. We report Cohen's κ (chance-adjusted agreement) against a hand-labeled gold set, on a single-answer discrete 1–5 rubric. Good rubrics land around κ ≈ 0.6–0.75. The judge is recalibrated at launch and periodically; its κ is published here so the secondary score is trustworthy before it is ever shown next to a run.

MVP status: the judge harness and Cohen's-κ computation ship with a partial seed gold set (golden/judge_gold.jsonl). The full ~200-example calibration set is a documented, in-progress item; this page will display the live κ once the set is complete.

Guardrails