ForgeJudge

Methodology & decontamination

SWE-bench Verified is now widely held to be contaminated — OpenAI stopped reporting it in Feb 2026; >32% of "passed" cases leaked the solution and ~31% passed on weak tests. ForgeJudge treats decontamination as a tested, documented property — not a footnote.

1 · Intrinsically-verifiable tasks (no human label at scoring time)

Every task is make-CI-green: a real test fails on the buggy code and must pass after the fix, while existing tests stay green. Correctness is a deterministic test transition — the exact SWE-bench validity rule (resolved ⇔ all FAIL_TO_PASS pass ∧ all PASS_TO_PASS stay green).

2 · Leak-resistance

3 · Mutation hardening (weak-test detection)

For each task we mutate the gold fix within the patched region and require the test suite to kill the mutants. A task whose tests pass under a wrong fix is flagged weak and rejected — a CI invariant. Current golden set: 7 mutation-hardened (mean score 0.89), 5 inconclusive (regex/string code with no mutable operators), 0 weak.

4 · Cheat-resistant, swebench-equivalent grading

5 · Determinism & the multi-seed gate

temperature=0 does not guarantee determinism (pass@1 varies 2–6pp). The scorer is fully deterministic; the regression gate is multi-seed — a change fails the build only when the candidate's CI upper bound is below the baseline's CI lower bound, so flaky single runs don't block merges.

6 · The honest framing

A $0 free-model resolution rate is modest by design. The deliverable is the engineered, decontaminated, regression-gated system — the harness, the traces, the gate — proven via the model-swap comparison (the score rises with a better model while the harness stays fixed).