Methodology & decontamination
SWE-bench Verified is now widely held to be contaminated — OpenAI stopped reporting it in Feb 2026; >32% of "passed" cases leaked the solution and ~31% passed on weak tests. ForgeJudge treats decontamination as a tested, documented property — not a footnote.
1 · Intrinsically-verifiable tasks (no human label at scoring time)
Every task is make-CI-green: a real test fails on the buggy code and must pass after the fix, while existing
tests stay green. Correctness is a deterministic test transition — the exact SWE-bench validity rule
(resolved ⇔ all FAIL_TO_PASS pass ∧ all PASS_TO_PASS stay green).
2 · Leak-resistance
- Own-source only. Tasks come from purpose-built post-cutoff fixtures + the author's own repositories
(pinned commit SHAs,
ownlicense) — no third-party code is bundled, so there is no GPL/attribution or leak surface. - Symptom-only problem statements. Issue text describes the observable bug, never the fix or its root cause.
- Hidden oracle. By default the agent solves from the issue against the buggy code; the FAIL_TO_PASS test is applied only at grading (real SWE-bench setup) — the agent never sees the exact assertions.
3 · Mutation hardening (weak-test detection)
For each task we mutate the gold fix within the patched region and require the test suite to kill the mutants.
A task whose tests pass under a wrong fix is flagged weak and rejected — a CI invariant. Current golden set:
7 mutation-hardened (mean score 0.89), 5 inconclusive (regex/string code with no mutable operators), 0 weak.
4 · Cheat-resistant, swebench-equivalent grading
- The candidate patch may change source only; every test file is restored to its canonical version before grading, so a patch cannot neuter or weaken the oracle to fake a resolution.
- Our verdict is verified equivalent to the official
swebench.harness.grading.get_resolution_statusin CI on every commit. - Patches run in ephemeral, isolated GitHub Actions VMs — the sandbox boundary — at $0.
5 · Determinism & the multi-seed gate
temperature=0 does not guarantee determinism (pass@1 varies 2–6pp). The scorer is fully deterministic; the regression gate is multi-seed — a change fails the build only when the candidate's CI upper bound is below the baseline's CI lower bound, so flaky single runs don't block merges.
6 · The honest framing
A $0 free-model resolution rate is modest by design. The deliverable is the engineered, decontaminated, regression-gated system — the harness, the traces, the gate — proven via the model-swap comparison (the score rises with a better model while the harness stays fixed).