Methodology & decontamination

SWE-bench Verified is now widely held to be contaminated — OpenAI stopped reporting it in Feb 2026; >32% of "passed" cases leaked the solution and ~31% passed on weak tests. ForgeJudge treats decontamination as a tested, documented property — not a footnote.

1 · Intrinsically-verifiable tasks (no human label at scoring time)

Every task is make-CI-green: a real test fails on the buggy code and must pass after the fix, while existing tests stay green. Correctness is a deterministic test transition — the exact SWE-bench validity rule (resolved ⇔ all FAIL_TO_PASS pass ∧ all PASS_TO_PASS stay green).

2 · Leak-resistance

Own-source only. Tasks come from purpose-built post-cutoff fixtures + the author's own repositories (pinned commit SHAs, own license) — no third-party code is bundled, so there is no GPL/attribution or leak surface.
Symptom-only problem statements. Issue text describes the observable bug, never the fix or its root cause.
Hidden oracle. By default the agent solves from the issue against the buggy code; the FAIL_TO_PASS test is applied only at grading (real SWE-bench setup) — the agent never sees the exact assertions.

3 · Mutation hardening (weak-test detection)

For each task we mutate the gold fix within the patched region and require the test suite to kill the mutants. A task whose tests pass under a wrong fix is flagged weak and rejected — a CI invariant. Current golden set: 16 mutation-hardened (mean score 0.94), 2 inconclusive (regex/string code with no mutable operators), 0 weak.

4 · Cheat-resistant, swebench-equivalent grading

The candidate patch may change source only; every test file is restored to its canonical version before grading, so a patch cannot neuter or weaken the oracle to fake a resolution.
Our verdict is verified equivalent to the official swebench.harness.grading.get_resolution_status in CI on every commit — for real PASS/FAIL/ERROR/XFAIL outcomes.
And it is deliberately stricter on a skipped oracle: swebench 4.1.0 rates a SKIPPED FAIL_TO_PASS as RESOLVED_FULL (a skip is neither success nor failure), so a patch that makes the oracle skip rather than run would grade as resolved. ForgeJudge counts a skip as not-passed, closing that cheat vector (pinned by a CI test against real swebench).
Patches run in ephemeral, isolated GitHub Actions VMs — the sandbox boundary — at $0.

5 · Determinism & the multi-seed gate

temperature=0 does not guarantee determinism (pass@1 varies 2–6pp). The scorer is fully deterministic; the regression gate is multi-seed — a change fails the build only when the candidate's CI upper bound is below the baseline's CI lower bound, so flaky single runs don't block merges.

Metric definitions. pass@1 is the expected single-sample resolve rate — the mean resolve rate across the k seeds, i.e. the probability one fresh attempt resolves the task. pass@k is the fraction of tasks where any of the k seeds resolved. Because runs are non-deterministic, pass@k ≥ pass@1, and the gap is exactly the run-to-run variance the multi-seed gate is built to absorb.

6 · The honest framing

A $0 free-model resolution rate is modest by design. The deliverable is the engineered, decontaminated, regression-gated system — the harness, the traces, the gate — proven via the model-swap comparison (the score rises with a better model while the harness stays fixed).