Model swap: the harness is the deliverable

We do not chase a SOTA resolution rate. To show the harness measures real capability — not a tuned scaffold — we hold the harness, prompts, golden set, and budget fixed and swap only the model. The score rises with a more capable model.

Same harness, different model

Loading…

Both models run the identical hidden-test pipeline (localize → repair → validate), the same deterministic, cheat-resistant grader, and the same golden set. The only variable is the model — so the gap is a property of the model, while the apparatus stays constant. A $0 free-tier number is modest by design.

Why this matters

It answers the "your rate is low vs SOTA" objection: SOTA (~88–94%) uses premium models and budgets; the engineered, decontaminated, regression-gated system is the artifact, and the swap proves it discriminates.
It is reproducible: one command re-runs any leaderboard entry and regenerates its trace.