ForgeJudge

Model swap: the harness is the deliverable

We do not chase a SOTA resolution rate. To show the harness measures real capability — not a tuned scaffold — we hold the harness, prompts, golden set, and budget fixed and swap only the model. The score rises with a more capable model.

Same harness, different model

Loading…

Both models run the identical hidden-test pipeline (localize → repair → validate), the same deterministic, cheat-resistant grader, and the same golden set. The only variable is the model — so the gap is a property of the model, while the apparatus stays constant. A $0 free-tier number is modest by design.

Why this matters