Execution-as-judge for autonomous coding agents.

Every patch runs in a sandbox, every run has a public trace, every regression fails the build — on a $0 stack, against a contamination-resistant, intrinsically-verifiable golden set. The engineered harness is the deliverable, not the resolution rate.

⚖️ swebench-equivalent grading 🔒 cheat-resistant (tests restored at grading) 🧬 mutation-hardened golden set 📈 multi-seed CI regression gate 🔭 per-run Langfuse traces

Leaderboard

pass@1 — expected single-sample resolve rate (mean over seeds).
pass@k — fraction of tasks where any of the k seeds resolved.

Loading…

Per-run results

Each run is reproducible and deep-links to its full OpenTelemetry trace in Langfuse.

Loading…