Execution-as-judge for autonomous coding agents.
Every patch runs in a sandbox, every run has a public trace, every regression fails the build — on a $0 stack, against a contamination-resistant, intrinsically-verifiable golden set. The engineered harness is the deliverable, not the resolution rate.
⚖️ swebench-equivalent grading
🔒 cheat-resistant (tests restored at grading)
🧬 mutation-hardened golden set
📈 multi-seed CI regression gate
🔭 per-run Langfuse traces
Leaderboard
Loading…
Per-run results
Each run is reproducible and deep-links to its full OpenTelemetry trace in Langfuse.
Loading…