●builderCost-per-task data ($31 for top performance vs. $10.40 for second place) gives concrete tradeoff numbers for scoping agentic knowledge work products.
●researcherPrivate holdout tests and unsaturated scoring make this a more credible long-horizon agentic benchmark than most current alternatives.
●founderThe 3x cost gap between top and second-place models is a real product decision point for anyone building long-horizon agentic workflows.