[X]score: 0.57
Cognition Releases Enterprise Code Evals Up to 100-Hour Task Horizons
June 4, 2026
Cognition's first public eval dataset covers real-world Java, TypeScript, Python, and C# tasks — feature dev, bugfixes, and migrations — across 258 sessions from 126 enterprise users, achieving rlog of 0.74 on held-out data. This extends well beyond METR's ~16-hour cap and includes a financial guarantee on results.
HOW THIS AFFECTS YOU
●
builderYou now have a public methodology for measuring real-world agent task time savings that you can apply to your own Devin or Claude Code deployments.
●
researcherThe 258-session ground-truth dataset with human time estimates provides a more ecologically valid benchmark than METR's synthetic tasks, with rlog 0.74 on held-out data worth comparing against.
●
founderWorth watching because this establishes a credible, enterprise-grounded eval standard for agentic coding that could become the reference benchmark buyers demand.