[X]score: 0.57

Cognition Releases Enterprise Code Evals Up to 100-Hour Task Horizons

June 4, 2026

Cognition's first public eval dataset covers real-world Java, TypeScript, Python, and C# tasks — feature dev, bugfixes, and migrations — across 258 sessions from 126 enterprise users, achieving rlog of 0.74 on held-out data. This extends well beyond METR's ~16-hour cap and includes a financial guarantee on results.

HOW THIS AFFECTS YOU

●

builderYou now have a public methodology for measuring real-world agent task time savings that you can apply to your own Devin or Claude Code deployments.

●

researcherThe 258-session ground-truth dataset with human time estimates provides a more ecologically valid benchmark than METR's synthetic tasks, with rlog 0.74 on held-out data worth comparing against.

●

founderWorth watching because this establishes a credible, enterprise-grounded eval standard for agentic coding that could become the reference benchmark buyers demand.

SOURCE

https://x.com/swyx/status/2062611218196771017#m

← back to feed