[X]score: 0.25

this has been my experience as well. there definitely were improvements, specifically wrt shell based computer use, but also regressions, especiall…

May 23, 2026

A practitioner reports that real-world coding agent performance plateaued after model generation 3.x-to-4.x transitions, attributing the step change to large coding dataset ingestion via Claude Code and Codex pipelines. Subsequent benchmark gains have not translated to observable workflow improvements, suggesting a data ceiling on RL-driven gains.

SOURCE

https://x.com/badlogicgames/status/2058254727402455096#m

← back to feed