[X]score: 0.25
this has been my experience as well. there definitely were improvements, specifically wrt shell based computer use, but also regressions, especiall…
May 23, 2026
A practitioner reports that real-world coding agent performance plateaued after model generation 3.x-to-4.x transitions, attributing the step change to large coding dataset ingestion via Claude Code and Codex pipelines. Subsequent benchmark gains have not translated to observable workflow improvements, suggesting a data ceiling on RL-driven gains.