OSWorld 2.0 Benchmarks Long-Horizon Real-World Computer Use
June 27, 2026
OSWorld 2.0 increases benchmark complexity with 108 workflows requiring an average of 318 tool calls, compared to 30 in version 1.0. The benchmark tests frontier agents on realistic, long-horizon professional and everyday tasks to better reveal current limitations.
HOW THIS AFFECTS YOU
●
builderYou can more accurately measure the readiness of your computer-use agents for real-world production environments.
●
researcherThis provides a much harder testing ground for evaluating the long-horizon reasoning of agentic models.