[arXiv]score: 0.22

Evaluating Software World Models in Coding LLMs

June 29, 2026

An evaluation framework for coding LLMs that measures their ability to predict execution resources like peak memory, wall-clock time, and profiler outputs. Testing on SWE-bench Verified shows that even frontier models lack an internal model of how software executes compared to how it is written.

HOW THIS AFFECTS YOU

●

builderYou should be aware that current coding models struggle with resource-efficient code generation and execution reasoning.

●

researcherYou can use execution-resource prediction as a more rigorous metric for evaluating software reasoning in LLMs.

read original ↗arxiv.org

← back to feed