An evaluation framework for coding LLMs that measures their ability to predict execution resources like peak memory, wall-clock time, and profiler outputs. Testing on SWE-bench Verified shows that even frontier models lack an internal model of how software executes compared to how it is written.
HOW THIS AFFECTS YOU
●
builderYou should be aware that current coding models struggle with resource-efficient code generation and execution reasoning.
●
researcherYou can use execution-resource prediction as a more rigorous metric for evaluating software reasoning in LLMs.