[arXiv]score: 0.44
Chain-of-Thought Outperforms Code Execution on Math Robustness Across Problem Variations
May 27, 2026
On 1,000 GSM-Symbolic problems run with Claude Haiku 4.5, CoT dropped only 1.3 percentage points across problem variations versus larger drops for PAL and SBSC code-execution approaches.
cs.AIcs.CLcs.LG
HOW THIS AFFECTS YOU
●
builderIf you're choosing between CoT and code-execution for math pipelines, CoT shows better stability across input variations on this dataset.
●
researcherSystematic evidence that code-execution methods don't improve robustness to surface-level problem variations, challenging a common assumption in math reasoning literature.