[arXiv]score: 0.44

Chain-of-Thought Outperforms Code Execution on Math Robustness Across Problem Variations

May 27, 2026

On 1,000 GSM-Symbolic problems run with Claude Haiku 4.5, CoT dropped only 1.3 percentage points across problem variations versus larger drops for PAL and SBSC code-execution approaches.

cs.AIcs.CLcs.LG

HOW THIS AFFECTS YOU

●

builderIf you're choosing between CoT and code-execution for math pipelines, CoT shows better stability across input variations on this dataset.

●

researcherSystematic evidence that code-execution methods don't improve robustness to surface-level problem variations, challenging a common assumption in math reasoning literature.

SOURCE

https://arxiv.org/abs/2605.26414

← back to feed