[arXiv]score: 0.60
RL Framework Derives Principled Mid-Generation Abstention Rule for LLM Reasoning
May 26, 2026
Modeling abstention as an explicit action in a regularized RL framework, the method shows that terminating chain-of-thought generation when the value function drops below an abstention reward parameter strictly outperforms baseline approaches, reducing wasted compute on incorrect long reasoning traces.
cs.LGcs.CLstat.ML
HOW THIS AFFECTS YOU
●
builderYou can use this framework to tune a single abstention reward parameter to control the compute-vs-accuracy tradeoff in deployed reasoning LLMs without architectural changes.
●
researcherThe formal RL derivation provides principled guidance for dynamic mid-generation abstention, filling a gap left by prior empirical-only approaches to early stopping in reasoning models.