[arXiv]score: 0.19
Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
April 30, 2026
Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
Researchers introduce entropy centroid-based intrinsic rewards for best-of-N response selection in LLMs, eliminating dependency on external reward models. High-entropy tokens cluster into consecutive segments during inference, forming temporally stable uncertainty signals used to rank candidate responses. This targets practitioners deploying test-time compute scaling, like Grok Heavy or Gemini Deep Think pipelines, without costly reward model training. It directly outperforms naive confidence and entropy aggregation baselines while reducing inference overhead.
cs.LGcs.AIcs.CL