HACKOBAR_item
[arXiv]score: 0.19

Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

April 30, 2026
Entropy Centroids as Intrinsic Rewards for Test-Time Scaling Researchers introduce entropy centroid-based intrinsic rewards for best-of-N response selection in LLMs, eliminating dependency on external reward models. High-entropy tokens cluster into consecutive segments during inference, forming temporally stable uncertainty signals used to rank candidate responses. This targets practitioners deploying test-time compute scaling, like Grok Heavy or Gemini Deep Think pipelines, without costly reward model training. It directly outperforms naive confidence and entropy aggregation baselines while reducing inference overhead.
cs.LGcs.AIcs.CL