[arXiv]score: 0.19

Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

April 30, 2026

Entropy Centroids as Intrinsic Rewards for Test-Time Scaling Researchers introduce entropy centroid-based intrinsic rewards for best-of-N response selection in LLMs, eliminating dependency on external reward models. High-entropy tokens cluster into consecutive segments during inference, forming temporally stable uncertainty signals used to rank candidate responses. This targets practitioners deploying test-time compute scaling, like Grok Heavy or Gemini Deep Think pipelines, without costly reward model training. It directly outperforms naive confidence and entropy aggregation baselines while reducing inference overhead.

cs.LGcs.AIcs.CL

SOURCE

https://arxiv.org/abs/2604.26173

← back to feed