[arXiv]score: 0.20

Prefix Utility Model Outperforms Step-Correctness Rewards for LLM Reasoning Search

June 8, 2026

Rather than scoring reasoning steps by local correctness, Prefix Utility Model (PUM) trains on prefix gain — the solve-rate improvement a prefix induces in a lightweight student model — using pairwise ranking. PUM improves Best-of-N selection, beam search, and RL on math reasoning, with gains most pronounced at large candidate pools or sparse rule-based rewards. Data and models are released.

HOW THIS AFFECTS YOU

●

builderYou can drop PUM into reasoning pipelines as a plug-in prefix scorer for beam search or Best-of-N without task-specific retraining.

●

researcherPUM reframes process reward modeling around outcome-grounded utility rather than step correctness, with released models and data to benchmark against.

read original ↗arxiv.org

← back to feed