[arXiv]score: 0.17
SAVE Framework Improves Reward Models Using On-Policy Value-Anchored Feedback
June 1, 2026
SAVE uses a value function to grade on-policy responses as self-supervised training signal for reward models, avoiding reliance on costly human annotations as the policy evolves. It outperforms baselines across six benchmarks using a contrastive objective with ambiguous-sample filtering.
cs.CL
HOW THIS AFFECTS YOU
●
researcherDirectly addresses reward model staleness during RLHF training loops — the value-anchored on-policy approach is a concrete alternative to expensive re-annotation pipelines.