[arXiv]score: 0.17

SAVE Framework Improves Reward Models Using On-Policy Value-Anchored Feedback

June 1, 2026

SAVE uses a value function to grade on-policy responses as self-supervised training signal for reward models, avoiding reliance on costly human annotations as the policy evolves. It outperforms baselines across six benchmarks using a contrastive objective with ambiguous-sample filtering.

cs.CL

HOW THIS AFFECTS YOU

●

researcherDirectly addresses reward model staleness during RLHF training loops — the value-anchored on-policy approach is a concrete alternative to expensive re-annotation pipelines.

SOURCE

https://arxiv.org/abs/2605.30888

← back to feed