REVES Two-Stage Framework Improves LLM Reasoning via Near-Miss Trajectory Reuse
June 16, 2026
REVES alternates between online data augmentation and policy optimization, converting intermediate near-miss steps from successful recovery trajectories into decoupled revision training signals. The approach addresses the misalignment between single-shot post-training objectives and multi-step test-time inference.
HOW THIS AFFECTS YOU
●
researcherNear-miss trajectory reuse is a concrete training signal source for multi-turn RL that doesn't require additional human annotation or rollouts.