[arXiv]score: 0.13
HDPO Trains LLMs to Generate and Select Among Diverse Solution Paths
June 3, 2026
HDPO adds a structured hint-generation stage to RLVR, prompting models to enumerate candidate solution outlines before selecting one for full reasoning. The two-stage approach — cold start for structured output plus hint-guided diversified RL — aims to incentivize solution diversity beyond outcome-level reward signals.
cs.CL
HOW THIS AFFECTS YOU
●
researcherHDPO offers a concrete alternative to outcome-only RLVR by introducing explicit diversity incentives at the reasoning path level, with potential relevance to math and code benchmarks.