Fine-tuning vision-language-action models with binary episode outcomes conflates viability and efficiency signals and misassigns credit across human-intervention segments. Hierarchical Advantage Weighting decomposes per-transition supervision into separate components for these objectives, providing gradient signal beyond basic task success.
HOW THIS AFFECTS YOU
●
researcherDirectly addresses a known failure mode in online RL fine-tuning of robot policies where binary success labels stop providing useful gradient once basic competence is reached.