[HUGGINGFACE]score: 0.67
AXPO Fixes Tool-Use Collapse in Agentic RL Training
May 26, 2026
Standard GRPO training of agentic VLMs shows tool use in only ~30% of rollouts, with ~40% of tool-using groups producing all-wrong answers, suppressing learning signal. AXPO (Agent eXplorative Policy Optimization) targets this Thinking-Acting Gap by intervening specifically on all-wrong tool-using rollout groups to recover gradient signal.
paper
HOW THIS AFFECTS YOU
●
builderWorth watching if you are training tool-using agents with GRPO-style RL, as AXPO directly addresses the tool-call learning collapse that degrades agent reliability.
●
researcherThe Thinking-Acting Gap framing and two diagnostic metrics offer a concrete diagnostic lens for debugging agentic RL training runs.