[HUGGINGFACE]score: 0.67

AXPO Fixes Tool-Use Collapse in Agentic RL Training

May 26, 2026

Standard GRPO training of agentic VLMs shows tool use in only ~30% of rollouts, with ~40% of tool-using groups producing all-wrong answers, suppressing learning signal. AXPO (Agent eXplorative Policy Optimization) targets this Thinking-Acting Gap by intervening specifically on all-wrong tool-using rollout groups to recover gradient signal.

paper

HOW THIS AFFECTS YOU

●

builderWorth watching if you are training tool-using agents with GRPO-style RL, as AXPO directly addresses the tool-call learning collapse that degrades agent reliability.

●

researcherThe Thinking-Acting Gap framing and two diagnostic metrics offer a concrete diagnostic lens for debugging agentic RL training runs.

SOURCE

https://huggingface.co/papers/2605.28774

← back to feed