[X]score: 0.35
Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
May 28, 2026
Multimodal agents trained with Agent Explorative Policy Optimization (AEPO) use self-generated exploratory rollouts to improve reasoning across vision-language tasks, encouraging diverse action trajectories during RL training rather than exploiting high-reward paths. The approach targets agentic settings where sparse rewards and long action horizons make standard PPO or GRPO unstable.