HACKOBAR_item
[arXiv]score: 0.24

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

May 7, 2026
Researchers propose Adaptive Power-Mean Policy Optimization (APMPO), a reinforcement learning method that dynamically adjusts policy optimization during LLM training by transitioning between arithmetic and geometric mean objectives via Power-Mean Policy Optimization (PMPO) and adaptively tuning clipping bounds through Feedback-Adaptive Clipping (FAC) to better align with evolving model reasoning capabilities.
cs.CLcs.ETcs.LG