HACKOBAR_item
[arXiv]score: 0.24

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

May 7, 2026
Researchers from arXiv (2605.04077) identify a critical but overlooked flaw in GRPO training: token aggregation creates sign-length coupling bias while sequence aggregation penalizes longer responses, both distorting policy gradients. Their fix, Balanced Aggregation (BA), computes token-level means separately across positive and negative reward subsets before combining via sequence-count weighting. BA is a drop-in replacement requiring zero architectural changes, directly relevant to anyone fine-tuning LLMs with GRPO, DAPO, or similar RLVR pipelines for reasoning or code tasks.
cs.LGcs.AIcs.CL