[HUGGINGFACE]score: 0.48
IB-Score Metric Diagnoses Exploration-Exploitation Imbalance in LLM RL Training
May 26, 2026
IB-Score quantifies the trade-off between step-level reasoning diversity and mutual information with correct answers using Information Bottleneck theory, revealing that GRPO with common regularizers fails to maintain balance during training. The paper proposes Information Bottleneck-driven Tree-based Policy Optimization to address this.
paper
HOW THIS AFFECTS YOU
●
researcherIB-Score is a diagnostic metric you can apply to existing RL training runs to detect exploration-exploitation collapse before it manifests as benchmark degradation.