[arXiv]score: 0.15

ARCO Co-Evolves Rubric and Policy for Per-Step Credit Assignment in LLM Agents

June 23, 2026

ARCO uses a shared-backbone model with separate generation and scoring heads to produce per-step natural-language rubric criteria and rewards, with a trajectory decomposition constraint tying step rewards to terminal outcomes. The rubric scorer and policy are jointly updated on-policy, eliminating dependence on a frozen closed-source judge.

HOW THIS AFFECTS YOU

●

researcherARCO's joint co-evolution of rubric and policy without step-level labels is a meaningful advance for RL training of multi-step agents — the trajectory decomposition constraint is the key technical contribution to examine.

read original ↗arxiv.org

← back to feed