ARCO Co-Evolves Rubric and Policy for Per-Step Credit Assignment in LLM Agents
June 23, 2026
ARCO uses a shared-backbone model with separate generation and scoring heads to produce per-step natural-language rubric criteria and rewards, with a trajectory decomposition constraint tying step rewards to terminal outcomes. The rubric scorer and policy are jointly updated on-policy, eliminating dependence on a frozen closed-source judge.
HOW THIS AFFECTS YOU
●
researcherARCO's joint co-evolution of rubric and policy without step-level labels is a meaningful advance for RL training of multi-step agents — the trajectory decomposition constraint is the key technical contribution to examine.