OPAC Algorithm Enables Offline RL from Trajectory-Level Labels with Tight Statistical Bounds
June 15, 2026
OPAC is a pessimistic actor-critic algorithm for offline RL that learns a latent reward model from trajectory-level scalar outcomes rather than per-step rewards, with a proven high-probability bound of order O(H²C_sa(π*)/n) and a matching lower bound. The work characterizes the exact statistical cost of replacing process-level supervision with outcome-level labels.
HOW THIS AFFECTS YOU
●
researcherProvides the first sharp statistical characterization of offline RL under trajectory-level supervision, useful for settings like RLHF where per-step rewards are unavailable.