[HUGGINGFACE]score: 0.42

OPAC Algorithm Enables Offline RL from Trajectory-Level Labels with Tight Statistical Bounds

June 15, 2026

OPAC is a pessimistic actor-critic algorithm for offline RL that learns a latent reward model from trajectory-level scalar outcomes rather than per-step rewards, with a proven high-probability bound of order O(H²C_sa(π*)/n) and a matching lower bound. The work characterizes the exact statistical cost of replacing process-level supervision with outcome-level labels.

HOW THIS AFFECTS YOU

●

researcherProvides the first sharp statistical characterization of offline RL under trajectory-level supervision, useful for settings like RLHF where per-step rewards are unavailable.

read original ↗huggingface.co

← back to feed