●builderWorth evaluating for RL training of multi-turn search agents where outcome-only rewards produce noisy gradients due to mixed-quality intermediate steps.
●researcherThe Bayesian posterior ratio as a credit assignment proxy is a principled alternative to heuristic step-level reward shaping for long-horizon agentic RL.