Position-aware trust regions in PPO improve LLM reasoning training | HACKOBAR_