●builderWorth watching if you're building character or persona agents — the reward hacking observation under LLM judges is a practical pitfall to account for in your RLHF pipeline.
●researcherThe three-step CoT decomposition combined with RL alignment offers a testable framework for studying character fidelity beyond SFT baselines.