●builderYou can potentially improve reasoning performance on low-data tasks without spinning up a full RL training loop.
●researcherOutperforming GRPO without parameter updates on reasoning benchmarks is a strong claim worth scrutinizing — the contrastive insight generation mechanism is the key variable to isolate.