●builderYou can use this curated 8-dataset recipe as a practical starting point for fine-tuning agents that must reason over long trajectories without complex reward design.
●researcherDemonstrates that data composition across complementary task families is a stronger lever than reward shaping for long-context RL, with a reproducible GRPO baseline.