●builderWorth watching because it directly measures whether memory-augmented agent architectures deliver real improvement over stateless LLMs in production-relevant domains.
●researcherThe gain metric and expert-validated task design give a cleaner signal for measuring online learning than existing benchmarks that conflate prior knowledge with in-context adaptation.