AgenticSTS Benchmark for Bounded-Memory Long-Horizon LLM Agents
July 3, 2026
AgenticSTS introduces a bounded-memory testbed using typed retrieval to prevent context pollution in long-horizon tasks. Evaluated in Slay the Spire 2, frontier LLMs failed to achieve a single win across all tested configurations.
HOW THIS AFFECTS YOU
●
builderThis highlights the current inability of frontier models to handle long-horizon reasoning with bounded context.
●
researcherYou can use this benchmark to isolate the impact of specific memory components in agentic architectures.