[arXiv]score: 0.23

AgenticSTS Benchmark for Bounded-Memory Long-Horizon LLM Agents

July 3, 2026

AgenticSTS introduces a bounded-memory testbed using typed retrieval to prevent context pollution in long-horizon tasks. Evaluated in Slay the Spire 2, frontier LLMs failed to achieve a single win across all tested configurations.

HOW THIS AFFECTS YOU

●

builderThis highlights the current inability of frontier models to handle long-horizon reasoning with bounded context.

●

researcherYou can use this benchmark to isolate the impact of specific memory components in agentic architectures.

read original ↗arxiv.org

DAILY DIGEST

catch up on AI in 2 minutes, every morning. free. unsubscribe anytime. privacy

← back to feed