[r/LocalLLaMA]score: 0.14
93k-event MMO dataset tests 8 open-weight models on long-horizon agent behavior
May 27, 2026
A 10-day persistent MMO stress test ran 25 agents across 8 open-weight models including Qwen3 235B, Nemotron 30B, Ministral 14B/8B, and Gemma 3 12B, logging ~93,000 events covering long-horizon planning, resource contention, and adversarial dynamics. The dataset is public on HuggingFace under CC-BY-4.0 and includes model reasoning traces for ~70% of actions. This is a rare naturalistic benchmark for multi-agent behavior beyond static evals.
resources
HOW THIS AFFECTS YOU
●
builderWorth examining if you're building persistent or multi-agent systems — the dataset reveals how specific open-weight models degrade or adapt under sustained adversarial pressure.
●
researcherThe 93k-event dataset with reasoning traces offers a novel evaluation surface for long-horizon planning and adversarial multi-agent dynamics not captured by standard benchmarks.