[r/LocalLLaMA]score: 0.14

93k-event MMO dataset tests 8 open-weight models on long-horizon agent behavior

May 27, 2026

A 10-day persistent MMO stress test ran 25 agents across 8 open-weight models including Qwen3 235B, Nemotron 30B, Ministral 14B/8B, and Gemma 3 12B, logging ~93,000 events covering long-horizon planning, resource contention, and adversarial dynamics. The dataset is public on HuggingFace under CC-BY-4.0 and includes model reasoning traces for ~70% of actions. This is a rare naturalistic benchmark for multi-agent behavior beyond static evals.

resources

HOW THIS AFFECTS YOU

●

builderWorth examining if you're building persistent or multi-agent systems — the dataset reveals how specific open-weight models degrade or adapt under sustained adversarial pressure.

●

researcherThe 93k-event dataset with reasoning traces offers a novel evaluation surface for long-horizon planning and adversarial multi-agent dynamics not captured by standard benchmarks.

SOURCE

https://www.reddit.com/r/LocalLLaMA/comments/1tp6pg7/i_ran_8_openweight_models_as_agents_in_a/

← back to feed