[arXiv]score: 0.15

13 LLMs score ~6% on open-ended multi-agent coordination benchmark ALEM

June 9, 2026

ALEM is a JAX-based long-horizon multi-agent benchmark built on Craftax dynamics with procedural coordination tasks, soft specialization, and communication. Evaluating 13 LLMs zero-shot, all average roughly 6% normalized return, with Gemini-2.5-Pro-High performing best on the hardest coordination settings but still far below trained MARL baselines.

HOW THIS AFFECTS YOU

●

researcherALEM exposes specific failure modes in LLM multi-agent coordination that structured benchmarks miss, useful for targeting agentic capability improvements.

read original ↗arxiv.org

← back to feed