13 LLMs score ~6% on open-ended multi-agent coordination benchmark ALEM
June 9, 2026
ALEM is a JAX-based long-horizon multi-agent benchmark built on Craftax dynamics with procedural coordination tasks, soft specialization, and communication. Evaluating 13 LLMs zero-shot, all average roughly 6% normalized return, with Gemini-2.5-Pro-High performing best on the hardest coordination settings but still far below trained MARL baselines.
HOW THIS AFFECTS YOU
●
researcherALEM exposes specific failure modes in LLM multi-agent coordination that structured benchmarks miss, useful for targeting agentic capability improvements.