[arXiv]score: 0.15

APB Benchmark: 4,209 Cases Diagnose LLM Agent Planning Failures

June 4, 2026

APB isolates planning failures from execution failures across 4,209 multimodal cases in 22 domains, testing 12 MLLMs on long-horizon planning, tool-noise robustness, and unsolvable task detection. Validated on ToolSandbox and tau-bench, APB-guided refinement improves plan correctness and downstream execution. Systematic weaknesses found across all tested models.

cs.CL

HOW THIS AFFECTS YOU

●

builderYou can use APB's five settings — including broken tools and unsolvable tasks — to stress-test agent planning logic before shipping.

●

researcherAPB provides a decomposed diagnostic signal missing from end-to-end benchmarks, letting you isolate whether model failures are planning- or execution-origin.

SOURCE

https://arxiv.org/abs/2606.04874

← back to feed