[arXiv]score: 0.15
APB Benchmark: 4,209 Cases Diagnose LLM Agent Planning Failures
June 4, 2026
APB isolates planning failures from execution failures across 4,209 multimodal cases in 22 domains, testing 12 MLLMs on long-horizon planning, tool-noise robustness, and unsolvable task detection. Validated on ToolSandbox and tau-bench, APB-guided refinement improves plan correctness and downstream execution. Systematic weaknesses found across all tested models.
cs.CL
HOW THIS AFFECTS YOU
●
builderYou can use APB's five settings — including broken tools and unsolvable tasks — to stress-test agent planning logic before shipping.
●
researcherAPB provides a decomposed diagnostic signal missing from end-to-end benchmarks, letting you isolate whether model failures are planning- or execution-origin.