[X]score: 0.57

AA-Briefcase Benchmark Tests Long-Horizon Knowledge Work; Claude Fable 5 Leads at $31/Task

June 18, 2026

Artificial Analysis launched AA-Briefcase, a benchmark for multi-week agentic knowledge work projects with thousands of source files and private holdout tests. Claude Fable 5 leads with an Elo of 1587 at $31 average cost per task; Claude Opus 4.8 scores 1356 at $10.40, and GLM 5.2 reaches 1266.

HOW THIS AFFECTS YOU

●

builderCost-per-task data ($31 for top performance vs. $10.40 for second place) gives concrete tradeoff numbers for scoping agentic knowledge work products.

●

researcherPrivate holdout tests and unsaturated scoring make this a more credible long-horizon agentic benchmark than most current alternatives.

●

founderThe 3x cost gap between top and second-place models is a real product decision point for anyone building long-horizon agentic workflows.

read original ↗x.com

← back to feed