[HUGGINGFACE]score: 0.47

EnterpriseClawBench: 852 Real-World Agent Tasks, Best Score Only 0.663

June 21, 2026

EnterpriseClawBench derives 852 reproducible enterprise agent tasks from real workplace sessions, each with fixtures, role classes, skill subclasses, and semantic rubrics. The best-performing configuration, Codex with GPT-5.5, scores only 0.663, exposing a significant gap in enterprise agent capability. Benchmark data is not released, but the construction and evaluation protocol is the reusable contribution.

HOW THIS AFFECTS YOU

●

builderA 0.663 ceiling even with GPT-5.5 signals that enterprise agent reliability remains unsolved — useful calibration before shipping agentic workflows.

●

researcherThe construction protocol offers a replicable methodology for building grounded enterprise agent evals from proprietary session logs.

read original ↗huggingface.co

← back to feed