[HUGGINGFACE]score: 0.55

Counsel Dataset Benchmarks LLM-as-Judge Reliability on Agentic Tasks

June 18, 2026

Counsel is the first public meta-evaluation dataset for agentic task assessment, containing process-level critiques from open-weight LLM judges on tau-bench and DA-Code benchmarks paired with human meta-evaluations of those critiques. It targets the reliability gap in LLM-as-judge pipelines used to scale agentic evaluation where human annotation costs hours per trajectory.

HOW THIS AFFECTS YOU

●

builderYou can use Counsel to validate whether your LLM-as-judge setup for agent evaluation is actually aligned with human judgment before trusting it for training data curation.

●

researcherProvides a concrete meta-evaluation testbed for measuring judge reliability on multi-step agentic trajectories, which has been a missing piece in the evaluation stack.

read original ↗huggingface.co

← back to feed