Counsel Dataset Benchmarks LLM-as-Judge Reliability on Agentic Tasks | HACKOBAR_