[arXiv]score: 0.15

CombEval Benchmark Exposes LLM Failures on Combinatorial Counting

June 19, 2026

CombEval uses typed Cofola specifications with solver-verified answers to dynamically generate combinatorial counting problems, testing 11 LLMs across ordered objects, indistinguishable elements, and nested dependencies. Models show consistent brittleness on positional constraints and constraint interpretation, with code-augmented settings also failing on complex nested structures.

HOW THIS AFFECTS YOU

●

researcherThe dynamic generation approach and error taxonomy provide a reusable diagnostic for isolating whether combinatorial failures stem from constraint parsing or counting principle application.

read original ↗arxiv.org

← back to feed