CombEval Benchmark Exposes LLM Failures on Combinatorial Counting
June 19, 2026
CombEval uses typed Cofola specifications with solver-verified answers to dynamically generate combinatorial counting problems, testing 11 LLMs across ordered objects, indistinguishable elements, and nested dependencies. Models show consistent brittleness on positional constraints and constraint interpretation, with code-augmented settings also failing on complex nested structures.
HOW THIS AFFECTS YOU
●
researcherThe dynamic generation approach and error taxonomy provide a reusable diagnostic for isolating whether combinatorial failures stem from constraint parsing or counting principle application.