[arXiv]score: 0.13

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

June 16, 2026

Mask-Proof converts real mathematical proofs into masked-step reconstruction tasks, where key formula steps are hidden and model outputs are scored by an LLM equivalence judge using repeated voting for stability. The resulting Mask-ProofBench covers 292 curated problems across diverse research domains, evaluated against 17 models. This targets step-level reasoning measurement in long proofs, a gap left by benchmarks that only check final answers.

read original ↗arxiv.org

← back to feed