[arXiv]score: 0.13
Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs
June 16, 2026
Mask-Proof converts real mathematical proofs into masked-step reconstruction tasks, where key formula steps are hidden and model outputs are scored by an LLM equivalence judge using repeated voting for stability. The resulting Mask-ProofBench covers 292 curated problems across diverse research domains, evaluated against 17 models. This targets step-level reasoning measurement in long proofs, a gap left by benchmarks that only check final answers.