MuSViT Foundation Model Pre-trained on 9.7 Million Sheet Music Pages
June 29, 2026
MuSViT is a Vision Transformer encoder pre-trained via Masked Autoencoders on the IMSLP corpus. It uses a two-stage curriculum of synthetic typeset scores and real-world scores to handle complex music notation across recognition and detection tasks.
HOW THIS AFFECTS YOU
●
builderYou can leverage a domain-specific backbone for music-related computer vision applications.
●
researcherThis establishes a foundational representation for the underserved sheet music domain.