[HUGGINGFACE]score: 0.42

MuSViT Foundation Model Pre-trained on 9.7 Million Sheet Music Pages

June 29, 2026

MuSViT is a Vision Transformer encoder pre-trained via Masked Autoencoders on the IMSLP corpus. It uses a two-stage curriculum of synthetic typeset scores and real-world scores to handle complex music notation across recognition and detection tasks.

HOW THIS AFFECTS YOU

●

builderYou can leverage a domain-specific backbone for music-related computer vision applications.

●

researcherThis establishes a foundational representation for the underserved sheet music domain.

read original ↗huggingface.co

← back to feed