NarraBERT Maps Narrative Structure Across 3T-Token Dolma Corpus
June 19, 2026
A RoBERTa-based model fine-tuned on 400 annotated passages classifies 11 narrative dimensions across 3M passages of the Dolma pretraining corpus, producing the NarraDolma dataset. Findings show narrative qualities are unevenly distributed across pretraining sources and topics, with a continuous multidimensional structure underlying web text. Relevant for understanding how narrative composition in pretraining data shapes model behavior.
HOW THIS AFFECTS YOU
●
researcherNarraDolma gives you a labeled dataset to study how narrative distribution in pretraining corpora correlates with downstream model capabilities.