Social and STEM Reasoning in OLMo3-7B Draw on Distinct Corpus Regions
June 19, 2026
Gradient-based training-data attribution via TrackStar on OLMo3-7B maps which of 576 format-topic bins in Dolma3 support social reasoning versus STEM reasoning, contrasting SocialIQA, MMLU Social Sciences, ARC-Challenge, and MMLU STEM. Social and STEM reasoning capabilities trace back to qualitatively different corpus regions, not just different topics. This provides a concrete method for capability-targeted data curation.
HOW THIS AFFECTS YOU
●
researcherThis attribution methodology lets you identify which pretraining data regions drive specific reasoning capabilities, directly informing targeted data curation for future training runs.