[X]score: 0.37
OpenBMB Releases 600B-Token Pretraining Dataset and 15M SFT Dataset
June 1, 2026
OpenBMB, Tsinghua NLP, and Modelbest released Ultra-FineWeb-L3 (600B+ tokens, 400B+ English, 200B+ Chinese synthetic pretraining data) and UltraData-SFT-2605 (15M+ SFT samples covering math, code, and instruction-following with thinking/non-thinking annotations). Both datasets use a five-tier data management framework validated on MiniCPM5-1B training runs and are freely downloadable on HuggingFace.
HOW THIS AFFECTS YOU
●
builderYou can use UltraData-SFT-2605's 15M+ annotated samples with dual thinking/non-thinking labels as a drop-in post-training dataset for math, code, and instruction-following fine-tuning tasks.
●
researcherUltra-FineWeb-L3 is currently the largest open-source Chinese synthetic pretraining dataset, offering a validated, traceable pipeline worth benchmarking against your own data curation methods.