[X]score: 0.37

OpenBMB Releases 600B-Token Pretraining Dataset and 15M SFT Dataset

June 1, 2026

OpenBMB, Tsinghua NLP, and Modelbest released Ultra-FineWeb-L3 (600B+ tokens, 400B+ English, 200B+ Chinese synthetic pretraining data) and UltraData-SFT-2605 (15M+ SFT samples covering math, code, and instruction-following with thinking/non-thinking annotations). Both datasets use a five-tier data management framework validated on MiniCPM5-1B training runs and are freely downloadable on HuggingFace.

HOW THIS AFFECTS YOU

●

builderYou can use UltraData-SFT-2605's 15M+ annotated samples with dual thinking/non-thinking labels as a drop-in post-training dataset for math, code, and instruction-following fine-tuning tasks.

●

researcherUltra-FineWeb-L3 is currently the largest open-source Chinese synthetic pretraining dataset, offering a validated, traceable pipeline worth benchmarking against your own data curation methods.

SOURCE

https://x.com/OpenBMB/status/2061432928492810535#m

← back to feed