[NEWSLETTER]score: 0.80
Data filtering may be suboptimal for large model pretraining
May 21, 2026
Scaling studies suggest data filtering may be counterproductive for large models in compute-rich, data-scarce regimes. Large parameter counts appear to extract signal from low-quality and distractor data that smaller models cannot, implying pretraining pipelines optimized around aggressive filtering should be revisited at scale. Practitioners training frontier-scale models should reconsider filtering budgets.