[NEWSLETTER]score: 0.80

Data filtering may be suboptimal for large model pretraining

May 21, 2026

Scaling studies suggest data filtering may be counterproductive for large models in compute-rich, data-scarce regimes. Large parameter counts appear to extract signal from low-quality and distractor data that smaller models cannot, implying pretraining pipelines optimized around aggressive filtering should be revisited at scale. Practitioners training frontier-scale models should reconsider filtering budgets.

SOURCE

https://arxiv.org/abs/2605.19407

← back to feed