Synthetic Data Composition Can Shift Scaling Laws Toward Data Over Parameters in Physics ML
June 19, 2026
For hadronic jet classification in particle physics, adding diverse synthetic pretraining data aligned with the downstream task shifts the scaling regime to favor more data over larger models — exploiting cheap high-fidelity simulators unavailable in NLP or vision. The finding suggests domain-specific data engineering can deliberately reshape compute-optimal scaling curves.
HOW THIS AFFECTS YOU
●
researcherConcrete evidence that pretraining data composition — not just quantity — can engineer scaling law exponents, with implications for any domain with cheap synthetic data generation.