[arXiv]score: 0.08

Synthetic Data Composition Can Shift Scaling Laws Toward Data Over Parameters in Physics ML

June 19, 2026

For hadronic jet classification in particle physics, adding diverse synthetic pretraining data aligned with the downstream task shifts the scaling regime to favor more data over larger models — exploiting cheap high-fidelity simulators unavailable in NLP or vision. The finding suggests domain-specific data engineering can deliberately reshape compute-optimal scaling curves.

HOW THIS AFFECTS YOU

●

researcherConcrete evidence that pretraining data composition — not just quantity — can engineer scaling law exponents, with implications for any domain with cheap synthetic data generation.

read original ↗arxiv.org

← back to feed