[arXiv]score: 0.10

MixSD Reduces SFT Forgetting by Mixing Expert and Naive Model Conditionals

June 19, 2026

Fine-tuning on human or external targets causes forgetting because low-probability token sequences pull models away from their pretrained distribution. MixSD constructs supervision by blending two conditionals of the base model itself — one with the injected fact in context, one without — keeping targets distribution-aligned without any external teacher.

HOW THIS AFFECTS YOU

●

builderYou can apply MixSD during fine-tuning to reduce capability regression on reasoning benchmarks when injecting domain-specific facts.

●

researcherWorth watching because the external-teacher-free distillation framing offers a principled alternative to standard SFT loss for knowledge injection without auxiliary models.

read original ↗arxiv.org

← back to feed