[arXiv]score: 0.41
GradShield: Alignment Preserving Finetuning
May 15, 2026
GradShield is a filtering method that safeguards LLM alignment during finetuning by computing Finetuning Implicit Harmfulness Scores to identify and remove harmful data points before they corrupt model behavior.
cs.CL