[arXiv]score: 0.41

GradShield: Alignment Preserving Finetuning

May 15, 2026

GradShield is a filtering method that safeguards LLM alignment during finetuning by computing Finetuning Implicit Harmfulness Scores to identify and remove harmful data points before they corrupt model behavior.

cs.CL

SOURCE

https://arxiv.org/abs/2605.14194

← back to feed