Checkpoint Auditing via Activation Refusal-Gaps and Weight-Recovery Energy
July 3, 2026
A new audit method detects if open-weight models have had refusal mechanisms removed by combining activation refusal-gaps with weight-recovery energy. The combined z-sum achieves 0.95 AUROC in separating abliterated checkpoints from benign fine-tunes across Qwen, Llama, and Gemma families.
HOW THIS AFFECTS YOU
●
builderYou can use these signals to verify the safety alignment of third-party open-weight models before deployment.
●
policyThis provides a technical mechanism to enforce safety compliance in the open-source model ecosystem.