[arXiv]score: 0.30

Checkpoint Auditing via Activation Refusal-Gaps and Weight-Recovery Energy

July 3, 2026

A new audit method detects if open-weight models have had refusal mechanisms removed by combining activation refusal-gaps with weight-recovery energy. The combined z-sum achieves 0.95 AUROC in separating abliterated checkpoints from benign fine-tunes across Qwen, Llama, and Gemma families.

HOW THIS AFFECTS YOU

●

builderYou can use these signals to verify the safety alignment of third-party open-weight models before deployment.

●

policyThis provides a technical mechanism to enforce safety compliance in the open-source model ecosystem.

read original ↗arxiv.org

DAILY DIGEST

catch up on AI in 2 minutes, every morning. free. unsubscribe anytime. privacy

← back to feed