[arXiv]score: 0.41
Before the Last Token: Diagnosing Final-Token Safety Probe Failures
May 14, 2026
Identifies prefill-time failure mode in final-token safety probes where jailbreak evidence distributed across earlier user-token representations is missed, demonstrating high recall on clean prompts but poor generalization to adversarial inputs.
cs.LG