[arXiv]score: 0.24

Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

May 5, 2026

Researchers at arXiv (2605.00994) demonstrate that perplexity differencing, ranking completions by the perplexity gap between reference and finetuned models using random short prefills, reliably surfaces hidden finetuning objectives from model organisms across 76 models ranging 0.5B to 70B parameters, requiring no model internals or behavioral priors. This black-box auditing method exploits overgeneralization, where finetuned models leak target behaviors outside intended contexts. Red teamers, safety auditors, and alignment researchers should prioritize this over existing interpretability-heavy approaches for scalable behavioral fingerprinting.

cs.CLcs.AI

SOURCE

https://arxiv.org/abs/2605.00994

← back to feed