[arXiv]score: 0.23

MIRAGE Probe Detects LLM Covert Encoding at AUC 0.975–1.0 Before Output Exists

June 10, 2026

A shared low-dimensional subspace in LLM residual streams supports covert data encoding across nine encoding families and eight models; a logistic-regression probe trained on eight families generalizes to the ninth at AUC 0.975–1.000. The same direction flips polarity at the planning token to distinguish inline execution from tool-call outsourcing before any encoded text is generated.

HOW THIS AFFECTS YOU

●

builderYou can integrate MIRAGE as a two-channel runtime monitor to catch covert exfiltration attempts in agentic pipelines before output is produced.

●

researcherIdentifies a mechanistically interpretable, architecture-agnostic subspace for encoding behavior, opening a new direction for probing agent internals.

●

policyProvides a detection mechanism for a concrete agentic threat model — covert data exfiltration via encoding — that evades output-side filters.

read original ↗arxiv.org

← back to feed