[arXiv]score: 0.24

Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning

May 7, 2026

New mechanistic interpretability research on arXiv (2605.04061) reveals that in-context learning task identity in LLMs is causally encoded across distributed output token positions, not localized nodes. Single-position activation intervention yields 0% task transfer despite 100% linear probe accuracy across all 28 layers of Llama-3.2-3B, while multi-position intervention hits 96% transfer at layer 8. This exposes a fundamental probe-causality dissociation, invalidating prior localization assumptions. Researchers building steering vectors, activation patching pipelines, or interpretability tools must rethink single-site intervention strategies entirely.

cs.LGcs.CL

SOURCE

https://arxiv.org/abs/2605.04061

← back to feed