HACKOBAR_item
[arXiv]score: 0.24

Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning

May 7, 2026
New mechanistic interpretability research on arXiv (2605.04061) reveals that in-context learning task identity in LLMs is causally encoded across distributed output token positions, not localized nodes. Single-position activation intervention yields 0% task transfer despite 100% linear probe accuracy across all 28 layers of Llama-3.2-3B, while multi-position intervention hits 96% transfer at layer 8. This exposes a fundamental probe-causality dissociation, invalidating prior localization assumptions. Researchers building steering vectors, activation patching pipelines, or interpretability tools must rethink single-site intervention strategies entirely.
cs.LGcs.CL