[arXiv]score: 0.20

reward-lens: A Mechanistic Interpretability Library for Reward Models

April 30, 2026

reward-lens is an open-source mechanistic interpretability library that ports the full LLM interpretability toolkit, including logit lens, direct logit attribution, activation patching, and sparse autoencoders, to reward models by reorienting every method around the scalar reward head weight vector w_r. The library ships 10 adapter protocols, a TopK SAE feature attribution pipeline, three-mode activation patching, and five theory-grounded extensions including misalignment cascade detection and reward-term conflict analysis. RLHF practitioners and alignment researchers should prioritize this immediately, as reward model internals have remained largely opaque despite directly shaping every RLHF-trained frontier model. Prior interpretability work assumed vocabulary unembedding projections, making direct application to regression-headed reward models architecturally impossible until now.

cs.LGcs.AI

SOURCE

https://arxiv.org/abs/2604.26130

← back to feed