[arXiv]score: 0.13

Formal Certification Framework for SAE-Based LLM Interpretability Faithfulness

June 18, 2026

A post-hoc generalization framework certifies whether a sparse autoencoder explanation faithfully represents a frozen language model by bounding the base model's expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. A non-vacuous bound confirms the sparse features retain meaningful predictive information.

HOW THIS AFFECTS YOU

●

researcherThis gives mechanistic interpretability work a concrete operational criterion for when SAE-extracted features can be trusted as faithful model explanations rather than post-hoc artifacts.

●

policyCertifiable interpretability bounds are a step toward auditable AI explanations, relevant for compliance contexts requiring model transparency guarantees.

read original ↗arxiv.org

← back to feed