●builderIf you're converting Transformer checkpoints to hybrid linear attention for faster long-context inference, this initialization recipe reduces distillation cost.
●researcherAddresses a concrete failure mode in Transformer-to-linear-attention conversion with a principled initialization method grounded in Taylor approximation.