●builderWorth tracking if you're training or fine-tuning long-context models and looking for attention efficiency gains that don't compromise KV cache behavior.
●researcherThe decoupled routing — sparse queries, dense KV — is an architecturally clean approach to conditional compute in attention that could generalize across transformer variants.