[r/MachineLearning]score: 0.16
MiniMax Sparse Attention Hits 1M Tokens at 1/20th Prior Compute
June 2, 2026
MiniMax's new sparse attention architecture (MSA) reaches 1M token context with 1/20th the per-token compute of their prior generation, using a KV-outer-loop memory access pattern that keeps reads contiguous and fetches each block exactly once. Compared to Flash-Sparse-Attention, MSA delivers 4x faster execution, 9x prefill speedup, and 15x decoding speedup. The model claims to be the first open-weight release combining frontier coding, 1M context, and native multimodality.
news
HOW THIS AFFECTS YOU
●
builderYou can run 1M-context inference at dramatically lower compute cost using an open-weight model — 15x decoding speedup over prior generation makes long-context production deployments newly viable.
●
researcherThe KV-outer-gather-Q memory access pattern is a concrete architectural departure from standard sparse approximations worth examining for recall fidelity and hardware efficiency claims.
●
founderFirst open-weight model combining frontier coding, 1M context, and multimodality shifts the cost baseline for building long-context products without proprietary API dependency.