[r/MachineLearning]score: 0.16

MiniMax Sparse Attention Hits 1M Tokens at 1/20th Prior Compute

June 2, 2026

MiniMax's new sparse attention architecture (MSA) reaches 1M token context with 1/20th the per-token compute of their prior generation, using a KV-outer-loop memory access pattern that keeps reads contiguous and fetches each block exactly once. Compared to Flash-Sparse-Attention, MSA delivers 4x faster execution, 9x prefill speedup, and 15x decoding speedup. The model claims to be the first open-weight release combining frontier coding, 1M context, and native multimodality.

news

HOW THIS AFFECTS YOU

●

builderYou can run 1M-context inference at dramatically lower compute cost using an open-weight model — 15x decoding speedup over prior generation makes long-context production deployments newly viable.

●

researcherThe KV-outer-gather-Q memory access pattern is a concrete architectural departure from standard sparse approximations worth examining for recall fidelity and hardware efficiency claims.

●

founderFirst open-weight model combining frontier coding, 1M context, and multimodality shifts the cost baseline for building long-context products without proprietary API dependency.

SOURCE

https://www.reddit.com/r/MachineLearning/comments/1tvameq/minimax_dropped_a_new_attention_architecture_n/

← back to feed