MiniMax Sparse Attention: Per-Group Block Selection for Cheap Million-Token Inference
15 June 2026
MiniMax Sparse Attention is a practical sparse-attention design for million-token LLMs - it uses a lightweight learned indexer to select relevant KV blocks and performs exact attention only over those blocks. The paper is important because it connects architecture, training stability, and GPU kernels into a deployable long-context system, powering the open-weight MiniMax-M3 model.