MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost

By Srivijay Mavuri, Founder & Editor 27 May 2026 4 min read News Wire

Abstract 3D render visualizing artificial intelligence and neural networks in digital form. — Photo by Google DeepMind on on on Unsplash

Chinese artificial intelligence firm MiniMax has unveiled significant technical advances in its language model architecture while simultaneously teasing a breakthrough innovation for its forthcoming M3 series. The company released a comprehensive technical report detailing the engineering foundations of its popular M2 model family, which includes the M2, M2.5, and M2.7 variants. Most notably, MiniMax announced the development of a novel sparse attention mechanism for M3 that it claims will deliver approximately 15.6 times faster response generation speeds when processing extremely long documents containing up to one million tokens. This advancement represents a substantial leap forward in making ultra-long-context AI agents economically practical for enterprise deployment, addressing a critical limitation that has long plagued the AI industry. MiniMax's technical report carries particular significance for enterprises seeking to optimize their artificial intelligence systems or develop proprietary models. When initially released, the M2 series achieved top-tier performance benchmarks among open source language models globally, though this distinction has since shifted to competitors including DeepSeek and Xiaomi.

Nevertheless, the detailed engineering blueprints contained within the technical report provide valuable insights applicable across the industry for improving model performance and agent capabilities. The company's commitment to publishing these technical specifications under enterprise-friendly licensing arrangements distinguishes MiniMax from competitors who maintain closed development practices. Observers within the AI research community, including Adina Yakup from Hugging Face, have praised the work for its solid contributions to mixture-of-experts efficiency and agent-oriented design considerations. The M2 architecture employs a decoder-only Transformer with sparse mixture-of-experts functionality, maintaining 229.9 billion total parameters while activating only 9.8 billion parameters per token across 256 specialized experts. A critical engineering decision involved implementing full multi-head attention with grouped query attention across all 62 model layers. This choice reflected an intentional rejection of more computationally efficient alternatives.

During development, MiniMax researchers extensively tested sub-quadratic scaling approaches including sliding window attention and compressed linear attention variants, discovering that these methods severely degraded the model's ability to perform multi-hop reasoning across distant sections of long documents. Performance testing revealed that sub-quadratic configurations dropped accuracy scores from 90.0 to 72.0 on complex reasoning tasks exceeding 32 thousand tokens. The team ultimately determined that preserving frontier-level intelligence required accepting the substantial computational costs associated with full attention mechanisms. The forthcoming M3 series breaks from this computational constraint through an innovative approach called MiniMax Sparse Attention, which operates differently from competitor solutions like DeepSeek's compressed latent attention. Rather than compressing key-value data into reduced-dimensional spaces, MSA maintains standard grouped query attention while performing block-level dynamic selection on uncompressed key-value pairs. This architecture delivers practical speed improvements while circumventing the precision losses and caching complications that plagued previous efficiency strategies.

The distinction proves particularly significant during the decoding phase of text generation, when language models must repetitively reference all prior context while generating each subsequent word. At million-token sequence lengths, MiniMax's profiling indicates 9.7 times faster prefilling and dramatically 15.6 times faster decoding speeds compared to the full-attention M2 baseline. This decoding acceleration directly addresses the computational bottleneck responsible for the gradual slowdowns users experience during extended AI conversations. MiniMax's product trajectory extends beyond architectural innovation into autonomous agent deployment through a reinforcement learning framework called Forge. The system decouples operations into agent-side modules, middleware abstraction layers, and training engines specifically designed to handle the extreme variability inherent in multi-step agent environments. This infrastructure culminated in M2.7, which operates as an independent machine learning engineer capable of profiling training runs, diagnosing anomalies, and automatically modifying its own codebase, successfully managing between thirty and fifty percent of its own development workflow.

As MiniMax prepares to reveal complete technical specifications for the MSA mechanism and officially introduce the M3 series, the company's trajectory indicates a fundamental industry shift toward translating efficient computational footprints into maximum practical intelligence. With technical documentation now establishing the M2 generation's

Related Articles

After Nvidia's $20B not-acqui-hire, AI chip startup Groq reportedly raising $650M

AI agents are quietly generating chaos engineering failures enterprises don’t track yet

Anthropic raises $65 billion, nears $1T valuation ahead of IPO

'People talk about AI reducing jobs, complete nonsense': Nvidia's Jensen Huang criticises economic doomerism on GTC stage

SGA calls MVP year a 'failure' after OKC falls sho...

The Future of Brain Health? How a New Scientific Discovery Could Regenerate Lost Neurons

More Stories

South Korea rally to beat Czechia 2-1 on World Cup opening day

Cheaper, faster, and culturally aware, Avataar's video AI is built for India's scale

A New Vaccine Was Designed by AI and Safey Tested on Humans

SpaceX raising $75 billion in record-setting IPO as Nasdaq debut awaits

'Massive body blow' as PM loses his defence secretary - and another resignation follows

Until Dawn Characters Will Never Not Look Cursed, I Guess

ShinyHunters Exploits Oracle PeopleSoft Zero-Day (CVE-2026-35273) to Breach Universities

Elon Musk's SpaceX prices shares at $135, raising $75 billion in largest-ever IPO