MiniMax M3: open-weights coding, 1M context, and multimodality at 12x less than GPT
MiniMax M3 is the first open-weights model to combine frontier coding (59% SWE-Bench Pro), a 1M-token context window via sparse attention, and native multimodality. At $0.60/M input tokens, it's 12x cheaper than GPT-5.5.
MiniMax just released M3, and it’s the first open-weights model that doesn’t make you choose.
You want frontier coding? It scores 59% on SWE-Bench Pro — ahead of GPT-5.5. You want 1M context? Its sparse attention architecture makes long-context inference economically viable for the first time in an open-weights model. You want multimodality? Native from the start, not a vision model bolted on after training.
The tradeoff used to be: pick two. M3 is the first model that delivers all three at a price that undercuts everything else.
Key takeaways:
- 59% SWE-Bench Pro beats GPT-5.5 (58.6%) at 12x lower cost
- MiniMax Sparse Attention (MSA) cuts 1M-token compute to 1/20th of previous gen
- Native multimodality from pretraining — not a post-hoc vision adapter
- $0.60/$2.40 per M tokens — 12x cheaper input than GPT-5.5, 6-8x cheaper than Claude
- Open-weights on Hugging Face under modified-MIT license (weights only, not training code)
The three in one
Most open-weights models specialize. DeepSeek is strong at reasoning but weak at multimodality. Llama 4 is good at vision but trails on coding benchmarks. Kimi K2.7 scores well on coding but doesn’t have native vision.
M3 stacks three capabilities that were previously split across different models:
Coding & agentic. 59% on SWE-Bench Pro beats GPT-5.5’s 58.6%. 66% on Terminal Bench 2.1. 74.2% on MCP Atlas (tool-use benchmark). These aren’t best-in-class scores — Claude Opus 4.7 leads at 64.3% on SWE-Bench Pro — but they’re competitive with every closed-source model at a fraction of the price.
1M context with real economics. MiniMax Sparse Attention (MSA) is the technical highlight. Instead of full attention across the entire context, it selects relevant KV blocks. At 1M tokens, this cuts compute to roughly 1/20th of the previous generation. 15.6x faster decoding. 9.7x faster prefill. This makes long-context agentic use cases viable in an open-weights model for the first time.
Native multimodality. M3 was trained on interleaved sequences of text, images, and video from scratch — over 100 trillion tokens. It’s not a text model with a vision adapter. It reads charts, interprets code screenshots, and processes video frames as a core capability, not an afterthought.
What the benchmarks actually say
| Benchmark | M3 | GPT-5.5 | Claude Opus 4.7 |
|---|---|---|---|
| SWE-Bench Pro | 59.0% | 58.6% | 64.3% |
| Terminal Bench 2.1 | 66.0% | 78.2% | — |
| BrowseComp | 83.5 | — | 79.3 |
| MCP Atlas | 74.2% | — | — |
M3 leads on BrowseComp and matches GPT-5.5 on SWE-Bench Pro. It trails on Terminal Bench. The pattern suggests M3 is stronger at autonomous, long-horizon tasks (BrowseComp, SWE-Bench) than at structured terminal interactions.
The real-world demos
MiniMax published two agentic workflows that demonstrate the combination of capabilities:
ICLR 2025 paper reproduction. M3 independently reproduced core experiments from an ICLR 2025 Outstanding Paper. It ran for 12 hours, produced 18 commits and 23 experimental figures. This required both multimodality (reading paper figures and formulas) and long context (fitting the paper, code, and logs in one window).
GPU kernel optimization. M3 optimized a matrix multiplication kernel on NVIDIA Hopper GPUs over 24 hours. It made 1,959 tool calls across 147 benchmark submissions. Hardware utilization went from 7.6% to 71.3%. That’s not a benchmark result — that’s a workflow.
These demos matter because they test the combination, not just individual capabilities. A model can score well on SWE-Bench without handling multimodal input. M3 handles both in the same session.
Pricing that changes the math
| Metric | M3 | GPT-5.5 | Claude Sonnet |
|---|---|---|---|
| Input / M tokens | $0.60 | $7.50 | $3.00 |
| Output / M tokens | $2.40 | $30.00 | $15.00 |
| 1M-context premium | 2x (512K-1M) | — | — |
M3 is roughly 12x cheaper than GPT-5.5 on input and 12.5x cheaper on output. Against Claude Sonnet, it’s 5x cheaper on input and 6x cheaper on output.
The 1M-context pricing is notable: 2x the standard rate for contexts between 512K and 1M tokens. At $1.20/M input for a 500K-token context, you can run agent loops that span entire codebases without chunking. Try that with GPT-5.5 at $7.50/M.
What this means for agent builders
M3 changes the economics for three specific use cases:
Long-context agents. If your agent needs to process an entire codebase, documentation set, or conversation history in one pass, M3’s sparse attention makes it affordable. At $1.20/M for extended context, a 500K-token analysis costs about $0.60 in input tokens.
Multimodal coding agents. If your agent needs to read screenshots, interpret UI mockups, or process video frames alongside code, M3 handles it natively. No separate vision model, no multi-model orchestration.
Cost-sensitive agent loops. At 12x cheaper than GPT-5.5, you can run 12 agent attempts for the same cost as one GPT attempt. For agent loops where retries are common, this changes the economics of quality control.
The honest tradeoffs
M3 isn’t the best model on any single dimension. Claude Opus leads on coding benchmarks. GPT-5.5 leads on terminal interaction. Dedicated vision models may outperform M3 on specific multimodal tasks.
What M3 offers is the combination at a price that makes it practical for production use. If your agent needs long context, coding capability, and multimodality in the same loop, M3 is the first open-weights model that delivers all three without requiring a second mortgage.
The weights are coming to Hugging Face. The API is live at platform.minimax.io. And MiniMax Code (code.minimax.io) is the coding interface if you want to try it without setting up an API key.
FAQ
Is M3 actually open-source or just open-weights? Open-weights. The model parameters will be on Hugging Face under a modified-MIT license. Training code and data are not included.
How does MSA compare to other attention architectures? MSA works on uncompressed KVs, unlike DeepSeek’s MLA which uses compressed latent vectors. This avoids precision loss in long-context inference but means the architecture is different from the current open-weight standard.
Can I self-host M3? Yes, once the weights are released. The model is large — expect significant VRAM requirements at 1M context — but self-hosting eliminates per-token costs entirely.
Should I switch my agent to M3 today? Test it on your specific agent loop first. If your loop needs long context, multimodality, or cost efficiency, M3 is worth a serious evaluation. If you need maximum coding capability for complex tasks, Claude is still the leader.
What about the 7-day launch discount? MiniMax offered 50% off standard pricing for the first seven days after the June 1 launch. Check current pricing at platform.minimax.io.
Related Posts
- Kimi K2.7 Code: the first open-source model that competes with Claude Code — Another strong open-weights coding model, focused on Claude Code compatibility
- When a ‘worse’ model beats a frontier model for agent work — Why cheaper models often outperform frontier ones in agent loops
- My AI model picks — Current top picks for agent development, updated quarterly
This article was published on Agentic Up (https://agenticup.dev) — practical guides for developers and founders building with AI agents. Reach me at [email protected].