M3 is MiniMax's latest open-weights model, released June 1, 2026. It's the first open-weights model to combine frontier coding performance, a 1-million-token context window via MiniMax Sparse Attention (MSA), and native multimodality (text, image, video) in a single model.

How does M3 compare to GPT-5.5 and Claude on coding?

M3 scores 59.0% on SWE-Bench Pro, beating GPT-5.5 (58.6%) but trailing Claude Opus 4.7 (64.3%). On Terminal Bench 2.1 it scores 66.0%, behind GPT-5.5 (78.2%). On BrowseComp it leads at 83.5 vs Opus 4.7's 79.3.

What is MiniMax Sparse Attention (MSA)?

MSA is a sparse attention architecture that replaces full attention with KV-block selection, reducing per-token compute to roughly 1/20th of the previous generation at 1M tokens. It also ships 15.6x faster decoding and 9.7x faster prefill compared to M2.

How much does M3 cost?

$0.60/M input tokens and $2.40/M output tokens at standard context (up to 512K). Extended context (512K–1M) is double. Prompt cache is $0.12/M. That's roughly 12x cheaper than GPT-5.5 and significantly less than Claude.

Is M3 really open-source?

Model weights will be released on Hugging Face under a modified-MIT license. This makes it open-weight : you can download and self-host. Training code and data are not included, so it's not fully open-source in the strict sense.

MiniMax M3: open-weights coding, 1M context, multimodality at 12x less than GPT

MiniMax M3 is the first open-weights model to combine frontier coding (59% SWE-Bench Pro), a 1M-token context window, and native multimodality. At $0.60/M, it's 12x cheaper than GPT-5.5.

TL;DR: I tested MiniMax M3 expecting tradeoffs. It scores 59% on SWE-Bench Pro. It handles 1M tokens. It processes images. At roughly what DeepSeek charges, with open weights. The catch: the 1M context works but quality fades past 200K tokens.

MiniMax just released M3, and it’s the first open-weights model that doesn’t make you choose.

You want frontier coding? It scores 59% on SWE-Bench Pro: ahead of GPT-5.5. You want 1M context? Its sparse attention architecture makes long-context inference economically viable for the first time in an open-weights model. You want multimodality? Native from the start, not a vision model bolted on after training.

The tradeoff used to be: pick two. M3 is the first model that ships all three at a price that undercuts everything else.

Key takeaways:

59% SWE-Bench Pro beats GPT-5.5 (58.6%) at 12x lower cost

MiniMax Sparse Attention (MSA) cuts 1M-token compute to 1/20th of previous gen

Native multimodality from pretraining: not a post-hoc vision adapter

$0.60/$2.40 per M tokens. 12x cheaper input than GPT-5.5, 6-8x cheaper than Claude

Open-weights on Hugging Face under modified-MIT license (weights only, not training code)

How does MiniMax M3 combine three capabilities?

Most open-weights models specialize. DeepSeek is strong at reasoning but weak at multimodality. Llama 4 is good at vision but trails on coding benchmarks. Kimi K2.7 scores well on coding but doesn’t have native vision.

M3 stacks three capabilities that were previously split across different models:

Coding & agentic. 59% on SWE-Bench Pro beats GPT-5.5’s 58.6%. 66% on Terminal Bench 2.1. 74.2% on MCP Atlas (tool-use benchmark). These aren’t best-in-class scores, Claude Opus 4.7 leads at 64.3% on SWE-Bench Pro, but they’re competitive with every closed-source model at a fraction of the price.

1M context with real economics. MiniMax Sparse Attention (MSA) is the technical highlight. Instead of full attention across the entire context, it selects relevant KV blocks. At 1M tokens, this cuts compute to roughly 1/20th of the previous generation. 15.6x faster decoding. 9.7x faster prefill. This makes long-context agentic use cases viable in an open-weights model for the first time.

Native multimodality. M3 was trained on interleaved sequences of text, images, and video from scratch: over 100 trillion tokens. It’s not a text model with a vision adapter. It reads charts, interprets code screenshots, and processes video frames as a core capability, not an afterthought.

What the benchmarks say

Benchmark	M3	GPT-5.5	Claude Opus 4.7
SWE-Bench Pro	59.0%	58.6%	64.3%
Terminal Bench 2.1	66.0%	78.2%	.
BrowseComp	83.5	.	79.3
MCP Atlas	74.2%	-	,

M3 leads on BrowseComp and matches GPT-5.5 on SWE-Bench Pro. It trails on Terminal Bench. The pattern suggests M3 is stronger at autonomous, long-horizon tasks (BrowseComp, SWE-Bench) than at structured terminal interactions.

What can MiniMax M3 do in real-world demos?

MiniMax published two agentic workflows that demonstrate the combination of capabilities:

ICLR 2025 paper reproduction. M3 independently reproduced core experiments from an ICLR 2025 Outstanding Paper. It ran for 12 hours, produced 18 commits and 23 experimental figures. This required both multimodality (reading paper figures and formulas) and long context (fitting the paper, code, and logs in one window).

GPU kernel optimization. M3 improved a matrix multiplication kernel on NVIDIA Hopper GPUs over 24 hours. It made 1,959 tool calls across 147 benchmark submissions. Hardware use went from 7.6% to 71.3%. That’s not a benchmark result: that’s a workflow.

These demos matter because they test the combination, not just individual capabilities. A model can score well on SWE-Bench without handling multimodal input. M3 handles both in the same session.

How does MiniMax M3 pricing compare to alternatives?

Metric	M3	GPT-5.5	Claude Sonnet
Input / M tokens	$0.60	$7.50	$3.00
Output / M tokens	$2.40	$30.00	$15.00
1M-context premium	2x (512K-1M)	-	,

M3 is roughly 12x cheaper than GPT-5.5 on input and 12.5x cheaper on output. Against Claude Sonnet, it’s 5x cheaper on input and 6x cheaper on output.

The 1M-context pricing is notable: 2x the standard rate for contexts between 512K and 1M tokens. At $1.20/M input for a 500K-token context, you can run agent loops that span entire codebases without chunking. Try that with GPT-5.5 at $7.50/M.

What this means for agent builders

M3 changes the economics for three specific use cases:

Long-context agents. If your agent needs to process an entire codebase, documentation set, or conversation history in one pass, M3’s sparse attention makes it affordable. At $1.20/M for extended context, a 500K-token analysis costs about $0.60 in input tokens.

Multimodal coding agents. If your agent needs to read screenshots, interpret UI mockups, or process video frames alongside code, M3 handles it natively. No separate vision model, no multi-model orchestration.

Cost-sensitive agent loops. At 12x cheaper than GPT-5.5, you can run 12 agent attempts for the same cost as one GPT attempt. For agent loops where retries are common, this changes the economics of quality control.

What are the honest tradeoffs of MiniMax M3?

M3 isn’t the best model on any single dimension. Claude Opus leads on coding benchmarks. GPT-5.5 leads on terminal interaction. Dedicated vision models may outperform M3 on specific multimodal tasks.

What M3 offers is the combination at a price that makes it practical for production use. If your agent needs long context, coding capability, and multimodality in the same loop, M3 is the first open-weights model that ships all three without requiring a second mortgage.

The weights are coming to Hugging Face. The API is live at platform.minimax.io. And MiniMax Code (code.minimax.io) is the coding interface if you want to try it without setting up an API key.

FAQ

Is M3 open-source or just open-weights? Open-weights. The model parameters will be on Hugging Face under a modified-MIT license. Training code and data are not included.

How does MSA compare to other attention architectures? MSA works on uncompressed KVs, unlike DeepSeek’s MLA which uses compressed latent vectors. This avoids precision loss in long-context inference but means the architecture is different from the current open-weight standard.

Can I self-host M3? Yes, once the weights are released. The model is large, expect significant VRAM requirements at 1M context, but self-hosting eliminates per-token costs entirely.

Should I switch my agent to M3 today? Test it on your specific agent loop first. If your loop needs long context, multimodality, or cost efficiency, M3 is worth a serious evaluation. If you need maximum coding capability for complex tasks, Claude is still the leader.

What about the 7-day launch discount? MiniMax offered 50% off standard pricing for the first seven days after the June 1 launch. Check current pricing at platform.minimax.io.

Artificial Analysis ranks MiniMax M3 as a leading open-weights model with competitive coding and reasoning scores. Community discussion on LocalLLaMA covers real-world usage of M3.

Kimi K2.7 Code: the first open-source model that competes with Claude Code. Another strong open-weights coding model, focused on Claude Code compatibility
When a ‘worse’ model beats a frontier model for agent work. Why cheaper models often outperform frontier ones in agent loops
My AI model picks. Current top picks for agent development, updated quarterly

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at [email protected]