Cohere North Mini Code: a 30B MoE model for agentic coding
Cohere released North Mini Code — a 30B MoE model with 3B active parameters trained for agentic coding. Scores 33.4 on the Coding Index, beating models 4x its size.
TL;DR: Cohere released North Mini Code, a 30B MoE model with 3B active parameters trained for agentic coding. Scores 33.4 on Coding Index, beating Qwen 3.5 and Gemma 4. Apache 2.0 license. Available on Hugging Face.
Cohere just released North Mini Code, a 30B-parameter Mixture-of-Experts model with only 3B active parameters per token. Most coding models are evaluated on static benchmarks like HumanEval or MBPP — write a function, pass the tests. North Mini Code was trained and evaluated on agentic coding tasks: edit codebases, run terminal commands, fix bugs across multiple files, and navigate real software engineering workflows. That’s a fundamentally different capability.
Key takeaways:
- North Mini Code is a 30B MoE model with 3B active parameters, Apache 2.0 licensed on Hugging Face
- Scores 33.4 on Artificial Analysis Coding Index, beating Qwen 3.5, Gemma 4, and models 4x its size
- Trained with multi-scaffold RLVR for agentic coding, not just text generation
- 128 experts with 8 active per token — inference cost close to a 3B dense model
- Available now in OpenCode and via Hugging Face
Architecture: sparse MoE with 128 experts
North Mini Code is a decoder-only Transformer with a sparse Mixture-of-Experts architecture. The 30B total parameters are distributed across 128 experts, with only 8 activated per token. This means the inference cost is closer to a 3B dense model than a 30B one.
The attention mechanism uses interleaved sliding-window attention (with RoPE) and global attention (without positional embeddings) in a 3:1 ratio. Three sliding-window layers for every global attention layer. This keeps the context processing efficient while maintaining the model’s ability to reason across long codebases.
The feed-forward block uses SwiGLU activation, and the router applies a sigmoid activation to logits before top-k selection — a detail that matters for training stability in sparse MoE models.
Post-training: two-stage SFT + RLVR
The training pipeline is where the “agentic” part happens. After pre-training, Cohere ran two phases of supervised fine-tuning followed by a phase of reinforcement learning with verifiable rewards (RLVR) targeting software engineering and terminal tasks.
The key insight: they used multiple agent scaffolds during RL training — not a single harness. This prevents the model from overfitting to one tool’s quirks and makes it useful across different coding agents, whether it’s OpenCode, Claude Code, or a custom harness you built yourself.
Benchmark results
On Artificial Analysis’ Coding Index, North Mini Code scores 33.4, outperforming:
| Model | Size | Score |
|---|---|---|
| North Mini Code | 30B-A3B MoE | 33.4 |
| Qwen 3.5 | 35B-A3B MoE | Lower |
| Gemma 4 | 26B-A4B MoE | Lower |
| Devstral Small 2 | 24B Dense | Lower |
| Nemotron 3 Super | 120B-A12B MoE | Lower |
| Mistral Small 4 | 119B-A6B MoE | Lower |
The fact that a 30B MoE model with only 3B active parameters beats models 4x its size on agentic coding tasks is worth paying attention to.
What this means for AI engineering
Three takeaways:
1. The “agentic coding” benchmark gap is closing. Until recently, proprietary models dominated SWE-Bench and similar agentic benchmarks. Open-weight models catching up means you can run capable coding agents without depending on a single API provider. This follows the same trajectory I wrote about in my comparison of AI coding tools — the open-source ecosystem is closing the gap faster than most expected.
2. MoE makes local inference practical. 3B active parameters is roughly what you’d need for a small dense model. The inference cost scales with active parameters, not total parameters. This makes North Mini Code viable on consumer GPUs and potentially even laptops for simpler agentic tasks — relevant to the local-first agent pattern I’ve been building around.
3. Multi-scaffold training is the right approach. A model trained on one agent harness develops quirks specific to that harness. Training across multiple scaffolds generalizes better — a lesson for anyone building or fine-tuning coding agents.
Cohere positions North Mini Code as the first model in a new family, with more sizes likely on the way. For now, it’s available on Hugging Face under Apache 2.0, and you can try it in OpenCode today.
Pull the model from Hugging Face and run it with your agent harness of choice. The multi-scaffold training means it should work well with OpenCode, Claude Code, or a custom setup. Let me know what you find — I'm curious how it handles real-world agentic tasks beyond the benchmarks.
If you’ve been waiting for an open-weight coding model that treats agentic workflows as a first-class concern rather than an afterthought, this is worth a look.
Related Posts
- Best open source LLMs for coding in 2026 — Comparing DeepSeek, Qwen, Llama, and other open-weight coding models
- Best AI coding agents in 2026 — Comparing Claude Code, Cursor, Copilot, and OpenCode for development workflows
- Making FlashAttention-4 faster for inference — GPU-level optimizations that benefit model inference performance
This article was published on Agentic Up (https://agenticup.dev) — practical guides for developers and founders building with AI agents. Reach me at [email protected].