SHIP · Jun 18, 2026

Gemma 4 12B runs locally. It nearly matches a 26B model.

Google's Gemma 4 12B fits 16GB machines at Q4, has native function calling, and nearly matches its 26B MoE sibling on benchmarks. I ran it on an M2 Air.

Agent-ready: drop this post into Claude Code or Codex

TL;DR: Google released Gemma 4 12B under Apache 2.0. It fits 16GB machines at Q4, has native function calling, a 262K context window, and an encoder-free architecture that feeds vision and audio directly into the transformer. On standard benchmarks it nearly matches the 26B MoE model at less than half the memory. Qwen 3.5 fine-tunes still lead on pure coding benchmarks, but Gemma 4 12B wins on multimodal capability and agentic workflow support.

Key takeaways:

  • Gemma 4 12B runs at Q4 on 16GB machines with 15-25 tok/s on M-series Macs
  • Encoder-free architecture replaces separate vision/audio encoders with direct linear projections into the LLM
  • Nearly matches Gemma 4 26B MoE on standard benchmarks at less than half the memory footprint
  • Native function calling, system prompts, and 262K context make it viable for agent workflows
  • Qwen 3.5 9B fine-tunes still edge ahead on pure coding benchmarks (5/8 shared benchmarks)
  • Apache 2.0 license, available on Ollama, LM Studio, llama.cpp, and HuggingFace

I pulled Gemma 4 12B from Ollama last night. Three commands: ollama pull gemma4:12b, ollama run gemma4:12b, and it was generating text inside a terminal on my M2 Air with 16GB of memory. No GPU. No cloud. No credit card.

The model responded in about two seconds with a working Python script. I checked Activity Monitor. Memory pressure stayed green. No swapping.

That is the headline. Google released a 12-billion-parameter model with 262K context, native function calling, and multimodal input that runs on a machine that costs $1,000. Under Apache 2.0 license.

What makes Gemma 4 12B different from other small models?

Other small models bolt on vision by stacking a separate vision transformer in front of the LLM. The result is a two-stage pipeline: encode the image, then feed the encoding into the language model. It works. It is also memory-intensive and complicates fine-tuning because you have to train two separate components.

The kitchen analogy is a restaurant with a separate prep kitchen. The prep kitchen chops vegetables, then sends them to the main kitchen through a pass. Two kitchens cost more and the handoff creates friction.

Gemma 4 12B is a single kitchen. It replaces the 27-layer vision transformer with a 35-million-parameter linear projection. The image gets sliced into 48x48 pixel patches and projected directly into the LLM’s hidden space with one matrix multiplication. Audio gets sliced into 40-millisecond frames and projected the same way.

The result: a single decoder-only transformer handles text, images, and audio through the same weights (Google AI dev docs). Fine-tuning one LoRA adapter updates the entire multimodal loop in one pass.

This architecture choice directly affects agent work. An agent that processes screenshots, reads documents, and handles voice input doesn’t need three separate encoder stacks. It runs one model. The inputs merge into the same hidden space from the first layer. This simplifies the agent’s tool-use stack and reduces the memory overhead of running multiple encoder models alongside the LLM.

How does Gemma 4 12B compare to Qwopus 3.5 9B and LFM 2.5 8B?

These are the two models I keep on my machine. Qwopus 3.5 9B Coder MTP for coding tasks (87.8% HumanEval). LFM 2.5 8B A1B for general agent work (1.5B active parameters via MoE, fast). I compared all three on the criteria that matter for local agent development.

CriterionGemma 4 12BQwopus 3.5 9B Coder MTPLFM 2.5 8B A1B
ArchitectureDense, encoder-freeDense, text-onlyMoE, 1.5B active
Quantized size (Q4)~7.2 GB~5.5 GB~4.8 GB
Context window262K131K32K
MultimodalImage + audioText onlyText only
Function callingNativeNativeNative
Coding benchmarkNear 26B MoE (~80% MMLU est.)HumanEval 87.8%Good, not top-tier
LicenseApache 2.0Apache 2.0Apache 2.0
AvailabilityOllama, LM Studio, llama.cppOllama, LM StudioOllama, LM Studio

The community comparison on Reddit shows Qwen 3.5 9B beating Gemma 4 12B in 5 out of 8 shared benchmarks. Qwopus is a fine-tuned variant of Qwen 3.5 9B specialized for coding with multi-token prediction. On pure code generation and tool-calling accuracy, Qwopus still leads.

But the comparison isn’t apples to apples. Qwen 3.5 9B is text-only. Gemma 4 12B processes images and audio natively. If your agent needs to read screenshots or handle voice input, Gemma 4 12B eliminates the second model you would otherwise need to run alongside a text-only LLM.

LFM 2.5 8B A1B stays the speed leader for simple agent loops. Its 1.5B active parameters mean it generates tokens faster on memory-constrained hardware. For straightforward tool calls and structured output, it is still the pragmatic choice.

What does the 262K context do for agents?

Agent sessions accumulate context fast. A single multi-step tool call chain with file reads, code generation, and error handling can consume 10-20K tokens in minutes. By the thirtieth turn in a session, you have crossed 50K tokens easily.

Most small models cap at 32K or 128K. At those limits, the agent either loses earlier context or requires expensive summarization strategies.

Gemma 4 12B supports 262K tokens natively. That means a 100-turn agent session with file browsing, code editing, test runs, and debugging fits without summarization. The model maintains coherence across the full window. This isn’t a feature for chatbots. It is a feature for agents that work for hours on a single task.

The 26B MoE and 31B Dense models also support 262K context, but they require 28-57 GB of memory. The 12B achieves the same context length at 13.4 GB. For developers running agents on a laptop, this is the practical difference between viable and impossible.

What does encoder-free architecture mean for agent workflows?

Consider what an agent does in a session. It reads a screenshot of a UI to determine the next action. It processes a PDF of documentation to extract an API endpoint. It receives a voice command from the user. Then it generates code, runs it, and evaluates the output.

With the traditional approach, the agent needs a vision encoder to process the screenshot, a document parser for the PDF, an audio encoder for the voice input, and the LLM itself for the text generation. That is three to four models running simultaneously or in sequence. Each one consumes memory, adds latency, and requires its own integration point in the agent harness.

Gemma 4 12B handles all of these inputs through one transformer. The 35M-parameter vision embedder replaces the 27-layer ViT. The audio projection replaces the separate audio encoder. The text path goes through the same decoder. One model. One memory budget. One integration.

For a local agent stack on a 16GB machine, this is the difference between running one model and running three.

Who is the 12B not for?

If you need the highest coding accuracy on complex multi-file refactoring tasks, Qwopus 3.5 9B Coder MTP or Qwen 3.5 9B still deliver better results on benchmarked coding tasks.

If your agent runs on a server with 48GB+ VRAM and you need maximum quality, the Gemma 4 31B Dense model scores 87.1% on MMLU and 82.7% on HumanEval. The 12B is a compromise for local deployment, not a replacement for the full-sized model.

If your workflow is strictly text-based and doesn’t use images or audio, a text-only model of comparable size will match or beat Gemma 4 12B on speed and accuracy for a smaller memory footprint.

If you need audio output or video understanding, Gemma 4 12B handles audio input but doesn’t generate audio or process video natively. It is an input-only multimodal model.

FAQ

What hardware do I need to run Gemma 4 12B? At Q4_K_M quantization, about 13.4 GB of memory. A machine with 16GB unified memory (Apple Silicon) or 16GB VRAM (NVIDIA) is the minimum. At BF16 precision, 26.7 GB is required.

How fast is it on Apple Silicon? Community benchmarks show 12-25 tok/s on M1-M3 machines at Q4. M4 Max reaches about 78 tok/s. My M2 Air with 16GB averaged about 17 tok/s at Q4_K_M on a 4K prompt.

Can I use Gemma 4 12B with OpenCode or Claude Code? Yes, through LiteRT-LM which provides an OpenAI-compatible server. llama.cpp also supports the model with an OpenAI-compatible endpoint. Direct harness integration depends on the tool’s model configuration.

Does it support tool calling? Yes. Native function calling is built in. It supports system prompts and structured output formats for agentic workflows.

How does the license work? Apache 2.0. Free for commercial use, modification, and redistribution. No usage restrictions or regional blocks.


This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at [email protected]

Newsletter

Get the brief on AI agents

Practical posts on shipping agents, automating work, and building in public. No hype, no fluff.

Contact: [email protected]