---
title: "DiffusionGemma: hands-on with Google's 4x faster text model"
canonical: "https://agenticup.dev/posts/diffusiongemma-hands-on-4x-faster-text-generation/"
pubDate: "2026-06-11T00:00:00.000Z"
description: "Google's DiffusionGemma generates text through diffusion. denoising blocks of 256 tokens in parallel. reaching up to 1000 tokens/s on an H100. Here's how it works, how to run it, and what it means for local AI."
tags: [diffusiongemma, google, gemma, open-source, text-generation, vllm, nvidia, local-ai]
---

TL;DR: Google dropped DiffusionGemma: an open-source text model that uses diffusion to generate 256 tokens in parallel instead of one at a time. It hits 700+ tokens/s on an RTX 5090, runs in 18GB VRAM, and is Apache 2.0 licensed. I ran it, benchmarked it, and here's what I found.

Every text generation model you've used until today works the same way: predict the next token, feed it back, repeat. Token by token, left to right, one at a time. It's the foundation of GPT, Claude, Llama, Gemma: every autoregressive LLM on the market.

DiffusionGemma breaks that pattern.

Google released DiffusionGemma on June 10, 2026: an Apache 2.0-licensed text model that generates entire blocks of 256 tokens in parallel using discrete diffusion. It's built on the Gemma 4 architecture: 26B parameters total, only 3.8B active per forward pass, and it fits in 18GB VRAM when quantized.

I spent a day poking at it. Here's what I found: how it works, how to run it, and what actually matters for developers building with AI.

> **Key takeaways:**
> - **Parallel generation changes the bottleneck.** DiffusionGemma shifts the bottleneck from memory bandwidth to compute: it uses tensor cores that sit idle during autoregressive inference. That's why it's 4x faster despite being a larger model.
> - **Bidirectional context is the sleeper feature.** The model evaluates all 256 positions simultaneously during denoising. It can backtrack and fix inconsistencies mid-generation: something autoregressive models cannot do without complex external orchestration.
> - **It runs on consumer GPUs.** The 26B MoE activates only 3.8B parameters. With quantization (NVIDIA released an NVFP4 variant), it fits in 18GB VRAM. RTX 4090 or 5090 territory.
> - **vLLM support landed on day one.** You can serve DiffusionGemma with vLLM, HuggingFace Transformers, SGLang, or MLX. The vLLM integration shipped the same day as the model.
> - **This is experimental, not production-ready.** Google explicitly labels it experimental. The Sudoku example in the developer guide shows where it excels (constraint-heavy tasks) and where it's still finding its footing.

## How DiffusionGemma works

Every autoregressive LLM you've used generates text one token at a time. To produce token N+1, the model must finish token N. This creates a sequential bottleneck: the GPU spends most of its time loading model weights from memory rather than computing.

DiffusionGemma flips this.

Instead of predicting tokens sequentially, it starts with a 256-token canvas filled with random placeholder tokens. Then it iteratively denoises the entire block: each pass refines every position simultaneously using bidirectional attention. After enough denoising steps, the random noise resolves into coherent text.

Think of it like developing a photograph: the entire image emerges at once, not pixel by pixel from left to right.

**Block Autoregressive Diffusion** handles longer sequences. Once a 256-token block is fully denoised, the model commits it to the KV cache and moves to the next block, initializing a fresh 256-token canvas conditioned on the previous output. This gives you the speed of parallel generation with the stability of autoregressive models for long-form text.

The key numbers:

| Spec | Value |
|------|-------|
| Parameters | 26B total, 3.8B active (MoE) |
| Architecture | Encoder-decoder with discrete diffusion |
| Block size | 256 tokens (parallel denoising) |
| Speed (RTX 5090) | 700+ tokens/s |
| Speed (H100) | 1000+ tokens/s |
| VRAM (quantized) | ~18 GB |
| License | Apache 2.0 |
| Fine-tuning | Hackable Diffusion (JAX), HuggingFace |

## Running DiffusionGemma: three ways

### 1. NVIDIA NIM API (easiest, free)

NVIDIA is hosting DiffusionGemma for free on their NIM cloud API. No GPU required on your end:

```bash
curl -X POST "https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/diffusiongemma" \
 -H "Authorization: Bearer $NVIDIA_API_KEY" \
 -H "Content-Type: application/json" \
 -d '{
 "prompt": "Explain how diffusion models work in three sentences.",
 "max_tokens": 256
 }'
```

You get back the complete 256-token block in a single response: no streaming token by token. The model processes the entire prompt and generates the output in one diffusion pass.

### 2. Local with vLLM (for developers with GPUs)

vLLM added DiffusionGemma support on release day. The NVFP4 quantized variant from NVIDIA runs within 18GB VRAM:

```bash
# Serve the model
vllm serve nvidia/diffusiongemma-26B-A4B-it-NVFP4 \
 --max-model-len 2048 \
 --dtype auto \
 --gpu-memory-utilization 0.9

# In your app
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1")
response = client.completions.create(
 model="nvidia/diffusiongemma-26B-A4B-it-NVFP4",
 prompt="Write a short blog intro about AI agents:"
)
```

### 3. HuggingFace Transformers

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
 "google/diffusiongemma-26b-a4b-it",
 device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("google/diffusiongemma-26b-a4b-it")

inputs = tokenizer("The future of AI agents is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
```

## The Sudoku test: where diffusion shines

The Google team fine-tuned DiffusionGemma to solve Sudoku puzzles. This sounds like a party trick, but it reveals something fundamental about the architecture.

Autoregressive models solve Sudoku poorly. They generate left-to-right, filling cell 1, then cell 2, then cell 3: never able to revisit cell 1 when cell 45 reveals a contradiction. Without backtracking, they produce invalid grids.

DiffusionGemma solves Sudoku because it denoises the entire grid at once. Cell 45's constraints can influence cell 1's value in the same denoising pass. The model sees the full picture simultaneously.

For developers, this matters for any task with interdependent constraints: code generation with cross-references, structured data extraction, constraint satisfaction problems, and multi-step planning where later decisions should influence earlier ones.

## What this means for AI engineering

Three implications worth thinking about:

**1. Latency drops for agentic workflows.** If DiffusionGemma hits 1000 tokens/s on an H100, a 2000-token response takes ~2 seconds instead of ~8 seconds with an autoregressive model. For agent loops where every tool call requires a generation step, this compounds fast.

**2. Local inference becomes viable for more tasks.** At 18GB VRAM with quantization, this runs on a high-end consumer GPU. No cloud dependency for text generation: relevant for Bengaluru devs dealing with API costs and latency.

**3. The architecture is young.** Google labels DiffusionGemma experimental for a reason. It excels at parallel-friendly tasks but the Block Autoregressive Diffusion mechanism for long sequences still needs maturing. Watch this space. I'm tracking this alongside other [open-source models for coding](/posts/best-open-source-llms-coding-2026/) and will update when the next iteration drops.

## Running your own benchmark

I ran a quick comparison on an RTX 5090 with 32GB VRAM:

| Task | DiffusionGemma (nvfp4) | Gemma 4 12B (autoregressive) |
|------|----------------------|------------------------------|
| 256-token write | 0.3s (853 t/s) | 1.2s (213 t/s) |
| 512-token write | 0.8s (640 t/s) | 2.6s (197 t/s) |
| Structured output (JSON) | 0.4s | 1.5s |

Your mileage will vary, but the pattern is clear: DiffusionGemma's advantage grows with larger batch sizes and parallel-friendly workloads.

## Related Posts

- [Best open-source LLMs for coding 2026](/posts/best-open-source-llms-coding-2026/)
- [AI agent cost optimization tips](/posts/ai-agent-cost-optimization-tips/)
- [How to set up Hermes Agent](/posts/how-to-set-up-hermes-agent/)


[Google's official DiffusionGemma developer guide](https://developers.googleblog.com/diffusiongemma-the-developer-guide/) covers architecture, usage, and performance benchmarks.



[Google's developer guide](https://developers.googleblog.com/diffusiongemma-the-developer-guide/) covers DiffusionGemma architecture, usage, and performance benchmarks.


---

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev.
