How much VRAM do I need for a 7B model?

7B models need roughly 5GB at Q4_K_M quantization, 8GB at Q8_0, and 14GB at FP16. A $340 RTX 4060 8GB or $500 RTX 5060 Ti 16GB handles this tier comfortably.

What hardware should I buy for 70B models locally?

For 70B models at Q4_K_M, you need about 42GB of memory. This requires a unified memory system: Strix Halo 128GB ($1,500+ entry, approx. $3,500 for 128GB config), Mac Studio M4 Max ($2,600+), or Mac Studio M3 Ultra 192GB ($3,999). Token speeds range from 12-15 tok/s on Strix Halo to 20-30 tok/s on Mac Studio.

Is Strix Halo or RTX 4090 better for local LLMs?

They solve different problems. RTX 4090 (24GB) runs 7B-13B models at 100+ tok/s, 5x faster than Strix Halo. Strix Halo 128GB runs 70B+ models that simply don't fit on the 4090. Choose based on your model size, not brand preference.

Does memory bandwidth or capacity matter more for LLM inference?

Bandwidth determines token speed. Capacity determines which models run at all. A model that fits entirely in memory always beats a faster card that offloads to system RAM. Among unified memory systems, Apple Silicon (546-819 GB/s) generates tokens faster than Strix Halo (approx. 256 GB/s) or DGX Spark (273 GB/s).

Which local hardware should you buy for which LLM in 2026?

Q: What is the actual Strix Halo 128GB price?

The 128GB configuration currently sells at roughly $3,500 street price. Entry-level configurations with 16GB start at $1,499. The chip launched roughly 1.5 years ago as the Ryzen AI Max 400 series.

A developer's decision framework for local LLM hardware. Model size determines everything. 7B needs 8GB VRAM. 70B needs 128GB unified memory. Here is what each tier costs and how fast it runs.

TL;DR: Model size determines your hardware, not brand. All figures at Q4_K_M (standard quantization). 7B needs 5GB ($340 RTX 4060 8GB). 13B needs 9GB ($670 RX 9070 XT 16GB). 34B needs 20GB ($1,600 RTX 4090 24GB). 70B needs 42GB (unified memory from $3,500). Nvidia GPUs are 5x faster on models that fit in VRAM. Unified memory runs models that don’t fit on any consumer GPU.

Key takeaways:

Model size determines everything. All figures at Q4_K_M. 7B needs 5GB ($340 GPU). 13B needs 9GB ($670 GPU). 34B needs 20GB ($1,600 RTX 4090). 70B needs 42GB (unified memory).

Memory bandwidth determines token speed. Capacity determines which models run at all.

RTX 4090 (24GB, 1,008 GB/s) does 100+ tok/s on models that fit. Strix Halo (128GB, 256 GB/s) does 12-15 tok/s on 70B.

Strix Halo 128GB costs roughly $3,500 street price. Entry level with 16GB starts at $1,499.

Apple Silicon (546-819 GB/s) leads unified memory for speed: 20-30 tok/s on 70B.

The one concept that explains everything

LLM inference is bottlenecked by memory bandwidth, not compute. When the model generates a token, it loads the entire weight matrix from memory once. The rate at which it can do this (GB/s) determines tokens per second, not teraflops (Pinggy hardware guide 2026).

This means two things:

If your model fits entirely in GPU or unified memory, bandwidth determines speed.
If your model doesn’t fit, it spills to slower memory and speed collapses regardless of GPU power.

The realistic bandwidth hierarchy:

Hardware	Memory bandwidth	Token speed (70B Q4_K_M)
RTX 4090	1,008 GB/s	100+ t/s (but 70B doesn’t fit in 24GB)
RTX 5090	1,792 GB/s	100+ t/s (but 70B doesn’t fit in 32GB)
Mac Studio M4 Max	approx. 400-550 GB/s	20-28 t/s
Mac Studio M3 Ultra	819 GB/s	25-30 t/s
DGX Spark	273 GB/s	approx. 2.7 t/s (FP8) to 12-15 t/s (Q4)
Strix Halo 128GB	approx. 256 GB/s	12-15 t/s

Numbers from community benchmarks and published reviews (Framework community, Pinggy hardware guide).

How much memory does each model need?

GGUF quantization at Q4_K_M is the standard recommendation. It uses roughly half the memory of FP16 with minimal quality loss (memory requirement table):

Model size	Q4_K_M	Q8_0	FP16
7B	approx. 5 GB	approx. 8 GB	approx. 14 GB
13B	approx. 9 GB	approx. 14 GB	approx. 26 GB
34B	approx. 20 GB	approx. 34 GB	approx. 68 GB
70B	approx. 42 GB	approx. 70 GB	approx. 140 GB
405B	approx. 220 GB	approx. 405 GB	approx. 810 GB

A 7B model at Q4_K_M needs 5GB. A 70B at Q4_K_M needs 42GB. This gap is why comparing a 24GB GPU to a 128GB unified memory system is meaningless. They serve different model sizes.

The decision framework

If you run 7B models: Needs approx. 5GB at Q4_K_M or 8GB at Q8_0. Buy an RTX 4060 8GB for $340 or an RTX 5060 Ti 16GB for $500. Both run 7B at Q8_0 comfortably at 100+ tok/s.

If you run 13B models: Needs approx. 9GB at Q4_K_M or 14GB at Q8_0. Buy an AMD RX 9070 XT 16GB for $670 or an RTX 5080 16GB for $1,000. Both run 13B at Q8_0.

If you run 34B models: Needs approx. 20GB at Q4_K_M (Q8_0 needs 34GB and does not fit on consumer GPUs). Buy an RTX 4090 24GB for $1,600. Runs 34B at Q4_K_M at over 100 tok/s with room for context.

If you run 70B+ models: Needs approx. 42GB at Q4_K_M. Consumer GPUs top out at 32GB (RTX 5090). You need a unified memory system:

System	Memory	Bandwidth	Token speed (70B)	Price
Strix Halo 128GB	128GB	approx. 256 GB/s	12-15 t/s	approx. $3,500
Mac Studio M4 Max	128GB	approx. 400-550 GB/s	20-28 t/s	approx. $2,600+
Mac Studio M3 Ultra	192GB	819 GB/s	25-30 t/s	approx. $3,999
DGX Spark	128GB	273 GB/s	12-15 t/s	approx. $4,000

If you run 100B+ models: Only the Mac Studio M3 Ultra 192GB handles these in a single consumer system. Beyond that, dual DGX Sparks or multi-GPU servers.

Where Strix Halo fits

The hype around Strix Halo isn’t about speed. At approx. 256 GB/s bandwidth, it generates tokens at about a quarter the speed of an Apple Silicon system and about a tenth the speed of an RTX 4090. The hype is about capacity. It’s the cheapest way to get 96GB of usable GPU memory in a single system.

The chip launched roughly 1.5 years ago as the AMD Ryzen AI Max 400 series (TechPowerUp). Entry-level configurations with 16GB start at $1,499. The 128GB configuration needed for 70B+ models currently sells at roughly $3,500 street price. The AMD Ryzen AI Halo Developer Kit, which uses the identical silicon, costs $3,999 for the branding and support bundle.

For a developer who needs to run 70B models locally without paying cloud API costs, Strix Halo is the budget option at $3,500. The Mac Studio is faster but starts at $4,000+ for equivalent memory. The DGX Spark is comparable in speed but costs $4,000.

Where Nvidia still dominates

For models that fit in VRAM, Nvidia isn’t close to being beaten. An RTX 4090 at $1,600 runs 7B-34B models at 100+ tok/s. Strix Halo at $3,500 runs those same models at 22 tok/s. The 5x speed gap comes from memory bandwidth: 1,008 GB/s versus 256 GB/s.

Nvidia also has mature CUDA ecosystem support. ROCm has improved significantly but still trails on compatibility breadth and documentation quality for non-standard workflows.

Where Apple Silicon wins

Apple Silicon’s memory bandwidth advantage (546-819 GB/s versus approx. 256 GB/s for Strix Halo) translates directly to faster token generation. An M4 Max Mac Studio at roughly $2,600 runs 70B models at 20-28 tok/s compared to Strix Halo’s 12-15 tok/s. The M3 Ultra at 192GB runs models that neither Strix Halo nor DGX Spark can load.

The tradeoff is macOS. MLX works well for inference but the ecosystem is narrower than Linux or Windows for AI development tooling. Docker, ROCm, and standard Linux server tools require workarounds.

Practical recommendation

For developers, the decision is about which model size you need to run, not which chip brand you prefer:

7B-13B models: Buy a $340-$1,000 GPU. Strix Halo is overkill and slower.
34B models: Buy an RTX 4090 at $1,600. Best price to performance in this range.
70B+ models, budget sensitive: Strix Halo 128GB at approx. $3,500. Slow but capable.
70B+ models, speed matters: Mac Studio M4 Max at approx. $2,600+. Faster tokens, macOS.
100B+ models: Mac Studio M3 Ultra 192GB at $3,999. The only consumer option.
Agentic coding workflows (Claude Code, etc.): Prioritize tool-calling model support (Qwen3-Coder, GLM-4.x) over raw token speed. Strix Halo is viable here because it runs these models locally at usable speed without per-token costs.

FAQ

How much VRAM for a 7B model? approx. 5GB at Q4_K_M. A $340 RTX 4060 8GB handles this.

What hardware for 70B models locally? You need 42GB+ of memory. Strix Halo 128GB at approx. $3,500, Mac Studio M4 Max at approx. $2,600+, or Mac Studio M3 Ultra 192GB at $3,999. Token speeds: 12-30 t/s depending on system.

Strix Halo or RTX 4090? Different tools. 4090 runs 7B-34B at 5x the speed. Strix Halo runs 70B+ that do not fit on the 4090.

What is Strix Halo’s actual price? 128GB config is approx. $3,500 street price. Entry level 16GB starts at $1,499.

What matters more: bandwidth or capacity? Capacity determines which models run. Bandwidth determines speed. A model that fits always beats a faster card that offloads.

I tested 7 local LLMs on real agent work. Two survived.. Dense beats MoE. Ollama beats LM Studio. Model selection for local agents.
When a ‘worse’ model beats a frontier model for agent work. Why model benchmarks do not predict agent performance.
Your agent is 1.6% model. The rest is the harness.. Why the harness matters more than the model.

This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at [email protected]