Hardware Selection Guide

Which Gemma 4 Model
Should I Run?

Pick your hardware and get a straight answer.

NVIDIA GPU Apple Silicon AMD GPU Edge / Mobile Benchmarks

Speed benchmarks are sourced from community testing (Hacker News, Reddit r/LocalLLaMA). Results vary by system configuration. Use these as a starting point, not a guarantee.

Last updated: 2026-04-05

NVIDIA GPU

The 26B MoE is the sweet spot for most NVIDIA setups. It activates only 3.8B parameters at inference time, which is why it runs much faster than its size suggests.

VRAM	Recommended Model	Recommended Quant	Est. Speed	Notes
6-8GB	Gemma 4 E4B	Q4_K_M	~20-30 tok/s	Limited context window headroom
16GB	Gemma 4 26B MoE	Q4_K_M	~80 tok/s	Good balance of speed and quality
24GB	Gemma 4 26B MoE	UD-Q4_K_XL	~150 tok/s	Best option for this VRAM tier
24GB	Gemma 4 31B Dense	IQ4_XS	~5 tok/s*	Exceeds VRAM, offloads to RAM - slow
48GB+	Gemma 4 31B Dense	Q8_0	~15-20 tok/s	High quality output
80GB+	Gemma 4 31B Dense	FP16 / BF16	Best quality	H100 / A100 class

* Running 31B Dense on 24GB VRAM causes CPU offloading. The 26B MoE at UD-Q4_K_XL is significantly faster on the same card. Source: HN community benchmarks.

Apple Silicon

Apple Silicon Mac

Apple Silicon's unified memory architecture makes it well-suited for MoE models. The 26B MoE is the recommended daily driver for most Mac users.

Unified Memory	Recommended Model	Recommended Tool	Est. Speed	Notes
16GB	Gemma 4 E4B	Ollama / LM Studio	~20 tok/s	E2B also works if tight on memory
32GB	Gemma 4 26B MoE	Ollama / mlx	~25-35 tok/s	-
48GB	Gemma 4 26B MoE	Ollama	~40 tok/s	M4 Pro tested
64GB	Gemma 4 31B Dense	llama.cpp / mlx	~12-18 tok/s	-
128GB	Gemma 4 31B Dense	llama.cpp	~20+ tok/s	M5 128GB tested; model uses ~20GB

Recommended quant for Mac: UD-Q4_K_XL (26B MoE) or Q8_0 (31B Dense). MLX-optimized versions are available on mlx-community on Hugging Face.

Sources: HN thread (M4 Pro / M5 128GB posts).

AMD GPU

AMD support is available via ROCm and llama.cpp. Google confirmed Day-0 support at launch.

GPU	VRAM	Recommended Model	Est. Speed	Notes
RX 7900 XTX	24GB	Gemma 4 26B MoE	100+ tok/s	llama.cpp + ROCm
RX 7900 XT	20GB	Gemma 4 26B MoE	~80 tok/s est.	-
RX 6800 XT	16GB	Gemma 4 26B MoE	~50 tok/s est.	-
Ryzen AI (NPU)	-	E2B / E4B	TBD	Support via Lemonade Server, coming soon

Tools: LM Studio + AMD Adrenalin drivers, or llama.cpp with ROCm. Source: AMD Day-0 Support Article.

Edge & Mobile

Edge devices can run Gemma 4, but pick lightweight models and expect lower throughput compared with desktop GPUs.

Device	Recommended Model	Tool	Est. Speed	Notes
Raspberry Pi 5 (16GB)	E4B	Ollama	~2-3 tok/s	Usable for batch tasks, not real-time chat
NVIDIA Jetson Orin Nano	E2B / E4B	llama.cpp	TBD	Official Google support
Android Phone	E2B	AI Edge Gallery	TBD	Download
iPhone / iOS	E2B	via LiteRT-LM	TBD	LiteRT-LM CLI

Sources: HN thread and AMD article.

Deployment Commands

vLLM Start Commands

Official starter commands for server-grade deployment.

Gemma 4 26B A4B on single A100

vllm serve google/gemma-4-26B-A4B-it \
  "keyword">--max-model-len 32768 \
  "keyword">--gpu-memory-utilization 0.90

Source: vLLM Docs - Gemma 4 · official · checked 2026-04-08

Gemma 4 31B Dense on 2x H100

vllm serve google/gemma-4-31B-it \
  "keyword">--tensor-parallel-size 2 \
  "keyword">--max-model-len 32768 \
  "keyword">--gpu-memory-utilization 0.90

Source: vLLM Docs - Gemma 4 · official · checked 2026-04-08

VRAM Quick Reference

Precision vs Memory

Values below are practical planning estimates. Actual usage varies with context length, KV cache strategy, and runtime implementation.

Model	Precision	Approx. VRAM	Evidence
Gemma 4 E2B-it	4-bit	~5GB	Community source Community deployment reports; depends on runtime and context length.
Gemma 4 E4B-it	4-bit	~5GB	Official source
Gemma 4 E4B-it	8-bit	~9GB	Community source
Gemma 4 26B-A4B-it	4-bit	~18GB	Community source
Gemma 4 26B-A4B-it	8-bit	~28GB	Community source
Gemma 4 31B-it	BF16	~70GB+	Official source Typical deployment uses tensor parallelism for stability.

Quantization Explained

What do Q4, Q8, UD mean?

Quantization balances quality and memory usage. Start from the highest quality your hardware can handle, then step down only if needed.

Q4_K_M

4-bit quantization, good quality/size tradeoff. Best starting point.

Q8_0

8-bit, near-original quality, uses roughly 2x the memory of Q4.

IQ4_XS

Aggressive 4-bit quant, smaller file, slightly lower quality than Q4_K_M.

UD-Q4_K_XL

Unsloth Dynamic quant. It keeps critical layers at higher precision for better quality at Q4-like memory size. Learn more: Unsloth Dynamic Quantization.

Rule of thumb: start with UD-Q4_K_XL if available. Fall back to Q4_K_M if your tool does not support UD yet.

Start with quality-first quantization, then reduce precision only when memory is tight.

FAQ

Hardware Selection FAQ

Quick answers to common deployment and performance questions across GPU, Mac, and edge devices.

Should I pick 26B MoE or 31B Dense?

For most users, 26B MoE. It is often around 10x faster on the same hardware because only 3.8B parameters are active at inference time. Choose 31B Dense only if you have 48GB+ VRAM/RAM and need maximum output quality.

What happens if the model exceeds my VRAM?

It offloads layers to system RAM. The model still runs, but speed drops dramatically (for example, from 150 tok/s to 5 tok/s). Check your tool logs to confirm how many layers loaded to GPU.

How do I verify the model is fully loaded to GPU?

In llama.cpp, look for "offloaded X/X layers to GPU" in startup logs. In Ollama, run `ollama ps` and confirm active GPU usage for the model.

The 31B model outputs garbage or repeated separators in LM Studio. What is wrong?

This was a known launch-period issue (April 2026). Update LM Studio runtimes in Settings, and update llama.cpp to b8638 or later.

Data last updated: April 2026 · Sources: HN community, Unsloth HuggingFace, AMD Developer Blog

Start Running Gemma 4 Compare Model Specs

Which Gemma 4 ModelShould I Run?