Which Gemma 4 Model
Should I Run?
Pick your hardware and get a straight answer.
Speed benchmarks are sourced from community testing (Hacker News, Reddit r/LocalLLaMA). Results vary by system configuration. Use these as a starting point, not a guarantee.
Last updated: 2026-04-05
NVIDIA GPU
The 26B MoE is the sweet spot for most NVIDIA setups. It activates only 3.8B parameters at inference time, which is why it runs much faster than its size suggests.
| VRAM | Recommended Model | Recommended Quant | Est. Speed | Notes |
|---|---|---|---|---|
| 6-8GB | Gemma 4 E4B | Q4_K_M | ~20-30 tok/s | Limited context window headroom |
| 16GB | Gemma 4 26B MoE | Q4_K_M | ~80 tok/s | Good balance of speed and quality |
| 24GB | Gemma 4 26B MoE | UD-Q4_K_XL | ~150 tok/s | Best option for this VRAM tier |
| 24GB | Gemma 4 31B Dense | IQ4_XS | ~5 tok/s* | Exceeds VRAM, offloads to RAM - slow |
| 48GB+ | Gemma 4 31B Dense | Q8_0 | ~15-20 tok/s | High quality output |
| 80GB+ | Gemma 4 31B Dense | FP16 / BF16 | Best quality | H100 / A100 class |
* Running 31B Dense on 24GB VRAM causes CPU offloading. The 26B MoE at UD-Q4_K_XL is significantly faster on the same card. Source: HN community benchmarks.
Apple Silicon Mac
Apple Silicon's unified memory architecture makes it well-suited for MoE models. The 26B MoE is the recommended daily driver for most Mac users.
| Unified Memory | Recommended Model | Recommended Tool | Est. Speed | Notes |
|---|---|---|---|---|
| 16GB | Gemma 4 E4B | Ollama / LM Studio | ~20 tok/s | E2B also works if tight on memory |
| 32GB | Gemma 4 26B MoE | Ollama / mlx | ~25-35 tok/s | - |
| 48GB | Gemma 4 26B MoE | Ollama | ~40 tok/s | M4 Pro tested |
| 64GB | Gemma 4 31B Dense | llama.cpp / mlx | ~12-18 tok/s | - |
| 128GB | Gemma 4 31B Dense | llama.cpp | ~20+ tok/s | M5 128GB tested; model uses ~20GB |
Recommended quant for Mac: UD-Q4_K_XL (26B MoE) or Q8_0 (31B Dense). MLX-optimized versions are available on mlx-community on Hugging Face.
Sources: HN thread (M4 Pro / M5 128GB posts).
AMD GPU
AMD support is available via ROCm and llama.cpp. Google confirmed Day-0 support at launch.
| GPU | VRAM | Recommended Model | Est. Speed | Notes |
|---|---|---|---|---|
| RX 7900 XTX | 24GB | Gemma 4 26B MoE | 100+ tok/s | llama.cpp + ROCm |
| RX 7900 XT | 20GB | Gemma 4 26B MoE | ~80 tok/s est. | - |
| RX 6800 XT | 16GB | Gemma 4 26B MoE | ~50 tok/s est. | - |
| Ryzen AI (NPU) | - | E2B / E4B | TBD | Support via Lemonade Server, coming soon |
Tools: LM Studio + AMD Adrenalin drivers, or llama.cpp with ROCm. Source: AMD Day-0 Support Article.
Edge & Mobile
Edge devices can run Gemma 4, but pick lightweight models and expect lower throughput compared with desktop GPUs.
| Device | Recommended Model | Tool | Est. Speed | Notes |
|---|---|---|---|---|
| Raspberry Pi 5 (16GB) | E4B | Ollama | ~2-3 tok/s | Usable for batch tasks, not real-time chat |
| NVIDIA Jetson Orin Nano | E2B / E4B | llama.cpp | TBD | Official Google support |
| Android Phone | E2B | AI Edge Gallery | TBD | Download |
| iPhone / iOS | E2B | via LiteRT-LM | TBD | LiteRT-LM CLI |
Sources: HN thread and AMD article.
What do Q4, Q8, UD mean?
Quantization balances quality and memory usage. Start from the highest quality your hardware can handle, then step down only if needed.
Q4_K_M
4-bit quantization, good quality/size tradeoff. Best starting point.
Q8_0
8-bit, near-original quality, uses roughly 2x the memory of Q4.
IQ4_XS
Aggressive 4-bit quant, smaller file, slightly lower quality than Q4_K_M.
UD-Q4_K_XL
Unsloth Dynamic quant. It keeps critical layers at higher precision for better quality at Q4-like memory size. Learn more: Unsloth Dynamic Quantization.
Rule of thumb: start with UD-Q4_K_XL if available. Fall back to Q4_K_M if your tool does not support UD yet.
Start with quality-first quantization, then reduce precision only when memory is tight.
Hardware Selection FAQ
Quick answers to common deployment and performance questions across GPU, Mac, and edge devices.
Should I pick 26B MoE or 31B Dense?
For most users, 26B MoE. It is often around 10x faster on the same hardware because only 3.8B parameters are active at inference time. Choose 31B Dense only if you have 48GB+ VRAM/RAM and need maximum output quality.
What happens if the model exceeds my VRAM?
It offloads layers to system RAM. The model still runs, but speed drops dramatically (for example, from 150 tok/s to 5 tok/s). Check your tool logs to confirm how many layers loaded to GPU.
How do I verify the model is fully loaded to GPU?
In llama.cpp, look for "offloaded X/X layers to GPU" in startup logs. In Ollama, run `ollama ps` and confirm active GPU usage for the model.
The 31B model outputs garbage or repeated separators in LM Studio. What is wrong?
This was a known launch-period issue (April 2026). Update LM Studio runtimes in Settings, and update llama.cpp to b8638 or later.
Data last updated: April 2026 · Sources: HN community, Unsloth HuggingFace, AMD Developer Blog