Hardware Selection Guide

Which Gemma 4 Model
Should I Run?

Pick your hardware and get a straight answer.

Speed benchmarks are sourced from community testing (Hacker News, Reddit r/LocalLLaMA). Results vary by system configuration. Use these as a starting point, not a guarantee.

Last updated: 2026-04-05

NVIDIA GPU

NVIDIA GPU

The 26B MoE is the sweet spot for most NVIDIA setups. It activates only 3.8B parameters at inference time, which is why it runs much faster than its size suggests.

VRAMRecommended ModelRecommended QuantEst. SpeedNotes
6-8GBGemma 4 E4BQ4_K_M~20-30 tok/sLimited context window headroom
16GBGemma 4 26B MoEQ4_K_M~80 tok/sGood balance of speed and quality
24GBGemma 4 26B MoEUD-Q4_K_XL~150 tok/sBest option for this VRAM tier
24GBGemma 4 31B DenseIQ4_XS~5 tok/s*Exceeds VRAM, offloads to RAM - slow
48GB+Gemma 4 31B DenseQ8_0~15-20 tok/sHigh quality output
80GB+Gemma 4 31B DenseFP16 / BF16Best qualityH100 / A100 class

* Running 31B Dense on 24GB VRAM causes CPU offloading. The 26B MoE at UD-Q4_K_XL is significantly faster on the same card. Source: HN community benchmarks.

Apple Silicon

Apple Silicon Mac

Apple Silicon's unified memory architecture makes it well-suited for MoE models. The 26B MoE is the recommended daily driver for most Mac users.

Unified MemoryRecommended ModelRecommended ToolEst. SpeedNotes
16GBGemma 4 E4BOllama / LM Studio~20 tok/sE2B also works if tight on memory
32GBGemma 4 26B MoEOllama / mlx~25-35 tok/s-
48GBGemma 4 26B MoEOllama~40 tok/sM4 Pro tested
64GBGemma 4 31B Densellama.cpp / mlx~12-18 tok/s-
128GBGemma 4 31B Densellama.cpp~20+ tok/sM5 128GB tested; model uses ~20GB

Recommended quant for Mac: UD-Q4_K_XL (26B MoE) or Q8_0 (31B Dense). MLX-optimized versions are available on mlx-community on Hugging Face.

Sources: HN thread (M4 Pro / M5 128GB posts).

AMD GPU

AMD GPU

AMD support is available via ROCm and llama.cpp. Google confirmed Day-0 support at launch.

GPUVRAMRecommended ModelEst. SpeedNotes
RX 7900 XTX24GBGemma 4 26B MoE100+ tok/sllama.cpp + ROCm
RX 7900 XT20GBGemma 4 26B MoE~80 tok/s est.-
RX 6800 XT16GBGemma 4 26B MoE~50 tok/s est.-
Ryzen AI (NPU)-E2B / E4BTBDSupport via Lemonade Server, coming soon

Tools: LM Studio + AMD Adrenalin drivers, or llama.cpp with ROCm. Source: AMD Day-0 Support Article.

Edge & Mobile

Edge & Mobile

Edge devices can run Gemma 4, but pick lightweight models and expect lower throughput compared with desktop GPUs.

DeviceRecommended ModelToolEst. SpeedNotes
Raspberry Pi 5 (16GB)E4BOllama~2-3 tok/sUsable for batch tasks, not real-time chat
NVIDIA Jetson Orin NanoE2B / E4Bllama.cppTBDOfficial Google support
Android PhoneE2BAI Edge GalleryTBDDownload
iPhone / iOSE2Bvia LiteRT-LMTBDLiteRT-LM CLI

Sources: HN thread and AMD article.

Quantization Explained

What do Q4, Q8, UD mean?

Quantization balances quality and memory usage. Start from the highest quality your hardware can handle, then step down only if needed.

Q4_K_M

4-bit quantization, good quality/size tradeoff. Best starting point.

Q8_0

8-bit, near-original quality, uses roughly 2x the memory of Q4.

IQ4_XS

Aggressive 4-bit quant, smaller file, slightly lower quality than Q4_K_M.

UD-Q4_K_XL

Unsloth Dynamic quant. It keeps critical layers at higher precision for better quality at Q4-like memory size. Learn more: Unsloth Dynamic Quantization.

Rule of thumb: start with UD-Q4_K_XL if available. Fall back to Q4_K_M if your tool does not support UD yet.

Start with quality-first quantization, then reduce precision only when memory is tight.

FAQ

Hardware Selection FAQ

Quick answers to common deployment and performance questions across GPU, Mac, and edge devices.

Should I pick 26B MoE or 31B Dense?

For most users, 26B MoE. It is often around 10x faster on the same hardware because only 3.8B parameters are active at inference time. Choose 31B Dense only if you have 48GB+ VRAM/RAM and need maximum output quality.

What happens if the model exceeds my VRAM?

It offloads layers to system RAM. The model still runs, but speed drops dramatically (for example, from 150 tok/s to 5 tok/s). Check your tool logs to confirm how many layers loaded to GPU.

How do I verify the model is fully loaded to GPU?

In llama.cpp, look for "offloaded X/X layers to GPU" in startup logs. In Ollama, run `ollama ps` and confirm active GPU usage for the model.

The 31B model outputs garbage or repeated separators in LM Studio. What is wrong?

This was a known launch-period issue (April 2026). Update LM Studio runtimes in Settings, and update llama.cpp to b8638 or later.

Data last updated: April 2026 · Sources: HN community, Unsloth HuggingFace, AMD Developer Blog