Run Gemma 4 Locally:
From Zero to Running in Minutes
From trying it online in seconds to running a production server. Choose the method that fits your hardware and experience level.
Try Online
No Setup
Test Gemma 4 directly in your browser. No installation or account required for basic use.
Run with Ollama
Recommended
Run Gemma 4 locally with a single command. Best balance of simplicity and performance for most users.
Hardware Requirements
E2B
4GB RAM
E4B
6GB RAM
26B MoE
8GB+ RAM
31B Dense
16GB+ RAM
Model VRAM Guide
| Model | Min VRAM | Best For |
|---|---|---|
| E2B | 4GB | Mobile / Raspberry Pi |
| E4B | 6GB | Laptop GPU |
| 26B MoE | 8GB (quantized) | RTX 3080 / M2 Pro |
| 31B Dense | 24GB (quantized) | RTX 4090 / H100 |
Quick Start
# Install Ollama
# macOS
brew install ollama
# Windows / Linux: download from https://ollama.com
# Pull Gemma 4
ollama pull gemma4:31b # 31B Dense (best quality)
ollama pull gemma4:26b-moe # 26B MoE (faster)
ollama pull gemma4:4b # E4B (edge devices)
ollama pull gemma4:2b # E2B (mobile)
# Start chatting
ollama run gemma4:31bRun with llama.cpp
Advanced
High-performance inference with full GPU acceleration and quantization support. For power users who need maximum control.
Hardware Requirements
GPU
CUDA / Metal / ROCm
Memory
8GB+ VRAM
Storage
20GB+
Quick Start
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Download GGUF model from Hugging Face
# Run with quantization (Q4_K_M recommended for 31B)
./main -m gemma4-31b-Q4_K_M.gguf -n 512 --interactiveRun with vLLM
Server
Production-grade inference server with PagedAttention, tensor parallelism, and OpenAI-compatible API.
Hardware Requirements
GPUs
1+ (tensor parallel)
Memory
16GB+ VRAM
Storage
40GB+
Quick Start
from vllm import LLM, SamplingParams
llm = LLM(model="google/gemma-4-31b-it")
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain quantum computing in simple terms"], sampling_params)
print(outputs[0].outputs[0].text)Hugging Face API
API Access
Use Hugging Face's hosted inference API. Quick integration without infrastructure setup.
Quick Start
import requests
API_URL = "https://api-inference.huggingface.co/models/google/gemma-4-31b-it"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({"inputs": "What is Gemma 4?"})Frequently Asked Questions
Quick answers to common questions about running Gemma 4.
Still have questions?
View Full FAQReady to Get Started?
Choose your level above and start running Gemma 4 today. Join thousands of developers building with Google's open model.