Run Gemma 4 Locally:
From Zero to Running in Minutes
From trying it online in seconds to running a production server. Choose the method that fits your hardware and experience level.
Try Online
No Setup
Test Gemma 4 directly in your browser. No installation or account required for basic use.
Run with Ollama
Recommended
Run Gemma 4 locally with a single command. Best balance of simplicity and performance for most users.
Hardware Requirements
E2B
4GB RAM
E4B
6GB RAM
26B MoE
8GB+ RAM
31B Dense
16GB+ RAM
Model VRAM Guide
| Model | Min VRAM | Best For |
|---|---|---|
| E2B | 4GB | Mobile / Raspberry Pi |
| E4B | 6GB | Laptop GPU |
| 26B MoE | 8GB (quantized) | RTX 3080 / M2 Pro |
| 31B Dense | 24GB (quantized) | RTX 4090 / H100 |
Quick Start
"comment"># Install Ollama
"comment"># macOS
brew install ollama
"comment"># Windows / Linux: download from https://ollama.com
"comment"># Pull Gemma 4
ollama pull gemma4:31b # 31B Dense (best quality)
ollama pull gemma4:26b-moe # 26B MoE (faster)
ollama pull gemma4:4b # E4B (edge devices)
ollama pull gemma4:2b # E2B (mobile)
"comment"># Start chatting
ollama run gemma4:31bRun with llama.cpp
Advanced
High-performance inference with full GPU acceleration and quantization support. For power users who need maximum control.
Hardware Requirements
GPU
CUDA / Metal / ROCm
Memory
8GB+ VRAM
Storage
20GB+
Quick Start
"comment"># Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
"comment"># Download GGUF model from Hugging Face
"comment"># Run with quantization (Q4_K_M recommended for 31B)
./main "keyword">-m gemma4-31b-Q4_K_M.gguf "keyword">-n 512 "keyword">--interactiveRun with vLLM
Server
Production-grade inference server with PagedAttention, tensor parallelism, and OpenAI-compatible API.
Hardware Requirements
GPUs
1+ (tensor parallel)
Memory
16GB+ VRAM
Storage
40GB+
Quick Start
from vllm import LLM, SamplingParams
llm = LLM(model=class="string">"google/gemma-4-31b-it")
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate([class="string">"Explain quantum computing in simple terms"], sampling_params)
print(outputs[0].outputs[0].text)Hugging Face API
API Access
Use Hugging Face's hosted inference API. Quick integration without infrastructure setup.
Quick Start
import requests
API_URL = class="string">"https://api-inference.huggingface.co/models/google/gemma-4-31b-it"
headers = {class="string">"Authorization": class="string">"Bearer YOUR_HF_TOKEN"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({class="string">"inputs": class="string">"What is Gemma 4?"})Run Gemma 4 with transformers.js (WebGPU)
Hugging Face ships official Gemma 4 browser support via transformers.js. You can run image + text captioning directly in the browser with WebGPU.
Quick Notes
Use ONNX checkpoints for browser inference. WebGPU is required for practical speed.
1. Install package: npm i @huggingface/transformers
2. Start with onnx-community/gemma-4-E2B-it-ONNX for lower memory.
3. Prefer Chrome/Edge with WebGPU enabled for best compatibility.
4. Official references: HF launch blog · ONNX model card · WebGPU demo
Captioning Example
import {
AutoProcessor,
Gemma4ForConditionalGeneration,
TextStreamer,
load_image,
} from class="string">"@huggingface/transformers";
const modelId = class="string">"onnx-community/gemma-4-E2B-it-ONNX";
const processor = await AutoProcessor.from_pretrained(modelId);
const model = await Gemma4ForConditionalGeneration.from_pretrained(modelId, {
device: class="string">"webgpu",
dtype: class="string">"q4f16",
});
const messages = [
{
role: class="string">"user",
content: [
{ type: class="string">"image" },
{ type: class="string">"text", text: class="string">"Write a short caption for this image." },
],
},
];
const prompt = processor.apply_chat_template(messages, {
add_generation_prompt: true,
enable_thinking: false,
});
const image = await load_image(class="string">"https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/artemis.jpeg");
const inputs = await processor(prompt, image, { add_special_tokens: false });
const output = await model.generate({
...inputs,
max_new_tokens: 128,
do_sample: false,
streamer: new TextStreamer(processor.tokenizer, { skip_prompt: true }),
});Thinking Mode + Function Calling
Use thinking mode for complex planning tasks, then pair it with structured tool calls when the model needs external data.
Thinking Mode
Recommended for math, debugging, and multi-step planning. Disable for trivial single-tool calls.
<|system|>
You are a precise coding assistant.
Enable deeper reasoning only for multi-step tasks.
<|think|>on
<|user|>
Design a zero-downtime Redis to Postgres migration plan.Function Calling
Define tool schema, let model emit structured arguments, execute tool, then continue the chat.
{
class="string">"name": class="string">"get_weather",
class="string">"description": class="string">"Get weather by city",
class="string">"parameters": {
class="string">"type": class="string">"object",
class="string">"properties": {
class="string">"city": { class="string">"type": class="string">"string" },
class="string">"unit": { class="string">"type": class="string">"string", class="string">"enum": [class="string">"c", class="string">"f"] }
},
class="string">"required": [class="string">"city"]
}
}"comment"># 1) Register tool schema in system prompt
"comment"># 2) Model emits structured tool call
{
"name": "get_weather",
"arguments": { "city": "Shanghai", "unit": "c" }
}
"comment"># 3) Execute tool in app, append tool result, ask model for final answerFrequently Asked Questions
Quick answers to common questions about running Gemma 4.
Still have questions?
View Full FAQReady to Get Started?
Choose your level above and start running Gemma 4 today. Join thousands of developers building with Google's open model.