5 Ways to Run Gemma 4

Run Gemma 4 Locally:
From Zero to Running in Minutes

From trying it online in seconds to running a production server. Choose the method that fits your hardware and experience level.

Level 1
Zero门槛

Try Online

No Setup

Zero门槛用户

Test Gemma 4 directly in your browser. No installation or account required for basic use.

Level 2
Recommended

Run with Ollama

Recommended

Development & experimentation

Run Gemma 4 locally with a single command. Best balance of simplicity and performance for most users.

Hardware Requirements

E2B

4GB RAM

E4B

6GB RAM

26B MoE

8GB+ RAM

31B Dense

16GB+ RAM

Model VRAM Guide

ModelMin VRAMBest For
E2B4GBMobile / Raspberry Pi
E4B6GBLaptop GPU
26B MoE8GB (quantized)RTX 3080 / M2 Pro
31B Dense24GB (quantized)RTX 4090 / H100

Quick Start

Run with Ollama
"comment"># Install Ollama
"comment"># macOS
brew install ollama
"comment"># Windows / Linux: download from https://ollama.com
"comment"># Pull Gemma 4
ollama pull gemma4:31b       # 31B Dense (best quality)
ollama pull gemma4:26b-moe   # 26B MoE (faster)
ollama pull gemma4:4b        # E4B (edge devices)
ollama pull gemma4:2b         # E2B (mobile)
"comment"># Start chatting
ollama run gemma4:31b
Level 3
Advanced

Run with llama.cpp

Advanced

Maximum performance & control

High-performance inference with full GPU acceleration and quantization support. For power users who need maximum control.

Hardware Requirements

GPU

CUDA / Metal / ROCm

Memory

8GB+ VRAM

Storage

20GB+

Quick Start

Run with llama.cpp
"comment"># Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
"comment"># Download GGUF model from Hugging Face
"comment"># Run with quantization (Q4_K_M recommended for 31B)
./main "keyword">-m gemma4-31b-Q4_K_M.gguf "keyword">-n 512 "keyword">--interactive
Level 4
Production

Run with vLLM

Server

Production deployments

Production-grade inference server with PagedAttention, tensor parallelism, and OpenAI-compatible API.

Hardware Requirements

GPUs

1+ (tensor parallel)

Memory

16GB+ VRAM

Storage

40GB+

Quick Start

Run with vLLM
from vllm import LLM, SamplingParams

llm = LLM(model=class="string">"google/gemma-4-31b-it")
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate([class="string">"Explain quantum computing in simple terms"], sampling_params)
print(outputs[0].outputs[0].text)
Level 5
API

Hugging Face API

API Access

Quick integration

Use Hugging Face's hosted inference API. Quick integration without infrastructure setup.

Quick Start

Hugging Face API
import requests

API_URL = class="string">"https://api-inference.huggingface.co/models/google/gemma-4-31b-it"
headers = {class="string">"Authorization": class="string">"Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({class="string">"inputs": class="string">"What is Gemma 4?"})
Browser Runtime

Run Gemma 4 with transformers.js (WebGPU)

Hugging Face ships official Gemma 4 browser support via transformers.js. You can run image + text captioning directly in the browser with WebGPU.

Quick Notes

Use ONNX checkpoints for browser inference. WebGPU is required for practical speed.

1. Install package: npm i @huggingface/transformers

2. Start with onnx-community/gemma-4-E2B-it-ONNX for lower memory.

3. Prefer Chrome/Edge with WebGPU enabled for best compatibility.

4. Official references: HF launch blog · ONNX model card · WebGPU demo

Captioning Example

transformers.js + WebGPU
import {
  AutoProcessor,
  Gemma4ForConditionalGeneration,
  TextStreamer,
  load_image,
} from class="string">"@huggingface/transformers";

const modelId = class="string">"onnx-community/gemma-4-E2B-it-ONNX";
const processor = await AutoProcessor.from_pretrained(modelId);
const model = await Gemma4ForConditionalGeneration.from_pretrained(modelId, {
  device: class="string">"webgpu",
  dtype: class="string">"q4f16",
});

const messages = [
  {
    role: class="string">"user",
    content: [
      { type: class="string">"image" },
      { type: class="string">"text", text: class="string">"Write a short caption for this image." },
    ],
  },
];

const prompt = processor.apply_chat_template(messages, {
  add_generation_prompt: true,
  enable_thinking: false,
});

const image = await load_image(class="string">"https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/artemis.jpeg");
const inputs = await processor(prompt, image, { add_special_tokens: false });

const output = await model.generate({
  ...inputs,
  max_new_tokens: 128,
  do_sample: false,
  streamer: new TextStreamer(processor.tokenizer, { skip_prompt: true }),
});
Agentic Features

Thinking Mode + Function Calling

Use thinking mode for complex planning tasks, then pair it with structured tool calls when the model needs external data.

Thinking Mode

Recommended for math, debugging, and multi-step planning. Disable for trivial single-tool calls.

Prompt Example
<|system|>
You are a precise coding assistant.
Enable deeper reasoning only for multi-step tasks.
<|think|>on

<|user|>
Design a zero-downtime Redis to Postgres migration plan.

Function Calling

Define tool schema, let model emit structured arguments, execute tool, then continue the chat.

Tool Schema
{
  class="string">"name": class="string">"get_weather",
  class="string">"description": class="string">"Get weather by city",
  class="string">"parameters": {
    class="string">"type": class="string">"object",
    class="string">"properties": {
      class="string">"city": { class="string">"type": class="string">"string" },
      class="string">"unit": { class="string">"type": class="string">"string", class="string">"enum": [class="string">"c", class="string">"f"] }
    },
    class="string">"required": [class="string">"city"]
  }
}
3-Step Flow
"comment"># 1) Register tool schema in system prompt
"comment"># 2) Model emits structured tool call
{
  "name": "get_weather",
  "arguments": { "city": "Shanghai", "unit": "c" }
}
"comment"># 3) Execute tool in app, append tool result, ask model for final answer
Troubleshooting

Frequently Asked Questions

Quick answers to common questions about running Gemma 4.

Still have questions?

View Full FAQ

Ready to Get Started?

Choose your level above and start running Gemma 4 today. Join thousands of developers building with Google's open model.