5 Ways to Run Gemma 4

Run Gemma 4 Locally:
From Zero to Running in Minutes

From trying it online in seconds to running a production server. Choose the method that fits your hardware and experience level.

Run with Ollama Troubleshooting FAQ

Jump to:Try Online Run with Ollama Run with llama.cpp Run with vLLM Hugging Face API transformers.js Thinking + Tools

Level 1

Zero门槛

Try Online

No Setup

Zero门槛用户

Test Gemma 4 directly in your browser. No installation or account required for basic use.

Try Now

Google AI Studio (31B & 26B MoE)Hugging Face Spaces Demo

Level 2

Recommended

Run with Ollama

Recommended

Development & experimentation

Run Gemma 4 locally with a single command. Best balance of simplicity and performance for most users.

Hardware Requirements

E2B

4GB RAM

E4B

6GB RAM

26B MoE

8GB+ RAM

31B Dense

16GB+ RAM

Model VRAM Guide

Model	Min VRAM	Best For
E2B	4GB	Mobile / Raspberry Pi
E4B	6GB	Laptop GPU
26B MoE	8GB (quantized)	RTX 3080 / M2 Pro
31B Dense	24GB (quantized)	RTX 4090 / H100

Quick Start

Run with Ollama

"comment"># Install Ollama
"comment"># macOS
brew install ollama
"comment"># Windows / Linux: download from https://ollama.com
"comment"># Pull Gemma 4
ollama pull gemma4:31b       # 31B Dense (best quality)
ollama pull gemma4:26b-moe   # 26B MoE (faster)
ollama pull gemma4:4b        # E4B (edge devices)
ollama pull gemma4:2b         # E2B (mobile)
"comment"># Start chatting
ollama run gemma4:31b

Level 3

Advanced

Run with llama.cpp

Advanced

Maximum performance & control

High-performance inference with full GPU acceleration and quantization support. For power users who need maximum control.

Hardware Requirements

GPU

CUDA / Metal / ROCm

Memory

8GB+ VRAM

Storage

20GB+

Quick Start

Run with llama.cpp

"comment"># Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
"comment"># Download GGUF model from Hugging Face
"comment"># Run with quantization (Q4_K_M recommended for 31B)
./main "keyword">-m gemma4-31b-Q4_K_M.gguf "keyword">-n 512 "keyword">--interactive

Level 4

Production

Run with vLLM

Server

Production deployments

Production-grade inference server with PagedAttention, tensor parallelism, and OpenAI-compatible API.

Hardware Requirements

GPUs

1+ (tensor parallel)

Memory

16GB+ VRAM

Storage

40GB+

Quick Start

Run with vLLM

from vllm import LLM, SamplingParams

llm = LLM(model=class="string">"google/gemma-4-31b-it")
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate([class="string">"Explain quantum computing in simple terms"], sampling_params)
print(outputs[0].outputs[0].text)

Level 5

API

Hugging Face API

API Access

Quick integration

Use Hugging Face's hosted inference API. Quick integration without infrastructure setup.

Quick Start

Hugging Face API

import requests

API_URL = class="string">"https://api-inference.huggingface.co/models/google/gemma-4-31b-it"
headers = {class="string">"Authorization": class="string">"Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({class="string">"inputs": class="string">"What is Gemma 4?"})

Browser Runtime

Run Gemma 4 with transformers.js (WebGPU)

Hugging Face ships official Gemma 4 browser support via transformers.js. You can run image + text captioning directly in the browser with WebGPU.

Quick Notes

Use ONNX checkpoints for browser inference. WebGPU is required for practical speed.

1. Install package: npm i @huggingface/transformers

2. Start with onnx-community/gemma-4-E2B-it-ONNX for lower memory.

3. Prefer Chrome/Edge with WebGPU enabled for best compatibility.

4. Official references: HF launch blog · ONNX model card · WebGPU demo

Captioning Example

transformers.js + WebGPU

import {
  AutoProcessor,
  Gemma4ForConditionalGeneration,
  TextStreamer,
  load_image,
} from class="string">"@huggingface/transformers";

const modelId = class="string">"onnx-community/gemma-4-E2B-it-ONNX";
const processor = await AutoProcessor.from_pretrained(modelId);
const model = await Gemma4ForConditionalGeneration.from_pretrained(modelId, {
  device: class="string">"webgpu",
  dtype: class="string">"q4f16",
});

const messages = [
  {
    role: class="string">"user",
    content: [
      { type: class="string">"image" },
      { type: class="string">"text", text: class="string">"Write a short caption for this image." },
    ],
  },
];

const prompt = processor.apply_chat_template(messages, {
  add_generation_prompt: true,
  enable_thinking: false,
});

const image = await load_image(class="string">"https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/artemis.jpeg");
const inputs = await processor(prompt, image, { add_special_tokens: false });

const output = await model.generate({
  ...inputs,
  max_new_tokens: 128,
  do_sample: false,
  streamer: new TextStreamer(processor.tokenizer, { skip_prompt: true }),
});

Agentic Features

Thinking Mode + Function Calling

Use thinking mode for complex planning tasks, then pair it with structured tool calls when the model needs external data.

Thinking Mode

Recommended for math, debugging, and multi-step planning. Disable for trivial single-tool calls.

Prompt Example

<|system|>
You are a precise coding assistant.
Enable deeper reasoning only for multi-step tasks.
<|think|>on

<|user|>
Design a zero-downtime Redis to Postgres migration plan.

Function Calling

Define tool schema, let model emit structured arguments, execute tool, then continue the chat.

Tool Schema

{
  class="string">"name": class="string">"get_weather",
  class="string">"description": class="string">"Get weather by city",
  class="string">"parameters": {
    class="string">"type": class="string">"object",
    class="string">"properties": {
      class="string">"city": { class="string">"type": class="string">"string" },
      class="string">"unit": { class="string">"type": class="string">"string", class="string">"enum": [class="string">"c", class="string">"f"] }
    },
    class="string">"required": [class="string">"city"]
  }
}

3-Step Flow

"comment"># 1) Register tool schema in system prompt
"comment"># 2) Model emits structured tool call
{
  "name": "get_weather",
  "arguments": { "city": "Shanghai", "unit": "c" }
}
"comment"># 3) Execute tool in app, append tool result, ask model for final answer

Troubleshooting

Frequently Asked Questions

Quick answers to common questions about running Gemma 4.

Still have questions?

View Full FAQ

Ready to Get Started?

Choose your level above and start running Gemma 4 today. Join thousands of developers building with Google's open model.

Run with Ollama View Model Specs

Run Gemma 4 Locally:From Zero to Running in Minutes

Try Online

Try Now

Run with Ollama

Hardware Requirements

Model VRAM Guide

Quick Start

Run with llama.cpp

Hardware Requirements

Quick Start

Run with vLLM

Hardware Requirements

Quick Start

Hugging Face API

Quick Start

Run Gemma 4 with transformers.js (WebGPU)

Quick Notes

Captioning Example

Thinking Mode + Function Calling

Thinking Mode

Function Calling

Frequently Asked Questions

Which model should I choose?

Why is Ollama recommended?

What quantization should I use?

How much VRAM do I need?

Can I run Gemma 4 on a Mac?

How fast is inference?

Where can I get help?

Ready to Get Started?

Run Gemma 4 Locally:
From Zero to Running in Minutes