5 Ways to Run Gemma 4

Run Gemma 4 Locally:
From Zero to Running in Minutes

From trying it online in seconds to running a production server. Choose the method that fits your hardware and experience level.

Level 1
Zero门槛

Try Online

No Setup

Zero门槛用户

Test Gemma 4 directly in your browser. No installation or account required for basic use.

Level 2
Recommended

Run with Ollama

Recommended

Development & experimentation

Run Gemma 4 locally with a single command. Best balance of simplicity and performance for most users.

Hardware Requirements

E2B

4GB RAM

E4B

6GB RAM

26B MoE

8GB+ RAM

31B Dense

16GB+ RAM

Model VRAM Guide

ModelMin VRAMBest For
E2B4GBMobile / Raspberry Pi
E4B6GBLaptop GPU
26B MoE8GB (quantized)RTX 3080 / M2 Pro
31B Dense24GB (quantized)RTX 4090 / H100

Quick Start

Run with Ollama
# Install Ollama
# macOS
brew install ollama
# Windows / Linux: download from https://ollama.com

# Pull Gemma 4
ollama pull gemma4:31b       # 31B Dense (best quality)
ollama pull gemma4:26b-moe   # 26B MoE (faster)
ollama pull gemma4:4b        # E4B (edge devices)
ollama pull gemma4:2b         # E2B (mobile)

# Start chatting
ollama run gemma4:31b
Level 3
Advanced

Run with llama.cpp

Advanced

Maximum performance & control

High-performance inference with full GPU acceleration and quantization support. For power users who need maximum control.

Hardware Requirements

GPU

CUDA / Metal / ROCm

Memory

8GB+ VRAM

Storage

20GB+

Quick Start

Run with llama.cpp
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download GGUF model from Hugging Face
# Run with quantization (Q4_K_M recommended for 31B)
./main -m gemma4-31b-Q4_K_M.gguf -n 512 --interactive
Level 4
Production

Run with vLLM

Server

Production deployments

Production-grade inference server with PagedAttention, tensor parallelism, and OpenAI-compatible API.

Hardware Requirements

GPUs

1+ (tensor parallel)

Memory

16GB+ VRAM

Storage

40GB+

Quick Start

Run with vLLM
from vllm import LLM, SamplingParams

llm = LLM(model="google/gemma-4-31b-it")
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain quantum computing in simple terms"], sampling_params)
print(outputs[0].outputs[0].text)
Level 5
API

Hugging Face API

API Access

Quick integration

Use Hugging Face's hosted inference API. Quick integration without infrastructure setup.

Quick Start

Hugging Face API
import requests

API_URL = "https://api-inference.huggingface.co/models/google/gemma-4-31b-it"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({"inputs": "What is Gemma 4?"})
Troubleshooting

Frequently Asked Questions

Quick answers to common questions about running Gemma 4.

Still have questions?

View Full FAQ

Ready to Get Started?

Choose your level above and start running Gemma 4 today. Join thousands of developers building with Google's open model.