✨Gemma 3: How to Run & Fine-tune

How to run Gemma 3 effectively with our GGUFs on llama.cpp, Ollama, Open WebUI and how to fine-tune with Unsloth!

Google released Gemma 3 in 4 sizes - 1B, 4B, 12B and 27B models! The smallest 1B model is text only, whilst the rest are capable of vision and text input! We provide GGUFs, and a guide of how to run it effectively, and how to finetune & do reasoning finetuning with Gemma 3!

NEW: Try our new fine-tuning notebook for Gemma 3 (4B) Vision here.

We also uploaded new quants using Google's new QAT method. See the full collection here.

Unsloth is the only framework which works in float16 machines for Gemma 3 inference and training. This means Colab Notebooks with free Tesla T4 GPUs also work!

Fine-tune Gemma 3 (4B) with vision support using our free Colab notebook

According to the Gemma team, the optimal config for inference is temperature = 1.0, top_k = 64, top_p = 0.95, min_p = 0.0

Unsloth Gemma 3 uploads with optimal configs:

GGUF

Unsloth Dynamic 4-bit Instruct

16-bit Instruct

We fixed an issue with our Gemma 3 GGUF uploads where previously they did not support vision. Now they do.

⚙️ Official Recommended Inference Settings

According to the Gemma team, the official recommended settings for inference is:

Temperature of 1.0
Top_K of 64
Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)
Top_P of 0.95
Repetition Penalty of 1.0. (1.0 means disabled in llama.cpp and transformers)

Chat template:

<bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\nHey there!<end_of_turn>\n<start_of_turn>user\nWhat is 1+1?<end_of_turn>\n<start_of_turn>model\n

Chat template with \nnewlines rendered (except for the last)

<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
<start_of_turn>user
What is 1+1?<end_of_turn>
<start_of_turn>model\n

llama.cpp an other inference engines auto add a <bos> - DO NOT add TWO <bos> tokens! You should ignore the <bos> when prompting the model!

🦙 Tutorial: How to Run Gemma 3 27B in Ollama

Install ollama if you haven't already!

apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh

Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_M

📖 Tutorial: How to Run Gemma 3 27B in llama.cpp

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run

./llama.cpp/llama-mtmd-cli \
    -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL

OR download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision). More versions at: https://huggingface.co/unsloth/gemma-3-27b-it-GGUF

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/gemma-3-27b-it-GGUF",
    local_dir = "unsloth/gemma-3-27b-it-GGUF",
    allow_patterns = ["*Q4_K_M*", "mmproj-BF16.gguf"], # For Q4_K_M
)

Run Unsloth's Flappy Bird test
Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length (Gemma 3 supports 128K context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
For conversation mode:

./llama.cpp/llama-mtmd-cli \
    --model unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q4_K_M.gguf \
    --mmproj unsloth/gemma-3-27b-it-GGUF/mmproj-BF16.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 1.0 \
    --repeat-penalty 1.0 \
    --min-p 0.01 \
    --top-k 64 \
    --top-p 0.95

For non conversation mode to test Flappy Bird:

./llama.cpp/llama-cli \
    --model unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q4_K_M.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 1.0 \
    --repeat-penalty 1.0 \
    --min-p 0.01 \
    --top-k 64 \
    --top-p 0.95 \
    -no-cnv \
    --prompt "<start_of_turn>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<end_of_turn>\n<start_of_turn>model\n"

The full input from our https://unsloth.ai/blog/deepseekr1-dynamic 1.58bit blog is:

Remember to remove <bos> since Gemma 3 auto adds a <bos>!

<start_of_turn>user
Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for error

🦥 Fine-tuning Gemma 3 in Unsloth

Unsloth is the only framework which works in float16 machines for Gemma 3 inference and training. This means Colab Notebooks with free Tesla T4 GPUs also work!

Fine-tune Gemma 3 (4B) using our notebooks for: Text or Vision

Unsloth Fine-tuning Fixes

Our solution in Unsloth is 3 fold:

Keep all intermediate activations in bfloat16 format - can be float32, but this uses 2x more VRAM or RAM (via Unsloth's async gradient checkpointing)
Do all matrix multiplies in float16 with tensor cores, but manually upcasting / downcasting without the help of Pytorch's mixed precision autocast.
Upcast all other options that don't need matrix multiplies (layernorms) to float32.

🤔 Gemma 3 Fixes Analysis

First, before we finetune or run Gemma 3, we found that when using float16 mixed precision, gradients and activations become infinity unfortunately. This happens in T4 GPUs, RTX 20x series and V100 GPUs where they only have float16 tensor cores.

For newer GPUs like RTX 30x or higher, A100s, H100s etc, these GPUs have bfloat16 tensor cores, so this problem does not happen! But why?

Float16 can only represent numbers up to 65504, whilst bfloat16 can represent huge numbers up to 10^38! But notice both number formats use only 16bits! This is because float16 allocates more bits so it can represent smaller decimals better, whilst bfloat16 cannot represent fractions well.

But why float16? Let's just use float32! But unfortunately float32 in GPUs is very slow for matrix multiplications - sometimes 4 to 10x slower! So we cannot do this.

PreviousTutorials: How To Fine-tune & Run LLMs NextMagistral: How to Run & Fine-tune

Last updated 2 days ago