✨Gemma 3n: How to Run & Fine-tune
How to run Google's new Gemma 3n locally with Dynamic GGUFs on llama.cpp, Ollama, Open WebUI and how to fine-tune with Unsloth!
Google’s Gemma 3n multimodal model handles image, audio, video, and text inputs. Available in 2B and 4B sizes, it supports 140 languages for text and multimodal tasks. You can now run and fine-tune Gemma-3n-E4B and Gemma-3n-E2B locally using Unsloth.
Gemma 3n has 32K context length, 30s audio input, OCR, auto speech recognition (ASR), and speech translation via prompts.
Unsloth Gemma 3n (Instruct) uploads with optimal configs:
See all our Gemma 3n uploads including base and more formats in our collection here.
🖥️ Running Gemma 3n
Currently Gemma 3n is only supported in text format for inference.
⚙️ Official Recommended Settings
According to the Gemma team, the official recommended settings for inference:
temperature = 1.0, top_k = 64, top_p = 0.95, min_p = 0.0
Temperature of 1.0
Top_K of 64
Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)
Top_P of 0.95
Repetition Penalty of 1.0. (1.0 means disabled in llama.cpp and transformers)
Chat template:
<bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\nHey there!<end_of_turn>\n<start_of_turn>user\nWhat is 1+1?<end_of_turn>\n<start_of_turn>model\n
Chat template with
\n
newlines rendered (except for the last)
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
<start_of_turn>user
What is 1+1?<end_of_turn>
<start_of_turn>model\n
llama.cpp an other inference engines auto add a <bos> - DO NOT add TWO <bos> tokens! You should ignore the <bos> when prompting the model!
🦙 Tutorial: How to Run Gemma 3n in Ollama
Install
ollama
if you haven't already!
apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh
Run the model! Note you can call
ollama serve
in another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) inparams
in our Hugging Face upload!
ollama run hf.co/unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL
📖 Tutorial: How to Run Gemma 3n in llama.cpp
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp
If you want to use
llama.cpp
directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar toollama run
./llama.cpp/llama-cli -hf unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja
OR download the model via (after installing
pip install huggingface_hub hf_transfer
). You can choose Q4_K_M, or other quantized versions (like BF16 full precision).
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/gemma-3n-E4B-it-GGUF",
local_dir = "unsloth/gemma-3n-E4B-it-GGUF",
allow_patterns = ["*UD-Q4_K_XL*", "mmproj-BF16.gguf"], # For Q4_K_XL
)
Run the model.
Edit
--threads 32
for the number of CPU threads,--ctx-size 16384
for context length (Gemma 3 supports 32K context length!),--n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.For conversation mode:
./llama.cpp/llama-cli \
--model unsloth/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-UD-Q4_K_XL.gguf \
--ctx-size 32768 \
--n-gpu-layers 99 \
--seed 3407 \
--prio 2 \
--temp 0.0 \
--repeat-penalty 1.0 \
--min-p 0.00 \
--top-k 64 \
--top-p 0.95
For non conversation mode to test Flappy Bird:
./llama.cpp/llama-cli \
--model unsloth/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-UD-Q4_K_XL.gguf \
--ctx-size 32768 \
--n-gpu-layers 99 \
--seed 3407 \
--prio 2 \
--temp 0.0 \
--repeat-penalty 1.0 \
--min-p 0.00 \
--top-k 64 \
--top-p 0.95 \
-no-cnv \
--prompt "<start_of_turn>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<end_of_turn>\n<start_of_turn>model\n"
Remember to remove <bos> since Gemma 3N auto adds a <bos>!
🦥 Fine-tuning Gemma 3n with Unsloth
Gemma 3n Notebooks coming soon! Stay tuned.
Fine-tune Gemma 3n with text, vision & audio with our free Colab notebook
We also heard you guys wanted a Vision notebook for Gemma 3 (4B) so here it is:
Fine-tune Gemma 3 (4B) with Vision support using our free Colab notebook
🛠️Technical Analysis
Gemma 3n : MatFormer
So what is so special about Gemma 3n you ask? It is based on Matryoshka Transformer or MatFormer architecture meaning that each transformer layer/block embeds/nests FFNs of progressively smaller sizes. Think of it like progressively smaller cups put inside one another. The training is done so that at inference time you can choose the size you want and get the most of the performance of the bigger models.
There is also Per Layer Embedding which can be cached to reduce memory usage at inference time. So the 2B model (E2B) is a sub-network inside the 4B (aka 5.44B) model that is achieved by both Per Layer Embedding caching and skipping audio and vision components focusing solely on text.
The MatFormer architecture, typically is trained with exponentially spaced sub-models aka of sizes S
, S/2, S/4, S/8
etc in each of the layers. So at training time, inputs are randomly forwarded through one of the said sub blocks giving every sub block equal chance to learn. Now the advantage is, at inference time, if you want the model to be 1/4th of the original size, you can pick S/4
sized sub blocks in each layer.
You can also choose to Mix and Match where you pick say, S/4
sized sub block of one layer, S/2
sized sub block of another layer and S/8
sized sub block of another layer. In fact, you can change the sub models you pick based on the input itself if you fancy so. Basically its like choose your own kind of structure at every layer. So by just training a model of one particular size, you are creating exponentially many models of smaller sizes. No learning goes waste. Pretty neat huh.

Last updated
Was this helpful?