Text-to-speech, all model types & full fine-tuning now supported!

Gemma 3n: How to Run & Fine-tune

How to run Google's new Gemma 3n locally with Dynamic GGUFs on llama.cpp, Ollama, Open WebUI and how to fine-tune with Unsloth!

Google’s Gemma 3n multimodal model handles image, audio, video, and text inputs. Available in 2B and 4B sizes, it supports 140 languages for text and multimodal tasks. You can now run and fine-tune Gemma-3n-E4B and Gemma-3n-E2B locally using Unsloth.

Gemma 3n has 32K context length, 30s audio input, OCR, auto speech recognition (ASR), and speech translation via prompts.

Running TutorialFine-tuning TutorialTechnical Analysis

Unsloth Gemma 3n (Instruct) uploads with optimal configs:

Dynamic 2.0 GGUF (text only)
Dynamic 4-bit Instruct (to fine-tune)
16-bit Instruct

See all our Gemma 3n uploads including base and more formats in our collection here.

🖥️ Running Gemma 3n

Currently Gemma 3n is only supported in text format for inference.

According to the Gemma team, the official recommended settings for inference:

temperature = 1.0, top_k = 64, top_p = 0.95, min_p = 0.0

  • Temperature of 1.0

  • Top_K of 64

  • Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)

  • Top_P of 0.95

  • Repetition Penalty of 1.0. (1.0 means disabled in llama.cpp and transformers)

  • Chat template:

    <bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\nHey there!<end_of_turn>\n<start_of_turn>user\nWhat is 1+1?<end_of_turn>\n<start_of_turn>model\n
  • Chat template with \nnewlines rendered (except for the last)

<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
<start_of_turn>user
What is 1+1?<end_of_turn>
<start_of_turn>model\n

🦙 Tutorial: How to Run Gemma 3n in Ollama

  1. Install ollama if you haven't already!

apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh
  1. Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

ollama run hf.co/unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL

📖 Tutorial: How to Run Gemma 3n in llama.cpp

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp
  1. If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run

./llama.cpp/llama-cli -hf unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja
  1. OR download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision).

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/gemma-3n-E4B-it-GGUF",
    local_dir = "unsloth/gemma-3n-E4B-it-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*", "mmproj-BF16.gguf"], # For Q4_K_XL
)
  1. Run the model.

  2. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length (Gemma 3 supports 32K context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

  3. For conversation mode:

./llama.cpp/llama-cli \
    --model unsloth/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-UD-Q4_K_XL.gguf \
    --ctx-size 32768 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 0.0 \
    --repeat-penalty 1.0 \
    --min-p 0.00 \
    --top-k 64 \
    --top-p 0.95
  1. For non conversation mode to test Flappy Bird:

./llama.cpp/llama-cli \
    --model unsloth/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-UD-Q4_K_XL.gguf \
    --ctx-size 32768 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 0.0 \
    --repeat-penalty 1.0 \
    --min-p 0.00 \
    --top-k 64 \
    --top-p 0.95 \
    -no-cnv \
    --prompt "<start_of_turn>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<end_of_turn>\n<start_of_turn>model\n"

🦥 Fine-tuning Gemma 3n with Unsloth

Fine-tune Gemma 3n with text, vision & audio with our free Colab notebook

We also heard you guys wanted a Vision notebook for Gemma 3 (4B) so here it is:

If you love Kaggle, Google is holding a competition where the best model fine-tuned with Gemma 3n and Unsloth will win a 10K prize! See more here.

🛠️Technical Analysis

Gemma 3n : MatFormer

So what is so special about Gemma 3n you ask? It is based on Matryoshka Transformer or MatFormer architecture meaning that each transformer layer/block embeds/nests FFNs of progressively smaller sizes. Think of it like progressively smaller cups put inside one another. The training is done so that at inference time you can choose the size you want and get the most of the performance of the bigger models.

There is also Per Layer Embedding which can be cached to reduce memory usage at inference time. So the 2B model (E2B) is a sub-network inside the 4B (aka 5.44B) model that is achieved by both Per Layer Embedding caching and skipping audio and vision components focusing solely on text.

The MatFormer architecture, typically is trained with exponentially spaced sub-models aka of sizes S, S/2, S/4, S/8 etc in each of the layers. So at training time, inputs are randomly forwarded through one of the said sub blocks giving every sub block equal chance to learn. Now the advantage is, at inference time, if you want the model to be 1/4th of the original size, you can pick S/4 sized sub blocks in each layer.

You can also choose to Mix and Match where you pick say, S/4 sized sub block of one layer, S/2 sized sub block of another layer and S/8 sized sub block of another layer. In fact, you can change the sub models you pick based on the input itself if you fancy so. Basically its like choose your own kind of structure at every layer. So by just training a model of one particular size, you are creating exponentially many models of smaller sizes. No learning goes waste. Pretty neat huh.

Last updated

Was this helpful?