✨Gemma 3: How to Run Guide
How to run Gemma 3 effectively with our GGUFs on llama.cpp, Ollama, Open WebUI and how to fine-tune with Unsloth!
Google releases Gemma 3 with a new 270M model and the previous 1B, 4B, 12B, and 27B sizes. The 270M and 1B are text-only, while larger models handle both text and vision. We provide GGUFs, and a guide of how to run it effectively, and how to finetune & do RL with Gemma 3!
NEW Aug 14, 2025 Update: Try our fine-tuning Gemma 3 (270M) notebook and GGUFs to run.
Also see our Gemma 3n Guide.
Unsloth is the only framework which works in float16 machines for Gemma 3 inference and training. This means Colab Notebooks with free Tesla T4 GPUs also work!
Fine-tune Gemma 3 (4B) with vision support using our free Colab notebook
Unsloth Gemma 3 uploads with optimal configs:
⚙️ Recommended Inference Settings
According to the Gemma team, the official recommended settings for inference is:
Temperature of 1.0
Top_K of 64
Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)
Top_P of 0.95
Repetition Penalty of 1.0. (1.0 means disabled in llama.cpp and transformers)
Chat template:
<bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\nHey there!<end_of_turn>\n<start_of_turn>user\nWhat is 1+1?<end_of_turn>\n<start_of_turn>model\nChat template with
\nnewlines rendered (except for the last)
llama.cpp an other inference engines auto add a <bos> - DO NOT add TWO <bos> tokens! You should ignore the <bos> when prompting the model!
✨Running Gemma 3 on your phone
To run the models on your phone, we recommend using any mobile app that can run GGUFs locally on edge devices like phones. After fine-tuning you can export it to GGUF then run it locally on your phone. Ensure your phone has enough RAM/power to process the models as it can overheat so we recommend using Gemma 3 270M or the Gemma 3n models for this use-case. You can try the open-source project AnythingLLM's mobile app which you can download on Android here or ChatterUI, which are great apps for running GGUFs on your phone.
Remember, you can change the model name 'gemma-3-27b-it-GGUF' to any Gemma model like 'gemma-3-270m-it-GGUF:Q8_K_XL' for all the tutorials.
🦙 Tutorial: How to Run Gemma 3 in Ollama
Install
ollamaif you haven't already!
Run the model! Note you can call
ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) inparamsin our Hugging Face upload! You can change the model name 'gemma-3-27b-it-GGUF' to any Gemma model like 'gemma-3-270m-it-GGUF:Q8_K_XL'.
📖 Tutorial: How to Run Gemma 3 27B in llama.cpp
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
If you want to use
llama.cppdirectly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar toollama run
OR download the model via (after installing
pip install huggingface_hub hf_transfer). You can choose Q4_K_M, or other quantized versions (like BF16 full precision). More versions at: https://huggingface.co/unsloth/gemma-3-27b-it-GGUF
Run Unsloth's Flappy Bird test
Edit
--threads 32for the number of CPU threads,--ctx-size 16384for context length (Gemma 3 supports 128K context length!),--n-gpu-layers 99for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.For conversation mode:
For non conversation mode to test Flappy Bird:
The full input from our https://unsloth.ai/blog/deepseekr1-dynamic 1.58bit blog is:
Remember to remove <bos> since Gemma 3 auto adds a <bos>!
🦥 Fine-tuning Gemma 3 in Unsloth
Unsloth is the only framework which works in float16 machines for Gemma 3 inference and training. This means Colab Notebooks with free Tesla T4 GPUs also work!
Try our new Gemma 3 (270M) notebook which makes the 270M parameter model very smart at playing chess and can predict the next chess move.
Or fine-tune Gemma 3n (E4B) with Text • Vision • Audio
When trying full fine-tune (FFT) Gemma 3, all layers default to float32 on float16 devices. Unsloth expects float16 and upcasts dynamically. To fix, run model.to(torch.float16) after loading, or use a GPU with bfloat16 support.
Unsloth Fine-tuning Fixes
Our solution in Unsloth is 3 fold:
Keep all intermediate activations in bfloat16 format - can be float32, but this uses 2x more VRAM or RAM (via Unsloth's async gradient checkpointing)
Do all matrix multiplies in float16 with tensor cores, but manually upcasting / downcasting without the help of Pytorch's mixed precision autocast.
Upcast all other options that don't need matrix multiplies (layernorms) to float32.
🤔 Gemma 3 Fixes Analysis

First, before we finetune or run Gemma 3, we found that when using float16 mixed precision, gradients and activations become infinity unfortunately. This happens in T4 GPUs, RTX 20x series and V100 GPUs where they only have float16 tensor cores.
For newer GPUs like RTX 30x or higher, A100s, H100s etc, these GPUs have bfloat16 tensor cores, so this problem does not happen! But why?
Float16 can only represent numbers up to 65504, whilst bfloat16 can represent huge numbers up to 10^38! But notice both number formats use only 16bits! This is because float16 allocates more bits so it can represent smaller decimals better, whilst bfloat16 cannot represent fractions well.
But why float16? Let's just use float32! But unfortunately float32 in GPUs is very slow for matrix multiplications - sometimes 4 to 10x slower! So we cannot do this.
Last updated
Was this helpful?


