Saving models to 16bit for GGUF so you can use it for Ollama, Jan AI, Open WebUI and more!
To save to GGUF, use the below to save locally:
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "f16")To push to Hugging Face hub:
model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q8_0")All supported quantization options for quantization_method are listed below:
# https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19
# From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html
ALLOWED_QUANTS = \
{
"not_quantized" : "Recommended. Fast conversion. Slow inference, big files.",
"fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
"quantized"
First save your model to 16bit:
Then use the terminal and do:
Or follow the steps at https://rentry.org/llama-cpp-conversions#merging-loras-into-a-model using the model name "merged_model" to merge to GGUF.
You might sometimes encounter an issue where your model runs and produces good results on Unsloth, but when you use it on another platform like Ollama or vLLM, the results are poor or you might get gibberish, endless/infinite generations or repeated outputs.
The most common cause of this error is using an incorrect chat template. It’s essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. When inferencing from a saved model, it's crucial to apply the correct template.
You must use the correct eos token. If not, you might get gibberish on longer generations.
It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses!
You can try reducing the maximum GPU usage during saving by changing maximum_memory_usage.
The default is model.save_pretrained(..., maximum_memory_usage = 0.75). Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.
First save your model to 16bit via:
Compile llama.cpp from source like below:
Then, save the model to F16:
model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp
python llama.cpp/convert-hf-to-gguf.py FOLDER --outfile OUTPUT --outtype f16Use our conversational notebooks to force the chat template - this will fix most issues.
Qwen-3 14B Conversational notebook Open in Colab
Gemma-3 4B Conversational notebook Open in Colab
Llama-3.2 3B Conversational notebook Open in Colab
Phi-4 14B Conversational notebook
Mistral v0.3 7B Conversational notebook
More notebooks in our
model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpppython llama.cpp/convert_hf_to_gguf.py merged_model \
--outfile model-F16.gguf --outtype f16 \
--split-max-size 50G# For BF16:
python llama.cpp/convert_hf_to_gguf.py merged_model \
--outfile model-BF16.gguf --outtype bf16 \
--split-max-size 50G
# For Q8_0:
python llama.cpp/convert_hf_to_gguf.py merged_model \
--outfile model-Q8_0.gguf --outtype q8_0 \
--split-max-size 50G