⚠️Troubleshooting & FAQs
Tips to solve issues, and frequently asked questions.
Try always to update Unsloth if you find any issues.
pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo
❓Running in Unsloth works well, but after exporting & running on other platforms, the results are poor
You might sometimes encounter an issue where your model runs and produces good results on Unsloth, but when you use it on another platform like Ollama or vLLM, the results are poor or you might get gibberish, endless/infinite generations or repeated outputs.
The most common cause of this error is using an incorrect chat template. It’s essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. When inferencing from a saved model, it's crucial to apply the correct template.
It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses!
Use our conversational notebooks to force the chat template - this will fix most issues.
Qwen-3 14B Conversational notebook Open in Colab
Gemma-3 4B Conversational notebook Open in Colab
Llama-3.2 3B Conversational notebook Open in Colab
Phi-4 14B Conversational notebook Open in Colab
Mistral v0.3 7B Conversational notebook Open in Colab
More notebooks in our notebooks repo.
❓Saving to GGUF / vLLM 16bit crashes
You can try reducing the maximum GPU usage during saving by changing maximum_memory_usage
.
The default is model.save_pretrained(..., maximum_memory_usage = 0.75)
. Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.
❓How do I manually save to GGUF?
First save your model to 16bit via:
model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)
Compile llama.cpp from source like below:
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp
Then, save the model to F16:
python llama.cpp/convert_hf_to_gguf.py merged_model \
--outfile model-F16.gguf --outtype f16 \
--split-max-size 50G
# For BF16:
python llama.cpp/convert_hf_to_gguf.py merged_model \
--outfile model-BF16.gguf --outtype bf16 \
--split-max-size 50G
# For Q8_0:
python llama.cpp/convert_hf_to_gguf.py merged_model \
--outfile model-Q8_0.gguf --outtype q8_0 \
--split-max-size 50G
❓Why is Q8_K_XL slower than Q8_0 GGUF?
On Mac devices, it seems like that BF16 might be slower than F16. Q8_K_XL upcasts some layers to BF16, so hence the slowdown, We are actively changing our conversion process to make F16 the default choice for Q8_K_XL to reduce performance hits.
❓Evaluation Loop - Out of Memory or crashing.
A common issue when you OOM is because you set your batch size too high. Set it lower than 2 to use less VRAM. Also use fp16_full_eval=True
to use float16 for evaluation which cuts memory by 1/2.
First split your training dataset into a train and test split. Set the trainer settings for evaluation to:
new_dataset = dataset.train_test_split(test_size = 0.01)
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
args = SFTConfig(
fp16_full_eval = True,
per_device_eval_batch_size = 2,
eval_accumulation_steps = 4,
eval_strategy = "steps",
eval_steps = 1,
),
train_dataset = new_dataset["train"],
eval_dataset = new_dataset["test"],
...
)
This will cause no OOMs and make it somewhat faster. You can also use bf16_full_eval=True
for bf16 machines. By default Unsloth should have set these flags on by default as of June 2025.
❓How do I do Early Stopping?
If you want to stop the finetuning / training run since the evaluation loss is not decreasing, then you can use early stopping which stops the training process. Use EarlyStoppingCallback
.
As usual, set up your trainer and your evaluation dataset. The below is used to stop the training run if the eval_loss
(the evaluation loss) is not decreasing after 3 steps or so.
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
args = SFTConfig(
fp16_full_eval = True,
per_device_eval_batch_size = 2,
eval_accumulation_steps = 4,
output_dir = "training_checkpoints", # location of saved checkpoints for early stopping
save_strategy = "steps", # save model every N steps
save_steps = 10, # how many steps until we save the model
save_total_limit = 3, # keep ony 3 saved checkpoints to save disk space
eval_strategy = "steps", # evaluate every N steps
eval_steps = 10, # how many steps until we do evaluation
load_best_model_at_end = True, # MUST USE for early stopping
metric_for_best_model = "eval_loss", # metric we want to early stop on
greater_is_better = False, # the lower the eval loss, the better
),
model = model,
tokenizer = tokenizer,
train_dataset = new_dataset["train"],
eval_dataset = new_dataset["test"],
)
We then add the callback which can also be customized:
from transformers import EarlyStoppingCallback
early_stopping_callback = EarlyStoppingCallback(
early_stopping_patience = 3, # How many steps we will wait if the eval loss doesn't decrease
# For example the loss might increase, but decrease after 3 steps
early_stopping_threshold = 0.0, # Can set higher - sets how much loss should decrease by until
# we consider early stopping. For eg 0.01 means if loss was
# 0.02 then 0.01, we consider to early stop the run.
)
trainer.add_callback(early_stopping_callback)
Then train the model as usual via trainer.train()
.
❓Downloading gets stuck at 90 to 95%
If your model gets stuck at 90, 95% for a long time before you can disable some fast downloading processes to force downloads to be synchronous and to print out more error messages.
Simply use UNSLOTH_STABLE_DOWNLOADS=1
before any Unsloth import.
import os
os.environ["UNSLOTH_STABLE_DOWNLOADS"] = "1"
from unsloth import FastLanguageModel
❓RuntimeError: CUDA error: device-side assert triggered
Restart and run all, but place this at the start before any Unsloth import:
import os
os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
os.environ["UNSLOTH_DISABLE_FAST_GENERATION"] = "1"
❓All labels in your dataset are -100. Training losses will be all 0.
This means that your usage of train_on_responses_only
is incorrect for that particular model. train_on_responses_only allows you to mask the user question, and train your model to output the assistant response with higher weighting. This is known to increase accuracy by 1% or more. See our LoRA Hyperparameters Guide for more details.
For Llama 3.1, 3.2, 3.3 type models, please use the below:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)
For Gemma 2, 3. 3n models, use the below:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<start_of_turn>user\n",
response_part = "<start_of_turn>model\n",
)
❓NotImplementedError: A UTF-8 locale is required. Got ANSI
See https://github.com/googlecolab/colabtools/issues/3409
In a new cell, run the below:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
📗Citing Unsloth
If you are citing the usage of our model uploads, use the below Bibtex. This is for Qwen3-30B-A3B-GGUF Q8_K_XL:
@misc{unsloth_2025_qwen3_30b_a3b,
author = {Unsloth AI and Han-Chen, Daniel and Han-Chen, Michael},
title = {Qwen3-30B-A3B-GGUF:Q8\_K\_XL},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF}}
}
To cite the usage of our Github package or our work in general:
@misc{unsloth,
author = {Unsloth AI and Han-Chen, Daniel and Han-Chen, Michael},
title = {Unsloth},
year = {2025},
publisher = {Github},
howpublished = {\url{https://github.com/unslothai/unsloth}}
}
Last updated