Saving to Ollama
Last updated
Last updated
See our guide below for the complete process on how to save to Ollama:
🦙Tutorial: How to Finetune Llama-3 and Use In OllamaYou can save the finetuned model as a small 100MB file called a LoRA adapter like below. You can instead push to the Hugging Face hub as well if you want to upload your model! Remember to get a Hugging Face token via: https://huggingface.co/settings/tokens and add your token!
After saving the model, we can again use Unsloth to run the model itself! Use FastLanguageModel
again to call it for inference!
Finally we can export our finetuned model to Ollama itself! First we have to install Ollama in the Colab notebook:
Then we export the finetuned model we have to llama.cpp's GGUF formats like below:
Reminder to convert False
to True
for 1 row, and not change every row to True
, or else you'll be waiting for a very time! We normally suggest the first row getting set to True
, so we can export the finetuned model quickly to Q8_0
format (8 bit quantization). We also allow you to export to a whole list of quantization methods as well, with a popular one being q4_k_m
.
Head over to https://github.com/ggerganov/llama.cpp to learn more about GGUF. We also have some manual instructions of how to export to GGUF if you want here: https://github.com/unslothai/unsloth/wiki#manually-saving-to-gguf
You will see a long list of text like below - please wait 5 to 10 minutes!!
And finally at the very end, it'll look like below:
Then, we have to run Ollama itself in the background. We use subprocess
because Colab doesn't like asynchronous calls, but normally one just runs ollama serve
in the terminal / command prompt.
Modelfile
creationThe trick Unsloth provides is we automatically create a Modelfile
which Ollama requires! This is a just a list of settings and includes the chat template which we used for the finetune process! You can also print the Modelfile
generated like below:
We then ask Ollama to create a model which is Ollama compatible, by using the Modelfile
And we can now call the model for inference if you want to do call the Ollama server itself which is running on your own local machine / in the free Colab notebook in the background. Remember you can edit the yellow underlined part.