Vision Fine-tuning
Learn how to fine-tune vision/multimodal LLMs with Unsloth
Last updated
Was this helpful?
Learn how to fine-tune vision/multimodal LLMs with Unsloth
Last updated
Was this helpful?
Fine-tuning vision models has numerous use cases across various industries, enabling models to adapt to specific tasks and datasets. We provided 3 example notebooks for vision finetuning.
Note: also works, just change Qwen or Pixtral's notebook to the Gemma 3 model.
Llama 3.2 Vision finetuning for radiography: How can we assist medical professionals in analyzing Xrays, CT Scans & ultrasounds faster.
Qwen2.5 VL finetuning for converting handwriting to LaTeX: This allows complex math formulas to be easily transcribed as LaTeX without manually writing it.
Pixtral 12B 2409 vision finetuning for general Q&A: One can concatenate general Q&A datasets with more niche datasets to make the finetune not forget base model skills.
To finetune vision models, we now allow you to select which parts of the mode to finetune. You can select to only finetune the vision layers, or the language layers, or the attention / MLP layers! We set them all on by default!
Let's take a look at the dataset, and check what the 1st example shows:
Panoramic radiography shows an osteolytic lesion in the right posterior maxilla with resorption of the floor of the maxillary sinus (arrows).
To format the dataset, all vision finetuning tasks should be formatted as follows:
We will craft an custom instruction asking the VLM to be an expert radiographer. Notice also instead of just 1 instruction, you can add multiple turns to make it a dynamic conversation.
Let's convert the dataset into the "correct" format for finetuning:
The first example is now structured like below:
Before we do any finetuning, maybe the vision model already knows how to analyse the images? Let's check if this is the case!
And the result:
The dataset for fine-tuning a vision or multimodal model is similar to standard question & answer pair , but this time, they also includes image inputs. For example, the uses a radiography case to show how AI can help medical professionals analyze X-rays, CT scans, and ultrasounds more efficiently.
We'll be using a sampled version of the ROCO radiography dataset. You can access the dataset . The dataset includes X-rays, CT scans and ultrasounds showcasing medical conditions and diseases. Each image has a caption written by experts describing it. The goal is to finetune a VLM to make it a useful analysis tool for medical professionals.
For more details, view our dataset section in the .