👁️Vision Fine-tuning
Learn how to fine-tune vision/multimodal LLMs with Unsloth
Fine-tuning vision models has numerous use cases across various industries, enabling models to adapt to specific tasks and datasets. We provided 3 example notebooks for vision finetuning.
Note: Gemma 3 also works, just change Qwen or Pixtral's notebook to the Gemma 3 model.
Llama 3.2 Vision finetuning for radiography: Notebook How can we assist medical professionals in analyzing Xrays, CT Scans & ultrasounds faster.
Qwen2.5 VL finetuning for converting handwriting to LaTeX: Notebook This allows complex math formulas to be easily transcribed as LaTeX without manually writing it.
Pixtral 12B 2409 vision finetuning for general Q&A: Notebook One can concatenate general Q&A datasets with more niche datasets to make the finetune not forget base model skills.
To finetune vision models, we now allow you to select which parts of the mode to finetune. You can select to only finetune the vision layers, or the language layers, or the attention / MLP layers! We set them all on by default!

Vision Fine-tuning Dataset
The dataset for fine-tuning a vision or multimodal model is similar to standard question & answer pair datasets , but this time, they also includes image inputs. For example, the Llama 3.2 Vision Notebook uses a radiography case to show how AI can help medical professionals analyze X-rays, CT scans, and ultrasounds more efficiently.
We'll be using a sampled version of the ROCO radiography dataset. You can access the dataset here. The dataset includes X-rays, CT scans and ultrasounds showcasing medical conditions and diseases. Each image has a caption written by experts describing it. The goal is to finetune a VLM to make it a useful analysis tool for medical professionals.
Let's take a look at the dataset, and check what the 1st example shows:
Dataset({
features: ['image', 'image_id', 'caption', 'cui'],
num_rows: 1978
})

Panoramic radiography shows an osteolytic lesion in the right posterior maxilla with resorption of the floor of the maxillary sinus (arrows).
To format the dataset, all vision finetuning tasks should be formatted as follows:
[
{ "role": "user",
"content": [{"type": "text", "text": instruction}, {"type": "image", "image": image} ]
},
{ "role": "assistant",
"content": [{"type": "text", "text": answer} ]
},
]
We will craft an custom instruction asking the VLM to be an expert radiographer. Notice also instead of just 1 instruction, you can add multiple turns to make it a dynamic conversation.
instruction = "You are an expert radiographer. Describe accurately what you see in this image."
def convert_to_conversation(sample):
conversation = [
{ "role": "user",
"content" : [
{"type" : "text", "text" : instruction},
{"type" : "image", "image" : sample["image"]} ]
},
{ "role" : "assistant",
"content" : [
{"type" : "text", "text" : sample["caption"]} ]
},
]
return { "messages" : conversation }
pass
Let's convert the dataset into the "correct" format for finetuning:
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
The first example is now structured like below:
converted_dataset[0]
{'messages': [{'role': 'user',
'content': [{'type': 'text',
'text': 'You are an expert radiographer. Describe accurately what you see in this image.'},
{'type': 'image',
'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=657x442>}]},
{'role': 'assistant',
'content': [{'type': 'text',
'text': 'Panoramic radiography shows an osteolytic lesion in the right posterior maxilla with resorption of the floor of the maxillary sinus (arrows).'}]}]}
Before we do any finetuning, maybe the vision model already knows how to analyse the images? Let's check if this is the case!
FastVisionModel.for_inference(model) # Enable for inference!
image = dataset[0]["image"]
instruction = "You are an expert radiographer. Describe accurately what you see in this image."
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": instruction}
]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.5, min_p = 0.1)
And the result:
This radiograph appears to be a panoramic view of the upper and lower dentition, specifically an Orthopantomogram (OPG).
* The panoramic radiograph demonstrates normal dental structures.
* There is an abnormal area on the upper right, represented by an area of radiolucent bone, corresponding to the antrum.
**Key Observations**
* The bone between the left upper teeth is relatively radiopaque.
* There are two large arrows above the image, suggesting the need for a closer examination of this area. One of the arrows is in a left-sided position, and the other is in the right-sided position. However, only
For more details, view our dataset section in the notebook here.
Last updated
Was this helpful?