👁️🗨️Vision Reinforcement Learning (VLM RL)
Train Vision/multimodal models via GRPO and RL with Unsloth!
Unsloth now supports vision/multimodal RL with Gemma 3 and Qwen2.5-VL. Due to Unsloth's unique weight sharing and custom kernels, Unsloth makes VLM RL 1.5–2× faster, uses 90% less VRAM, and enables 15× longer context lengths than FA2 setups, with no accuracy loss. This update also introduces Qwen's GSPO algorithm.
Unsloth can train Qwen2.5-VL-7B with GSPO/GRPO on a free Colab T4 GPU. Other VLMs work too, but may need larger GPUs. Gemma requires newer GPUs than T4 because vLLM restricts to Bfloat16, thus we recommend NVIDIA L4 on Colab. Our notebooks solve numerical math problems involving images and diagrams:
Gemma-3-4B (Unsloth inference): Colab
We have also added vLLM VLM integration into Unsloth natively, so all you have to do to use vLLM inference is enable the fast_inference=True
flag when initializing the model. Special thanks to Sinoué GAD for providing the first notebook that made integrating VLM RL easier!
This VLM support also integrates our latest update for even more memory efficient + faster RL including our Standby feature, which uniquely limits speed degradation compared to other implementations.
os.environ['UNSLOTH_VLLM_STANDBY'] = '1' # To enable memory efficient GRPO with vLLM
model, tokenizer = FastVisionModel.from_pretrained(
model_name = "Qwen/Qwen2.5-VL-7B-Instruct",
max_seq_length = 16384, #Must be this large to fit image in context
load_in_4bit = True, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
gpu_memory_utilization = 0.8, # Reduce if out of memory
)
It is also important to note, that vLLM does not support LoRA for vision/encoder layers, thus set finetune_vision_layers = False
when loading a LoRA adapter.
However you CAN train the vision layers as well if you use inference via transformers/Unsloth.
# Add LoRA adapter to the model for parameter efficient fine tuning
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = False,# fast_inference doesn't support finetune_vision_layers yet :(
finetune_language_layers = True, # False if not finetuning language layers
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules = True, # False if not finetuning MLP layers
r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
lora_alpha = lora_rank*2, # *2 speeds up training
use_gradient_checkpointing = "unsloth", # Reduces memory usage
random_state = 3407,
)
GSPO RL
This update in addition adds GSPO which is a variant of GRPO made by the Qwen team at Alibaba. They noticed that GRPO implicitly results in importance weights for each token, even though explicitly advantages do not scale or change with each token. This lead to the creation of GSPO, which now assigns the importance on the sequence likelihood rather than the individual token likelihoods of the tokens. The difference between these two algorithms can be seen below, both from the GSPO paper from Qwen and Alibaba:


In Equation 1, it can be seen that the advantages scale each of the rows into the token logprobs before that tensor is sumed. Essentially, each token is given the same scaling even though that scaling was given to the entire sequence rather than each individual token. A simple diagram of this can be seen below:

Equation 2 shows that the logprob ratios for each sequence is summed and exponentiated after the Logprob ratios are computed, and only the resulting now sequence ratios get row wise multiplied by the advantages.

Enabling GSPO is simple, all you need to do is set the importance_sampling_level = "sequence"
flag in the GRPO config.
training_args = GRPOConfig(
output_dir="vlm-grpo-unsloth",
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
learning_rate = 5e-6,
adam_beta1 = 0.9,
adam_beta2 = 0.99,
weight_decay = 0.1,
warmup_ratio = 0.1,
lr_scheduler_type = "cosine",
optim = "adamw_8bit",
importance_sampling_level = "sequence",
loss_type="dr_grpo",
#beta=0.00,
epsilon=3e-4,
epsilon_high=4e-4,
num_generations=8,
max_prompt_length=1024,
max_completion_length=1024,
log_completions=True,
max_grad_norm = 0.1,
temperature = 0.9,
#report_to="none", # Set to "wandb" if you want to log to Weights & Biases
num_train_epochs=2, # For a quick test run, increase for full training
report_to = "none"
)
Overall, Unsloth now with VLM vLLM fast inference enables for both 90% reduced memory usage but also 1.5-2x faster speed with GRPO and GSPO!
If you'd like to read more about reinforcement learning, check out out RL guide:
Reinforcement Learning (RL) Guide
Authors: A huge thank you to Keith and Datta for contributing to this article!
Last updated
Was this helpful?