🌠Qwen3-VL: How to Run Guide
Learn to fine-tune and run Qwen3-VL locally with Unsloth.
Qwen3-VL is Qwen’s new vision models with instruct and thinking versions. The 2B, 4B, 8B and 32B models are dense, while 30B and 235B are MoE. The 235B thinking LLM delivers SOTA vision and coding performance rivaling GPT-5 (high) and Gemini 2.5 Pro. Qwen3-VL has vision, video and OCR capabilities as well as 256K context (can be extended to 1M). Unsloth supports Qwen3-VL fine-tuning and RL. Train Qwen3-VL (8B) for free with our notebooks.
🖥️ Running Qwen3-VL
To run the model in llama.cpp, vLLM, Ollama etc., here are the recommended settings:
⚙️ Recommended Settings
Qwen recommends these settings for both models (they're a bit different for Instruct vs Thinking):
Temperature = 0.7
Temperature = 1.0
Top_P = 0.8
Top_P = 0.95
presence_penalty = 1.5
presence_penalty = 0.0
Output Length = 32768 (up to 256K)
Output Length = 40960 (up to 256K)
Top_K = 20
Top_K = 20
Qwen3-VL also used the below settings for their benchmarking numbers, as mentioned on GitHub.
Instruct Settings:
export greedy='false'
export seed=3407
export top_p=0.8
export top_k=20
export temperature=0.7
export repetition_penalty=1.0
export presence_penalty=1.5
export out_seq_length=32768Thinking Settings:
export greedy='false'
export seed=1234
export top_p=0.95
export top_k=20
export temperature=1.0
export repetition_penalty=1.0
export presence_penalty=0.0
export out_seq_length=40960🐛Chat template bug fixes
At Unsloth, we care about accuracy the most, so we investigated why after the 2nd turn of running the Thinking models, llama.cpp would break, as seen below:

The error code:
terminate called after throwing an instance of 'std::runtime_error'
what(): Value is not callable: null at row 63, column 78:
{%- if '</think>' in content %}
{%- set reasoning_content = ((content.split('</think>')|first).rstrip('\n').split('<think>')|last).lstrip('\n') %}
^We have successfully fixed the Thinking chat template for the VL models so we re-uploaded all Thinking quants and Unsloth's quants. They should now all work after the 2nd conversation - other quants will fail to load after the 2nd conversation.
Qwen3-VL Unsloth uploads:
Qwen3-VL is now supported for GGUFs by llama.cpp as of 30th October 2025, so you can run them locally!
📖 Llama.cpp: Run Qwen3-VL Tutorial
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
Let's first get an image! You can also upload images as well. We shall use https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/unsloth%20made%20with%20love.png, which is just our mini logo showing how finetunes are made with Unsloth:

Let's download this image

Then, let's use llama.cpp's auto model downloading feature, try this for the 8B Instruct model:
Once in, you will see the below screen:

Load up the image via
/image PATHie/image unsloth.pngthen press ENTER

When you hit ENTER, it'll say "unsloth.png image loaded"

Now let's ask a question like "What is this image?":

Now load in picture 2 via
/image picture.pngthen hit ENTER and ask "What is this image?"

And finally let's ask how are both images are related (it works!)

You can also download the model via (after installing
pip install huggingface_hub hf_transfer) HuggingFace'ssnapshot_downloadwhich is useful for large model downloads, since llama.cpp's auto downloader might lag. You can choose Q4_K_M, or other quantized versions.
Run the model and try any prompt. For Instruct:
For Thinking:
🪄Running Qwen3-VL-235B-A22B and Qwen3-VL-30B-A3B
For Qwen3-VL-235B-A22B, we will use llama.cpp for optimized inference and a plethora of options.
We're following similar steps to above however this time we'll also need to perform extra steps because the model is so big.
Download the model via (after installing
pip install huggingface_hub hf_transfer). You can choose UD-Q2_K_XL, or other quantized versions..Run the model and try a prompt. Set the correct parameters for Thinking vs. Instruct.
Instruct:
Thinking:
Edit,
--ctx-size 16384for context length,--n-gpu-layers 99for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
🐋 Docker: Run Qwen3-VL
If you already have Docker desktop, to run Unsloth's models from Hugging Face, run the command below and you're done:
Or you can run Docker's uploaded Qwen3-VL models:
🦥 Fine-tuning Qwen3-VL
Unsloth supports fine-tuning and reinforcement learning (RL) Qwen3-VL including the larger 32B and 235B models. This includes support for fine-tuning for video and object detection. As usual, Unsloth makes Qwen3-VL models train 1.7x faster with 60% less VRAM and 8x longer context lengths with no accuracy degradation. We made two Qwen3-VL (8B) training notebooks which you can train free on Colab:
Saving Qwen3-VL to GGUF now works as llama.cpp just supported it!
If you want to use any other Qwen3-VL model, just change the 8B model to the 2B, 32B etc. one.
The goal of the GRPO notebook is to make a vision language model solve maths problems via RL given an image input like below:

This Qwen3-VL support also integrates our latest update for even more memory efficient + faster RL including our Standby feature, which uniquely limits speed degradation compared to other implementations. You can read more about how to train vision LLMs with RL with our VLM GRPO guide.
Multi-image training
In order to fine-tune or train Qwen3-VL with multi-images the most straightforward change is to swap
with:
Using map kicks in dataset standardization and arrow processing rules which can be strict and more complicated to define.
Last updated
Was this helpful?

