🌠Qwen3-VL: How to Run Guide

Learn to fine-tune and run Qwen3-VL locally with Unsloth.

Qwen3-VL is Qwen’s new vision models with instruct and thinking versions. The 2B, 4B, 8B and 32B models are dense, while 30B and 235B are MoE. The 235B thinking LLM delivers SOTA vision and coding performance rivaling GPT-5 (high) and Gemini 2.5 Pro. Qwen3-VL has vision, video and OCR capabilities as well as 256K context (can be extended to 1M). Unsloth supports Qwen3-VL fine-tuning and RL. Train Qwen3-VL (8B) for free with our notebooks.

Running Qwen3-VLFine-tuning Qwen3-VL

🖥️ Running Qwen3-VL

To run the model in llama.cpp, vLLM, Ollama etc., here are the recommended settings:

Qwen recommends these settings for both models (they're a bit different for Instruct vs Thinking):

Instruct Settings:
Thinking Settings:

Temperature = 0.7

Temperature = 1.0

Top_P = 0.8

Top_P = 0.95

presence_penalty = 1.5

presence_penalty = 0.0

Output Length = 32768 (up to 256K)

Output Length = 40960 (up to 256K)

Top_K = 20

Top_K = 20

Qwen3-VL also used the below settings for their benchmarking numbers, as mentioned on GitHub.

Instruct Settings:

export greedy='false'
export seed=3407
export top_p=0.8
export top_k=20
export temperature=0.7
export repetition_penalty=1.0
export presence_penalty=1.5
export out_seq_length=32768

Thinking Settings:

export greedy='false'
export seed=1234
export top_p=0.95
export top_k=20
export temperature=1.0
export repetition_penalty=1.0
export presence_penalty=0.0
export out_seq_length=40960

🐛Chat template bug fixes

At Unsloth, we care about accuracy the most, so we investigated why after the 2nd turn of running the Thinking models, llama.cpp would break, as seen below:

The error code:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Value is not callable: null at row 63, column 78:
            {%- if '</think>' in content %}
                {%- set reasoning_content = ((content.split('</think>')|first).rstrip('\n').split('<think>')|last).lstrip('\n') %}
                                                                             ^

We have successfully fixed the Thinking chat template for the VL models so we re-uploaded all Thinking quants and Unsloth's quants. They should now all work after the 2nd conversation - other quants will fail to load after the 2nd conversation.

Qwen3-VL Unsloth uploads:

Qwen3-VL is now supported for GGUFs by llama.cpp as of 30th October 2025, so you can run them locally!

📖 Llama.cpp: Run Qwen3-VL Tutorial

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

  1. Let's first get an image! You can also upload images as well. We shall use https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/unsloth%20made%20with%20love.png, which is just our mini logo showing how finetunes are made with Unsloth:

  1. Let's download this image

  1. Then, let's use llama.cpp's auto model downloading feature, try this for the 8B Instruct model:

  1. Once in, you will see the below screen:

  1. Load up the image via /image PATH ie /image unsloth.png then press ENTER

  1. When you hit ENTER, it'll say "unsloth.png image loaded"

  1. Now let's ask a question like "What is this image?":

  1. Now load in picture 2 via /image picture.png then hit ENTER and ask "What is this image?"

  1. And finally let's ask how are both images are related (it works!)

  1. You can also download the model via (after installing pip install huggingface_hub hf_transfer ) HuggingFace's snapshot_download which is useful for large model downloads, since llama.cpp's auto downloader might lag. You can choose Q4_K_M, or other quantized versions.

  1. Run the model and try any prompt. For Instruct:

  1. For Thinking:

🪄Running Qwen3-VL-235B-A22B and Qwen3-VL-30B-A3B

For Qwen3-VL-235B-A22B, we will use llama.cpp for optimized inference and a plethora of options.

  1. We're following similar steps to above however this time we'll also need to perform extra steps because the model is so big.

  2. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

  3. Run the model and try a prompt. Set the correct parameters for Thinking vs. Instruct.

Instruct:

Thinking:

  1. Edit, --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

🐋 Docker: Run Qwen3-VL

If you already have Docker desktop, to run Unsloth's models from Hugging Face, run the command below and you're done:

Or you can run Docker's uploaded Qwen3-VL models:

🦥 Fine-tuning Qwen3-VL

Unsloth supports fine-tuning and reinforcement learning (RL) Qwen3-VL including the larger 32B and 235B models. This includes support for fine-tuning for video and object detection. As usual, Unsloth makes Qwen3-VL models train 1.7x faster with 60% less VRAM and 8x longer context lengths with no accuracy degradation. We made two Qwen3-VL (8B) training notebooks which you can train free on Colab:

The goal of the GRPO notebook is to make a vision language model solve maths problems via RL given an image input like below:

This Qwen3-VL support also integrates our latest update for even more memory efficient + faster RL including our Standby feature, which uniquely limits speed degradation compared to other implementations. You can read more about how to train vision LLMs with RL with our VLM GRPO guide.

Multi-image training

In order to fine-tune or train Qwen3-VL with multi-images the most straightforward change is to swap

with:

Using map kicks in dataset standardization and arrow processing rules which can be strict and more complicated to define.

Last updated

Was this helpful?