🌠Qwen3-VL: How to Run Guide

Learn to fine-tune and run Qwen3-VL locally with Unsloth.

Qwen3-VL is Qwen’s new vision models with instruct and thinking versions. The 2B, 4B, 8B and 32B models are dense, while 30B and 235B are MoE. The 235B thinking LLM delivers SOTA vision and coding performance rivaling GPT-5 (high) and Gemini 2.5 Pro. Qwen3-VL has vision, video and OCR capabilities as well as 256K context (can be extended to 1M). Unsloth supports Qwen3-VL fine-tuning and RL. Train Qwen3-VL (8B) for free with our notebooks.

Running Qwen3-VLFine-tuning Qwen3-VL

🖥️ Running Qwen3-VL

To run the model in llama.cpp, vLLM, Ollama etc., here are the recommended settings:

⚙️ Recommended Settings

Qwen recommends these settings for both models (they're a bit different for Instruct vs Thinking):

Instruct Settings:

Thinking Settings:

Temperature = 0.7

Temperature = 1.0

Top_P = 0.8

Top_P = 0.95

presence_penalty = 1.5

presence_penalty = 0.0

Output Length = 32768 (up to 256K)

Output Length = 40960 (up to 256K)

Top_K = 20

Qwen3-VL also used the below settings for their benchmarking numbers, as mentioned on GitHub.

Instruct Settings:

export greedy='false'
export seed=3407
export top_p=0.8
export top_k=20
export temperature=0.7
export repetition_penalty=1.0
export presence_penalty=1.5
export out_seq_length=32768

Thinking Settings:

export greedy='false'
export seed=1234
export top_p=0.95
export top_k=20
export temperature=1.0
export repetition_penalty=1.0
export presence_penalty=0.0
export out_seq_length=40960

🐛Chat template bug fixes

At Unsloth, we care about accuracy the most, so we investigated why after the 2nd turn of running the Thinking models, llama.cpp would break, as seen below:

The error code:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Value is not callable: null at row 63, column 78:
            {%- if '</think>' in content %}
                {%- set reasoning_content = ((content.split('</think>')|first).rstrip('\n').split('<think>')|last).lstrip('\n') %}
                                                                             ^

We have successfully fixed the Thinking chat template for the VL models so we re-uploaded all Thinking quants and Unsloth's quants. They should now all work after the 2nd conversation - other quants will fail to load after the 2nd conversation.

Qwen3-VL Unsloth uploads:

Qwen3-VL is now supported for GGUFs by llama.cpp as of 30th October 2025, so you can run them locally!

Dynamic GGUFs (to run)

4-bit BnB Unsloth Dynamic

16-bit full-precision

📖 Llama.cpp: Run Qwen3-VL Tutorial

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first
cp llama.cpp/build/bin/llama-* llama.cpp

Let's first get an image! You can also upload images as well. We shall use https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/unsloth%20made%20with%20love.png, which is just our mini logo showing how finetunes are made with Unsloth:

Let's download this image

wget https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/unsloth%20made%20with%20love.png -O unsloth.png

Let's get the 2nd image at https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg

wget https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg -O picture.png

Then, let's use llama.cpp's auto model downloading feature, try this for the 8B Instruct model:

./llama.cpp/llama-mtmd-cli \
    -hf unsloth/Qwen3-VL-8B-Instruct-GGUF:UD-Q4_K_XL \
    --n-gpu-layers 99 \
    --jinja \
    --top-p 0.8 \
    --top-k 20 \
    --temp 0.7 \
    --min-p 0.0 \
    --flash-attn on \
    --presence-penalty 1.5 \
    --ctx-size 8192

Once in, you will see the below screen:

Load up the image via /image PATH ie /image unsloth.png then press ENTER

When you hit ENTER, it'll say "unsloth.png image loaded"

Now let's ask a question like "What is this image?":

Now load in picture 2 via /image picture.png then hit ENTER and ask "What is this image?"

And finally let's ask how are both images are related (it works!)

The two images are directly related because they both feature the **tree sloth**, which is the central subject of the "made with unsloth" project.

- The first image is the **official logo** for the "made with unsloth" project. It features a stylized, cartoonish tree sloth character inside a green circle, with the text "made with unsloth" next to it. This is the visual identity of the project.
- The second image is a **photograph** of a real tree sloth in its natural habitat. This photo captures the animal's physical appearance and behavior in the wild.

The relationship between the two images is that the logo (image 1) is a digital representation or symbol used to promote the "made with unsloth" project, while the photograph (image 2) is a real-world depiction of the actual tree sloth. The project likely uses the character from the logo as an icon or mascot, and the photograph serves to illustrate what the tree sloth looks like in its natural environment.

You can also download the model via (after installing pip install huggingface_hub hf_transfer ) HuggingFace's snapshot_download which is useful for large model downloads, since llama.cpp's auto downloader might lag. You can choose Q4_K_M, or other quantized versions.

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id   = "unsloth/Qwen3-VL-8B-Instruct-GGUF", # Or "unsloth/Qwen3-VL-8B-Thinking-GGUF"
    local_dir = "unsloth/Qwen3-VL-8B-Instruct-GGUF", # Or "unsloth/Qwen3-VL-8B-Thinking-GGUF"
    allow_patterns = ["*UD-Q4_K_XL*"],
)

Run the model and try any prompt. For Instruct:

./llama.cpp/llama-mtmd-cli \
    --model unsloth/Qwen3-VL-8B-Instruct-GGUF/Qwen3-VL-8B-Instruct-UD-Q4_K_XL.gguf \
    --mmproj unsloth/Qwen3-VL-8B-Instruct-GGUF/mmproj-F16.gguf \
    --n-gpu-layers 99 \
    --jinja \
    --top-p 0.8 \
    --top-k 20 \
    --temp 0.7 \
    --min-p 0.0 \
    --flash-attn on \
    --presence-penalty 1.5 \
    --ctx-size 8192

For Thinking:

./llama.cpp/llama-mtmd-cli \
    --model unsloth/Qwen3-VL-8B-Thinking-GGUF/Qwen3-VL-8B-Thinking-UD-Q4_K_XL.gguf \
    --mmproj unsloth/Qwen3-VL-8B-Thinking-GGUF/mmproj-F16.gguf \
    --n-gpu-layers 99 \
    --jinja \
    --top-p 0.95 \
    --top-k 20 \
    --temp 1.0 \
    --min-p 0.0 \
    --flash-attn on \
    --presence-penalty 0.0 \
    --ctx-size 8192

🪄Running Qwen3-VL-235B-A22B and Qwen3-VL-30B-A3B

For Qwen3-VL-235B-A22B, we will use llama.cpp for optimized inference and a plethora of options.

We're following similar steps to above however this time we'll also need to perform extra steps because the model is so big.

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF",
    local_dir = "unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF",
    allow_patterns = ["*UD-Q2_K_XL*"],
)

Run the model and try a prompt. Set the correct parameters for Thinking vs. Instruct.

Instruct:

./llama.cpp/llama-mtmd-cli \
    --model unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF/UD-Q2_K_XL/Qwen3-VL-235B-A22B-Instruct-UD-Q2_K_XL-00001-of-00002.gguf \
    --mmproj unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF/mmproj-F16.gguf \
    --n-gpu-layers 99 \
    --jinja \
    --top-p 0.8 \
    --top-k 20 \
    --temp 0.7 \
    --min-p 0.0 \
    --flash-attn on \
    --presence-penalty 1.5 \
    --ctx-size 8192 \
    -ot ".ffn_.*_exps.=CPU"

Thinking:

./llama.cpp/llama-mtmd-cli \
    --model unsloth/Qwen3-VL-235B-A22B-Thinking-GGUF/UD-Q2_K_XL/Qwen3-VL-235B-A22B-Thinking-UD-Q2_K_XL-00001-of-00002.gguf \
    --mmproj unsloth/Qwen3-VL-235B-A22B-Thinking-GGUF/mmproj-F16.gguf \
    --n-gpu-layers 99 \
    --jinja \
    --top-p 0.95 \
    --top-k 20 \
    --temp 1.0 \
    --min-p 0.0 \
    --flash-attn on \
    --presence-penalty 0.0 \
    --ctx-size 8192 \
    -ot ".ffn_.*_exps.=CPU"

Edit, --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

🐋 Docker: Run Qwen3-VL

If you already have Docker desktop, to run Unsloth's models from Hugging Face, run the command below and you're done:

docker model pull hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:UD-Q4_K_XL

Or you can run Docker's uploaded Qwen3-VL models:

docker model run ai/qwen3-vl

🦥 Fine-tuning Qwen3-VL

Unsloth supports fine-tuning and reinforcement learning (RL) Qwen3-VL including the larger 32B and 235B models. This includes support for fine-tuning for video and object detection. As usual, Unsloth makes Qwen3-VL models train 1.7x faster with 60% less VRAM and 8x longer context lengths with no accuracy degradation. We made two Qwen3-VL (8B) training notebooks which you can train free on Colab:

Saving Qwen3-VL to GGUF now works as llama.cpp just supported it!

If you want to use any other Qwen3-VL model, just change the 8B model to the 2B, 32B etc. one.

The goal of the GRPO notebook is to make a vision language model solve maths problems via RL given an image input like below:

This Qwen3-VL support also integrates our latest update for even more memory efficient + faster RL including our Standby feature, which uniquely limits speed degradation compared to other implementations. You can read more about how to train vision LLMs with RL with our VLM GRPO guide.

Multi-image training

In order to fine-tune or train Qwen3-VL with multi-images the most straightforward change is to swap

ds_converted = ds.map(
    convert_to_conversation,
)

with:

ds_converted = [convert_to_converation(sample) for sample in dataset]

Using map kicks in dataset standardization and arrow processing rules which can be strict and more complicated to define.

PreviousRun Unsloth models in Docker NextDeepSeek-OCR

Last updated 3 days ago

Was this helpful?