๐ŸŒ Qwen3: How to Run & Fine-tune

Learn to run & fine-tune Qwen3 locally with Unsloth + our Dynamic 2.0 quants

Qwen's new Qwen3 models deliver state-of-the-art advancements in reasoning, instruction-following, agent capabilities, and multilingual support.

All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized Qwen LLMs with minimal accuracy loss.

We also uploaded Qwen3 with native 128K context length. Qwen achieves this by using YaRN to extend its original 40K window to 128K.

Unsloth also now supports fine-tuning and Reinforcement Learning (RL) of Qwen3 and Qwen3 MOE models โ€” 2x faster, with 70% less VRAM, and 8x longer context lengths. Fine-tune Qwen3 (14B) for free using our Colab notebook.

Running Qwen3 Tutorial Fine-tuning Qwen3

Qwen3 - Unsloth Dynamic 2.0 with optimal configs:

Dynamic 2.0 GGUF (to run)
128K Context GGUF
Dynamic 4-bit Safetensor (to finetune/deploy)

๐Ÿ–ฅ๏ธ Running Qwen3

To achieve inference speeds of 6+ tokens per second, we recommend your available memory should match or exceed the size of the model youโ€™re using. For example, a 30GB 1-bit quantized model requires at least 150GB of memory. The Q2_K_XL quant, which is 180GB, will require at least 180GB of unified memory (VRAM + RAM) or 180GB of RAM for optimal performance.

NOTE: Itโ€™s possible to run the model with less total memory than its size (i.e., less VRAM, less RAM, or a lower combined total). However, this will result in slower inference speeds. Sufficient memory is only required if you want to maximize throughput and achieve the fastest inference times.

According to Qwen, these are the recommended settings for inference:

Non-Thinking Mode Settings:
Thinking Mode Settings:

Temperature = 0.7

Temperature = 0.6

Min_P = 0.0 (optional, but 0.01 works well, llama.cpp default is 0.1)

Min_P = 0.0

Top_P = 0.8

Top_P = 0.95

TopK = 20

TopK = 20

Chat template/prompt format:

Switching Between Thinking and Non-Thinking Mode

Qwen3 models come with built-in "thinking mode" to boost reasoning and improve response quality - similar to how QwQ-32B worked. Instructions for switching will differ depending on the inference engine you're using so ensure you use the correct instructions.

Instructions for llama.cpp and Ollama:

You can add /think and /no_think to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.

Here is an example of multi-turn conversation:

Instructions for transformers and vLLM:

Thinking mode:

enable_thinking=True

By default, Qwen3 has thinking enabled. When you call tokenizer.apply_chat_template, you donโ€™t need to set anything manually.

In thinking mode, the model will generate an extra <think>...</think> block before the final answer โ€” this lets it "plan" and sharpen its responses.

Non-thinking mode:

enable_thinking=False

Enabling non-thinking will make Qwen3 will skip all the thinking steps and behave like a normal LLM.

This mode will provide final responses directly โ€” no <think> blocks, no chain-of-thought.

๐Ÿฆ™ Ollama: Run Qwen3 Tutorial

  1. Install ollama if you haven't already! You can only run models up to 32B in size. To run the full 235B-A22B model, see here.

  1. Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

  1. To disable thinking, use (or you can set it in the system prompt):

๐Ÿ“– Llama.cpp: Run Qwen3 Tutorial

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

  1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions.

  1. Run the model and try any prompt.

To disable thinking, use (or you can set it in the system prompt):

Running Qwen3-235B-A22B

For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

  1. We're following similar steps to above however this time we'll also need to perform extra steps because the model is so big.

  2. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

  3. Run the model and try any prompt.

  4. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

๐Ÿฆฅ Fine-tuning Qwen3 with Unsloth

Unsloth makes Qwen3 fine-tuning 2x faster, use 70% less VRAM and supports 8x longer context lengths. Qwen3 (14B) fits comfortably in a Google Colab 16GB VRAM Tesla T4 GPU.

Because Qwen3 supports both reasoning and non-reasoning, you can fine-tune it with a non-reasoning dataset, but this may affect its reasoning ability. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.

Our Conversational notebook uses a combo of 75% NVIDIAโ€™s open-math-reasoning dataset and 25% Maximeโ€™s FineTome dataset (non-reasoning). Here's free Unsloth Colab notebooks to fine-tune Qwen3:

If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

Qwen3 MOE models fine-tuning

Fine-tuning support includes MOE models: 30B-A3B and 235B-A22B. Qwen3-30B-A3B works on just 17.5GB VRAM with Unsloth. On fine-tuning MoE's - it's probably not a good idea to fine-tune the router layer so we disabled it by default.

The 30B-A3B fits in 17.5GB VRAM, but you may lack RAM or disk space since the full 16-bit model must be downloaded and converted to 4-bit on the fly for QLoRA fine-tuning. This is due to issues importing 4-bit BnB MOE models directly. This only affects MOE models.

Notebook Guide:

To use the notebooks, just click Runtime, then Run all. You can change settings in the notebook to whatever you desire. We have set them automatically by default. Change model name to whatever you like by matching it with model's name on Hugging Face e.g. 'unsloth/Qwen3-8B' or 'unsloth/Qwen3-0.6B-unsloth-bnb-4bit'.

There are other settings which you can toggle:

  • max_seq_length = 2048 โ€“ Controls context length. While Qwen3 supports 40960, we recommend 2048 for testing. Unsloth enables 8ร— longer context fine-tuning.

  • load_in_4bit = True โ€“ Enables 4-bit quantization, reducing memory use 4ร— for fine-tuning on 16GB GPUs.

  • For full-finetuning - set full_finetuning = True and 8-bit finetuning - set load_in_8bit = True

If you'd like to read a full end-to-end guide on how to use Unsloth notebooks for fine-tuning or just learn about fine-tuning, creating datasets etc., view our complete guide here:

๐ŸงฌFine-tuning Guide๐Ÿ“ˆDatasets Guide

GRPO with Qwen3

We made a new advanced GRPO notebook for fine-tuning Qwen3. Learn to use our new proximity-based reward function (closer answers = rewarded) and Hugging Face's Open-R1 math dataset. Unsloth now also has better evaluations and uses the latest version of vLLM.

Qwen3 (4B) notebook - Advanced GRPO LoRA

Learn about:

  • Enabling reasoning in Qwen3 (Base)+ guiding it to do a specific task

  • Pre-finetuning to bypass GRPO's tendency to learn formatting

  • Improved evaluation accuracy via new regex matching

  • Custom GRPO templates beyond just 'think' e.g. <start_working_out></end_working_out>

  • Proximity-based scoring: better answers earn more points (e.g., predicting 9 when the answer is 10) and outliers are penalized

Last updated

Was this helpful?