Qwen3: How to Run & Fine-tune
Learn to run & fine-tune Qwen3 locally with Unsloth + our Dynamic 2.0 quants
Last updated
Was this helpful?
Learn to run & fine-tune Qwen3 locally with Unsloth + our Dynamic 2.0 quants
Last updated
Was this helpful?
Qwen's new Qwen3 models deliver state-of-the-art advancements in reasoning, instruction-following, agent capabilities, and multilingual support. All Qwen3 uploads use our new Unsloth methodology, delivering the best performance on 5-shot MMLU and KL Divergence benchmarks. This means, you can run and fine-tune quantized Qwen3 LLMs with minimal accuracy loss!
We also uploaded Qwen3 with native 128K context length. Qwen achieves this by using YaRN to extend its original 40K window to 128K.
also now supports fine-tuning and GRPO of Qwen3 and Qwen3 MOE models — 2x faster, with 70% less VRAM, and 8x longer context lengths. Fine-tune Qwen3 (14B) for free using our
According to Qwen, these are the recommended settings for inference:
Temperature = 0.7
Temperature = 0.6
Min_P = 0.0 (optional, but 0.01 works well, llama.cpp default is 0.1)
Min_P = 0.0
Top_P = 0.8
Top_P = 0.95
TopK = 20
TopK = 20
Chat template/prompt format:
For NON thinking mode, we purposely enclose <think> and </think> with nothing:
For Thinking-mode, DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.
You can add /think
and /no_think
to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.
Here is an example of multi-turn conversation:
Thinking mode:
enable_thinking=True
By default, Qwen3 has thinking enabled. When you call tokenizer.apply_chat_template
, you don’t need to set anything manually.
In thinking mode, the model will generate an extra <think>...</think>
block before the final answer — this lets it "plan" and sharpen its responses.
Non-thinking mode:
enable_thinking=False
Enabling non-thinking will make Qwen3 will skip all the thinking steps and behave like a normal LLM.
This mode will provide final responses directly — no <think>
blocks, no chain-of-thought.
Run the model! Note you can call ollama serve
in another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params
in our Hugging Face upload!
To disable thinking, use (or you can set it in the system prompt):
If you're experiencing any looping, Ollama might have set your context length window to 2,048 or so. If this is the case, bump it up to 32,000 and see if the issue still persists.
Download the model via (after installing pip install huggingface_hub hf_transfer
). You can choose Q4_K_M, or other quantized versions.
Run the model and try any prompt. To disable thinking, use (or you can set it in the system prompt):
For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.
We're following similar steps to above however this time we'll also need to perform extra steps because the model is so big.
Download the model via (after installing pip install huggingface_hub hf_transfer
). You can choose UD-Q2_K_XL, or other quantized versions..
Run the model and try any prompt.
Edit --threads 32
for the number of CPU threads, --ctx-size 16384
for context length, --n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
Unsloth makes Qwen3 fine-tuning 2x faster, use 70% less VRAM and supports 8x longer context lengths. Qwen3 (14B) fits comfortably in a Google Colab 16GB VRAM Tesla T4 GPU.
Because Qwen3 supports both reasoning and non-reasoning, you can fine-tune it with a non-reasoning dataset, but this may affect its reasoning ability. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.
Our Conversational notebook uses a combo of 75% NVIDIA’s open-math-reasoning dataset and 25% Maxime’s FineTome dataset (non-reasoning). Here's free Unsloth Colab notebooks to fine-tune Qwen3:
If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:
Fine-tuning support includes MOE models: 30B-A3B and 235B-A22B. Qwen3-30B-A3B works on just 17.5GB VRAM with Unsloth. On fine-tuning MoE's - it's probably not a good idea to fine-tune the router layer so we disabled it by default.
The 30B-A3B fits in 17.5GB VRAM, but you may lack RAM or disk space since the full 16-bit model must be downloaded and converted to 4-bit on the fly for QLoRA fine-tuning. This is due to issues importing 4-bit BnB MOE models directly. This only affects MOE models.
If you're fine-tuning the MOE models, please use FastModel
and not FastLanguageModel
To use the notebooks, just click Runtime, then Run all. You can change settings in the notebook to whatever you desire. We have set them automatically by default. Change model name to whatever you like by matching it with model's name on Hugging Face e.g. 'unsloth/Qwen3-8B' or 'unsloth/Qwen3-0.6B-unsloth-bnb-4bit'.
There are other settings which you can toggle:
max_seq_length = 2048
– Controls context length. While Qwen3 supports 40960, we recommend 2048 for testing. Unsloth enables 8× longer context fine-tuning.
load_in_4bit = True
– Enables 4-bit quantization, reducing memory use 4× for fine-tuning on 16GB GPUs.
For full-finetuning - set full_finetuning = True
and 8-bit finetuning - set load_in_8bit = True
We made a new advanced GRPO notebook for fine-tuning Qwen3. Learn to use our new proximity-based reward function (closer answers = rewarded) and Hugging Face's Open-R1 math dataset. Unsloth now also has better evaluations and uses the latest version of vLLM.
Learn about:
Enabling reasoning in Qwen3 (Base)+ guiding it to do a specific task
Pre-finetuning to bypass GRPO's tendency to learn formatting
Improved evaluation accuracy via new regex matching
Custom GRPO templates beyond just 'think' e.g. <start_working_out></end_working_out>
Proximity-based scoring: better answers earn more points (e.g., predicting 9 when the answer is 10) and outliers are penalized
•
•
Qwen3 models come with built-in "thinking mode" to boost reasoning and improve response quality - similar to how worked. Instructions for switching will differ depending on the inference engine you're using so ensure you use the correct instructions.
Install ollama
if you haven't already! You can only run models up to 32B in size. To run the full 235B-A22B model, .
Obtain the latest llama.cpp
on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON
to -DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
(recommended)
- Advanced GRPO LoRA
(for Base models)
If you'd like to read a full end-to-end guide on how to use Unsloth notebooks for fine-tuning or just learn about fine-tuning, creating etc., view our :
notebook - Advanced GRPO LoRA