๐ Qwen3: How to Run & Fine-tune
Learn to run & fine-tune Qwen3 locally with Unsloth + our Dynamic 2.0 quants
Qwen's new Qwen3 models deliver state-of-the-art advancements in reasoning, instruction-following, agent capabilities, and multilingual support.
NEW! Qwen3 got an update in July 2025. Run & fine-tune the latest model: Qwen-2507
All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized Qwen LLMs with minimal accuracy loss.
We also uploaded Qwen3 with native 128K context length. Qwen achieves this by using YaRN to extend its original 40K window to 128K.
Unsloth also now supports fine-tuning and Reinforcement Learning (RL) of Qwen3 and Qwen3 MOE models โ 2x faster, with 70% less VRAM, and 8x longer context lengths. Fine-tune Qwen3 (14B) for free using our Colab notebook.
Qwen3 - Unsloth Dynamic 2.0 with optimal configs:
๐ฅ๏ธ Running Qwen3
To achieve inference speeds of 6+ tokens per second, we recommend your available memory should match or exceed the size of the model youโre using. For example, a 30GB 1-bit quantized model requires at least 150GB of memory. The Q2_K_XL quant, which is 180GB, will require at least 180GB of unified memory (VRAM + RAM) or 180GB of RAM for optimal performance.
NOTE: Itโs possible to run the model with less total memory than its size (i.e., less VRAM, less RAM, or a lower combined total). However, this will result in slower inference speeds. Sufficient memory is only required if you want to maximize throughput and achieve the fastest inference times.
โ๏ธ Official Recommended Settings
According to Qwen, these are the recommended settings for inference:
Temperature = 0.7
Temperature = 0.6
Min_P = 0.0 (optional, but 0.01 works well, llama.cpp default is 0.1)
Min_P = 0.0
Top_P = 0.8
Top_P = 0.95
TopK = 20
TopK = 20
Chat template/prompt format:
For NON thinking mode, we purposely enclose <think> and </think> with nothing:
For Thinking-mode, DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.
Switching Between Thinking and Non-Thinking Mode
Qwen3 models come with built-in "thinking mode" to boost reasoning and improve response quality - similar to how QwQ-32B worked. Instructions for switching will differ depending on the inference engine you're using so ensure you use the correct instructions.
Instructions for llama.cpp and Ollama:
You can add /think and /no_think to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.
Here is an example of multi-turn conversation:
Instructions for transformers and vLLM:
Thinking mode:
enable_thinking=True
By default, Qwen3 has thinking enabled. When you call tokenizer.apply_chat_template, you donโt need to set anything manually.
In thinking mode, the model will generate an extra <think>...</think> block before the final answer โ this lets it "plan" and sharpen its responses.
Non-thinking mode:
enable_thinking=False
Enabling non-thinking will make Qwen3 will skip all the thinking steps and behave like a normal LLM.
This mode will provide final responses directly โ no <think> blocks, no chain-of-thought.
๐ฆ Ollama: Run Qwen3 Tutorial
Install
ollamaif you haven't already! You can only run models up to 32B in size. To run the full 235B-A22B model, see here.
Run the model! Note you can call
ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) inparamsin our Hugging Face upload!
To disable thinking, use (or you can set it in the system prompt):
If you're experiencing any looping, Ollama might have set your context length window to 2,048 or so. If this is the case, bump it up to 32,000 and see if the issue still persists.
๐ Llama.cpp: Run Qwen3 Tutorial
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
Download the model via (after installing
pip install huggingface_hub hf_transfer). You can choose Q4_K_M, or other quantized versions.
Run the model and try any prompt.
To disable thinking, use (or you can set it in the system prompt):
Running Qwen3-235B-A22B
For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.
We're following similar steps to above however this time we'll also need to perform extra steps because the model is so big.
Download the model via (after installing
pip install huggingface_hub hf_transfer). You can choose UD-Q2_K_XL, or other quantized versions..Run the model and try any prompt.
Edit
--threads 32for the number of CPU threads,--ctx-size 16384for context length,--n-gpu-layers 99for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
๐ฆฅ Fine-tuning Qwen3 with Unsloth
Unsloth makes Qwen3 fine-tuning 2x faster, use 70% less VRAM and supports 8x longer context lengths. Qwen3 (14B) fits comfortably in a Google Colab 16GB VRAM Tesla T4 GPU.
Because Qwen3 supports both reasoning and non-reasoning, you can fine-tune it with a non-reasoning dataset, but this may affect its reasoning ability. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.
Our Conversational notebook uses a combo of 75% NVIDIAโs open-math-reasoning dataset and 25% Maximeโs FineTome dataset (non-reasoning). Here's free Unsloth Colab notebooks to fine-tune Qwen3:
Qwen3 (14B) Reasoning + Conversational notebook (recommended)
Qwen3 (4B) - Advanced GRPO LoRA
Qwen3 (14B) Alpaca notebook (for Base models)
If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:
Qwen3 MOE models fine-tuning
Fine-tuning support includes MOE models: 30B-A3B and 235B-A22B. Qwen3-30B-A3B works on just 17.5GB VRAM with Unsloth. On fine-tuning MoE's - it's probably not a good idea to fine-tune the router layer so we disabled it by default.
The 30B-A3B fits in 17.5GB VRAM, but you may lack RAM or disk space since the full 16-bit model must be downloaded and converted to 4-bit on the fly for QLoRA fine-tuning. This is due to issues importing 4-bit BnB MOE models directly. This only affects MOE models.
If you're fine-tuning the MOE models, please use FastModel and not FastLanguageModel
Notebook Guide:

To use the notebooks, just click Runtime, then Run all. You can change settings in the notebook to whatever you desire. We have set them automatically by default. Change model name to whatever you like by matching it with model's name on Hugging Face e.g. 'unsloth/Qwen3-8B' or 'unsloth/Qwen3-0.6B-unsloth-bnb-4bit'.
There are other settings which you can toggle:
max_seq_length = 2048โ Controls context length. While Qwen3 supports 40960, we recommend 2048 for testing. Unsloth enables 8ร longer context fine-tuning.load_in_4bit = Trueโ Enables 4-bit quantization, reducing memory use 4ร for fine-tuning on 16GB GPUs.For full-finetuning - set
full_finetuning = Trueand 8-bit finetuning - setload_in_8bit = True
If you'd like to read a full end-to-end guide on how to use Unsloth notebooks for fine-tuning or just learn about fine-tuning, creating datasets etc., view our complete guide here:
๐งฌFine-tuning Guide๐Datasets GuideGRPO with Qwen3
We made a new advanced GRPO notebook for fine-tuning Qwen3. Learn to use our new proximity-based reward function (closer answers = rewarded) and Hugging Face's Open-R1 math dataset. Unsloth now also has better evaluations and uses the latest version of vLLM.
Qwen3 (4B) notebook - Advanced GRPO LoRA
Learn about:
Enabling reasoning in Qwen3 (Base)+ guiding it to do a specific task
Pre-finetuning to bypass GRPO's tendency to learn formatting
Improved evaluation accuracy via new regex matching
Custom GRPO templates beyond just 'think' e.g. <start_working_out></end_working_out>
Proximity-based scoring: better answers earn more points (e.g., predicting 9 when the answer is 10) and outliers are penalized

Last updated
Was this helpful?

