Qwen3: How to Run & Fine-tune
Learn to run & fine-tune Qwen3 locally with Unsloth + our Dynamic 2.0 quants
Last updated
Was this helpful?
Learn to run & fine-tune Qwen3 locally with Unsloth + our Dynamic 2.0 quants
Last updated
Was this helpful?
Qwen's new Qwen3 models deliver state-of-the-art advancements in reasoning, instruction-following, agent capabilities, and multilingual support. All Qwen3 uploads use our new methodology, delivering the best performance on 5-shot MMLU and KL Divergence benchmarks. This means, you can run and fine-tune quantized Qwen3 LLMs with minimal accuracy loss!
We also uploaded Qwen3 with native 128K context length. Qwen achieves this by using YaRN to extend its original 40K window to 128K.
FIXED UPDATE 04/29/2025: our GGUFs now work on ALL inference engines including llama.cpp, Ollama, LM Studio & Open WebUI.
Qwen3 30B-A3B is now fixed! All uploads are now fixed and will work anywhere with any quant!
According to Qwen, these are the recommended settings for inference:
Temperature = 0.7
Temperature = 0.6
Min_P = 0.0 (optional, but 0.01 works well, llama.cpp default is 0.1)
Min_P = 0.0
Top_P = 0.8
Top_P = 0.95
TopK = 20
TopK = 20
Chat template/prompt format:
For NON thinking mode, we purposely enclose <think> and </think> with nothing:
For Thinking-mode, DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.
enable_thinking=True
By default, Qwen3 has thinking enabled. When you call tokenizer.apply_chat_template
, you don’t need to set anything manually.
In thinking mode, the model will generate an extra <think>...</think>
block before the final answer — this lets it "plan" and sharpen its responses.
enable_thinking=False
Enabling non-thinking will make Qwen3 will skip all the thinking steps and behave like a normal LLM.
This mode will provide final responses directly — no <think>
blocks, no chain-of-thought.
Run the model! Note you can call ollama serve
in another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params
in our Hugging Face upload!
Download the model via (after installing pip install huggingface_hub hf_transfer
). You can choose Q4_K_M, or other quantized versions.
Run the model and try any prompt.
For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options. We're still doing some testing so we'd recommend you wait.
We're following similar steps to above however this time we'll also need to perform extra steps because the model is so big.
Download the model via (after installing pip install huggingface_hub hf_transfer
). You can choose UD_IQ2_XXS, or other quantized versions..
Run the model and try any prompt.
Edit --threads 32
for the number of CPU threads, --ctx-size 16384
for context length, --n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
Notebooks coming soon!
•
• (TBA)
(TBA)
(TBA)
Qwen3 models come with built-in "thinking mode" to boost reasoning and improve response quality - similar to how worked.
Install ollama
if you haven't already! You can only run models up to 32B in size. To run the full 235B-A22B model, .
Obtain the latest llama.cpp
on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON
to -DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.