🌠QwQ-32B: How to Run effectively
How to run QwQ-32B effectively with our bug fixes and without endless generations + GGUFs.
Qwen released QwQ-32B - a reasoning model with performance comparable to DeepSeek-R1 on many benchmarks. However, people have been experiencing infinite generations, many repetitions, <think> token issues and finetuning issues. We hope this guide will help debug and fix most issues!
Unsloth QwQ-32B uploads with our bug fixes:
⚙️ Official Recommended Settings
According to Qwen, these are the recommended settings for inference:
Temperature of 0.6
Top_K of 40 (or 20 to 40)
Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)
Top_P of 0.95
Repetition Penalty of 1.0. (1.0 means disabled in llama.cpp and transformers)
Chat template:
<|im_start|>user\nCreate a Flappy Bird game in Python.<|im_end|>\n<|im_start|>assistant\n<think>\n
llama.cpp uses min_p = 0.1by default, which might cause issues. Force it to 0.0.
👍 Recommended settings for llama.cpp
We noticed many people use a Repetition Penalty greater than 1.0. For example 1.1 to 1.5. This actually interferes with llama.cpp's sampling mechanisms. The goal of a repetition penalty is to penalize repeated generations, but we found this doesn't work as expected.
Turning off Repetition Penalty also works (ie setting it to 1.0), but we found using it to be useful to penalize endless generations.
To use it, we found you must also edit the ordering of samplers in llama.cpp to before applying Repetition Penalty, otherwise there will be endless generations. So add this:
By default, llama.cpp uses this ordering:
We reorder essentially temperature and dry, and move min_p forward. This means we apply samplers in this order:
If you still encounter issues, you can increase the--repeat-penalty 1.0 to 1.2 or 1.3.
Courtesy to @krist486 for bringing llama.cpp sampling directions to my attention.
☀️ Dry Repetition Penalty
We investigated usage of dry penalty as suggested in https://github.com/ggml-org/llama.cpp/blob/master/examples/main/README.md using a value of 0.8, but we actually found this to rather cause syntax issues especially for coding. If you still encounter issues, you can increase thedry penalty to 0.8.
Utilizing our swapped sampling ordering can also help if you decide to use dry penalty.
🦙 Tutorial: How to Run QwQ-32B in Ollama
Install
ollamaif you haven't already!
Run run the model! Note you can call
ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature, min_p etc) inparamin our Hugging Face upload!
📖 Tutorial: How to Run QwQ-32B in llama.cpp
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
Download the model via (after installing
pip install huggingface_hub hf_transfer). You can choose Q4_K_M, or other quantized versions (like BF16 full precision). More versions at: https://huggingface.co/unsloth/QwQ-32B-GGUF
Run Unsloth's Flappy Bird test, which will save the output to
Q4_K_M_yes_samplers.txtEdit
--threads 32for the number of CPU threads,--ctx-size 16384for context length,--n-gpu-layers 99for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.We use
--repeat-penalty 1.1and--dry-multiplier 0.5which you can adjust.
The full input from our https://unsloth.ai/blog/deepseekr1-dynamic 1.58bit blog is:
The beginning and the end of the final Python output after removing the thinking parts:
When running it, we get a runnable game!

Now try the same without our fixes! So remove
--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"This will save the output toQ4_K_M_no_samplers.txt
You will get some looping, but problematically incorrect Python syntax and many other issues. For example the below looks correct, but is wrong! Ie line 39 pipes.clear() ### <<< NameError: name 'pipes' is not defined. Did you forget to import 'pipes'?
If you use
--repeat-penalty 1.5, it gets even worse and more obvious, with actually totally incorrect syntax.
You might be wondering maybe it's Q4_K_M? B16 ie full precision should work fine right? Incorrect - the outputs again fail if we do not use our fix of -
-samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"when using a Repetition Penalty.
🌄 Still doesn't work? Try Min_p = 0.1, Temperature = 1.5
According to the Min_p paper https://arxiv.org/pdf/2407.01082, for more creative and diverse outputs, and if you still see repetitions, try disabling top_p and top_k!
Another approach is to disable min_p directly, since llama.cpp by default uses min_p = 0.1!
🤔 <think> token not shown?
Some people are reporting that because <think> is default added in the chat template, some systems are not outputting the thinking traces correctly. You will have to manually edit the Jinja template from:
to another by removing the <think>\n at the end. The model will now have to manually add <think>\n during inference, which might not always succeed. DeepSeek also edited all models to default add a <think> token to force the model to go into reasoning model.
So change {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n<think>\n' }} {%- endif %} to {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %}
ie remove <think>\n
Extra Notes
We first thought maybe:
QwQ's context length was not natively 128K, but rather 32K with YaRN extension. For example in the readme file for https://huggingface.co/Qwen/QwQ-32B, we see:
We tried overriding llama.cpp's YaRN handling, but nothing changed.
We also tested if tokenizer IDs matched between llama.cpp and normal Transformers courtesy of @kalomaze. They matched, so this was not the culprit.
We provide our experimental results below:
✏️ Tokenizer Bug Fixes
We found a few issues as well specifically impacting finetuning! The EOS token is correct, but the PAD token should probably rather be
"<|vision_pad|>" We updated it in: https://huggingface.co/unsloth/QwQ-32B/blob/main/tokenizer_config.json
🛠️ Dynamic 4-bit Quants
We also uploaded dynamic 4bit quants which increase accuracy vs naive 4bit quantizations! We attach the QwQ quantization error plot analysis for both activation and weight quantization errors:

We uploaded dynamic 4-bit quants to: https://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit
Since vLLM 0.7.3 (2025 February 20th) https://github.com/vllm-project/vllm/releases/tag/v0.7.3, vLLM now supports loading Unsloth dynamic 4bit quants!
All our GGUFs are at https://huggingface.co/unsloth/QwQ-32B-GGUF!
Last updated
Was this helpful?

