🌠Qwen3-2507
Run Qwen3-235B-A22B-Thinking-2507 and Qwen3-235B-A22B-Instruct-2507 locally on your device!
Qwen released 2507 (July 2025) updates for their Qwen3 235B models, introducing both "thinking" and "non-thinking" variants. The non-thinking Qwen3-235B-A22B-Instruct-2507 features a 256K context window, improved instruction following, multilingual capabilities and alignment. The thinking Qwen3-235B-A22B-Thinking-2507 achieves SOTA perfomance in reasoning, excelling at logic, math, science, coding, and complex academic tasks requiring expert-level performance.
Unsloth Dynamic 2.0 GGUFs:
Thinking: Qwen3-235B-A22B-Thinking-2507-GGUF
Non-thinking: Qwen3-235B-A22B-Instruct-2507-GGUF
⚙️Best Practices for the Thinking & Instruct model
The settings for the Thinking and Instruct model are different. The thinking model uses temperature = 0.6, but the instruct model uses temperature = 0.7 The thinking model uses top_p = 0.95, but the instruct model uses top_p = 0.8
To achieve optimal performance, Qwen recommends these settings:
For the Thinking model:
temperature = 0.6
top_k = 20
min_p = 0.00
(llama.cpp's default is 0.1)top_p = 0.95
presence_penalty = 0.0 to 2.0
(llama.cpp default turns it off, but to reduce repetitions, you can use this)
For the Instruct model:
temperature = 0.7
top_k = 20
min_p = 0.00
(llama.cpp's default is 0.1)top_p = 0.80
presence_penalty = 0.0 to 2.0
(llama.cpp default turns it off, but to reduce repetitions, you can use this)
Adequate Output Length: Use an output length of 32,768
tokens for most queries, which is adequate for most queries. For the Instruct model, you should set the max output length to be 81,920
tokens.
Chat template for both Thinking (thinking has <think></think>
) and Instruct is below:
<|im_start|>user
Hey there!<|im_end|>
<|im_start|>assistant
What is 1+1?<|im_end|>
<|im_start|>user
2<|im_end|>
<|im_start|>assistant
📖 Run Qwen3-2507-Thinking Tutorial
This model supports only thinking mode and a 256K context window natively. The default chat template adds <think>
automatically, so you may see only a closing </think>
tag in the output.
⚙️ Best Practices for the Thinking model
To achieve optimal performance, Qwen recommends these settings for the Thinking model:
temperature = 0.6
top_k = 20
min_p = 0.00
(llama.cpp's default is 0.1)top_p = 0.95
presence_penalty = 0.0 to 2.0
(llama.cpp default turns it off, but to reduce repetitions, you can use this) Try 1.5 for example.Adequate Output Length: Use an output length of
32,768
tokens for most queries, which is adequate for most queries.
🖥️ Run Qwen3-235B-A22B-Thinking via llama.cpp:
For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.
If you want a full precision unquantized version, use our Q8_K_XL, Q8_0
or BF16
versions!
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp
You can directly use llama.cpp to download the model but I normally suggest using
huggingface_hub
To use llama.cpp directly, do:./llama.cpp/llama-cli \ -hf unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF:Q2_K_XL \ --threads -1 \ --ctx-size 16384 \ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.6 \ --min-p 0.0 \ --top-p 0.95 \ --top-k 20 \ --presence-penalty 1.5
Download the model via (after installing
pip install huggingface_hub hf_transfer
). You can choose UD-Q2_K_XL, or other quantized versions..# !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF", local_dir = "unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF", allow_patterns = ["*UD-Q2_K_XL*"], )
Run the model and try any prompt.
Edit
--threads -1
for the number of CPU threads,--ctx-size
262114 for context length,--n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
./llama.cpp/llama-cli \
--model unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF/UD-Q2_K_XL/Qwen3-235B-A22B-Thinking-2507-UD-Q2_K_XL-00001-of-00002.gguf \
--threads -1 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--seed 3407 \
--temp 0.6 \
--min-p 0.0 \
--top-p 0.95 \
--top-k 20
--presence-penalty 1.05
📖 Run Qwen3-2507-Instruct Tutorial
Given that this is a non thinking model, there is no need to set thinking=False
and the model does not generate <think> </think>
blocks.
⚙️Best Practices
To achieve optimal performance, we recommend the following settings:
1. Sampling Parameters: We suggest using temperature=0.7, top_p=0.8, top_k=20, and min_p=0.
presence_penalty
between 0 and 2 if the framework supports to reduce endless repetitions.
2. Adequate Output Length: We recommend using an output length of 16,384
tokens for most queries, which is adequate for instruct models.
3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.
Math Problems: Include
Please reason step by step, and put your final answer within \boxed{}.
in the prompt.Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C".
🖥️Run Qwen3-235B-A22B-Instruct via llama.cpp:
For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.
If you want a full precision unquantized version, use our Q8_K_XL, Q8_0
or BF16
versions!
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp
You can directly use llama.cpp to download the model but I normally suggest using
huggingface_hub
To use llama.cpp directly, do:./llama.cpp/llama-cli \ -hf unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF:Q2_K_XL \ --threads -1 \ --ctx-size 16384 \ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.7 \ --min-p 0.0 \ --top-p 0.8 \ --top-k 20 \ --repeat-penalty 1.5
Download the model via (after installing
pip install huggingface_hub hf_transfer
). You can choose UD-Q2_K_XL, or other quantized versions..# !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF", local_dir = "unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF", allow_patterns = ["*UD-Q2_K_XL*"], )
Run the model and try any prompt.
Edit
--threads -1
for the number of CPU threads,--ctx-size
262114 for context length,--n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
./llama.cpp/llama-cli \
--model unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF/UD-Q2_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL-00001-of-00002.gguf \
--threads -1 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--temp 0.7 \
--min-p 0.0 \
--top-p 0.8 \
--top-k 20
🛠️ Improving generation speed
If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.
Normally, -ot ".ffn_.*_exps.=CPU"
offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU"
This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU"
if you have even more GPU memory. This offloads only up projection MoE layers.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"
means to offload gate, up and down MoE layers but only from the 6th layer onwards.
The latest llama.cpp release also introduces high throughput mode. Use llama-parallel
. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster. The next section talks about KV cache quantization.
📐How to fit long context
To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16
) include the below.
--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
You should use the _1
variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1
So try out --cache-type-k q4_1
You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON
, and use --flash-attn
to enable it. After installing Flash Attention, you can then use --cache-type-v q4_1
🔍 Architectural info
Number of Parameters
235B of which 22B are activated
Number of Layers
94
Number of heads
64 Query Heads and 4 Key/Value heads
Number of Experts
128 out of which 8 are activated
Context Length
262,114
Last updated