🌠Qwen3-2507
Run Qwen3-235B-A22B-Instruct-2507 locally on your device!
The Qwen3-2507 is an updated variant of the Qwen3 models called Qwen3-235B-A22B-Instruct-2507. This new update features enhancements like significant improvement in instruction following, 256K context length, wider coverage of long tail knowledge like multiple languages and better alignment.
Unsloth Dynamic GGUFs: Qwen3-235B-A22B-Instruct-2507-GGUF
⚙️Best Practices
To achieve optimal performance, we recommend the following settings:
1. Sampling Parameters: We suggest using temperature=0.7, topP=0.8, topK=20, and minP=0. presence_penalty between 0 and 2 if the framework supports to reduce endless repetitions.
2. Adequate Output Length: We recommend using an output length of 16,384
tokens for most queries, which is adequate for instruct models.
3. Standardize Output Format We recommend using prompts to standardize model outputs when benchmarking.
Math Problems: Include
Please reason step by step, and put your final answer within \boxed{}.
in the prompt.Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"
.
📖 Run Qwen3-2507 Tutorial
Run Qwen3-235B-A22B via llama.cpp:
For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp
Download the model via (after installing
pip install huggingface_hub hf_transfer
). You can choose UD-Q2_K_XL, or other quantized versions..# !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF", local_dir = "unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF", allow_patterns = ["*UD-Q2_K_XL*"], )
Run the model and try any prompt.
Edit
--threads -1
for the number of CPU threads,--ctx-size
262114 for context length,--n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
./llama.cpp/llama-cli \
--model unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF/UD-Q2_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL-00001-of-00002.gguf \
--threads 32 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--seed 3407 \
--prio 3 \
--temp 0.7 \
--min-p 0.0 \
--top-p 0.8 \
--top-k 20
Architectural info
Number of Parameters
235B of which 22B are activated
Number of Layers
94
Number of heads
64 Query Heads and 4 Key/Value heads
Number of Experts
128 out of which 8 are activated
Context Length
262,114
Given that this is a non thinking model, there is no need to set thinking=False
and the model does not generate <think> </think>
blocks.
Performance
The new release comes with new performance improvements. Here's how it compares against the popular models on the standard benchmarks.
Knowledge
MMLU-Pro
81.2
79.8
86.6
81.1
75.2
83.0
MMLU-Redux
90.4
91.3
94.2
92.7
89.2
93.1
GPQA
68.4
66.9
74.9
75.1
62.9
77.5
SuperGPQA
57.3
51.0
56.5
57.2
48.2
62.6
SimpleQA
27.2
40.3
22.8
31.0
12.2
54.3
CSimpleQA
71.1
60.2
68.0
74.5
60.8
84.3
Reasoning
AIME25
46.6
26.7
33.9
49.5
24.7
70.3
HMMT25
27.5
7.9
15.9
38.8
10.0
55.4
ARC-AGI
9.0
8.8
30.3
13.3
4.3
41.8
ZebraLogic
83.4
52.6
-
89.0
37.7
95.0
LiveBench 20241125
66.9
63.7
74.6
76.4
62.5
75.4
Coding
LiveCodeBench v6 (25.02-25.05)
45.2
35.8
44.6
48.9
32.9
51.8
MultiPL-E
82.2
82.7
88.5
85.7
79.3
87.9
Aider-Polyglot
55.1
45.3
70.7
59.0
59.6
57.3
Alignment
IFEval
82.3
83.9
87.4
89.8
83.2
88.7
Arena-Hard v2*
45.6
61.9
51.5
66.1
52.0
79.2
Creative Writing v3
81.6
84.9
83.8
88.1
80.4
87.5
WritingBench
74.5
75.5
79.2
86.2
77.0
85.2
Agent
BFCL-v3
64.7
66.5
60.1
65.2
68.0
70.9
TAU-Retail
49.6
60.3#
81.4
70.7
65.2
71.3
TAU-Airline
32.0
42.8#
59.6
53.5
32.0
44.0
Multilingualism
MultiIF
66.5
70.4
-
76.2
70.2
77.5
MMLU-ProX
75.8
76.2
-
74.5
73.2
79.4
INCLUDE
80.1
82.1
-
76.9
75.6
79.5
PolyMATH
32.2
25.5
30.0
44.8
27.0
50.2
Last updated