🌠Qwen3 2507
Run Qwen3-235B-A22B-Instruct-2507 locally on your device!
The Qwen3-2507 is an updated non-thinking variant called Qwen3-235B-A22B-Instruct-2507. The model has undergone pre training and post training phases. This new release features enhancements like significant improvement in instruction following, wider coverage of long tail knowledge like multiple languages and better alignment.
Unsloth Dynamic GGUFs: Qwen3-235B-A22B-Instruct-2507-GGUF
⚙️Best Practices
To achieve optimal performance, we recommend the following settings:
1. Sampling Parameters: We suggest using temperature=0.7, topP=0.8, topK=20, and minP=0. presence_penalty between 0 and 2 if the framework supports to reduce endless repetitions.
2. Adequate Output Length: We recommend using an output length of 16,384
tokens for most queries, which is adequate for instruct models.
3. Standardize Output Format We recommend using prompts to standardize model outputs when benchmarking.
Math Problems: Include
Please reason step by step, and put your final answer within \boxed{}.
in the prompt.Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"
.
📖 Run Qwen3-2507 Tutorial
For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp
Download the model via (after installing
pip install huggingface_hub hf_transfer
). You can choose UD-Q2_K_XL, or other quantized versions..# !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3235B-A22B-Instruct-2507-GGUF", local_dir = "unsloth/235B-A22B-Instruct-2507-GGUF", allow_patterns = ["*UD-Q2_K_XL*"], )
Run the model and try any prompt.
Edit
--threads 32
for the number of CPU threads,--ctx-size 16384
for context length,--n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
./llama.cpp/llama-cli \
--model unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL.gguf \
--threads 32 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--seed 3407 \
--prio 3 \
--temp 0.6 \
--min-p 0.0 \
--top-p 0.95 \
--top-k 20 \
-no-cnv \
--prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n"
Architectural info
Number of Parameters
235B of which 22B are activated
Number of Layers
94
Number of heads
64 Query Heads and 4 Key/Value heads
Number of Experts
128 out of which 8 are activated
Context Length
262114
Given that this is a non thinking model, there is no need to set thinking=False
and the model does not generate <think> </think>
blocks.
Performance
The new release comes with new performance improvements. Here's how it compares against the popular models on the standard benchmarks.
Knowledge
MMLU-Pro
81.2
79.8
86.6
81.1
75.2
83.0
MMLU-Redux
90.4
91.3
94.2
92.7
89.2
93.1
GPQA
68.4
66.9
74.9
75.1
62.9
77.5
SuperGPQA
57.3
51.0
56.5
57.2
48.2
62.6
SimpleQA
27.2
40.3
22.8
31.0
12.2
54.3
CSimpleQA
71.1
60.2
68.0
74.5
60.8
84.3
Reasoning
AIME25
46.6
26.7
33.9
49.5
24.7
70.3
HMMT25
27.5
7.9
15.9
38.8
10.0
55.4
ARC-AGI
9.0
8.8
30.3
13.3
4.3
41.8
ZebraLogic
83.4
52.6
-
89.0
37.7
95.0
LiveBench 20241125
66.9
63.7
74.6
76.4
62.5
75.4
Coding
LiveCodeBench v6 (25.02-25.05)
45.2
35.8
44.6
48.9
32.9
51.8
MultiPL-E
82.2
82.7
88.5
85.7
79.3
87.9
Aider-Polyglot
55.1
45.3
70.7
59.0
59.6
57.3
Alignment
IFEval
82.3
83.9
87.4
89.8
83.2
88.7
Arena-Hard v2*
45.6
61.9
51.5
66.1
52.0
79.2
Creative Writing v3
81.6
84.9
83.8
88.1
80.4
87.5
WritingBench
74.5
75.5
79.2
86.2
77.0
85.2
Agent
BFCL-v3
64.7
66.5
60.1
65.2
68.0
70.9
TAU-Retail
49.6
60.3#
81.4
70.7
65.2
71.3
TAU-Airline
32.0
42.8#
59.6
53.5
32.0
44.0
Multilingualism
MultiIF
66.5
70.4
-
76.2
70.2
77.5
MMLU-ProX
75.8
76.2
-
74.5
73.2
79.4
INCLUDE
80.1
82.1
-
76.9
75.6
79.5
PolyMATH
32.2
25.5
30.0
44.8
27.0
50.2
Last updated