🌠Qwen3 2507

Run Qwen3-235B-A22B-Instruct-2507 locally on your device!

The Qwen3-2507 is an updated non-thinking variant called Qwen3-235B-A22B-Instruct-2507. The model has undergone pre training and post training phases. This new release features enhancements like significant improvement in instruction following, wider coverage of long tail knowledge like multiple languages and better alignment.

Unsloth Dynamic GGUFs: Qwen3-235B-A22B-Instruct-2507-GGUF

⚙️Best Practices

To achieve optimal performance, we recommend the following settings:

1. Sampling Parameters: We suggest using temperature=0.7, topP=0.8, topK=20, and minP=0. presence_penalty between 0 and 2 if the framework supports to reduce endless repetitions.

2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models.

3. Standardize Output Format We recommend using prompts to standardize model outputs when benchmarking.

Math Problems: Include Please reason step by step, and put your final answer within \boxed{}. in the prompt.
Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C".

📖 Run Qwen3-2507 Tutorial

For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3235B-A22B-Instruct-2507-GGUF",
    local_dir = "unsloth/235B-A22B-Instruct-2507-GGUF",
    allow_patterns = ["*UD-Q2_K_XL*"],
)

Run the model and try any prompt.
Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

./llama.cpp/llama-cli \
    --model unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --seed 3407 \
    --prio 3 \
    --temp 0.6 \
    --min-p 0.0 \
    --top-p 0.95 \
    --top-k 20 \
    -no-cnv \
    --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n"

Architectural info

Name

Value

Number of Parameters

235B of which 22B are activated

Number of Layers

Number of heads

64 Query Heads and 4 Key/Value heads

Number of Experts

128 out of which 8 are activated

Context Length

262114

Given that this is a non thinking model, there is no need to set thinking=False and the model does not generate <think> </think> blocks.

Performance

The new release comes with new performance improvements. Here's how it compares against the popular models on the standard benchmarks.

Benchmark

Deepseek-V3-0324

GPT-4o-0327

Claude Opus 4 Non-thinking

Kimi K2

Qwen3-235B-A22B Non-thinking

Qwen3-235B-A22B-Instruct-2507

Knowledge

MMLU-Pro

81.2

79.8

86.6

81.1

75.2

83.0

MMLU-Redux

90.4

91.3

94.2

92.7

89.2

93.1

GPQA

68.4

66.9

74.9

75.1

62.9

77.5

SuperGPQA

57.3

51.0

56.5

57.2

48.2

62.6

SimpleQA

27.2

40.3

22.8

31.0

12.2

54.3

CSimpleQA

71.1

60.2

68.0

74.5

60.8

84.3

Reasoning

AIME25

46.6

26.7

33.9

49.5

24.7

70.3

HMMT25

27.5

7.9

15.9

38.8

10.0

55.4

ARC-AGI

9.0

8.8

30.3

13.3

4.3

41.8

ZebraLogic

83.4

52.6

89.0

37.7

95.0

LiveBench 20241125

66.9

63.7

74.6

76.4

62.5

75.4

Coding

LiveCodeBench v6 (25.02-25.05)

45.2

35.8

44.6

48.9

32.9

51.8

MultiPL-E

82.2

82.7

88.5

85.7

79.3

87.9

Aider-Polyglot

55.1

45.3

70.7

59.0

59.6

57.3

Alignment

IFEval

82.3

83.9

87.4

89.8

83.2

88.7

Arena-Hard v2*

45.6

61.9

51.5

66.1

52.0

79.2

Creative Writing v3

81.6

84.9

83.8

88.1

80.4

87.5

WritingBench

74.5

75.5

79.2

86.2

77.0

85.2

Agent

BFCL-v3

64.7

66.5

60.1

65.2

68.0

70.9

TAU-Retail

49.6

60.3#

81.4

70.7

65.2

71.3

TAU-Airline

32.0

42.8#

59.6

53.5

32.0

44.0

Multilingualism

MultiIF

66.5

70.4

76.2

70.2

77.5

MMLU-ProX

75.8

76.2

74.5

73.2

79.4

INCLUDE

80.1

82.1

76.9

75.6

79.5

PolyMATH

32.2

25.5

30.0

44.8

27.0

50.2

PreviousQwen3: How to Run & Fine-tune NextReinforcement Learning (RL) Guide

Last updated 3 hours ago