🌠Qwen3-Coder: How to Run Locally
Run Qwen3-Coder-480B-A35B locally with Unsloth Dynamic quants.
Qwen3-Coder-480B-A35B delivers SOTA advancements in agentic coding and code tasks, matching or outperforming Claude Sonnet-4, GPT-4.1, and Kimi K2. The 480B model achieves a 61.8% on Aider Polygot and supports a 256K token context, extendable to 1M tokens.
All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized Qwen LLMs with minimal accuracy loss.
We also uploaded Qwen3 with native 1M context length extended by YaRN and unquantized 8bit and 16bit versions.
Qwen3 Coder - Unsloth Dynamic 2.0 GGUFs:
🖥️ Running Qwen3 Coder
⚙️ Official Recommended Settings
According to Qwen, these are the recommended settings for inference:
temperature=0.7
, top_p=0.8
, top_k=20
, repetition_penalty=1.05
Temperature of 0.7
Top_K of 20
Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)
Top_P of 0.8
Repetition Penalty of 1.05
Chat template:
<|im_start|>user Hey there!<|im_end|> <|im_start|>assistant What is 1+1?<|im_end|> <|im_start|>user 2<|im_end|> <|im_start|>assistant
Recommended context output: 65,536 tokens (can be increased). Details here.
Chat template/prompt format with newlines un-rendered
<|im_start|>user\nHey there!<|im_end|>\n<|im_start|>assistant\nWhat is 1+1?<|im_end|>\n<|im_start|>user\n2<|im_end|>\n<|im_start|>assistant\n
Chat template for tool calling (Getting the current temperature for San Francisco). More details here for how to format tool calls.
<|im_start|>user
What's the temperature in San Francisco now? How about tomorrow?<|im_end|>
<|im_start|>assistant
<tool_call>\n<function=get_current_temperature>\n<parameter=location>\nSan Francisco, CA, USA
</parameter>\n</function>\n</tool_call><|im_end|>
<|im_start|>user
<tool_response>
{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}
</tool_response>\n<|im_end|>
📖 Llama.cpp: Run Qwen3 Tutorial
For Coder-480B-A35B, we will specifically use Llama.cpp for optimized inference and a plethora of options.
If you want a full precision unquantized version, use our Q8_K_XL, Q8_0
or BF16
versions!
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp
You can directly use llama.cpp to download the model but I normally suggest using
huggingface_hub
To use llama.cpp directly, do:./llama.cpp/llama-cli \ -hf unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF:Q2_K_XL \ --threads -1 \ --ctx-size 16384 \ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.7 \ --min-p 0.0 \ --top-p 0.8 \ --top-k 20 \ --repeat-penalty 1.05
Or, download the model via (after installing
pip install huggingface_hub hf_transfer
). You can choose UD-Q2_K_XL, or other quantized versions..# !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF", local_dir = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF", allow_patterns = ["*UD-Q2_K_XL*"], )
Run the model in conversation mode and try any prompt.
Edit
--threads -1
for the number of CPU threads,--ctx-size
262114 for context length,--n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity. More options discussed here.
./llama.cpp/llama-cli \
--model unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/UD-Q2_K_XL/Qwen3-Coder-480B-A35B-Instruct-UD-Q2_K_XL-00001-of-00004.gguf \
--threads -1 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--temp 0.7 \
--min-p 0.0 \
--top-p 0.8 \
--top-k 20 \
--repeat-penalty 1.05
Also don't forget about the new Qwen3 update. Run Qwen3-235B-A22B-Instruct-2507 locally with llama.cpp.
🛠️ Improving generation speed
If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.
Normally, -ot ".ffn_.*_exps.=CPU"
offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU"
This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU"
if you have even more GPU memory. This offloads only up projection MoE layers.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"
means to offload gate, up and down MoE layers but only from the 6th layer onwards.
The latest llama.cpp release also introduces high throughput mode. Use llama-parallel
. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.
📐How to fit long context (256K to 1M)
To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16
) include the below.
--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
You should use the _1
variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1
You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON
, and use --flash-attn
to enable it.
We also uploaded 1 million context length GGUFs via YaRN scaling here.
🧰 Tool Calling
To format the prompts for tool calling, let's showcase it with an example.
I created a Python function called get_current_temperature
which is a function which should get the current temperature for a location. For now we created a placeholder function which will always return 21.6 degrees celsius. You should change this to a true function!!
def get_current_temperature(location: str, unit: str = "celsius"):
"""Get current temperature at a location.
Args:
location: The location to get the temperature for, in the format "City, State, Country".
unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
Returns:
the temperature, the location, and the unit in a dict
"""
return {
"temperature": 26.1, # PRE_CONFIGURED -> you change this!
"location": location,
"unit": unit,
}
Then use the tokenizer to create the entire prompt:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-Coder-480B-A35B-Instruct")
messages = [
{'role': 'user', 'content': "What's the temperature in San Francisco now? How about tomorrow?"},
{'content': "", 'role': 'assistant', 'function_call': None, 'tool_calls': [
{'id': 'ID', 'function': {'arguments': {"location": "San Francisco, CA, USA"}, 'name': 'get_current_temperature'}, 'type': 'function'},
]},
{'role': 'tool', 'content': '{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}', 'tool_call_id': 'ID'},
]
prompt = tokenizer.apply_chat_template(messages, tokenize = False)
💡Performance Benchmarks
Here are the benchmarks for the 480B model:
Agentic Coding
Terminal‑Bench
37.5
30.0
2.5
35.5
25.3
SWE‑bench Verified w/ OpenHands (500 turns)
69.6
–
–
70.4
–
SWE‑bench Verified w/ OpenHands (100 turns)
67.0
65.4
38.8
68.0
48.6
SWE‑bench Verified w/ Private Scaffolding
–
65.8
–
72.7
63.8
SWE‑bench Live
26.3
22.3
13.0
27.7
–
SWE‑bench Multilingual
54.7
47.3
13.0
53.3
31.5
Multi‑SWE‑bench mini
25.8
19.8
7.5
24.8
–
Multi‑SWE‑bench flash
27.0
20.7
–
25.0
–
Aider‑Polyglot
61.8
60.0
56.9
56.4
52.4
Spider2
31.1
25.2
12.8
31.1
16.5
Agentic Browser Use
WebArena
49.9
47.4
40.0
51.1
44.3
Mind2Web
55.8
42.7
36.0
47.4
49.6
Agentic Tool -Use
BFCL‑v3
68.7
65.2
56.9
73.3
62.9
TAU‑Bench Retail
77.5
70.7
59.1
80.5
–
TAU‑Bench Airline
60.0
53.5
40.0
60.0
–
Last updated