🌠Qwen3-Coder: How to Run Locally

Run Qwen3-Coder-480B-A35B locally with Unsloth Dynamic quants.

Qwen3-Coder-480B-A35B delivers SOTA advancements in agentic coding and code tasks, matching or outperforming Claude Sonnet-4, GPT-4.1, and Kimi K2. The 480B model achieves a 61.8% on Aider Polygot and supports a 256K token context, extendable to 1M tokens.

All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized Qwen LLMs with minimal accuracy loss.

We also uploaded Qwen3 with native 1M context length extended by YaRN and unquantized 8bit and 16bit versions.

Qwen3 Coder - Unsloth Dynamic 2.0 GGUFs:

Dynamic 2.0 GGUF (to run)

1M Context Dynamic 2.0 GGUF

480B-A35B-Instruct

480B-A35B-Instruct

🖥️ Running Qwen3 Coder

⚙️ Official Recommended Settings

According to Qwen, these are the recommended settings for inference:

temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05

Temperature of 0.7
Top_K of 20
Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)
Top_P of 0.8
Repetition Penalty of 1.05

Chat template:

<|im_start|>user
Hey there!<|im_end|>
<|im_start|>assistant
What is 1+1?<|im_end|>
<|im_start|>user
2<|im_end|>
<|im_start|>assistant

Recommended context output: 65,536 tokens (can be increased). Details here.

Chat template/prompt format with newlines un-rendered

<|im_start|>user\nHey there!<|im_end|>\n<|im_start|>assistant\nWhat is 1+1?<|im_end|>\n<|im_start|>user\n2<|im_end|>\n<|im_start|>assistant\n

Chat template for tool calling (Getting the current temperature for San Francisco). More details here for how to format tool calls.

<|im_start|>user
What's the temperature in San Francisco now? How about tomorrow?<|im_end|>
<|im_start|>assistant
<tool_call>\n<function=get_current_temperature>\n<parameter=location>\nSan Francisco, CA, USA
</parameter>\n</function>\n</tool_call><|im_end|>
<|im_start|>user
<tool_response>
{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}
</tool_response>\n<|im_end|>

Reminder that this model supports only non-thinking mode and does not generate <think></think> blocks in its output. Meanwhile, specifying enable_thinking=False is no longer required.

📖 Llama.cpp: Run Qwen3 Tutorial

For Coder-480B-A35B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

If you want a full precision unquantized version, use our Q8_K_XL, Q8_0 or BF16 versions!

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

You can directly use llama.cpp to download the model but I normally suggest using huggingface_hub To use llama.cpp directly, do:

./llama.cpp/llama-cli \
    -hf unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF:Q2_K_XL \
    --threads -1 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --temp 0.7 \
    --min-p 0.0 \
    --top-p 0.8 \
    --top-k 20 \
    --repeat-penalty 1.05

Or, download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF",
    local_dir = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF",
    allow_patterns = ["*UD-Q2_K_XL*"],
)

Run the model in conversation mode and try any prompt.
Edit --threads -1 for the number of CPU threads, --ctx-size 262114 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity. More options discussed here.

./llama.cpp/llama-cli \
    --model unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/UD-Q2_K_XL/Qwen3-Coder-480B-A35B-Instruct-UD-Q2_K_XL-00001-of-00004.gguf \
    --threads -1 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --temp 0.7 \
    --min-p 0.0 \
    --top-p 0.8 \
    --top-k 20 \
    --repeat-penalty 1.05

Also don't forget about the new Qwen3 update. Run Qwen3-235B-A22B-Instruct-2507 locally with llama.cpp.

🛠️ Improving generation speed

If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.

📐How to fit long context (256K to 1M)

To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1

You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it.

We also uploaded 1 million context length GGUFs via YaRN scaling here.

🧰 Tool Calling

To format the prompts for tool calling, let's showcase it with an example.

I created a Python function called get_current_temperature which is a function which should get the current temperature for a location. For now we created a placeholder function which will always return 21.6 degrees celsius. You should change this to a true function!!

def get_current_temperature(location: str, unit: str = "celsius"):
    """Get current temperature at a location.

    Args:
        location: The location to get the temperature for, in the format "City, State, Country".
        unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])

    Returns:
        the temperature, the location, and the unit in a dict
    """
    return {
        "temperature": 26.1, # PRE_CONFIGURED -> you change this!
        "location": location,
        "unit": unit,
    }

Then use the tokenizer to create the entire prompt:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-Coder-480B-A35B-Instruct")

messages = [
    {'role': 'user', 'content': "What's the temperature in San Francisco now? How about tomorrow?"},
    {'content': "", 'role': 'assistant', 'function_call': None, 'tool_calls': [
        {'id': 'ID', 'function': {'arguments': {"location": "San Francisco, CA, USA"}, 'name': 'get_current_temperature'}, 'type': 'function'},
    ]},
    {'role': 'tool', 'content': '{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}', 'tool_call_id': 'ID'},
]

prompt = tokenizer.apply_chat_template(messages, tokenize = False)

💡Performance Benchmarks

These official benchmarks are for the full BF16 checkpoint. To use this, simply use the Q8_K_XL, Q8_0, BF16 checkpoints we uploaded - you can still use the tricks like MoE offloading for these versions as well!

Here are the benchmarks for the 480B model:

Agentic Coding

Benchmark

Qwen3‑Coder 40B‑A35B‑Instruct

Kimi‑K2

DeepSeek‑V3-0324

Claude 4 Sonnet

GPT‑4.1

Terminal‑Bench

37.5

30.0

2.5

35.5

25.3

SWE‑bench Verified w/ OpenHands (500 turns)

69.6

–

70.4

–

SWE‑bench Verified w/ OpenHands (100 turns)

67.0

65.4

38.8

68.0

48.6

SWE‑bench Verified w/ Private Scaffolding

–

65.8

–

72.7

63.8

SWE‑bench Live

26.3

22.3

13.0

27.7

–

SWE‑bench Multilingual

54.7

47.3

13.0

53.3

31.5

Multi‑SWE‑bench mini

25.8

19.8

7.5

24.8

–

Multi‑SWE‑bench flash

27.0

20.7

–

25.0

–

Aider‑Polyglot

61.8

60.0

56.9

56.4

52.4

Spider2

31.1

25.2

12.8

31.1

16.5

Agentic Browser Use

Benchmark

Qwen3‑Coder 40B‑A35B‑Instruct

Kimi‑K2

DeepSeek‑V3 0324

Claude Sonnet‑4

GPT‑4.1

WebArena

49.9

47.4

40.0

51.1

44.3

Mind2Web

55.8

42.7

36.0

47.4

49.6

Agentic Tool -Use

Benchmark

Qwen3‑Coder 40B‑A35B‑Instruct

Kimi‑K2

DeepSeek‑V3 0324

Claude Sonnet‑4

GPT‑4.1

BFCL‑v3

68.7

65.2

56.9

73.3

62.9

TAU‑Bench Retail

77.5

70.7

59.1

80.5

–

TAU‑Bench Airline

60.0

53.5

40.0

60.0

–

PreviousKimi K2: How to Run Locally NextGemma 3n: How to Run & Fine-tune

Last updated 11 hours ago

Qwen3 Coder - Unsloth Dynamic 2.0 GGUFs:

🖥️ Running Qwen3 Coder

⚙️ Official Recommended Settings

📖 Llama.cpp: Run Qwen3 Tutorial

🛠️ Improving generation speed

📐How to fit long context (256K to 1M)

🧰 Tool Calling

💡Performance Benchmarks

Agentic Coding

Agentic Browser Use

Agentic Tool -Use

Agentic Browser Use

Agentic Tool -Use