OpenAI gpt-oss & all model types now supported!

🦥Unsloth Dynamic GGUFs on Aider Polyglot

Performance of Unsloth Dynamic GGUFs on Aider Polyglot Benchmarks

We’re excited to share that Unsloth Dynamic GGUFs shows how it's possible to quantize LLMs like DeepSeek-V3.1 (671B) down to just 1-bit or 3-bit, and still be able to outperform SOTA models like GPT-4.5, GPT-4.1 (April 2025) and Claude-4-Opus (May 2025).

Previously, we demonstrated how Unsloth Dynamic GGUFs outperform other quantization methods on 5-shot MMLU and KL Divergence. Now, we’re showcasing their performance on independent third-party evaluations using the Aider Polyglot benchmark.

Key results

  • Our 1-bit Unsloth Dynamic GGUF shrinks DeepSeek-V3.1 from 671GB → 192GB (-75% size) and no-thinking mode greatly outperforms GPT-4.1 (Apr 2025), GPT-4.5, and DeepSeek-V3-0324.

  • 3-bit Unsloth DeepSeek-V3.1 (thinking) GGUF: Outperforms Claude-4-Opus-20250514 (thinking).

  • 5-bit Unsloth DeepSeek-V3.1 (non-thinking) GGUF: Matches Claude-4-Opus-20250514 (non-thinking) performance.

  • Unsloth Dynamic GGUFs perform consistently better than other non-Unsloth Dynamic imatrix GGUFs

  • Other non-Unsloth 1-bit and 2-bit DeepSeek-V3.1 quantizations, as well as standard 1-bit quantization without selective layer quantization, either failed to load or produced gibberish and looping outputs. This highlights how Unsloth Dynamic GGUFs are able to largely retain accuracy whereas other methods do not even function.

Why the Aider Polyglot benchmark? Aider is one of the most comprehensive measures of how well LLMs can write, code, follow instructions, and apply changes without human intervention, making it one of the hardest and most valuable benchmarks for real-world use.

🦥Unsloth Dynamic Quantization

In Nov 2024, our 4-bit Dynamic Quants showcased how you could largely restore QLoRA fine-tuning & model accuracy by just selectively quantizing layers. We later studied DeepSeek-R1's architecture and applied this similar methodology, where we quantized some layers to as low as 1-bit and important layers to higher bits (6, 8-bit). This approach quickly gained popularity and has proven especially effective for MoE models, making dynamic quantization the de facto for MoE quantization.

Our Dynamic GGUFs are even more effective when paired with our imatrix calibration dataset, designed for chat and coding performance. All of this enabled extreme LLM compression without catastrophic loss in quality.

For example in Qwen2-VL-2B-Instruct, naively quantizing all layers to 4bit causes the model to fail understanding the image below. It's a train, not a coastal scene!

We also showed dynamic benchmarks in https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs for Gemma 3 and Llama 4 Scout, showing how effective our methodology is:

⚙️Benchmark setup

For our DeepSeek-V3.1 experiments, we compared different bits of Unsloth Dynamic GGUFs against:

  • Full-precision, unquantized LLMs including GPT 4.5, 4.1, Claude-4-Opus, DeepSeek-V3-0324 etc.

  • Other dynamic imatrix V3.1 GGUFs

  • Semi-dynamic (some selective layer quantization) imatrix V3.1 GGUFs for ablation purposes.

Benchmark experiments were mainly conducted by David Sluys (neolithic5452 on Aider Discord), a trusted community contributor to Aider Polyglot evaluations. Tests were run ~3 times and averaged for a median score, and the Pass-2 accuracy is reported as by convention. There are some reproducible benchmark code snippets in Aider's Discord.

Expand for Reasoning model Aider benchmarks
Model
Accuracy

GPT-5

86.7

Gemini 2.5 Pro (June)

83.1

o3

76.9

DeepSeek V3.1

76.1

(3 bit) DeepSeek V3.1 Unsloth

75.6

Claude-4-Opus (May)

72

o4-mini (High)

72

DeepSeek R1 0528

71.4

(2 bit) DeepSeek V3.1 Unsloth

66.7

Claude-3.7-Sonnet (Feb)

64.9

(1 bit) DeepSeek V3.1 Unsloth

57.8

DeepSeek R1

56.9

Expand for Non Reasoning model Aider benchmarks
Model
Accuracy

DeepSeek V3.1

71.6

Claude-4-Opus (May)

70.7

(5 bit) DeepSeek V3.1 Unsloth

70.7

(4 bit) DeepSeek V3.1 Unsloth

69.7

(3 bit) DeepSeek V3.1 Unsloth

68.4

(2 bit) DeepSeek V3.1 Unsloth

65.8

Qwen3 235B A22B

59.6

Kimi K2

59.1

(1 bit) DeepSeek V3.1 Unsloth

55.7

DeepSeek V3-0324

55.1

GPT-4.1 (April, 2025)

52.4

ChatGPT 4o (March, 2025)

45.3

GPT-4.5

44.9

DeepSeek V3.1 has both a reasoning and a non reasoning mode, and we test both. For non reasoning, we see a clear trend of how our dynamic quantizations perform below. dynamic 5-bit attains 70.7% on Aider Pass-2, whilst dynamic 1-bit attains 55.7%. In terms of size and accuracy, the 3 and 4bit are extremely powerful!

🎇Comparison to other quants

We also run the Aider Polyglot benchmark on other dynamic imatrix GGUFs from the community and compare it to ours. To ensure a fair comparison, we do the following:

  1. We select similar sized files and bit types to each Unsloth quant.

  2. We use our fixed chat template if the community quant fails to execute the benchmark. We found some community quants {"code":500,"message":"split method must have between 1 and 1 positional arguments and between 0 and 0 keyword arguments at row 3, column 1908"}, and this gets fixed by using our fixed chat template.

We see Unsloth dynamic quants doing remarkably well when compared to other community quantization for the same model size and quant type!

Expand for raw numerical data comparison to other quants
Quant
Quant Size (GB)
Unsloth Accuracy %
Comparison Accuracy %

IQ2_XXS

164

43.6

TQ1_0

170

50.7

IQ1_M

206

55.7

IQ2_M

215

56.6

IQ2_XXS

225

61.2

IQ2_M

235

64.3

Q2_K_L

239

64.0

Q2_K_XL

255

65.8

IQ3_XXS

268

65.6

65.6

IQ3_XXS

279

66.8

Q3_K_S

293

65.2

Q3_K_XL

300

68.4

IQ4_XS

357

69.2

IQ4_XS

360

66.3

Q4_K_XL

387

69.7

Q4_K_M

405

69.7

Q4_K_M

409

67.7

Q5_K_M

478

68.9

Q5_K_XL

484

70.7

🍰Dynamic quantization ablations

We did some ablations as well to confirm if our calibration dataset and our dynamic quantization methodology actually works. The trick of Unsloth's dynamic method is to quantize important layers to higher bits say 8bits, whilst un-important layers are left in lower bis like 2bits.

To test our method, we leave specific tensors in lower precision like 4bit vs higher precision. For example below we leave attn_k_b tensors in 4bit (semi-dynamic) vs 8bit (Unsloth current), and by increasing the quant size by only ~100MB or so (<0.1%), accuracy shoots up dramatically!

🐛Chat Template Bug Fixes

During testing of DeepSeek-V3.1 quants, we found some lower bit quants not enclosing <think> </think> properly or doing some weird formatting. This caused some community quants to not work on lower bits, and so this caused unfair comparisons. We found llama.cpp's usage of minja (a simpler version of jinja) does not accept positional argument in .split. We had to change:

{%- set content = content.split("</think>", 1)[1] -%}

to the below:

{%- set splitted = content.split("</think>") -%}
{%- set content = splitted[1:] | join("</think>") -%}

See https://huggingface.co/unsloth/DeepSeek-V3.1-GGUF?chat_template=default&format=true for our fixed chat template or https://huggingface.co/unsloth/DeepSeek-V3.1/raw/main/chat_template.jinja for a raw jinja file.

📊Pass Rate 1

Aider is reported mainly on pass rate 2. We also report pass rate 1 to compare community quants of the same size. We see our dynamic quants do much better than other community quants of similar sizes especially on smaller than 2 bit and larger than 4bits. 3 and 4 bit perform similarly well.

💻Run DeepSeek V3.1 Dynamic quants

Head over to our DeepSeek V3.1 guide or to quickly get the dynamic 2bit version, do:

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp

then use llama.cpp to directly download the weights. We set the optimal suggested parameters like temperature, the chat template etc already as well:

export LLAMA_CACHE="unsloth/DeepSeek-V3.1-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/DeepSeek-V3.1-GGUF:Q2_K_XL \
    --jinja \
    --n-gpu-layers 99 \
    --temp 0.6 \
    --top_p 0.95 \
    --min_p 0.01 \
    --ctx-size 8192 \
    --seed 3407 \
    -ot ".ffn_.*_exps.=CPU"

Last updated

Was this helpful?