🦥Unsloth Dynamic GGUFs on Aider Polyglot
Performance of Unsloth Dynamic GGUFs on Aider Polyglot Benchmarks
We’re excited to share that Unsloth Dynamic GGUFs shows how it's possible to quantize LLMs like DeepSeek-V3.1 (671B) down to just 1-bit or 3-bit, and still be able to outperform SOTA models like GPT-4.5, GPT-4.1 (April 2025) and Claude-4-Opus (May 2025).
Previously, we demonstrated how Unsloth Dynamic GGUFs outperform other quantization methods on 5-shot MMLU and KL Divergence. Now, we’re showcasing their performance on independent third-party evaluations using the Aider Polyglot benchmark.


⭐Key results
Our 1-bit Unsloth Dynamic GGUF shrinks DeepSeek-V3.1 from 671GB → 192GB (-75% size) and no-thinking mode greatly outperforms GPT-4.1 (Apr 2025), GPT-4.5, and DeepSeek-V3-0324.
3-bit Unsloth DeepSeek-V3.1 (thinking) GGUF: Outperforms Claude-4-Opus-20250514 (thinking).
5-bit Unsloth DeepSeek-V3.1 (non-thinking) GGUF: Matches Claude-4-Opus-20250514 (non-thinking) performance.
Unsloth Dynamic GGUFs perform consistently better than other non-Unsloth Dynamic imatrix GGUFs
Other non-Unsloth 1-bit and 2-bit DeepSeek-V3.1 quantizations, as well as standard 1-bit quantization without selective layer quantization, either failed to load or produced gibberish and looping outputs. This highlights how Unsloth Dynamic GGUFs are able to largely retain accuracy whereas other methods do not even function.
Why the Aider Polyglot benchmark? Aider is one of the most comprehensive measures of how well LLMs can write, code, follow instructions, and apply changes without human intervention, making it one of the hardest and most valuable benchmarks for real-world use.
The key advantage of using the Unsloth package and models is our active role in fixing critical bugs in major models. We've collaborated directly with teams behind Qwen3, Meta (Llama 4), Mistral (Devstral), Google (Gemma 1–3) and Microsoft (Phi-3/4), contributing essential fixes that significantly boost accuracy.
🦥Unsloth Dynamic Quantization
Dynamic 1 bit makes important layers in 8 or 16 bits and un-important layers in 1,2,3,4,5 or 6bits.
In Nov 2024, our 4-bit Dynamic Quants showcased how you could largely restore QLoRA fine-tuning & model accuracy by just selectively quantizing layers. We later studied DeepSeek-R1's architecture and applied this similar methodology, where we quantized some layers to as low as 1-bit and important layers to higher bits (6, 8-bit). This approach quickly gained popularity and has proven especially effective for MoE models, making dynamic quantization the de facto for MoE quantization.
Our Dynamic GGUFs are even more effective when paired with our imatrix calibration dataset, designed for chat and coding performance. All of this enabled extreme LLM compression without catastrophic loss in quality.
For example in Qwen2-VL-2B-Instruct, naively quantizing all layers to 4bit causes the model to fail understanding the image below. It's a train, not a coastal scene!


We also showed dynamic benchmarks in https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs for Gemma 3 and Llama 4 Scout, showing how effective our methodology is:


⚙️Benchmark setup
For our DeepSeek-V3.1 experiments, we compared different bits of Unsloth Dynamic GGUFs against:
Full-precision, unquantized LLMs including GPT 4.5, 4.1, Claude-4-Opus, DeepSeek-V3-0324 etc.
Other dynamic imatrix V3.1 GGUFs
Semi-dynamic (some selective layer quantization) imatrix V3.1 GGUFs for ablation purposes.
Benchmark experiments were mainly conducted by David Sluys (neolithic5452 on Aider Discord), a trusted community contributor to Aider Polyglot evaluations. Tests were run ~3 times and averaged for a median score, and the Pass-2 accuracy is reported as by convention. There are some reproducible benchmark code snippets in Aider's Discord.
DeepSeek V3.1 has both a reasoning and a non reasoning mode, and we test both. For non reasoning, we see a clear trend of how our dynamic quantizations perform below. dynamic 5-bit attains 70.7% on Aider Pass-2, whilst dynamic 1-bit attains 55.7%. In terms of size and accuracy, the 3 and 4bit are extremely powerful!

🎇Comparison to other quants
We also run the Aider Polyglot benchmark on other dynamic imatrix GGUFs from the community and compare it to ours. To ensure a fair comparison, we do the following:
We select similar sized files and bit types to each Unsloth quant.
We use our fixed chat template if the community quant fails to execute the benchmark. We found some community quants
{"code":500,"message":"split method must have between 1 and 1 positional arguments and between 0 and 0 keyword arguments at row 3, column 1908"}
, and this gets fixed by using our fixed chat template.
We see Unsloth dynamic quants doing remarkably well when compared to other community quantization for the same model size and quant type!

🍰Dynamic quantization ablations
We did some ablations as well to confirm if our calibration dataset and our dynamic quantization methodology actually works. The trick of Unsloth's dynamic method is to quantize important layers to higher bits say 8bits, whilst un-important layers are left in lower bis like 2bits.
To test our method, we leave specific tensors in lower precision like 4bit vs higher precision. For example below we leave attn_k_b
tensors in 4bit (semi-dynamic) vs 8bit (Unsloth current), and by increasing the quant size by only ~100MB or so (<0.1%), accuracy shoots up dramatically!
attn_k_b
and other tensors in DeepSeek V3.1 are highly important / sensitive to quantization and should left in higher precision to retain accuracy!

🐛Chat Template Bug Fixes
During testing of DeepSeek-V3.1 quants, we found some lower bit quants not enclosing <think> </think>
properly or doing some weird formatting. This caused some community quants to not work on lower bits, and so this caused unfair comparisons. We found llama.cpp's usage of minja (a simpler version of jinja) does not accept positional argument in .split
. We had to change:
{%- set content = content.split("</think>", 1)[1] -%}
to the below:
{%- set splitted = content.split("</think>") -%}
{%- set content = splitted[1:] | join("</think>") -%}
See https://huggingface.co/unsloth/DeepSeek-V3.1-GGUF?chat_template=default&format=true for our fixed chat template or https://huggingface.co/unsloth/DeepSeek-V3.1/raw/main/chat_template.jinja for a raw jinja file.
📊Pass Rate 1
Aider is reported mainly on pass rate 2. We also report pass rate 1 to compare community quants of the same size. We see our dynamic quants do much better than other community quants of similar sizes especially on smaller than 2 bit and larger than 4bits. 3 and 4 bit perform similarly well.

💻Run DeepSeek V3.1 Dynamic quants
Head over to our DeepSeek V3.1 guide or to quickly get the dynamic 2bit version, do:
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp
then use llama.cpp
to directly download the weights. We set the optimal suggested parameters like temperature, the chat template etc already as well:
export LLAMA_CACHE="unsloth/DeepSeek-V3.1-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/DeepSeek-V3.1-GGUF:Q2_K_XL \
--jinja \
--n-gpu-layers 99 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 8192 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"
Last updated
Was this helpful?