Unsloth Dynamic v2.0 GGUFs
A big new upgrade to our Dynamic Quants!
Last updated
Was this helpful?
A big new upgrade to our Dynamic Quants!
Last updated
Was this helpful?
We're excited to introduce our Dynamic v2.0 quantization method - a major upgrade to our previous quants. This new method outperforms leading quantization methods and sets new benchmarks for 5-shot MMLU and KL Divergence.
This means you can now run + fine-tune quantized LLMs while preserving as much accuracy as possible! You can run the v2.0 GGUFs on any inference engine like llama.cpp, Ollama, Open WebUI etc.
Revamped Layer Selection for GGUFs + safetensors: Unsloth Dynamic 2.0 now selectively quantizes layers much more intelligently and extensively. Rather than modifying only select layers, we now dynamically adjust the quantization type of every possible layer, and the combinations will differ for each layer and model.
Current selected and all future GGUF uploads will utilize Dynamic 2.0 and our new calibration dataset. The dataset ranges from 300K to 1.5M tokens (depending on model) and comprise of high-quality, hand-curated and cleaned data - to greatly enhance conversational chat performance.
Previously, our Dynamic quantization (DeepSeek-R1 1.58-bit GGUF) was effective only for MoE architectures. Dynamic 2.0 quantization now works on all models (including MoEs).
Model-Specific Quants: Each model now uses a custom-tailored quantization scheme. E.g. the layers quantized in Gemma 3 differ significantly from those in Llama 4.
To ensure accurate benchmarking, we built an internal evaluation framework to match official reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, QAT and standard imatrix GGUF quants.
View all our Dynamic 2.0 GGUF models on . Currently, we've released updates for:
All future GGUF uploads will utilize Unsloth Dynamic 2.0, and our Dynamic 4-bit safe tensor quants will also benefit from this in the future.
The paper also shows that interestingly KL Divergence is highly correlated with flips, and so our goal is to reduce the mean KL Divergence whilst increasing the disk space of the quantization as less as possible.
Most frameworks report perplexity and KL Divergence using a test set of Wikipedia articles. However, we noticed using the calibration dataset which is also Wikipedia related causes quants to overfit, and attain lower perplexity scores. Also instruct models have unique chat templates, and using text only calibration datasets is not effective for instruct models (base models yes). In fact most imatrix GGUFs are typically calibrated with these issues. As a result, they naturally perform better on KL Divergence benchmarks that also use Wikipedia data, since the model is essentially optimized for that domain.
To ensure a fair and controlled evaluation, we do not to use our own calibration dataset (which is optimized for chat performance) when benchmarking KL Divergence. Instead, we conducted tests using the same standard Wikipedia datasets, allowing us to directly compare the performance of our Dynamic 2.0 method against the baseline imatrix approach.
Replicating MMLU 5 shot was nightmarish. We could not replicate MMLU results for many models including Llama 3.1 (8B) Instruct, Gemma 3 (12B) and others due to subtle implementation issues. Llama 3.1 (8B) for example should be getting ~68.2%, whilst using incorrect implementations can attain 35% accuracy.
Llama 3.1 (8B) Instruct has a MMLU 5 shot accuracy of 67.8% using a naive MMLU implementation. We find however Llama tokenizes "A" and "_A" (A with a space in front) as different token ids. If we consider both spaced and non spaced tokens, we get 68.2% (+0.4%)
The Gemma team released two QAT (quantization aware training) versions of Gemma 3:
We benchmarked all Q4_0 GGUF versions, and did extensive experiments on the 12B model. We see the 12B Q4_0 QAT model gets 67.07% whilst the full bfloat16 12B version gets 67.15% on 5 shot MMLU. That's very impressive! The 27B model is mostly nearly there!
MMLU 5 shot
26.12%
55.13%
67.07% (67.15% BF16)
70.64% (71.5% BF16)
Disk Space
0.93GB
2.94GB
7.52GB
16.05GB
Efficiency*
1.20
10.26
5.59
2.84
We designed a new Efficiency metric which calculates the usefulness of the model whilst also taking into account its disk size and MMLU 5 shot score:
We have to minus 25 since MMLU has 4 multiple choices - A, B, C or D. Assume we make a model that simply randomly chooses answers - it'll get 25% accuracy, and have a disk space of a few bytes. But clearly this is not a useful model.
On KL Divergence vs the base model, below is a table showcasing the improvements. Reminder the closer the KL Divergence is to 0, the better (ie 0 means identical to the full precision model)
IQ1_S
1.035688
5.83
0.972932
6.06
IQ1_M
0.832252
6.33
0.800049
6.51
IQ2_XXS
0.535764
7.16
0.521039
7.31
IQ2_M
0.26554
8.84
0.258192
8.96
Q2_K_XL
0.229671
9.78
0.220937
9.95
Q3_K_XL
0.087845
12.51
0.080617
12.76
Q4_K_XL
0.024916
15.41
0.023701
15.64
If we plot the ratio of the disk space increase and the KL Divergence ratio change, we can see a much clearer benefit! Our dynamic 2bit Q2_K_XL reduces KLD quite a bit (around 7.5%).
To run Llama 4 Scout for example, first clone llama.cpp:
Then download out new dynamic v 2.0 quant for Scout:
And and let's do inference!
We also helped and fixed a few Llama 4 bugs:
As shown in our graph, our 4-bit Dynamic QAT quantization deliver better performance on 5-shot MMLU while also being smaller in size.
DeepSeek: •
Llama: •
Gemma 3: • •
Mistral:
showcases how pruning layers, even by selecting unnecessary ones still yields vast differences in terms of "flips". A "flip" is defined as answers changing from incorrect to correct or vice versa. The paper shows how MMLU might not decrease as we prune layers or do quantization,but that's because some incorrect answers might have "flipped" to become correct. Our goal is to match the original model, so measuring "flips" is a good metric.
Interestingly Llama 3 as per Eleuther AI's also appends "The best answer is" to the question, following Llama 3's original MMLU benchmarks.
There are many other subtle issues, and so to benchmark everything in a controlled environment, we designed our own MMLU implementation from scratch by investigating directly, and verified our results across multiple models and comparing to reported numbers.
Q4_0 GGUF - Quantizes all layers to Q4_0 via the formula w = q * block_scale
with each block having 32 weights. See for more details.
int4 version - presumably ?
Read more on running Llama 4 here:
Llama 4 Scout changed the RoPE Scaling configuration in their official repo. We helped resolve issues in llama.cpp to enable this
Llama 4's QK Norm's epsilon for both Scout and Maverick should be from the config file - this means using 1e-05 and not 1e-06. We helped resolve these in and
The Llama 4 team and vLLM also independently fixed an issue with QK Norm being shared across all heads (should not be so) . MMLU Pro increased from 68.58% to 71.53% accuracy.
showcased how our GGUFs via llama.cpp attain much higher accuracy than third party inference providers - this was most likely a combination of the issues explained above, and also probably due to quantization issues.