🦙Llama 4: How to Run & Fine-tune

How to run Llama 4 locally using our dynamic GGUFs which recovers accuracy compared to standard quantization.

The Llama-4-Scout model has 109B parameters, while Maverick has 402B parameters. The full unquantized version requires 113GB of disk space whilst the 1.78-bit version uses 33.8GB (-75% reduction in size). Maverick (402Bs) went from 422GB to just 122GB (-70%).

Scout 1.78-bit fits in a 24GB VRAM GPU for fast inference at ~20 tokens/sec. Maverick 1.78-bit fits in 2x48GB VRAM GPUs for fast inference at ~40 tokens/sec.

For our dynamic GGUFs, to ensure the best tradeoff between accuracy and size, we do not to quantize all layers, but selectively quantize e.g. the MoE layers to lower bit, and leave attention and other layers in 4 or 6bit.

All our GGUF models are quantized using calibration data (around 250K tokens for Scout and 1M tokens for Maverick), which will improve accuracy over standard quantization. Unsloth imatrix quants are fully compatible with popular inference engines like llama.cpp & Open WebUI etc.

Scout - Unsloth Dynamic GGUFs with optimal configs:

MoE Bits
Type
Disk Size
Link
Details

1.78bit

IQ1_S

33.8GB

2.06/1.56bit

1.93bit

IQ1_M

35.4GB

2.5/2.06/1.56

2.42bit

IQ2_XXS

38.6GB

2.5/2.06bit

2.71bit

Q2_K_XL

42.2GB

3.5/2.5bit

3.5bit

Q3_K_XL

52.9GB

4.5/3.5bit

4.5bit

Q4_K_XL

65.6GB

5.5/4.5bit

For best results, use the 2.42-bit (IQ2_XXS) or larger versions.

Maverick - Unsloth Dynamic GGUFs with optimal configs:

MoE Bits
Type
Disk Size
HF Link

1.78bit

IQ1_S

122GB

1.93bit

IQ1_M

128GB

2.42-bit

IQ2_XXS

140GB

2.71-bit

Q2_K_XL

151B

3.5-bit

Q3_K_XL

193GB

4.5-bit

Q4_K_XL

243GB

According to Meta, these are the recommended settings for inference:

  • Temperature of 0.6

  • Min_P of 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)

  • Top_P of 0.9

  • Chat template/prompt format:

📖 Tutorial: How to Run Llama-4-Scout in llama.cpp

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

  1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision). More versions at: https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF

  1. Run the model and try any prompt.

  2. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length (Llama 4 supports 10M context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

In terms of testing, unfortunately we can't make the full BF16 version (ie regardless of quantization or not) complete the Flappy Bird game nor the Heptagon test appropriately. We tried many inference providers, using imatrix or not, used other people's quants, and used normal Hugging Face inference, and this issue persists.

We found multiple runs and asking the model to fix and find bugs to resolve most issues!

For Llama 4 Maverick - it's best to have 2 RTX 4090s (2 x 24GB)

🕵️ Interesting Insights and Issues

During quantization of Llama 4 Maverick (the large model), we found the 1st, 3rd and 45th MoE layers could not be calibrated correctly. Maverick uses interleaving MoE layers for every odd layer, so Dense->MoE->Dense and so on.

We tried adding more uncommon languages to our calibration dataset, and tried using more tokens (1 million) vs Scout's 250K tokens for calibration, but we still found issues. We decided to leave these MoE layers as 3bit and 4bit.

For Llama 4 Scout, we found we should not quantize the vision layers, and leave the MoE router and some other layers as unquantized - we upload these to https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-dynamic-bnb-4bit

We also had to convert torch.nn.Parameter to torch.nn.Linear for the MoE layers to allow 4bit quantization to occur. This also means we had to rewrite and patch over the generic Hugging Face implementation. We upload our quantized versions to https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit and https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-8bit for 8bit.

Llama 4 also now uses chunked attention - it's essentially sliding window attention, but slightly more efficient by not attending to previous tokens over the 8192 boundary.

Last updated

Was this helpful?