Grok 2
Run xAI's Grok 2 model locally!
You can now run Grok 2 (aka Grok 2.5), the 270B parameter model by xAI. Full precision requires 539GB, while the Unsloth Dynamic 3-bit version shrinks size down to just 118GB (a 75% reduction). GGUF: Grok-2-GGUF
The 3-bit Q3_K_XL model runs on a single 128GB Mac or 24GB VRAM + 128GB RAM, achieving 5+ tokens/s inference. Thanks to the llama.cpp team and community for supporting Grok 2 and making this possible. We were also glad to have helped a little along the way!
All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run quantized Grok LLMs with minimal accuracy loss.
⚙️ Recommended Settings
The 3-bit dynamic quant uses 118GB (126GiB) of disk space - this works well in a 128GB RAM unified memory Mac or on a 1x24GB card and 128GB of RAM. It is recommended to have at least 120GB RAM to run this 3-bit quant.
You must use --jinja for Grok 2. You might get incorrect results if you do not use --jinja
The 8-bit quant is ~300GB in size will fit in a 1x 80GB GPU (with MoE layers offloaded to RAM). Expect around 5 tokens/s with this setup if you have bonus 200GB RAM as well. To learn how to increase generation speed and fit longer contexts, read here.
Sampling parameters
Grok 2 has a 128K max context length thus, use
131,072context or less.Use
--jinjafor llama.cpp variants
There are no official sampling parameters to run the model, thus you can use standard defaults for most models:
Set the temperature = 1.0
Min_P = 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)
Run Grok 2 Tutorial:
Currently you can only run Grok 2 in llama.cpp.
✨ Run in llama.cpp
Install the specific llama.cpp PR for Grok 2 on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 128K context length.
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q3_K_XL (dynamic 3-bit quant) or other quantized versions like Q4_K_M . We recommend using our 2.7bit dynamic quant UD-Q2_K_XL or above to balance size and accuracy.
You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Model uploads
ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and language tasks.
🏂 Improving generation speed
If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.
Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.
The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.
📐How to fit long context (full 128K)
To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.
--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1
You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it. Then you can use together with --cache-type-k :
--cache-type-v f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
Last updated
Was this helpful?

