Grok 2
Run xAI's Grok 2 model locally!
You can now run Grok 2, a 270B parameter model by xAI. Full precision requires 539GB, while the Unsloth Dynamic 3-bit version shrinks size down to just 118GB (a 75% reduction). GGUF: Grok-2-GGUF
Thanks to the llama.cpp team and community for supporting Grok 2 and making this possible. We were also glad to have helped a little along the way. The 3-bit Q3_K_XL model runs on a single 128GB Mac or 24GB VRAM + 128GB RAM, achieving 5+ tokens/s inference.
All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run quantized Grok LLMs with minimal accuracy loss.
⚙️ Recommended Settings
The 3-bit dynamic quant uses 118GB (126GiB) of disk space - this works well in a 128GB RAM unified memory Mac or on a 1x24GB card and 128GB of RAM. It is recommended to have at least 120GB RAM to run this 3-bit quant.
You must use --jinja
for Grok 2. You might get incorrect results if you do not use --jinja
The 8-bit quant is ~300GB in size will fit in a 1x 80GB GPU (with MoE layers offloaded to RAM). Expect around 5 tokens/s with this setup if you have bonus 200GB RAM as well. To learn how to increase generation speed and fit longer contexts, read here.
Sampling parameters
Grok 2 has a 128K max context length thus, use
131,072
context or less.Use
--jinja
for llama.cpp variants
There are no official sampling parameters to run the model, thus you can use standard defaults for most models:
Set the temperature = 1.0
Min_P = 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)
Run Grok 2 Tutorial:
Currently you can only run Grok 2 in llama.cpp.
✨ Run in llama.cpp
Obtain the specific llama.cpp
PR for Grok 2 on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON
to -DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
git fetch origin pull/15539/head:MASTER && git checkout MASTER
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp
If you want to use llama.cpp
directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run
. Use export LLAMA_CACHE="folder"
to force llama.cpp
to save to a specific location. Remember the model has only a maximum of 128K context length.
export LLAMA_CACHE="unsloth/grok-2-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/grok-2-GGUF:Q3_K_XL \
--jinja \
--n-gpu-layers 99 \
--temp 1.0 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"
Download the model via (after installing pip install huggingface_hub hf_transfer
). You can choose UD-Q3_K_XL
(dynamic 3-bit quant) or other quantized versions like Q4_K_M
. We recommend using our 2.7bit dynamic quant UD-Q2_K_XL
or above to balance size and accuracy.
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/grok-2-GGUF",
local_dir = "unsloth/grok-2-GGUF",
allow_patterns = ["*UD-Q3_K_XL*"], # Dynamic 3bit
)
You can edit --threads 32
for the number of CPU threads, --ctx-size 16384
for context length, --n-gpu-layers 2
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
./llama.cpp/llama-cli \
--model unsloth/grok-2-GGUF/UD-Q3_K_XL/grok-2-UD-Q3_K_XL-00001-of-00003.gguf \
--jinja \
--threads -1 \
--n-gpu-layers 99 \
--temp 1.0 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"
Model uploads
ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and language tasks.
🏂 Improving generation speed
If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.
Normally, -ot ".ffn_.*_exps.=CPU"
offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU"
This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU"
if you have even more GPU memory. This offloads only up projection MoE layers.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"
means to offload gate, up and down MoE layers but only from the 6th layer onwards.
The latest llama.cpp release also introduces high throughput mode. Use llama-parallel
. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.
📐How to fit long context (full 128K)
To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16
) include the below.
--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
You should use the _1
variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1
You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON
, and use --flash-attn
to enable it. Then you can use together with --cache-type-k
:
--cache-type-v f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
Last updated
Was this helpful?