GLM-4.6: How to Run Locally
A guide on how to run Z.ai's new GLM-4.6 model on your own local device!
GLM-4.6 is the latest reasoning model from Z.ai, achieving SOTA performance on coding and agent benchmarks while offering improved conversational chats. The full 355B parameter model requires 400GB of disk space, while the Unsloth Dynamic 2-bit GGUF reduces the size to 135GB (-75%). GLM-4.6-GGUF
There is currently no smaller GLM-4.6-Air model available, however Z.ai's team says that it is expected soon.
We did multiple chat template fixes for GLM-4.6 to make llama.cpp/llama-cli --jinja
work - please only use --jinja
otherwise the output will be wrong!
You asked for benchmarks on our quants, so we’re showcasing Aider Polyglot results! Our Dynamic 3-bit DeepSeek V3.1 GGUF scores 75.6%, surpassing many full-precision SOTA LLMs. Read more.
All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and Aider performance, meaning you can run & fine-tune quantized GLM LLMs with minimal accuracy loss.
Tutorials navigation:
⚙️ Recommended Settings
The 2-bit dynamic quant UD-Q2_K_XL uses 135GB of disk space - this works well in a 1x24GB card and 128GB of RAM with MoE offloading. The 1-bit UD-TQ1 GGUF also works natively in Ollama!
The 4-bit quants will fit in a 1x 40GB GPU (with MoE layers offloaded to RAM). Expect around 5 tokens/s with this setup if you have bonus 165GB RAM as well. It is recommended to have at least 205GB RAM to run this 4-bit. For optimal performance you will need at least 205GB unified memory or 205GB combined RAM+VRAM for 5+ tokens/s. To learn how to increase generation speed and fit longer contexts, read here.
Though not a must, for best performance, have your VRAM + RAM combined equal to the size of the quant you're downloading. If not, hard drive / SSD offloading will work with llama.cpp, just inference will be slower.
Official Recommended Settings
According to Z.ai, these are the recommended settings for GLM inference:
Set the temperature 1.0
Set top_p to 0.95 (recommended for coding)
Set top_k to 40 (recommended for coding)
200K context length or less
Use
--jinja
for llama.cpp variants - we fixed some chat template issues as well!
Run GLM-4.6 Tutorials:
🦙 Run in Ollama
Install ollama
if you haven't already! To run more variants of the model, see here.
apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh
Run the model! Note you can call ollama serve
in another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params
in our Hugging Face upload!
OLLAMA_MODELS=unsloth ollama serve &
OLLAMA_MODELS=unsloth ollama run hf.co/unsloth/GLM-4.6-GGUF:TQ1_0
To run other quants, you need to first merge the GGUF split files into 1 like the code below. Then you will need to run the model locally.
./llama.cpp/llama-gguf-split --merge \
GLM-4.6-GGUF/GLM-4.6-UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
merged_file.gguf
OLLAMA_MODELS=unsloth ollama serve &
OLLAMA_MODELS=unsloth ollama run merged_file.gguf
✨ Run in llama.cpp
Obtain the latest llama.cpp
on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON
to -DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp
If you want to use llama.cpp
directly to load models, you can do the below: (:Q2_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run
. Use export LLAMA_CACHE="folder"
to force llama.cpp
to save to a specific location. Remember the model has only a maximum of 128K context length.
Please try out -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU"
This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU"
if you have even more GPU memory. This offloads only up projection MoE layers.
And finally offload all layers via -ot ".ffn_.*_exps.=CPU"
This uses the least VRAM.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"
means to offload gate, up and down MoE layers but only from the 6th layer onwards.
export LLAMA_CACHE="unsloth/GLM-4.6-GGUF"
./llama.cpp/llama-cli \
--model GLM-4.6-GGUF/UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
--n-gpu-layers 99 \
--jinja \
--ctx-size 16384 \
--flash-attn on \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--min_p 0.0 \
-ot ".ffn_.*_exps.=CPU"
Download the model via (after installing pip install huggingface_hub hf_transfer
). You can choose UD-
Q2_K_XL (dynamic 2bit quant) or other quantized versions like Q4_K_XL
. We recommend using our 2.7bit dynamic quant UD-Q2_K_XL
to balance size and accuracy.
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/GLM-4.6-GGUF",
local_dir = "unsloth/GLM-4.6-GGUF",
allow_patterns = ["*UD-Q2_K_XL*"], # Dynamic 2bit Use "*UD-TQ1_0*" for Dynamic 1bit
)
You can edit --threads 32
for the number of CPU threads, --ctx-size 16384
for context length, --n-gpu-layers 2
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
./llama.cpp/llama-cli \
--model unsloth/GLM-4.6-GGUF/UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
--cache-type-k q4_0 \
--jinja \
--threads -1 \
--n-gpu-layers 99 \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--min-p 0.00 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"
✨ Deploy with llama-server and OpenAI's completion library
To use llama-server for deployment, use the following command:
./llama.cpp/llama-server \
--model unsloth/GLM-4.6-GGUF/GLM-4.6-UD-TQ1_0.gguf \
--alias "unsloth/GLM-4.6" \
--threads -1 \
--n-gpu-layers 999 \
-ot ".ffn_.*_exps.=CPU" \
--prio 3 \
--min_p 0.01 \
--ctx-size 16384 \
--port 8001 \
--jinja
Then use OpenAI's Python library after pip install openai
:
from openai import OpenAI
import json
openai_client = OpenAI(
base_url = "http://127.0.0.1:8001/v1",
api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
model = "unsloth/GLM-4.6",
messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)
💽Model uploads
ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and language tasks.
Full GLM-4.6 model uploads below:
We also uploaded IQ4_NL and Q4_1 quants which run specifically faster for ARM and Apple devices respectively.
🏂 Improving generation speed
If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.
Normally, -ot ".ffn_.*_exps.=CPU"
offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU"
This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU"
if you have even more GPU memory. This offloads only up projection MoE layers.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"
means to offload gate, up and down MoE layers but only from the 6th layer onwards.
Llama.cpp also introduces high throughput mode. Use llama-parallel
. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.
📐How to fit long context (full 200K)
To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16
) include the below.
--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
You should use the _1
variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1
You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON
, and use --flash-attn
to enable it. Then you can use together with --cache-type-k
:
--cache-type-v f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
Last updated
Was this helpful?