🐋DeepSeek-R1: How to Run Locally
A guide on how you can run our 1.58-bit Dynamic Quants for DeepSeek-R1 using llama.cpp.
Please see https://docs.unsloth.ai/basics/deepseek-r1-0528-how-to-run-locally for an updated DeepSeek R1-0528 (May 28th 2025 version)
Using llama.cpp (recommended)
Do not forget about
<|User|>and<|Assistant|>tokens! - Or use a chat template formatterObtain the latest
llama.cppat: github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cppIt's best to use
--min-p 0.05to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.Download the model via:
# pip install huggingface_hub hf_transfer
# import os # Optional for faster downloading
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/DeepSeek-R1-GGUF",
local_dir = "DeepSeek-R1-GGUF",
allow_patterns = ["*UD-IQ1_S*"], # Select quant type UD-IQ1_S for 1.58bit
)Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode
Example output:
If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
To test our Flappy Bird example as mentioned in our blog post here: https://unsloth.ai/blog/deepseekr1-dynamic, we can produce the 2nd example like below using our 1.58bit dynamic quant:

Original DeepSeek R1

1.58bit Dynamic Quant
The prompt used is as below:
To call llama.cpp using this example, we do:
Also, if you want to merge the weights together for use in Ollama for example, use this script:
DeepSeek R1 has 61 layers. For example with a 24GB GPU or 80GB GPU, you can expect to offload after rounding down (reduce by 1 if it goes out of memory):
1.58bit
131GB
7
33
All layers 61
1.73bit
158GB
5
26
57
2.22bit
183GB
4
22
49
2.51bit
212GB
2
19
32
Running on Mac / Apple devices
For Apple Metal devices, be careful of --n-gpu-layers. If you find the machine going out of memory, reduce it. For a 128GB unified memory machine, you should be able to offload 59 layers or so.
Run in Ollama/Open WebUI
Open WebUI has made an step-by-step tutorial on how to run R1 here: docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/ If you want to use Ollama for inference on GGUFs, you need to first merge the 3 GGUF split files into 1 like the code below. Then you will need to run the model locally.
DeepSeek Chat Template
All distilled versions and the main 671B R1 model use the same chat template:
<|begin▁of▁sentence|><|User|>What is 1+1?<|Assistant|>It's 2.<|end▁of▁sentence|><|User|>Explain more!<|Assistant|>
A BOS is forcibly added, and an EOS separates each interaction. To counteract double BOS tokens during inference, you should only call tokenizer.encode(..., add_special_tokens = False) since the chat template auto adds a BOS token as well. For llama.cpp / GGUF inference, you should skip the BOS since it’ll auto add it.
<|User|>What is 1+1?<|Assistant|>
The <think> and </think> tokens get their own designated tokens. For the distilled versions for Qwen and Llama, some tokens are re-mapped, whilst Qwen for example did not have a BOS token, so <|object_ref_start|> had to be used instead. Tokenizer ID Mappings:
<think>
128798
151648
128013
</think>
128799
151649
128014
<|begin_of_sentence|>
0
151646
128000
<|end_of_sentence|>
1
151643
128001
<|User|>
128803
151644
128011
<|Assistant|>
128804
151645
128012
Padding token
2
151654
128004
Original tokens in models:
<think>
<|box_start|>
<|reserved_special_token_5|>
</think>
<|box_end|>
<|reserved_special_token_6|>
<|begin▁of▁sentence|>
<|object_ref_start|>
<|begin_of_text|>
<|end▁of▁sentence|>
<|endoftext|>
<|end_of_text|>
<|User|>
<|im_start|>
<|reserved_special_token_3|>
<|Assistant|>
<|im_end|>
<|reserved_special_token_4|>
Padding token
<|vision_pad|>
<|finetune_right_pad_id|>
All Distilled and the original R1 versions seem to have accidentally assigned the padding token to <|end▁of▁sentence|>, which is mostly not a good idea, especially if you want to further finetune on top of these reasoning models. This will cause endless infinite generations, since most frameworks will mask the EOS token out as -100. We fixed all distilled and the original R1 versions with the correct padding token (Qwen uses <|vision_pad|>, Llama uses <|finetune_right_pad_id|>, and R1 uses <|▁pad▁|> or our own added <|PAD▁TOKEN|>.
GGUF R1 Table
Last updated
Was this helpful?

