Tutorial: How to Run DeepSeek-R1 on your own local device
A guide on how you can run our 1.58-bit Dynamic Quants for DeepSeek-R1 using llama.cpp.
Using llama.cpp (recommended)
Do not forget about
<|User|>
and<|Assistant|>
tokens! - Or use a chat template formatterObtain the latest
llama.cpp
at: github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
It's best to use
--min-p 0.05
to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.Download the model via:
Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode
Example output:
If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
If you want to merge the weights together, use this script:
DeepSeek R1 has 61 layers. For example with a 24GB GPU or 80GB GPU, you can expect to offload after rounding down (reduce by 1 if it goes out of memory):
1.58bit
131GB
7
33
All layers 61
1.73bit
158GB
5
26
57
2.22bit
183GB
4
22
49
2.51bit
212GB
2
19
32
Running on Mac / Apple devices
For Apple Metal devices, be careful of --n-gpu-layers. If you find the machine going out of memory, reduce it. For a 128GB unified memory machine, you should be able to offload 59 layers or so.
Run in Ollama/Open WebUI
Open WebUI has made an step-by-step tutorial on how to run R1 here: docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/ If you want to use Ollama for inference on GGUFs, you need to first merge the 3 GGUF split files into 1 like the code below. Then you will need to run the model locally.
DeepSeek Chat Template
All distilled versions and the main 671B R1 model use the same chat template:
<|begin▁of▁sentence|><|User|>What is 1+1?<|Assistant|>It's 2.<|end▁of▁sentence|><|User|>Explain more!<|Assistant|>
A BOS is forcibly added, and an EOS separates each interaction. To counteract double BOS tokens during inference, you should only call tokenizer.encode(..., add_special_tokens = False) since the chat template auto adds a BOS token as well. For llama.cpp / GGUF inference, you should skip the BOS since it’ll auto add it.
<|User|>What is 1+1?<|Assistant|>
The <think> and </think> tokens get their own designated tokens. For the distilled versions for Qwen and Llama, some tokens are re-mapped, whilst Qwen for example did not have a BOS token, so <|object_ref_start|> had to be used instead. Tokenizer ID Mappings:
<think>
128798
151648
128013
</think>
128799
151649
128014
<|begin_of_sentence|>
0
151646
128000
<|end_of_sentence|>
1
151643
128001
<|User|>
128803
151644
128011
<|Assistant|>
128804
151645
128012
Padding token
2
151654
128004
Original tokens in models:
<think>
<|box_start|>
<|reserved_special_token_5|>
</think>
<|box_end|>
<|reserved_special_token_6|>
<|begin▁of▁sentence|>
<|object_ref_start|>
<|begin_of_text|>
<|end▁of▁sentence|>
<|endoftext|>
<|end_of_text|>
<|User|>
<|im_start|>
<|reserved_special_token_3|>
<|Assistant|>
<|im_end|>
<|reserved_special_token_4|>
Padding token
<|vision_pad|>
<|finetune_right_pad_id|>
All Distilled and the original R1 versions seem to have accidentally assigned the padding token to <|end▁of▁sentence|>, which is mostly not a good idea, especially if you want to further finetune on top of these reasoning models. This will cause endless infinite generations, since most frameworks will mask the EOS token out as -100. We fixed all distilled and the original R1 versions with the correct padding token (Qwen uses <|vision_pad|>, Llama uses <|finetune_right_pad_id|>, and R1 uses <|▁pad▁|> or our own added <|PAD▁TOKEN|>.
GGUF R1 Table
1.58bit
UD-IQ1_S
131GB
Fair
MoE all 1.56bit. down_proj
in MoE mixture of 2.06/1.56bit
1.73bit
UD-IQ1_M
158GB
Good
MoE all 1.56bit. down_proj
in MoE left at 2.06bit
2.22bit
UD-IQ2_XXS
183GB
Better
MoE all 2.06bit. down_proj
in MoE mixture of 2.5/2.06bit
2.51bit
UD-Q2_K_XL
212GB
Best
MoE all 2.5bit. down_proj
in MoE mixture of 3.5/2.5bit
Last updated
Was this helpful?