🐋Tutorial: How to Run DeepSeek-R1 on your own local device

A guide on how you can run our 1.58-bit Dynamic Quants for DeepSeek-R1 using llama.cpp.

  1. Do not forget about <|User|> and <|Assistant|> tokens! - Or use a chat template formatter

  2. Obtain the latest llama.cpp at: github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:

apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
  1. It's best to use --min-p 0.05 to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.

  2. Download the model via:

# pip install huggingface_hub hf_transfer
# import os # Optional for faster downloading
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

from huggingface_hub import snapshot_download
snapshot_download(
  repo_id = "unsloth/DeepSeek-R1-GGUF",
  local_dir = "DeepSeek-R1-GGUF",
  allow_patterns = ["*UD-IQ1_S*"], # Select quant type UD-IQ1_S for 1.58bit
)
  1. Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode

   ./llama.cpp/llama-cli \
	  --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
	  --cache-type-k q4_0 \
	  --threads 12 -no-cnv --prio 2 \
	  --temp 0.6 \
	  --ctx-size 8192 \
	  --seed 3407 \
	  --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"

Example output:

 <think>
 Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly.
 Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense.
 Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything.
 I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right.
 Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...
  1. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.

  ./llama.cpp/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --prio 2 \
    --n-gpu-layers 7 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
  1. If you want to merge the weights together, use this script:

./llama.cpp/llama-gguf-split --merge \
    DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    merged_file.gguf
  1. DeepSeek R1 has 61 layers. For example with a 24GB GPU or 80GB GPU, you can expect to offload after rounding down (reduce by 1 if it goes out of memory):

Quant
File Size
24GB GPU
80GB GPU
2x80GB GPU

1.58bit

131GB

7

33

All layers 61

1.73bit

158GB

5

26

57

2.22bit

183GB

4

22

49

2.51bit

212GB

2

19

32

Running on Mac / Apple devices

For Apple Metal devices, be careful of --n-gpu-layers. If you find the machine going out of memory, reduce it. For a 128GB unified memory machine, you should be able to offload 59 layers or so.

./llama.cpp/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 16 \
    --prio 2 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --n-gpu-layers 59 \
    -no-cnv \
    --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"

Run in Ollama/Open WebUI

Open WebUI has made an step-by-step tutorial on how to run R1 here: docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/ If you want to use Ollama for inference on GGUFs, you need to first merge the 3 GGUF split files into 1 like the code below. Then you will need to run the model locally.

./llama.cpp/llama-gguf-split --merge \
  DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
	merged_file.gguf

DeepSeek Chat Template

All distilled versions and the main 671B R1 model use the same chat template:

<|begin▁of▁sentence|><|User|>What is 1+1?<|Assistant|>It's 2.<|end▁of▁sentence|><|User|>Explain more!<|Assistant|>

A BOS is forcibly added, and an EOS separates each interaction. To counteract double BOS tokens during inference, you should only call tokenizer.encode(..., add_special_tokens = False) since the chat template auto adds a BOS token as well. For llama.cpp / GGUF inference, you should skip the BOS since it’ll auto add it.

<|User|>What is 1+1?<|Assistant|>

The <think> and </think> tokens get their own designated tokens. For the distilled versions for Qwen and Llama, some tokens are re-mapped, whilst Qwen for example did not have a BOS token, so <|object_ref_start|> had to be used instead. Tokenizer ID Mappings:

Token
R1
Distill Qwen
Distill Llama

<think>

128798

151648

128013

</think>

128799

151649

128014

<|begin_of_sentence|>

0

151646

128000

<|end_of_sentence|>

1

151643

128001

<|User|>

128803

151644

128011

<|Assistant|>

128804

151645

128012

Padding token

2

151654

128004

Original tokens in models:

Token
Qwen 2.5 32B Base
Llama 3.3 70B Instruct

<think>

<|box_start|>

<|reserved_special_token_5|>

</think>

<|box_end|>

<|reserved_special_token_6|>

<|begin▁of▁sentence|>

<|object_ref_start|>

<|begin_of_text|>

<|end▁of▁sentence|>

<|endoftext|>

<|end_of_text|>

<|User|>

<|im_start|>

<|reserved_special_token_3|>

<|Assistant|>

<|im_end|>

<|reserved_special_token_4|>

Padding token

<|vision_pad|>

<|finetune_right_pad_id|>

All Distilled and the original R1 versions seem to have accidentally assigned the padding token to <|end▁of▁sentence|>, which is mostly not a good idea, especially if you want to further finetune on top of these reasoning models. This will cause endless infinite generations, since most frameworks will mask the EOS token out as -100. We fixed all distilled and the original R1 versions with the correct padding token (Qwen uses <|vision_pad|>, Llama uses <|finetune_right_pad_id|>, and R1 uses <|▁pad▁|> or our own added <|PAD▁TOKEN|>.

GGUF R1 Table

MoE Bits
Type
Disk Size
Accuracy
Link
Details

1.58bit

UD-IQ1_S

131GB

Fair

MoE all 1.56bit. down_proj in MoE mixture of 2.06/1.56bit

1.73bit

UD-IQ1_M

158GB

Good

MoE all 1.56bit. down_proj in MoE left at 2.06bit

2.22bit

UD-IQ2_XXS

183GB

Better

MoE all 2.06bit. down_proj in MoE mixture of 2.5/2.06bit

2.51bit

UD-Q2_K_XL

212GB

Best

MoE all 2.5bit. down_proj in MoE mixture of 3.5/2.5bit

Last updated

Was this helpful?