🐋Tutorial: How to Run DeepSeek-R1 on your own local device

A guide on how you can run our 1.58-bit Dynamic Quants for DeepSeek-R1 using llama.cpp.

  1. Do not forget about <|User|> and <|Assistant|> tokens! - Or use a chat template formatter

  2. Obtain the latest llama.cpp at: github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:

apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
  1. It's best to use --min-p 0.05 to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.

  2. Download the model via:

# pip install huggingface_hub hf_transfer
# import os # Optional for faster downloading
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

from huggingface_hub import snapshot_download
snapshot_download(
  repo_id = "unsloth/DeepSeek-R1-GGUF",
  local_dir = "DeepSeek-R1-GGUF",
  allow_patterns = ["*UD-IQ1_S*"], # Select quant type UD-IQ1_S for 1.58bit
)
  1. Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode

   ./llama.cpp/llama-cli \
	  --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
	  --cache-type-k q4_0 \
	  --threads 12 -no-cnv --prio 2 \
	  --temp 0.6 \
	  --ctx-size 8192 \
	  --seed 3407 \
	  --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"

Example output:

 <think>
 Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly.
 Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense.
 Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything.
 I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right.
 Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...
  1. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.

  ./llama.cpp/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --prio 2 \
    --n-gpu-layers 7 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
  1. If you want to merge the weights together, use this script:

./llama.cpp/llama-gguf-split --merge \
    DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    merged_file.gguf
  1. DeepSeek R1 has 61 layers. For example with a 24GB GPU or 80GB GPU, you can expect to offload after rounding down (reduce by 1 if it goes out of memory):

Quant
File Size
24GB GPU
80GB GPU
2x80GB GPU

1.58bit

131GB

7

33

All layers 61

1.73bit

158GB

5

26

57

2.22bit

183GB

4

22

49

2.51bit

212GB

2

19

32

Running on Mac / Apple devices

For Apple Metal devices, be careful of --n-gpu-layers. If you find the machine going out of memory, reduce it. For a 128GB unified memory machine, you should be able to offload 59 layers or so.

./llama.cpp/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 16 \
    --prio 2 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --n-gpu-layers 59 \
    -no-cnv \
    --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"

Run in Ollama

If you want to use Ollama for inference on GGUFs, you need to first merge the 3 GGUF split files into 1 like the code below. Then you will need to run the model locally.

./llama.cpp/llama-gguf-split --merge \
  DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
	merged_file.gguf

GGUF R1 Table

MoE Bits
Type
Disk Size
Accuracy
Link
Details

1.58bit

UD-IQ1_S

131GB

Fair

MoE all 1.56bit. down_proj in MoE mixture of 2.06/1.56bit

1.73bit

UD-IQ1_M

158GB

Good

MoE all 1.56bit. down_proj in MoE left at 2.06bit

2.22bit

UD-IQ2_XXS

183GB

Better

MoE all 2.06bit. down_proj in MoE mixture of 2.5/2.06bit

2.51bit

UD-Q2_K_XL

212GB

Best

MoE all 2.5bit. down_proj in MoE mixture of 3.5/2.5bit

Last updated

Was this helpful?