🐋DeepSeek-R1: How to Run Locally

A guide on how you can run our 1.58-bit Dynamic Quants for DeepSeek-R1 using llama.cpp.

Please see https://docs.unsloth.ai/basics/deepseek-r1-0528-how-to-run-locally for an updated DeepSeek R1-0528 (May 28th 2025 version)

Using llama.cpp (recommended)

Do not forget about <｜User｜> and <｜Assistant｜> tokens! - Or use a chat template formatter
Obtain the latest llama.cpp at: github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

It's best to use --min-p 0.05 to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.
Download the model via:

# pip install huggingface_hub hf_transfer
# import os # Optional for faster downloading
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

from huggingface_hub import snapshot_download
snapshot_download(
  repo_id = "unsloth/DeepSeek-R1-GGUF",
  local_dir = "DeepSeek-R1-GGUF",
  allow_patterns = ["*UD-IQ1_S*"], # Select quant type UD-IQ1_S for 1.58bit
)

Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode

./llama.cpp/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --prio 2 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<｜User｜>What is 1+1?<｜Assistant｜>"

Example output:

 <think>
 Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly.
 Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense.
 Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything.
 I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right.
 Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...

If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.

./llama.cpp/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --prio 2 \
    --n-gpu-layers 7 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜>"

To test our Flappy Bird example as mentioned in our blog post here: https://unsloth.ai/blog/deepseekr1-dynamic, we can produce the 2nd example like below using our 1.58bit dynamic quant:

Original DeepSeek R1

1.58bit Dynamic Quant

The prompt used is as below:

<｜User｜>Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<｜Assistant｜>

To call llama.cpp using this example, we do:

./llama.cpp/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --prio 2 \
    --n-gpu-layers 7 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<｜User｜>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<｜Assistant｜>"

Also, if you want to merge the weights together for use in Ollama for example, use this script:

./llama.cpp/llama-gguf-split --merge \
    DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    merged_file.gguf

DeepSeek R1 has 61 layers. For example with a 24GB GPU or 80GB GPU, you can expect to offload after rounding down (reduce by 1 if it goes out of memory):

Quant

File Size

24GB GPU

80GB GPU

2x80GB GPU

1.58bit

131GB

All layers 61

1.73bit

158GB

2.22bit

183GB

2.51bit

212GB

Running on Mac / Apple devices

For Apple Metal devices, be careful of --n-gpu-layers. If you find the machine going out of memory, reduce it. For a 128GB unified memory machine, you should be able to offload 59 layers or so.

./llama.cpp/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 16 \
    --prio 2 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --n-gpu-layers 59 \
    -no-cnv \
    --prompt "<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜>"

Run in Ollama/Open WebUI

Open WebUI has made an step-by-step tutorial on how to run R1 here: docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/ If you want to use Ollama for inference on GGUFs, you need to first merge the 3 GGUF split files into 1 like the code below. Then you will need to run the model locally.

./llama.cpp/llama-gguf-split --merge \
  DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
	merged_file.gguf

DeepSeek Chat Template

All distilled versions and the main 671B R1 model use the same chat template:

<｜begin▁of▁sentence｜><｜User｜>What is 1+1?<｜Assistant｜>It's 2.<｜end▁of▁sentence｜><｜User｜>Explain more!<｜Assistant｜>

A BOS is forcibly added, and an EOS separates each interaction. To counteract double BOS tokens during inference, you should only call tokenizer.encode(..., add_special_tokens = False) since the chat template auto adds a BOS token as well. For llama.cpp / GGUF inference, you should skip the BOS since it’ll auto add it.

<｜User｜>What is 1+1?<｜Assistant｜>

The <think> and </think> tokens get their own designated tokens. For the distilled versions for Qwen and Llama, some tokens are re-mapped, whilst Qwen for example did not have a BOS token, so <|object_ref_start|> had to be used instead. Tokenizer ID Mappings:

Token

Distill Qwen

Distill Llama

<think>

128798

151648

128013

</think>

128799

151649

128014

<|begin_of_sentence|>

151646

128000

<|end_of_sentence|>

151643

128001

<|User|>

128803

151644

128011

<|Assistant|>

128804

151645

128012

Padding token

151654

128004

Original tokens in models:

Token

Qwen 2.5 32B Base

Llama 3.3 70B Instruct

<think>

<|box_start|>

<|reserved_special_token_5|>

</think>

<|box_end|>

<|reserved_special_token_6|>

<｜begin▁of▁sentence｜>

<|object_ref_start|>

<|begin_of_text|>

<｜end▁of▁sentence｜>

<|endoftext|>

<|end_of_text|>

<｜User｜>

<|im_start|>

<|reserved_special_token_3|>

<｜Assistant｜>

<|im_end|>

<|reserved_special_token_4|>

Padding token

<|vision_pad|>

<|finetune_right_pad_id|>

All Distilled and the original R1 versions seem to have accidentally assigned the padding token to <｜end▁of▁sentence｜>, which is mostly not a good idea, especially if you want to further finetune on top of these reasoning models. This will cause endless infinite generations, since most frameworks will mask the EOS token out as -100. We fixed all distilled and the original R1 versions with the correct padding token (Qwen uses <|vision_pad|>, Llama uses <|finetune_right_pad_id|>, and R1 uses <｜▁pad▁｜> or our own added <｜PAD▁TOKEN｜>.

GGUF R1 Table

MoE Bits

Type

Disk Size

Accuracy

Link

Details

1.58bit

UD-IQ1_S

131GB

Fair

Link

MoE all 1.56bit. down_proj in MoE mixture of 2.06/1.56bit

1.73bit

UD-IQ1_M

158GB

Good

Link

MoE all 1.56bit. down_proj in MoE left at 2.06bit

2.22bit

UD-IQ2_XXS

183GB

Better

Link

MoE all 2.06bit. down_proj in MoE mixture of 2.5/2.06bit

2.51bit

UD-Q2_K_XL

212GB

Best

Link

MoE all 2.5bit. down_proj in MoE mixture of 3.5/2.5bit

PreviousDeepSeek-V3-0324: How to Run Locally NextDeepSeek-R1 Dynamic 1.58-bit

Last updated 12 days ago

Was this helpful?