🐋Tutorial: How to Run DeepSeek-R1 on your own local device
A guide on how you can run our 1.58-bit Dynamic Quants for DeepSeek-R1 using llama.cpp.
Using llama.cpp (recommended)
Do not forget about
<|User|>
and<|Assistant|>
tokens! - Or use a chat template formatterObtain the latest
llama.cpp
at: github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:
It's best to use
--min-p 0.05
to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.Download the model via:
Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode
Example output:
If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
If you want to merge the weights together, use this script:
DeepSeek R1 has 61 layers. For example with a 24GB GPU or 80GB GPU, you can expect to offload after rounding down (reduce by 1 if it goes out of memory):
1.58bit
131GB
7
33
All layers 61
1.73bit
158GB
5
26
57
2.22bit
183GB
4
22
49
2.51bit
212GB
2
19
32
Running on Mac / Apple devices
For Apple Metal devices, be careful of --n-gpu-layers. If you find the machine going out of memory, reduce it. For a 128GB unified memory machine, you should be able to offload 59 layers or so.
Run in Ollama
If you want to use Ollama for inference on GGUFs, you need to first merge the 3 GGUF split files into 1 like the code below. Then you will need to run the model locally.
GGUF R1 Table
Last updated
Was this helpful?