Tutorial: How to Run DeepSeek-V3-0324 Locally
How to run DeepSeek-V3-0324 locally using our dynamic quants which recovers accuracy
Last updated
Was this helpful?
How to run DeepSeek-V3-0324 locally using our dynamic quants which recovers accuracy
Last updated
Was this helpful?
DeepSeek is at it again! After releasing V3, R1 Zero and R1 back in December 2024 and January 2025, DeepSeek updated their checkpoints / models for V3, and released a March update!
According to DeepSeek, MMLU-Pro jumped +5.3% to 81.2%. GPQA +9.3% points. AIME + 19.8% and LiveCodeBench + 10.0%! They provided a plot showing how they compared to the previous V3 checkpoint and other models like GPT 4.5 and Claude Sonnet 3.7. But how do we run a 671 billion parameter model locally?
1.78bit
IQ1_S
173GB
Ok
2.06/1.56bit
1.93bit
IQ1_M
183GB
Fair
2.5/2.06/1.56
2.42bit
IQ2_XXS
203GB
Suggested
2.5/2.06bit
2.71bit
Q2_K_XL
231GB
Suggested
3.5/2.5bit
3.5bit
Q3_K_XL
320GB
Great
4.5/3.5bit
4.5bit
Q4_K_XL
406GB
Best
5.5/4.5bit
DeepSeek V3's original upload is in float8, which takes 715GB. Using Q4_K_M halves the file size to 404GB or so, and our dynamic 1.78bit quant fits in around 151GB. I suggest using our 2.7bit quant to balance size and accuracy! The 2.4bit one also works well!
According to DeepSeek, these are the recommended settings for inference:
Temperature of 0.3 (Maybe 0.0 for coding as seen here)
Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)
Chat template: <|User|>Create a simple playable Flappy Bird Game in Python. Place the final game inside of a markdown section.<|Assistant|>
A BOS token of <|begin▁of▁sentence|>
is auto added during tokenization (do NOT add it manually!)
DeepSeek mentioned using a system prompt as well (optional) - it's in Chinese: 该助手为DeepSeek Chat,由深度求索公司创造。\n今天是3月24日,星期一。
which translates to: The assistant is DeepSeek Chat, created by DeepSeek.\nToday is Monday, March 24th.
For KV cache quantization, use 8bit, NOT 4bit - we found it to do noticeably worse.
Obtain the latest llama.cpp
on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON
to -DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
NOTE using -DGGML_CUDA=ON
for GPUs might take 5 minutes to compile. CPU only takes 1 minute to compile. You might be interested in llama.cpp's precompiled binaries.
Download the model via (after installing pip install huggingface_hub hf_transfer
). You can choose UD-IQ1_S
(dynamic 1.78bit quant) or other quantized versions like Q4_K_M
. I recommend using our 2.7bit dynamic quant UD-Q2_K_XL
to balance size and accuracy. More versions at: https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF
Run Unsloth's Flappy Bird test as described in our 1.58bit Dynamic Quant for DeepSeek R1.
Edit --threads 32
for the number of CPU threads, --ctx-size 16384
for context length, --n-gpu-layers 2
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
If we run the above, we get the old 2 bit on the left (seizure warning sorry!)
We also test our dynamic quants via r/Localllama which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.
The dynamic 2.7 bit quant which is only 230GB in size actually manages to solve the heptagon puzzle! The full output for all 3 versions (including full fp8) is below:
We find using lower KV cache quantization (4bit) seems to degrade generation quality via empirical tests - more tests need to be done, but we suggest using q8_0
cache quantization. The goal of quantization is to support longer context lengths since the KV cache uses quite a bit of memory.
We found the down_proj
in this model to be extremely sensitive to quantitation. We had to redo some of our dyanmic quants which used 2bits for down_proj
and now we use 3bits as the minimum for all these matrices.
Using llama.cpp
's Flash Attention backend does result in somewhat faster decoding speeds. Use -DGGML_CUDA_FA_ALL_QUANTS=ON
when compiling. Note it's also best to set your CUDA architecture as found in https://developer.nvidia.com/cuda-gpus to reduce compilation times, then set it via -DCMAKE_CUDA_ARCHITECTURES="80"
Using a min_p=0.01
is probably enough. llama.cpp
defaults to 0.1, which is probably not necessary. Since a temperature of 0.3 is used anyways, we most likely will very unlikely sample low probability tokens, so removing very unlikely tokens is a good idea. DeepSeek recommends 0.0 temperature for coding tasks.
Non Dynamic 2bit. Fails - SEIZURE WARNING again!
Dynamic 2bit. Actually solves the heptagon puzzle correctly!!
Original float8