🐳DeepSeek-V3-0324: How to Run Locally

How to run DeepSeek-V3-0324 locally using our dynamic quants which recovers accuracy

Please see https://docs.unsloth.ai/basics/deepseek-r1-0528-how-to-run-locally (May 28th 2025 update) to learn on how to run DeepSeek faster and more efficiently!

DeepSeek is at it again! After releasing V3, R1 Zero and R1 back in December 2024 and January 2025, DeepSeek updated their checkpoints / models for V3, and released a March update!

According to DeepSeek, MMLU-Pro jumped +5.3% to 81.2%. GPQA +9.3% points. AIME + 19.8% and LiveCodeBench + 10.0%! They provided a plot showing how they compared to the previous V3 checkpoint and other models like GPT 4.5 and Claude Sonnet 3.7. But how do we run a 671 billion parameter model locally?

MoE Bits
Type
Disk Size
Accuracy
Link
Details

1.78bit

IQ1_S

173GB

Ok

2.06/1.56bit

1.93bit

IQ1_M

183GB

Fair

2.5/2.06/1.56

2.42bit

IQ2_XXS

203GB

Suggested

2.5/2.06bit

2.71bit

Q2_K_XL

231GB

Suggested

3.5/2.5bit

3.5bit

Q3_K_XL

320GB

Great

4.5/3.5bit

4.5bit

Q4_K_XL

406GB

Best

5.5/4.5bit

According to DeepSeek, these are the recommended settings for inference:

  • Temperature of 0.3 (Maybe 0.0 for coding as seen here)

  • Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)

  • Chat template: <|User|>Create a simple playable Flappy Bird Game in Python. Place the final game inside of a markdown section.<|Assistant|>

  • A BOS token of <|begin▁of▁sentence|> is auto added during tokenization (do NOT add it manually!)

  • DeepSeek mentioned using a system prompt as well (optional) - it's in Chinese: 该助手为DeepSeek Chat,由深度求索公司创造。\n今天是3月24日,星期一。 which translates to: The assistant is DeepSeek Chat, created by DeepSeek.\nToday is Monday, March 24th.

  • For KV cache quantization, use 8bit, NOT 4bit - we found it to do noticeably worse.

📖 Tutorial: How to Run DeepSeek-V3 in llama.cpp

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

  1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-IQ1_S(dynamic 1.78bit quant) or other quantized versions like Q4_K_M . I recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy. More versions at: https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF

  1. Run Unsloth's Flappy Bird test as described in our 1.58bit Dynamic Quant for DeepSeek R1.

  2. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

If we run the above, we get 2 very different results. Standard 2-bit version: Click to view result (seizure warning!) Dynamic 2-bit version: See the result below:

Standard 2-bit. Fails with background, fails with collision

Dynamic 2-bit. Succeeds in creating a playable game.
  1. Like DeepSeek-R1, V3 has 61 layers. For example with a 24GB GPU or 80GB GPU, you can expect to offload after rounding down (reduce by 1 if it goes out of memory):

Quant
File Size
24GB GPU
80GB GPU
2x80GB GPU

1.73bit

173GB

5

25

56

2.22bit

183GB

4

22

49

2.51bit

212GB

2

19

32

Running on Mac / Apple devices

For Apple Metal devices, be careful of --n-gpu-layers. If you find the machine going out of memory, reduce it. For a 128GB unified memory machine, you should be able to offload 59 layers or so.

🎱 Heptagon Test

We also test our dynamic quants via r/Localllama which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.

The goal is to make the heptagon spin, and the balls in the heptagon should move.
Cover

Non Dynamic 2bit. Fails - SEIZURE WARNING again!

Cover

Dynamic 2bit. Actually solves the heptagon puzzle correctly!!

Cover

Original float8

The dynamic 2.7 bit quant which is only 230GB in size actually manages to solve the heptagon puzzle! The full output for all 3 versions (including full fp8) is below:

Dynamic 2bit Heptagon code
Non Dynamic 2bit Heptagon code
Float8 Heptagon code

🕵️ Extra Findings & Tips

  1. We find using lower KV cache quantization (4bit) seems to degrade generation quality via empirical tests - more tests need to be done, but we suggest using q8_0 cache quantization. The goal of quantization is to support longer context lengths since the KV cache uses quite a bit of memory.

  2. We found the down_proj in this model to be extremely sensitive to quantitation. We had to redo some of our dyanmic quants which used 2bits for down_proj and now we use 3bits as the minimum for all these matrices.

  3. Using llama.cpp 's Flash Attention backend does result in somewhat faster decoding speeds. Use -DGGML_CUDA_FA_ALL_QUANTS=ON when compiling. Note it's also best to set your CUDA architecture as found in https://developer.nvidia.com/cuda-gpus to reduce compilation times, then set it via -DCMAKE_CUDA_ARCHITECTURES="80"

  4. Using a min_p=0.01is probably enough. llama.cppdefaults to 0.1, which is probably not necessary. Since a temperature of 0.3 is used anyways, we most likely will very unlikely sample low probability tokens, so removing very unlikely tokens is a good idea. DeepSeek recommends 0.0 temperature for coding tasks.

Last updated

Was this helpful?