🐋DeepSeek-V3.1
A guide on how to run DeepSeek-V3.1 on your own local device!
DeepSeek’s V3.1 update introduces hybrid reasoning inference, combining 'think' and 'non-think' into one model. The full 671B parameter model requires 715GB of disk space. The quantized dynamic 2-bit version uses 245GB (-75% reduction in size). GGUF: DeepSeek-V3.1-GGUF
Our DeepSeek-V3.1 GGUFs include Unsloth chat template fixes for llama.cpp supported backends.
All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized DeepSeek LLMs with minimal accuracy loss.
Tutorials navigation:
⚙️ Recommended Settings
The 2-bit quants will fit in a 1x 24GB GPU (with all layers offloaded). Expect around 7 tokens/s with this setup if you have bonus 128GB RAM as well. It is recommended to have at least 246GB RAM to run this quant. For optimal performance you will need at least 246GB unified memory or 246GB combined RAM+VRAM for 5+ tokens/s. We suggest using our 2.7bit (Q2_K_XL) or 2.4bit (IQ2_XXS) quant to balance size and accuracy.
Though not a must, for the best performance, have your VRAM + RAM combined = to the size of the quant you're downloading.
🐳 Official Recommended Settings:
According to DeepSeek, these are the recommended settings for V3.1 inference:
Set the temperature 0.6 to reduce repetition and incoherence.
Set top_p to 0.95 (recommended)
128K context length or less
🔢 Chat template/prompt format
You do not need to force <think>\n
, but you can still add it in! With the given prefix, DeepSeek V3.1 generates responses to queries in non-thinking mode. Unlike DeepSeek V3, it introduces an additional token </think>
.
<|begin▁of▁sentence|>{system prompt}<|User|>{query}<|Assistant|></think>
A BOS is forcibly added, and an EOS separates each interaction. To counteract double BOS tokens during inference, you should only call tokenizer.encode(..., add_special_tokens = False)
since the chat template auto adds a BOS token as well.
For llama.cpp / GGUF inference, you should skip the BOS since it’ll auto add it:
Non-Thinking Mode
First-Turn
Prefix: <|begin▁of▁sentence|>{system prompt}<|User|>{query}<|Assistant|></think>
With the given prefix, DeepSeek V3.1 generates responses to queries in non-thinking mode. Unlike DeepSeek V3, it introduces an additional token </think>
.
Multi-Turn
Context: <|begin▁of▁sentence|>{system prompt}<|User|>{query}<|Assistant|></think>{response}<|end▁of▁sentence|>...<|User|>{query}<|Assistant|></think>{response}<|end▁of▁sentence|>
Prefix: <|User|>{query}<|Assistant|></think>
By concatenating the context and the prefix, we obtain the correct prompt for the query.
Thinking Mode
First-Turn
Prefix: <|begin▁of▁sentence|>{system prompt}<|User|>{query}<|Assistant|><think>
The prefix of thinking mode is similar to DeepSeek-R1.
Multi-Turn
Context: <|begin▁of▁sentence|>{system prompt}<|User|>{query}<|Assistant|></think>{response}<|end▁of▁sentence|>...<|User|>{query}<|Assistant|></think>{response}<|end▁of▁sentence|>
Prefix: <|User|>{query}<|Assistant|><think>
The multi-turn template is the same with non-thinking multi-turn chat template. It means the thinking token in the last turn will be dropped but the </think>
is retained in every turn of context.
ToolCall
Toolcall is supported in non-thinking mode. The format is:
<|begin▁of▁sentence|>{system prompt}{tool_description}<|User|>{query}<|Assistant|></think>
where the tool_description is.
Run DeepSeek-V3.1 Tutorials:
🦙 Run in Ollama/Open WebUI
Install ollama
if you haven't already! To run more variants of the model, see here.
apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh
Run the model! Note you can call ollama serve
in another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params
in our Hugging Face upload! To run the quants, you need to first merge the 3 GGUF split files into 1 like the code below. Then you will need to run the model locally.
./llama.cpp/llama-gguf-split --merge \
DeepSeek-V3.1-GGUF/DeepSeek-V3.1-UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \
merged_file.gguf
OLLAMA_MODELS=unsloth_downloaded_models ollama serve &
ollama run hf.co/unsloth/DeepSeek-V3.1-GGUF:UD_Q2_K_XL
Open WebUI also made a step-by-step tutorial on how to run R1 and for V3.1, you will just need to replace R1 with the new V3.1 quant.
✨ Run in llama.cpp
Obtain the latest llama.cpp
on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON
to -DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp
If you want to use llama.cpp
directly to load models, you can do the below: (:Q2_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run
. Use export LLAMA_CACHE="folder"
to force llama.cpp
to save to a specific location. Remember the model has only a maximum of 128K context length.
Please try out -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU"
This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU"
if you have even more GPU memory. This offloads only up projection MoE layers.
And finally offload all layers via -ot ".ffn_.*_exps.=CPU"
This uses the least VRAM.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"
means to offload gate, up and down MoE layers but only from the 6th layer onwards.
export LLAMA_CACHE="unsloth/DeepSeek-V3.1-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/DeepSeek-V3.1-GGUF:Q2_K_XL \
--cache-type-k q4_0 \
--threads -1 \
--n-gpu-layers 99 \
--prio 3 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"
Download the model via (after installing pip install huggingface_hub hf_transfer
). You can choose UD-
Q2_K_XL (dynamic 2bit quant) or other quantized versions like Q4_K_M
. We recommend using our 2.7bit dynamic quant UD-Q2_K_XL
to balance size and accuracy.
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/DeepSeek-V3.1-GGUF",
local_dir = "unsloth/DeepSeek-V3.1-GGUF",
allow_patterns = ["*UD-Q2_K_XL*"], # Dynamic 2bit (247GB) Use "*UD-Q2_K_XL*" for Dynamic 2bit (251GB)
)
Run the model by prompting it.
You can edit --threads 32
for the number of CPU threads, --ctx-size 16384
for context length, --n-gpu-layers 2
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
./llama.cpp/llama-cli \
--model unsloth/DeepSeek-V3.1-GGUF/UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \
--cache-type-k q4_0 \
--threads -1 \
--n-gpu-layers 99 \
--prio 3 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU" \
-no-cnv \
--prompt "<|User|>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|Assistant|>"
Model uploads
ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and language tasks.
Full DeepSeek-V3.1 model uploads below:
We also uploaded IQ4_NL and Q4_1 quants which run specifically faster for ARM and Apple devices respectively.
We've also uploaded versions in BF16 format, and original FP8 (float8) format.
Last updated
Was this helpful?