🌙Kimi K2 Thinking: How to Run Locally
Guide on running Kimi-K2-Thinking and Kimi-K2 on your own local device!
Kimi-K2-Thinking got released. Read our Thinking guide or access GGUFs here.
We also collaborated with the Kimi team on system prompt fix for Kimi-K2-Thinking.
Kimi-K2 and Kimi-K2-Thinking achieve SOTA performance in knowledge, reasoning, coding, and agentic tasks. The full 1T parameter models from Moonshot AI requires 1.09TB of disk space, while the quantized Unsloth Dynamic 1.8-bit version reduces this to just 230GB (-80% size): Kimi-K2-GGUF
You can also now run our Kimi-K2-Thinking GGUFs.
All uploads use Unsloth Dynamic 2.0 for SOTA Aider Polyglot and 5-shot MMLU performance. See how our Dynamic 1–2 bit GGUFs perform on coding benchmarks here.
⚙️ Recommended Requirements
The 1.8-bit (UD-TQ1_0) quant will fit in a 1x 24GB GPU (with all MoE layers offloaded to system RAM or a fast disk). Expect around ~1-2 tokens/s with this setup if you have bonus 256GB RAM as well. The full Kimi K2 Q8 quant is 1.09TB in size and will need at least 8 x H200 GPUs.
For optimal performance you will need at least 247GB unified memory or 247GB combined RAM+VRAM for 5+ tokens/s. If you have less than 247GB combined RAM+VRAM, then the speed of the model will definitely take a hit.
If you do not have 247GB of RAM+VRAM, no worries! llama.cpp inherently has disk offloading, so through mmaping, it'll still work, just be slower - for example before you might get 5 to 10 tokens / second, now it's under 1 token.
We suggest using our UD-Q2_K_XL (360GB) quant to balance size and accuracy!
For the best performance, have your VRAM + RAM combined = the size of the quant you're downloading. If not, it'll still work via disk offloading, just it'll be slower!
💭Kimi-K2-Thinking Guide
Kimi-K2-Thinking should generally follow the same instructions as the Instruct model, with a few key differences, particularly in areas such as settings and the chat template.
To run the model in full precision, you only need to use the 4-bit or 5-bit Dynamic GGUFs (e.g. UD_Q4_K_XL) because the model was originally released in INT4 format.
You can choose a higher-bit quantization just to be safe in case of small quantization differences, but in most cases this is unnecessary.
🌙 Official Recommended Settings:
According to Moonshot AI, these are the recommended settings for Kimi-K2-Thinking inference:
Set the temperature 1.0 to reduce repetition and incoherence.
Suggested context length = 98,304 (up to 256K)
Note: Using different tools may require different settings
For example given a user message of "What is 1+1?", we get:
<|im_system|>system<|im_middle|>You are Kimi, an AI assistant created by Moonshot AI.<|im_end|><|im_user|>user<|im_middle|>What is 1+1?<|im_end|><|im_assistant|>assistant<|im_middle|>✨ Run Kimi K2 Thinking in llama.cpp
You can now use the latest update of llama.cpp to run the model:
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cppIf you want to use
llama.cppdirectly to load models, you can do the below: (:UD-TQ1_0) is the quantization type. You can also download via Hugging Face (point 3). This is similar toollama run. Useexport LLAMA_CACHE="folder"to forcellama.cppto save to a specific location.
export LLAMA_CACHE="unsloth/Kimi-K2-Thinking-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/Kimi-K2-Thinking-GGUF:UD-TQ1_0 \
--n-gpu-layers 99 \
--temp 1.0 \
--min-p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"The above will use around 8GB of GPU memory. If you have around 360GB of combined GPU memory, remove
-ot ".ffn_.*_exps.=CPU"to get maximum speed!
Download the model via (after installing
pip install huggingface_hub hf_transfer). We recommend using our 2bit dynamic quant UD-Q2_K_XL to balance size and accuracy. All versions at: huggingface.co/unsloth/Kimi-K2-Thinking-GGUF
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/Kimi-K2-Thinking-GGUF",
local_dir = "unsloth/Kimi-K2-Thinking-GGUF",
allow_patterns = ["*UD-TQ1_0*"], # Use "*UD-Q2_K_XL*" for Dynamic 2bit (381GB)
)Run any prompt.
Edit
--threads -1for the number of CPU threads (be default it's set to the maximum CPU threads),--ctx-size 16384for context length,--n-gpu-layers 99for GPU offloading on how many layers. Set it to 99 combined with MoE CPU offloading to get the best performance. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
./llama.cpp/llama-cli \
--model unsloth/Kimi-K2-Thinking-GGUF/UD-TQ1_0/Kimi-K2-Thinking-UD-TQ1_0-00001-of-00006.gguf \
--n-gpu-layers 99 \
--temp 1.0 \
--min_p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"🤔No Thinking Tags?
You may notice that there are no thinking tags when you run the model. This is normal and intended behavior.
In your llama.cpp script, make sure to include the --special flag at the very end of your command. Once you do, you’ll see the <think> token appear as expected.
You might also see every answer end with <|im_end|>. This is normal as <|im_end|> is a special token that appears when printing special tokens. If you’d like to hide it, you can set <|im_end|> as a stop string in your settings.
✨ Deploy with llama-server and OpenAI's completion library
After installing llama.cpp as per ✨ Run Kimi K2 Thinking in llama.cpp, you can use the below to launch an OpenAI compatible server:
./llama.cpp/llama-server \
--model unsloth/Kimi-K2-Thinking-GGUF/UD-TQ1_0/Kimi-K2-Thinking-UD-TQ1_0-00001-of-00006.gguf \
--alias "unsloth/Kimi-K2-Thinking" \
--threads -1 \
-fa on \
--n-gpu-layers 999 \
-ot ".ffn_.*_exps.=CPU" \
--min_p 0.01 \
--ctx-size 16384 \
--port 8001 \
--jinjaThen use OpenAI's Python library after pip install openai :
from openai import OpenAI
openai_client = OpenAI(
base_url = "http://127.0.0.1:8001/v1",
api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
model = "unsloth/Kimi-K2-Thinking",
messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)🔍Tokenizer quirks and bug fixes
7th November 2025: We notified the Kimi team, and fixed the default system prompt of You are Kimi, an AI assistant created by Moonshot AI. not appearing on the first user prompt! See https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/12
Huge thanks to the Moonshot Kimi team for their extremely fast response time to our queries and fixing the issue ASAP!
16th July 2025: Kimi K2 updated their tokenizer to enable multiple tool calls as per https://x.com/Kimi_Moonshot/status/1945050874067476962
18th July 2025: We fixed a system prompt - Kimi tweeted about our fix as well here: https://x.com/Kimi_Moonshot/status/1946130043446690030. The fix was described here as well: https://huggingface.co/moonshotai/Kimi-K2-Instruct/discussions/28
If you have the old checkpoints downloaded - now worries - simply download the first GGUF split which was changed. OR if you do not want to download any new files do:
wget https://huggingface.co/unsloth/Kimi-K2-Instruct/raw/main/chat_template.jinja
./llama.cpp ... --chat-template-file /dir/to/chat_template.jinjaThe Kimi K2 tokenizer was interesting to play around with - it's mostly similar in action to GPT-4o's tokenizer! We first see in the tokenization_kimi.py file the following regular expression (regex) that Kimi K2 uses:
pat_str = "|".join(
[
r"""[\p{Han}]+""",
r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
r"""\p{N}{1,3}""",
r""" ?[^\s\p{L}\p{N}]+[\r\n]*""",
r"""\s*[\r\n]+""",
r"""\s+(?!\S)""",
r"""\s+""",
]
)After careful inspection, we find Kimi K2 is nearly identical to GPT-4o's tokenizer regex which can be found in llama.cpp's source code.
[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n/]*|\s*[\r\n]+|\s+(?!\S)|\s+Both tokenize numbers into groups of 1 to 3 numbers (9, 99, 999), and use similar patterns. The only difference looks to be the handling of "Han" or Chinese characters, which Kimi's tokenizer deals with more. The PR by https://github.com/gabriellarson handles these differences well after some discussions here.
We also find the correct EOS token should not be [EOS], but rather <|im_end|>, which we have also fixed in our model conversions.
🌝Kimi-K2-Instruct Guide
Step-by-step guide on running the Instruct Kimi K2 models including Kimi K2 0905 - the September 5 update.
🌙 Official Recommended Settings:
According to Moonshot AI, these are the recommended settings for Kimi K2 inference:
Set the temperature 0.6 to reduce repetition and incoherence.
Original default system prompt is:
You are a helpful assistant(Optional) Moonshot also suggests the below for the system prompt:
You are Kimi, an AI assistant created by Moonshot AI.
We recommend setting min_p to 0.01 to suppress the occurrence of unlikely tokens with low probabilities.
🔢 Chat template and prompt format
Kimi Chat does use a BOS (beginning of sentence token). The system, user and assistant roles are all enclosed with <|im_middle|> which is interesting, and each get their own respective token <|im_system|>, <|im_user|>, <|im_assistant|>.
<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_user|>user<|im_middle|>What is 1+1?<|im_end|><|im_assistant|>assistant<|im_middle|>2<|im_end|>To separate the conversational boundaries (you must remove each new line), we get:
<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|>
<|im_user|>user<|im_middle|>What is 1+1?<|im_end|>
<|im_assistant|>assistant<|im_middle|>2<|im_end|>💾 Model uploads
ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and reasoning tasks.
We've also uploaded versions in BF16 format.
✨ Run Instruct in llama.cpp
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cppIf you want to use
llama.cppdirectly to load models, you can do the below: (:UD-IQ1_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar toollama run. Useexport LLAMA_CACHE="folder"to forcellama.cppto save to a specific location. To run the new September 2025 update for the model, change the model name from 'Kimi-K2-Instruct' to 'Kimi-K2-Instruct-0905'.
export LLAMA_CACHE="unsloth/Kimi-K2-Instruct-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/Kimi-K2-Instruct-GGUF:TQ1_0 \
--threads -1 \
--n-gpu-layers 99 \
--temp 0.6 \
--min-p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"Download the model via (after installing
pip install huggingface_hub hf_transfer). You can chooseUD-TQ1_0(dynamic 1.8bit quant) or other quantized versions likeQ2_K_XL. We recommend using our 2bit dynamic quantUD-Q2_K_XLto balance size and accuracy. More versions at: huggingface.co/unsloth/Kimi-K2-Instruct-GGUF
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/Kimi-K2-Instruct-GGUF",
local_dir = "unsloth/Kimi-K2-Instruct-GGUF",
allow_patterns = ["*UD-TQ1_0*"], # Dynamic 1bit (281GB) Use "*UD-Q2_K_XL*" for Dynamic 2bit (381GB)
)Run any prompt.
Edit
--threads -1for the number of CPU threads (be default it's set to the maximum CPU threads),--ctx-size 16384for context length,--n-gpu-layers 99for GPU offloading on how many layers. Set it to 99 combined with MoE CPU offloading to get the best performance. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
./llama.cpp/llama-cli \
--model unsloth/Kimi-K2-Instruct-GGUF/UD-TQ1_0/Kimi-K2-Instruct-UD-TQ1_0-00001-of-00005.gguf \
--threads -1 \
--n-gpu-layers 99 \
--temp 0.6 \
--min_p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"🐦 Flappy Bird + other tests
We introduced the Flappy Bird test when our 1.58bit quants for DeepSeek R1 were provided. We found Kimi K2 one of the only models to one-shot all our tasks including this one, Heptagon and others tests even at 2-bit. The goal is to ask the LLM to create a Flappy Bird game but following some specific instructions:
Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.You can also test the dynamic quants via the Heptagon Test as per r/Localllama which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.

The goal is to make the heptagon spin, and the balls in the heptagon should move. The prompt is below:
Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.Last updated
Was this helpful?

