Cogito v2: How to Run Locally
Cogito v2 LLMs are one of the strongest open models in the world trained with IDA. They come in 4 sizes: 70B, 109B, 405B and 671B, allowing you to select which size best matches your hardware.
Cogito v2 Preview is Deep Cogito's latest release of models that spans 4 model sizes ranging from 70B to 671B. By using IDA (Iterated Distillation & Amplification), these models are trained with the model internalizing the reasoning process using iterative policy improvement, rather than simply searching longer at inference time (like DeepSeek R1).
Deep Cogito is based in San Fransisco, USA (like Unsloth 🇺🇸) and we're excited to provide quantized dynamic models for all 4 model sizes! All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized these LLMs with minimal accuracy loss!
Tutorials navigation:
Choose which model size fits your hardware! We upload 1.58bit to 16bit variants for all 4 model sizes!
💎 Model Sizes and Uploads
There are 4 model sizes:
2 Dense models based off from Llama - 70B and 405B
2 MoE models based off from Llama 4 Scout (109B) and DeepSeek R1 (671B)
Though not necessary, for the best performance, have your VRAM + RAM combined = to the size of the quant you're downloading. If you have less VRAM + RAM, then the quant will still function, just be much slower.
🐳 Run Cogito 671B MoE in llama.cpp
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp
If you want to use
llama.cpp
directly to load models, you can do the below: (:IQ1_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar toollama run
. Useexport LLAMA_CACHE="folder"
to forcellama.cpp
to save to a specific location.
Please try out -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU"
This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU"
if you have even more GPU memory. This offloads only up projection MoE layers.
And finally offload all layers via -ot ".ffn_.*_exps.=CPU"
This uses the least VRAM.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"
means to offload gate, up and down MoE layers but only from the 6th layer onwards.
export LLAMA_CACHE="unsloth/cogito-v2-preview-deepseek-671B-MoE-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/cogito-v2-preview-deepseek-671B-MoE-GGUF:Q2_K_XL \
--cache-type-k q4_0 \
--threads -1 \
--n-gpu-layers 99 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"
Download the model via (after installing
pip install huggingface_hub hf_transfer
). You can chooseUD-IQ1_S
(dynamic 1.78bit quant) or other quantized versions likeQ4_K_M
. We recommend using our 2.7bit dynamic quantUD-Q2_K_XL
to balance size and accuracy. More versions at: https://huggingface.co/unsloth/cogito-v2-preview-deepseek-671B-MoE-GGUF
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/cogito-v2-preview-deepseek-671B-MoE-GGUF",
local_dir = "unsloth/cogito-v2-preview-deepseek-671B-MoE-GGUF",
allow_patterns = ["*UD-IQ1_S*"], # Dynamic 1bit (168GB) Use "*UD-Q2_K_XL*" for Dynamic 2bit (251GB)
)
Edit
--threads 32
for the number of CPU threads,--ctx-size 16384
for context length,--n-gpu-layers 2
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
🖱️Run Cogito 109B MoE in llama.cpp
Follow the same instructions as running the 671B model above.
Then run the below:
export LLAMA_CACHE="unsloth/cogito-v2-preview-llama-109B-MoE-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/cogito-v2-preview-llama-109B-MoE-GGUF:Q3_K_XL \
--cache-type-k q4_0 \
--n-gpu-layers 99 \
--temp 0.6 \
--min-p 0.01 \
--top-p 0.9 \
--ctx-size 16384 \
-ot ".ffn_.*_exps.=CPU"
🌳Run Cogito 405B Dense in llama.cpp
Follow the same instructions as running the 671B model above.
Then run the below:
export LLAMA_CACHE="unsloth/cogito-v2-preview-llama-405B-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/cogito-v2-preview-llama-405B-GGUF:Q2_K_XL \
--cache-type-k q4_0 \
--n-gpu-layers 99 \
--temp 0.6 \
--min-p 0.01 \
--top-p 0.9 \
--ctx-size 16384
😎 Run Cogito 70B Dense in llama.cpp
Follow the same instructions as running the 671B model above.
Then run the below:
export LLAMA_CACHE="unsloth/cogito-v2-preview-llama-70B-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/cogito-v2-preview-llama-70B-GGUF:Q4_K_XL \
--cache-type-k q4_0 \
--n-gpu-layers 99 \
--temp 0.6 \
--min-p 0.01 \
--top-p 0.9 \
--ctx-size 16384
Last updated