Cogito v2.1: How to Run Locally

Cogito v2.1 LLMs are one of the strongest open models in the world trained with IDA. Also v1 comes in 4 sizes: 70B, 109B, 405B and 671B, allowing you to select which size best matches your hardware.

Cogito v2.1 comes in 1 671B MoE size, whilst Cogito v2 Preview is Deep Cogito's release of models spans 4 model sizes ranging from 70B to 671B. By using IDA (Iterated Distillation & Amplification), these models are trained with the model internalizing the reasoning process using iterative policy improvement, rather than simply searching longer at inference time (like DeepSeek R1).

Deep Cogito is based in San Fransisco, USA (like Unsloth 🇺🇸) and we're excited to provide quantized dynamic models for all 4 model sizes! All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized these LLMs with minimal accuracy loss!

Tutorials navigation:

Run 671B MoERun 109B MoERun 405B DenseRun 70B Dense

💎 Model Sizes and Uploads

There are 4 model sizes:

  1. 2 Dense models based off from Llama - 70B and 405B

  2. 2 MoE models based off from Llama 4 Scout (109B) and DeepSeek R1 (671B)

Model Sizes
Recommended Quant & Link
Disk Size
Architecture

70B Dense

44GB

Llama 3 70B

109B MoE

50GB

Llama 4 Scout

405B Dense

152GB

Llama 3 405B

671B MoE

251GB

DeepSeek R1

🐳 Run Cogito 671B MoE in llama.cpp

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

  1. If you want to use llama.cpp directly to load models, you can do the below: (:IQ1_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location.

  1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-IQ1_S(dynamic 1.78bit quant) or other quantized versions like Q4_K_M . We recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy. More versions at: https://huggingface.co/unsloth/cogito-671b-v2.1-GGUF

  1. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

🖱️Run Cogito 109B MoE in llama.cpp

  1. Follow the same instructions as running the 671B model above.

  2. Then run the below:

🌳Run Cogito 405B Dense in llama.cpp

  1. Follow the same instructions as running the 671B model above.

  2. Then run the below:

😎 Run Cogito 70B Dense in llama.cpp

  1. Follow the same instructions as running the 671B model above.

  2. Then run the below:

See https://www.deepcogito.com/research/cogito-v2-1 for more details

Last updated

Was this helpful?