📙Devstral 2: How to Run Guide

Guide for local running Mistral Devstral 2 models: 123B-Instruct-2512 and Small-2-24B-Instruct-2512.

Devstral 2 are Mistral’s new coding and agentic LLMs for software engineering, available in 24B and 123B sizes. The 123B model achieves SOTA in SWE-bench, coding, tool-calling and agent use-cases. The 24B model fits in 25GB RAM/VRAM and 123B fits in 128GB.

Devstral 2 supports vision capabilities, a 256k context window and uses the same architecture as Ministral 3. You can now run and fine-tune both models locally with Unsloth.

All Devstral 2 uploads use our Unsloth Dynamic 2.0 methodology, delivering the best performance on Aider Polyglot and 5-shot MMLU benchmarks.

Devstral-Small-2-24BDevstral-2-123B

Devstral 2 - Unsloth Dynamic GGUFs:

Devstral-Small-2-24B-Instruct-2512
Devstral-2-123B-Instruct-2512

🖥️ Running Devstral 2

See our step-by-step guides for running Devstral 24B and the large Devstral 123B models. Both models include vision support with a separate mmproj file.

⚙️ Usage Guide

Here are the recommended settings for inference:

  • Temperature ~0.15

  • Min_P of 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)

  • Use --jinja to enable the system prompt.

  • Max context length = 262,144

  • Recommended minimum context: 16,384

Devstral-Small-2-24B

The full precision (Q8) Devstral-Small-2-24B GGUF will fit in 25GB RAM/VRAM.

✨ Run Devstral-Small-2-24B-Instruct-2512 in llama.cpp

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

  1. If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also directly pull from Hugging Face:

  1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

  1. Run the model. Otherwise for conversation mode:

Devstral-2-123B

The full precision (Q8) Devstral-Small-2-123B GGUF will fit in 128GB RAM/VRAM.

Run Devstral-2-123B-Instruct-2512 Tutorial

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

  1. You can directly pull from HuggingFace via:

  1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

🦥 Fine-tuning Devstral 2 with Unsloth

Just like Ministral 3, Unsloth supports Devstral 2 fine-tuning. Training is 2x faster, use 70% less VRAM and supports 8x longer context lengths. Devstral 2 fits comfortably in a 24GB VRAM L4 GPU.

Unfortunately, Devstral 2 slightly exceeds the memory limits of a 16GB VRAM, so fine-tuning it for free on Google Colab isn't possible for now. However, you can fine-tune the model for free using our Kaggle notebook, which offers access to dual GPUs. Just change the notebook's Magistral model name to the unsloth/Devstral-Small-2-24B-Instruct-2512 model.

Last updated

Was this helpful?