📙Devstral 2: How to Run Guide

Guide for local running Mistral Devstral 2 models: 123B-Instruct-2512 and Small-2-24B-Instruct-2512.

Devstral 2 are Mistral’s new coding and agentic LLMs for software engineering, available in 24B and 123B sizes. The 123B model achieves SOTA in SWE-bench, coding, tool-calling and agent use-cases. The 24B model fits in 25GB RAM/VRAM and 123B fits in 128GB.

We’ve resolved issues in Devstral’s chat template, and results should be significantly better. The 24B & 123B have been updated.

Devstral 2 supports vision capabilities, a 256k context window and uses the same architecture as Ministral 3. You can now run and fine-tune both models locally with Unsloth.

All Devstral 2 uploads use our Unsloth Dynamic 2.0 methodology, delivering the best performance on Aider Polyglot and 5-shot MMLU benchmarks.

Devstral-Small-2-24BDevstral-2-123B

Devstral 2 - Unsloth Dynamic GGUFs:

Devstral-Small-2-24B-Instruct-2512

Devstral-2-123B-Instruct-2512

Devstral-Small-2-24B-Instruct-2512-GGUF

Devstral-2-123B-Instruct-2512-GGUF

🖥️ Running Devstral 2

See our step-by-step guides for running Devstral 24B and the large Devstral 123B models. Both models include vision support with a separate mmproj file.

⚙️ Usage Guide

Here are the recommended settings for inference:

Temperature ~0.15
Min_P of 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)
Use --jinja to enable the system prompt.
Max context length = 262,144
Recommended minimum context: 16,384

Devstral-Small-2-24B

The full precision (Q8) Devstral-Small-2-24B GGUF will fit in 25GB RAM/VRAM.

✨ Run Devstral-Small-2-24B-Instruct-2512 in llama.cpp

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also directly pull from Hugging Face:

./llama.cpp/llama-cli \
    -hf unsloth/Devstral-Small-2-24B-Instruct-2512:UD-Q4_K_XL \
    --jinja -ngl 99 --threads -1 --ctx-size 16384 \
    --temp 0.15

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF",
    local_dir = "unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*", "*mmproj-F16*"], # For Q4_K_XL
)

Run the model. Otherwise for conversation mode:

./llama.cpp/llama-cli \
    --model unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL.gguf \
    --threads -1 \
    --ctx-size 16384 \
    --cache-type-k q8_0 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 0.15 \
    --jinja

Remember to remove <bos> since Devstral auto adds a <bos>! Also please use --jinja to enable the system prompt!

Devstral-2-123B

The full precision (Q8) Devstral-Small-2-123B GGUF will fit in 128GB RAM/VRAM.

✨ Run Devstral-2-123B-Instruct-2512 Tutorial

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

You can directly pull from HuggingFace via:

./llama.cpp/llama-cli \
    -hf unsloth/Devstral-2-123B-Instruct-2512:UD-Q4_K_XL \
    --jinja -ngl 99 --threads -1 --ctx-size 16384 \
    --temp 0.15

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Devstral-2-123B-Instruct-2512-GGUF",
    local_dir = "Devstral-2-123B-Instruct-2512-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*", "*mmproj-F16*"],
)

Remember to remove <bos> since Devstral auto adds a <bos>! Also please use --jinja to enable the system prompt!

🦥 Fine-tuning Devstral 2 with Unsloth

Just like Ministral 3, Unsloth supports Devstral 2 fine-tuning. Training is 2x faster, use 70% less VRAM and supports 8x longer context lengths. Devstral 2 fits comfortably in a 24GB VRAM L4 GPU.

Unfortunately, Devstral 2 slightly exceeds the memory limits of a 16GB VRAM, so fine-tuning it for free on Google Colab isn't possible for now. However, you can fine-tune the model for free using our Kaggle notebook, which offers access to dual GPUs. Just change the notebook's Magistral model name to the unsloth/Devstral-Small-2-24B-Instruct-2512 model.

PreviousTutorial: gpt-oss RL NextMinistral 3

Last updated 12 minutes ago

Was this helpful?