📙Devstral 2: How to Run Guide
Guide for local running Mistral Devstral 2 models: 123B-Instruct-2512 and Small-2-24B-Instruct-2512.
Devstral 2 are Mistral’s new coding and agentic LLMs for software engineering, available in 24B and 123B sizes. The 123B model achieves SOTA in SWE-bench, coding, tool-calling and agent use-cases. The 24B model fits in 25GB RAM/VRAM and 123B fits in 128GB.
We’ve resolved issues in Devstral’s chat template, and results should be significantly better. The 24B & 123B have been updated.
Devstral 2 supports vision capabilities, a 256k context window and uses the same architecture as Ministral 3. You can now run and fine-tune both models locally with Unsloth.
All Devstral 2 uploads use our Unsloth Dynamic 2.0 methodology, delivering the best performance on Aider Polyglot and 5-shot MMLU benchmarks.
Devstral 2 - Unsloth Dynamic GGUFs:
🖥️ Running Devstral 2
See our step-by-step guides for running Devstral 24B and the large Devstral 123B models. Both models include vision support with a separate mmproj file.
⚙️ Usage Guide
Here are the recommended settings for inference:
Temperature ~0.15
Min_P of 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)
Use
--jinjato enable the system prompt.Max context length = 262,144
Recommended minimum context: 16,384
Devstral-Small-2-24B
The full precision (Q8) Devstral-Small-2-24B GGUF will fit in 25GB RAM/VRAM.
✨ Run Devstral-Small-2-24B-Instruct-2512 in llama.cpp
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
If you want to use
llama.cppdirectly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also directly pull from Hugging Face:
Download the model via (after installing
pip install huggingface_hub hf_transfer). You can chooseUD_Q4_K_XLor other quantized versions.
Run the model. Otherwise for conversation mode:
Remember to remove <bos> since Devstral auto adds a <bos>! Also please use --jinja to enable the system prompt!
Devstral-2-123B
The full precision (Q8) Devstral-Small-2-123B GGUF will fit in 128GB RAM/VRAM.
✨ Run Devstral-2-123B-Instruct-2512 Tutorial
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
You can directly pull from HuggingFace via:
Download the model via (after installing
pip install huggingface_hub hf_transfer). You can chooseUD_Q4_K_XLor other quantized versions.
Remember to remove <bos> since Devstral auto adds a <bos>! Also please use --jinja to enable the system prompt!
🦥 Fine-tuning Devstral 2 with Unsloth
Just like Ministral 3, Unsloth supports Devstral 2 fine-tuning. Training is 2x faster, use 70% less VRAM and supports 8x longer context lengths. Devstral 2 fits comfortably in a 24GB VRAM L4 GPU.
Unfortunately, Devstral 2 slightly exceeds the memory limits of a 16GB VRAM, so fine-tuning it for free on Google Colab isn't possible for now. However, you can fine-tune the model for free using our Kaggle notebook, which offers access to dual GPUs. Just change the notebook's Magistral model name to the unsloth/Devstral-Small-2-24B-Instruct-2512 model.
Last updated
Was this helpful?

