๐Devstral: How to Run & Fine-tune
Run and fine-tune Mistral Devstral 1.1, including Small-2507 and 2505.
Devstral-Small-2507 (Devstral 1.1) is Mistral's new agentic LLM for software engineering. It excels at tool-calling, exploring codebases, and powering coding agents. Mistral AI released the original 2505 version in May, 2025.
Finetuned from Mistral-Small-3.1, Devstral supports a 128k context window. Devstral Small 1.1 has improved performance, achieving a score of 53.6% performance on SWE-bench verified, making it (July 10, 2025) the #1 open model on the benchmark.
Unsloth Devstral 1.1 GGUFs contain additional tool-calling support and chat template fixes. Devstral 1.1 still works well with OpenHands but now also generalizes better to other prompts and coding environments.
As text-only, Devstralโs vision encoder was removed prior to fine-tuning. We've added optional Vision support for the model.
We also worked with Mistral behind the scenes to help debug, test and correct any possible bugs and issues! Make sure to download Mistral's official downloads or Unsloth's GGUFs / dynamic quants to get the correct implementation (ie correct system prompt, correct chat template etc)
Please use --jinja in llama.cpp to enable the system prompt!
All Devstral uploads use our Unsloth Dynamic 2.0 methodology, delivering the best performance on 5-shot MMLU and KL Divergence benchmarks. This means, you can run and fine-tune quantized Mistral LLMs with minimal accuracy loss!
Devstral - Unsloth Dynamic quants:
๐ฅ๏ธ Running Devstral
โ๏ธ Official Recommended Settings
According to Mistral AI, these are the recommended settings for inference:
Temperature from 0.0 to 0.15
Min_P of 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)
Use
--jinjato enable the system prompt.
A system prompt is recommended, and is a derivative of Open Hand's system prompt. The full system prompt is provided here.
Our dynamic uploads have the 'UD' prefix in them. Those without are not dynamic however still utilize our calibration dataset.
๐ฆ Tutorial: How to Run Devstral in Ollama
Install
ollamaif you haven't already!
Run the model with our dynamic quant. Note you can call
ollama serve &in another terminal if it fails! We include all suggested parameters (temperature etc) inparamsin our Hugging Face upload!Also Devstral supports 128K context lengths, so best to enable KV cache quantization. We use 8bit quantization which saves 50% memory usage. You can also try
"q4_0"
๐ Tutorial: How to Run Devstral in llama.cpp
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
If you want to use
llama.cppdirectly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar toollama run
OR download the model via (after installing
pip install huggingface_hub hf_transfer). You can choose Q4_K_M, or other quantized versions (like BF16 full precision).
Run the model.
Edit
--threads -1for the maximum CPU threads,--ctx-size 131072for context length (Devstral supports 128K context length!),--n-gpu-layers 99for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference. We also use 8bit quantization for the K cache to reduce memory usage.For conversation mode:
For non conversation mode to test our Flappy Bird prompt:
Remember to remove <bos> since Devstral auto adds a <bos>! Also please use --jinja to enable the system prompt!
๐Experimental Vision Support
Xuan-Son from Hugging Face showed in their GGUF repo how it is actually possible to "graft" the vision encoder from Mistral 3.1 Instruct onto Devstral 2507. We also uploaded our mmproj files which allows you to use the following:
For example:


๐ฆฅ Fine-tuning Devstral with Unsloth
Just like standard Mistral models including Mistral Small 3.1, Unsloth supports Devstral fine-tuning. Training is 2x faster, use 70% less VRAM and supports 8x longer context lengths. Devstral fits comfortably in a 24GB VRAM L4 GPU.
Unfortunately, Devstral slightly exceeds the memory limits of a 16GB VRAM, so fine-tuning it for free on Google Colab isn't possible for now. However, you can fine-tune the model for free using our Kaggle notebook, which offers access to dual GPUs. Just change the notebook's Magistral model name to the Devstral model.
If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:
Last updated
Was this helpful?

