💥Magistral: How to Run & Fine-tune

Meet Magistral - Mistral's new reasoning models.

Magistral-Small-2509 is a reasoning LLM developed by Mistral AI. It excels at coding and mathematics and supports multiple languages. Magistral supports a 128k token context window and was finetuned from Mistral-Small-3.2. Magistral runs perfectly well locally on a single RTX 4090 or a Mac with 16 to 24GB RAM.

Running Magistral Tutorial Fine-tuning Magistral

All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized Mistral LLMs with minimal accuracy loss.

Magistral-Small - Unsloth Dynamic uploads:

🖥️ Running Magistral

According to Mistral AI, these are the recommended settings for inference:

  • Temperature of: 0.7

  • Min_P of: 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)

  • Set top_p to: 0.95

  • A 128k context window is supported, but performance might degrade past 40k. So we recommend setting the maximum length to 40k if you see bad performance.

This is the recommended system prompt for Magistral 2509, 2507:

This is the recommended system prompt for Magistral 2506:

  • Multilingual: Magistral supports many languages including: English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, and Farsi.

Testing the model

Mistral has their own vibe checking prompts which can be used to evaluate Magistral. Keep in mind these tests are based on running the full unquantized version of the model, however you could also test them on quantized versions:

Easy - Make sure they always work

Medium - Should most of the time be correct

Hard - Should sometimes get them right

We provide some example outputs at the end of the blog.

🦙 Tutorial: How to Run Magistral in Ollama

  1. Install ollama if you haven't already!

  1. Run the model with our dynamic quant. We did not set the context length automatically, so it will just use Ollama's default set context length. Note you can call ollama serve &in another terminal if it fails! We include all suggested parameters (temperature etc) in params in our Hugging Face upload!

  2. Also Magistral supports 40K context lengths, so best to enable KV cache quantization. We use 8bit quantization which saves 50% memory usage. You can also try "q4_0" or "q8_0"

  3. Ollama also sets the default context length to 4096, as mentioned here. Use OLLAMA_CONTEXT_LENGTH=8192 to change it to 8192. Magistral supports up to 128K, but 40K (40960) is tested most.

📖 Tutorial: How to Run Magistral in llama.cpp

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

  1. If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run

  1. OR download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL, (Unsloth Dynamic), Q4_K_M, or other quantized versions (like BF16 full precision).

  1. Run the model.

  2. Edit --threads -1 for the maximum CPU threads, --ctx-size 40960 for context length (Magistral supports 40K context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference. We also use 8bit quantization for the K cache to reduce memory usage.

  3. For conversation mode:

Sample outputs

How many "r" are in strawberry? [Correct answer = 3]
Exactly how many days ago did the French Revolution start? Today is June 4th, 2025. [Correct answer = 86,157 days]

👁Vision Support

For Magistral versions before September 2025, Xuan-Son from HuggingFace showed in their GGUF repo how it is actually possible to "graft" the vision encoder from Mistral 3.1 Instruct onto Devstral meaning you could do the same for Magistral! According to our tests and many users, it works quite well! We also uploaded our mmproj files which allows you to use the following:

🦥 Fine-tuning Magistral with Unsloth

Just like standard Mistral models including Mistral Small 3.1, Unsloth supports Magistral fine-tuning. Training is 2x faster, use 70% less VRAM and supports 8x longer context lengths. Magistral fits comfortably in a 24GB VRAM L4 GPU.

Magistral slightly exceeds the memory limits of a 16GB VRAM, so fine-tuning it for free on Google Colab isn't possible for now. However, you can fine-tune the model for free using Kaggle, which offers access to dual GPUs.

To finetune on new reasoning traces, you can use our free Kaggle notebook for Magistral

If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

💠Dynamic Float8 Checkpoints

We also provide 2 popular formats for float8 checkpoints, which also utilizes some of our dynamic methodology to retain maximum accuracy:

Both are fantastic to deploy via vLLM. Read up on using TorchAO based FP8 quants in vLLM here.

Last updated

Was this helpful?