💥Magistral: How to Run & Fine-tune
Meet Magistral - Mistral's new reasoning models.
Magistral-Small-2509 is a reasoning LLM developed by Mistral AI. It excels at coding and mathematics and supports multiple languages. Magistral supports a 128k token context window and was finetuned from Mistral-Small-3.2. Magistral runs perfectly well locally on a single RTX 4090 or a Mac with 16 to 24GB RAM.
Update: Magistral-2509 new update is out as of September, 2025! Now with Vision support! We worked with Mistral again with the release of Magistral. Make sure to download Mistral's official uploads or Unsloth's uploads to get the correct implementation (ie correct system prompt, correct chat template etc.)
If you're using llama.cpp, please use --jinja
to enable the system prompt!
All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized Mistral LLMs with minimal accuracy loss.
Magistral-Small - Unsloth Dynamic uploads:
🖥️ Running Magistral
⚙️ Official Recommended Settings
According to Mistral AI, these are the recommended settings for inference:
Temperature of: 0.7
Min_P of: 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)
Set top_p to: 0.95
A 128k context window is supported, but performance might degrade past 40k. So we recommend setting the maximum length to 40k if you see bad performance.
This is the recommended system prompt for Magistral 2509, 2507:
First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.
Your thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response.
This is the recommended system prompt for Magistral 2506:
A user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts (i.e. your summary should be succinct but contain all the critical steps you needed to reach the conclusion). You should use Markdown to format your response. Write both your thoughts and summary in the same language as the task posed by the user. NEVER use \boxed{} in your response.
Your thinking process must follow the template below:
<think>
Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate a correct answer.
</think>
Here, provide a concise summary that reflects your reasoning and presents a clear final answer to the user. Don't mention that this is a summary.
Problem:
Our dynamic uploads have the 'UD
' prefix in them. Those without are not dynamic however still utilize our calibration dataset.
Multilingual: Magistral supports many languages including: English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, and Farsi.
❓Testing the model
Mistral has their own vibe checking prompts which can be used to evaluate Magistral. Keep in mind these tests are based on running the full unquantized version of the model, however you could also test them on quantized versions:
Easy - Make sure they always work
prompt_1 = 'How many "r" are in strawberry?'
prompt_2 = 'John is one of 4 children. The first sister is 4 years old. Next year, the second sister will be twice as old as the first sister. The third sister is two years older than the second sister. The third sister is half the ago of her older brother. How old is John?'
prompt_3 = '9.11 and 9.8, which is greater?'
Medium - Should most of the time be correct
prompt_4 = "Think about 5 random numbers. Verify if you can combine them with addition, multiplication, subtraction or division to 133"
prompt_5 = "Write 4 sentences, each with at least 8 words. Now make absolutely sure that every sentence has exactly one word less than the previous sentence."
prompt_6 = "If it takes 30 minutes to dry 12 T-shirts in the sun, how long does it take to dry 33 T-shirts?"
Hard - Should sometimes get them right
prompt_7 = "Pick 5 random words each with at least 10 letters. Print them out. Reverse each word and print it out. Then extract letters that are alphabetically sorted smaller than "g" and print them. Do not use code."
prompt_8 = "Exactly how many days ago did the French Revolution start? Today is June 4th, 2025."
We provide some example outputs at the end of the blog.
🦙 Tutorial: How to Run Magistral in Ollama
Install
ollama
if you haven't already!
apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh
Run the model with our dynamic quant. We did not set the context length automatically, so it will just use Ollama's default set context length. Note you can call
ollama serve &
in another terminal if it fails! We include all suggested parameters (temperature etc) inparams
in our Hugging Face upload!Also Magistral supports 40K context lengths, so best to enable KV cache quantization. We use 8bit quantization which saves 50% memory usage. You can also try
"q4_0"
or"q8_0"
Ollama also sets the default context length to 4096, as mentioned here. Use
OLLAMA_CONTEXT_LENGTH=8192
to change it to 8192. Magistral supports up to 128K, but 40K (40960) is tested most.
export OLLAMA_KV_CACHE_TYPE="f16"
OLLAMA_CONTEXT_LENGTH=8192 ollama serve &
ollama run hf.co/unsloth/Magistral-Small-2509-GGUF:UD-Q4_K_XL
📖 Tutorial: How to Run Magistral in llama.cpp
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp
If you want to use
llama.cpp
directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar toollama run
./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2509-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99
In llama.cpp, please use --jinja
to enable the system prompt!
OR download the model via (after installing
pip install huggingface_hub hf_transfer
). You can choose UD-Q4_K_XL, (Unsloth Dynamic), Q4_K_M, or other quantized versions (like BF16 full precision).
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/Magistral-Small-2509-GGUF",
local_dir = "unsloth/Magistral-Small-2509-GGUF",
allow_patterns = ["*UD-Q4_K_XL*"], # For UD-Q4_K_XL
)
Run the model.
Edit
--threads -1
for the maximum CPU threads,--ctx-size 40960
for context length (Magistral supports 40K context length!),--n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference. We also use 8bit quantization for the K cache to reduce memory usage.For conversation mode:
./llama.cpp/llama-cli \
--model unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-UD-Q4_K_XL.gguf \
--threads -1 \
--ctx-size 40960 \
\
--n-gpu-layers 99 \
--seed 3407 \
--prio 2 \
--temp 0.7 \
--repeat-penalty 1.0 \
--min-p 0.01 \
--top-k -1 \
--top-p 0.95 \
Remember to remove <bos> since Magistral auto adds a <bos>
Sample outputs
👁Vision Support
Magistral 2509's September 2025 update now includes Vision support by default!
./llama.cpp/llama-mtmd-cli \
--model unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-Q4_K_XL.gguf \
--mmproj unsloth/Magistral-Small-2509-GGUF/mmproj-BF16.gguf \
--threads -1 \
--ctx-size 40960 \
--cache-type-k f16
--n-gpu-layers 99 \
--seed 3407 \
--prio 2 \
--temp 0.7 \
--repeat-penalty 1.0 \
--min-p 0.01 \
--top-k -1 \
--top-p 0.95 \
--jinja
For Magistral versions before September 2025, Xuan-Son from HuggingFace showed in their GGUF repo how it is actually possible to "graft" the vision encoder from Mistral 3.1 Instruct onto Devstral meaning you could do the same for Magistral! According to our tests and many users, it works quite well! We also uploaded our mmproj files which allows you to use the following:
./llama.cpp/llama-mtmd-cli \
--model unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-Q4_K_XL.gguf \
--mmproj unsloth/Magistral-Small-2509-GGUF/mmproj-BF16.gguf \
--threads -1 \
--ctx-size 40960 \
--n-gpu-layers 99 \
--seed 3407 \
--prio 2 \
--temp 0.7 \
--repeat-penalty 1.0 \
--min-p 0.01 \
--top-k -1 \
--top-p 0.95 \
--jinja
🦥 Fine-tuning Magistral with Unsloth
Just like standard Mistral models including Mistral Small 3.1, Unsloth supports Magistral fine-tuning. Training is 2x faster, use 70% less VRAM and supports 8x longer context lengths. Magistral fits comfortably in a 24GB VRAM L4 GPU.
Magistral 2509 Kaggle (2x Tesla T4s) free finetuning notebook
Magistral 2509 Colab L4 (24GB) finetuning notebook
Magistral slightly exceeds the memory limits of a 16GB VRAM, so fine-tuning it for free on Google Colab isn't possible for now. However, you can fine-tune the model for free using Kaggle, which offers access to dual GPUs.
To finetune on new reasoning traces, you can use our free Kaggle notebook for Magistral
!pip install --upgrade unsloth
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Magistral-Small-2509-unsloth-bnb-4bit",
max_seq_length = 2048, # Context length - can be longer, but uses more memory
load_in_4bit = True, # 4bit uses much less memory
load_in_8bit = False, # A bit more accurate, uses 2x memory
full_finetuning = False, # We have full finetuning now!
device_map = "balanced", # Uses 2x Telsa T4s
# token = "hf_...", # use one if using gated models
)
If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
💠Dynamic Float8 Checkpoints
We also provide 2 popular formats for float8 checkpoints, which also utilizes some of our dynamic methodology to retain maximum accuracy:
Both are fantastic to deploy via vLLM. Read up on using TorchAO based FP8 quants in vLLM here.
Last updated
Was this helpful?