🐱Ministral 3 - How to Run Guide

Guide for Mistral Ministral 3 models, to run or fine-tune locally on your device

istral releases Ministral 3, their new multimodal models in Base, Instruct, and Reasoning variants, available in 3B, 8B, and 14B sizes. They offer best-in-class performance for their size, and are fine-tuned for instruction and chat use cases. The multimodal models support 256K context windows, multiple languages, native function calling, and JSON output.

The full unquantized 14B Ministral-3-Instruct-2512 model fits in 24GB RAM/VRAM. You can now run, fine-tune and RL on all Ministral 3 models with Unsloth:

Run Ministral 3 TutorialsFine-tuning Ministral 3

We've also uploaded Mistral Large 3 GGUFs here. For all Ministral 3 uploads (BnB, FP8), see here.

Ministral-3-Instruct GGUFs:

Ministral-3-Reasoning GGUFs:

3B • 8B • 14B

⚙️ Usage Guide

To achieve optimal performance for Instruct, Mistral recommends using lower temperatures such as temperature = 0.15 or 0.1

For Reasoning, Mistral recommends temperature = 0.7 and top_p = 0.95.

Instruct:

Reasoning:

Temperature = 0.15 or 0.1

Temperature = 0.7

Top_P = default

Top_P = 0.95

Adequate Output Length: Use an output length of 32,768 tokens for most queries for the reasoning variant, and 16,384 for the instruct variant. You can increase the max output size for the reasoning model if necessary.

The maximum context length Ministral 3 can reach is 262,144

The chat template format is found when we use the below:

tokenizer.apply_chat_template([
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : "2"},
    {"role" : "user", "content" : "What is 2+2?"}
    ], add_generation_prompt = True
)

Ministral Reasoning chat template:

<s>[SYSTEM_PROMPT]# HOW YOU SHOULD THINK AND ANSWER

First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.

Your thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response to the user.[/THINK]Here, provide a self-contained response.[/SYSTEM_PROMPT][INST]What is 1+1?[/INST]2</s>[INST]What is 2+2?[/INST]

Ministral Instruct chat template:

<s>[SYSTEM_PROMPT]You are Ministral-3-3B-Instruct-2512, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.
You power an AI assistant called Le Chat.
Your knowledge base was last updated on 2023-10-01.
The current date is {today}.

When you're not sure about some information or when the user's request requires up-to-date or specific data, you must use the available tools to fetch the information. Do not hesitate to use tools whenever they can provide a more accurate or complete response. If no relevant tools are available, then clearly state that you don't have the information and avoid making up anything.
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. "What are some good restaurants around me?" => "Where are you?" or "When is the next flight to Tokyo" => "Where do you travel from?").
You are always very attentive to dates, in particular you try to resolve dates (e.g. "yesterday" is {yesterday}) and when asked about information at specific dates, you discard information that is at another date.
You follow these instructions in all languages, and always respond to the user in the language they use or request.
Next sections describe the capabilities that you have.

# WEB BROWSING INSTRUCTIONS

You cannot perform any web search or access internet to open URLs, links etc. If it seems like the user is expecting you to do so, you clarify the situation and ask the user to copy paste the text directly in the chat.

# MULTI-MODAL INSTRUCTIONS

You have the ability to read images, but you cannot generate images. You also cannot transcribe audio files or videos.
You cannot read nor transcribe audio files or videos.

# TOOL CALLING INSTRUCTIONS

You may have access to tools that you can use to fetch information or perform actions. You must use these tools in the following situations:

1. When the request requires up-to-date information.
2. When the request requires specific data that you do not have in your knowledge base.
3. When the request involves actions that you cannot perform without tools.

Always prioritize using tools to provide the most accurate and helpful response. If tools are not available, inform the user that you cannot perform the requested action at the moment.[/SYSTEM_PROMPT][INST]What is 1+1?[/INST]2</s>[INST]What is 2+2?[/INST]

📖 Run Ministral 3 Tutorials

Below are guides for the Reasoning and Instruct variants of the model.

Instruct: Ministral-3-Instruct-2512

To achieve optimal performance for Instruct, Mistral recommends using lower temperatures such as temperature = 0.15 or 0.1

✨ Llama.cpp: Run Ministral-3-14B-Instruct Tutorial

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

You can directly pull from Hugging Face via:

./llama.cpp/llama-cli \
    -hf unsloth/Ministral-3-14B-Instruct-2512-GGUF:Q4_K_XL \
    --jinja -ngl 99 --threads -1 --ctx-size 32684 \
    --temp 0.15

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Ministral-3-14B-Instruct-2512-GGUF",
    local_dir = "Ministral-3-14B-Instruct-2512-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)

Reasoning: Ministral-3-Reasoning-2512

To achieve optimal performance for Reasoning, Mistral recommends using temperature = 0.7 and top_p = 0.95.

✨ Llama.cpp: Run Ministral-3-14B-Reasoning Tutorial

Obtain the latest llama.cpp on GitHub. You can also use the build instructions below. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

You can directly pull from Hugging Face via:

./llama.cpp/llama-cli \
    -hf unsloth/Ministral-3-14B-Reasoning-2512-GGUF:Q4_K_XL \
    --jinja -ngl 99 --threads -1 --ctx-size 32684 \
    --temp 0.6 --top-p 0.95

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Ministral-3-14B-Reasoning-2512-GGUF",
    local_dir = "Ministral-3-14B-Reasoning-2512-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)

🛠️ Fine-tuning Ministral 3

Unsloth now supports fine-tuning of all Ministral 3 models, including vision support. To train, you must use the latest 🤗Hugging Face transformers v5 and unsloth which includes our our recent ultra long context support. The large 14B Ministral 3 model should fit on a free Colab GPU.

We made free Unsloth notebooks to fine-tune Ministral 3. Change the name to use the desired model.

Ministral-3B-Instruct Vision notebook (vision)
Ministral-3B-Instruct GRPO notebook

Ministral Vision finetuning notebook

Google Colabcolab.research.google.com

Ministral Sudoku GRPO RL notebook

Google Colabcolab.research.google.com

✨Reinforcement Learning (GRPO)

Unsloth now supports RL and GRPO for the Mistral models as well. As usual, they benefit from all of Unsloth's enhancements and tomorrow, we are going to release a notebook soon specifically for autonomously solving the sudoku puzzle.

Ministral-3B-Instruct GRPO notebook

To use the latest version of Unsloth and transformers v5, update via:

pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo

The goal is to auto generate strategies to complete Sudoku!

For the reward plots for Ministral, we get the below. We see it works well!

PreviousDevstral 2 NextGLM-4.6

Last updated 7 days ago

Was this helpful?