Only this pageAll pages
Powered by GitBook
1 of 92

Unsloth Documentation

Get Started

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

New

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Models

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Basics

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Updating

To update or use an old version of Unsloth, follow the steps below:

Standard Updating (recommended):

pip install --upgrade unsloth unsloth_zoo

Updating without dependency updates:

pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth
pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth_zoo

To use an old version of Unsloth:

'2025.1.5' is one of the previous old versions of Unsloth. Change it to a specific release listed on our .

Conda Install

To install Unsloth locally on Conda, follow the steps below:

Only use Conda if you have it. If not, use .

Select either pytorch-cuda=11.8,12.1 for CUDA 11.8 or CUDA 12.1. We support python=3.10,3.11,3.12.

If you're looking to install Conda in a Linux environment, , or run the below:

pip install --force-reinstall --no-cache-dir --no-deps unsloth==2025.1.5
Github here
conda create --name unsloth_env \
    python=3.11 \
    pytorch-cuda=12.1 \
    pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \
    -y
conda activate unsloth_env

pip install unsloth
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
Pip
read here

GSPO Reinforcement Learning

Train with GSPO (Group Sequence Policy Optimization) RL in Unsloth.

We're introducing GSPO which is a variant of GRPO made by the Qwen team at Alibaba. They noticed the observation that when GRPO takes importance weights for each token, even though inherently advantages do not scale or change with each token. This lead to the creation of GSPO, which now assigns the importance on the sequence likelihood rather than the individual token likelihoods of the tokens.

  • Use our free GSPO notebooks for: gpt-oss-20b and Qwen2.5-VL

Enable GSPO in Unsloth by setting importance_sampling_level = "sequence" in the GRPO config. The difference between these two algorithms can be seen below, both from the GSPO paper from Qwen and Alibaba:

GRPO Algorithm, Source:
GSPO algorithm, Source:

In Equation 1, it can be seen that the advantages scale each of the rows into the token logprobs before that tensor is sumed. Essentially, each token is given the same scaling even though that scaling was given to the entire sequence rather than each individual token. A simple diagram of this can be seen below:

Equation 2 shows that the logprob ratios for each sequence is summed and exponentiated after the Logprob ratios are computed, and only the resulting now sequence ratios get row wise multiplied by the advantages.

Enabling GSPO is simple, all you need to do is set the importance_sampling_level = "sequence" flag in the GRPO config.

RL Reward Hacking

Learn what is Reward Hacking in Reinforcement Learning and how to counter it.

The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric). But RL can cheat. When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking".

It's the reason models learn to modify unit tests to pass coding challenges, and these are critical blockers for real world deployment. Some other good examples are from Wikipedia.

Can you counter reward hacking? Yes! In our free gpt-oss RL notebook we explore how to counter reward hacking in a code generation setting and showcase tangible solutions to common error modes. We saw the model edit the timing function, outsource to other libraries, cache the results, and outright cheat. After countering, the result is our model generates genuinely optimized matrix multiplication kernels, not clever cheats.

🏆 Reward Hacking Overview

Some common examples of reward hacking during RL include:

Laziness

RL learns to use Numpy, Torch, other libraries, which calls optimized CUDA kernels. We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries.

Caching & Cheating

RL learns to cache the result of the output and RL learns to find the actual output by inspecting Python global variables.

We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.

Cheating

RL learns to edit the timing function to make it output 0 time as passed. We can stop the RL algorithm from using global or cached variables by restricting it's locals and globals. We are also going to use exec to create the function, so we have to save the output to an empty dict. We also disallow global variable access via types.FunctionType(f.__code__, {})\

Unsloth Installation

Learn to install Unsloth locally or online.

Unsloth works on Linux, Windows, NVIDIA, AMD, Google Colab and more. See our system requirements.

Recommended install methods:

pip install unsloth
uv pip install unsloth

To update Unsloth

pip install --upgrade unsloth

Fine-tuning for Beginners

If you're a beginner, here might be the first questions you'll ask before your first fine-tune. You can also always ask our community by joining our .

Reinforcement Learning - DPO, ORPO & KTO

To use the reward modelling functions for DPO, GRPO, ORPO or KTO with Unsloth, follow the steps below:

DPO (Direct Preference Optimization), ORPO (Odds Ratio Preference Optimization), PPO, KTO Reward Modelling all work with Unsloth.

We have Google Colab notebooks for reproducing GRPO, ORPO, DPO Zephyr, KTO and SimPO:

FP16 vs BF16 for RL

Defeating the Training-Inference Mismatch via FP16 https://arxiv.org/pdf/2510.26788 shows how using float16 is better than bfloat16

Float16 vs Bfloat16

There was a paper titled "Defeating the Training-Inference Mismatch via FP16" showing how using float16 precision can dramatically be better than using bfloat16 when doing reinforcement learning.

In fact the longer the generation, the worse it gets when using bfloat16:

We did an investigation, and DO find float16 to be more stable

Google Colab

To install and run Unsloth on Google Colab, follow the steps below:

If you have never used a Colab notebook, a quick primer on the notebook itself:

  1. Play Button at each "cell". Click on this to run that cell's code. You must not skip any cells and you must run every cell in chronological order. If you encounter errors, simply rerun the cell you did not run. Another option is to click CTRL + ENTER if you don't want to click the play button.

  2. Runtime Button in the top toolbar. You can also use this button and hit "Run all" to run the entire notebook in 1 go. This will skip all the customization steps, but is a good first try.

Unsloth Docs

Train your own model with Unsloth, an open-source framework for LLM fine-tuning and reinforcement learning.

At , our mission is to make AI as accurate and accessible as possible. Train, run, evaluate and save gpt-oss, Llama, DeepSeek, TTS, Qwen, Mistral, Gemma LLMs 2x faster with 70% less VRAM.

Our docs will guide you through running & training your own model locally.

uv, pip install & venv
Docker
Windows
Updating
AMD
Conda
Google Colab
Reddit page
🧬Fine-tuning Guide

Step-by-step on how to fine-tune!

Learn the core basics of training.

❓What Model Should I Use?

Instruct or Base Model?

How big should my dataset be?

🚀LLM Tutorials Directory

How to Run & Fine-tune DeepSeek?

What settings should I set when running Gemma 3?

🤔FAQ + Is Fine-tuning Right For Me?

What can fine-tuning do for me?

RAG vs. Fine-tuning?

📥Installation

How do I install Unsloth locally?

How to update Unsloth?

📈Datasets Guide

How do I structure/prepare my dataset?

How do I collect data?

🛠️Unsloth Requirements

Does Unsloth work on my GPU?

How much VRAM will I need?

🖥️Inference & Deployment

How do I save my model locally?

How do I run my model via Ollama or vLLM?

🧠LoRA Hyperparameters Guide

What happens when I change a parameter?

What parameters should I change?

Qwen
Qwen
GRPO Logprob Ratio row wise scaled with advantages
GSPO Sequence Ratio row wise scaled with advantages
  • KTO notebook

  • SimPO notebook

  • We're also in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs.

    DPO Code

    GRPO notebooks
    ORPO notebook
    DPO Zephyr notebook
    training_args = GRPOConfig(
        output_dir = "vlm-grpo-unsloth",
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 4,
        learning_rate = 5e-6,
        adam_beta1 = 0.9,
        adam_beta2 = 0.99,
        weight_decay = 0.1,
        warmup_ratio = 0.1,
        lr_scheduler_type = "cosine",
        optim = "adamw_8bit",
        # beta = 0.00,
        epsilon = 3e-4,
        epsilon_high = 4e-4,
        num_generations = 8,    
        max_prompt_length = 1024,
        max_completion_length = 1024,
        log_completions = False,
        max_grad_norm = 0.1,
        temperature = 0.9,
        # report_to = "none", # Set to "wandb" if you want to log to Weights & Biases
        num_train_epochs = 2, # For a quick test run, increase for full training
        report_to = "none"
        
        # GSPO is below:
        importance_sampling_level = "sequence",
        
        # Dr GRPO / GAPO etc
        loss_type = "dr_grpo",
    )
    python
    import os
    os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Optional set GPU device ID
    
    from unsloth import FastLanguageModel, PatchDPOTrainer
    from unsloth import is_bfloat16_supported
    PatchDPOTrainer()
    import torch
    from transformers import TrainingArguments
    from trl import DPOTrainer
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/zephyr-sft-bnb-4bit",
        max_seq_length = max_seq_length,
        dtype = None,
        load_in_4bit = True,
    )
    
    # Do model patching and add fast LoRA weights
    model = FastLanguageModel.get_peft_model(
        model,
        r = 64,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 64,
        lora_dropout = 0, # Supports any, but = 0 is optimized
        bias = "none",    # Supports any, but = "none" is optimized
        # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
        use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
        random_state = 3407,
        max_seq_length = max_seq_length,
    )
    
    dpo_trainer = DPOTrainer(
        model = model,
        ref_model = None,
        args = TrainingArguments(
            per_device_train_batch_size = 4,
            gradient_accumulation_steps = 8,
            warmup_ratio = 0.1,
            num_train_epochs = 3,
            fp16 = not is_bfloat16_supported(),
            bf16 = is_bfloat16_supported(),
            logging_steps = 1,
            optim = "adamw_8bit",
            seed = 42,
            output_dir = "outputs",
        ),
        beta = 0.1,
        train_dataset = YOUR_DATASET_HERE,
        # eval_dataset = YOUR_DATASET_HERE,
        tokenizer = tokenizer,
        max_length = 1024,
        max_prompt_length = 512,
    )
    dpo_trainer.train()
    than bfloat16 with much smaller gradient norms see
    and

    🤯A100 Cascade Attention Bug

    As per https://x.com/RichardYRLi/status/1984858850143715759 and https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda, older vLLM versions (before 0.11.0) had broken attention mechanisms for A100 and similar GPUs. Please update vLLM! We also by default disable cascade attention in vLLM during Unsloth reinforcement learning if we detect an older vLLM version.

    Different hardware also changes results, where newer and more expensive GPUs have less KL difference between the inference and training sides:

    🔥Using float16 in Unsloth RL

    To use float16 precision in Unsloth GRPO and RL, you just need to set dtype = torch.float16 and we'll take care of the rest!

    https://arxiv.org/pdf/2510.26788
    https://x.com/danielhanchen/status/1985557028295827482
    https://x.com/danielhanchen/status/1985562902531850472

    Connect / Reconnect T4 button. T4 is the free GPU Google is providing. It's quite powerful!

    The first installation cell looks like below: Remember to click the PLAY button in the brackets [ ]. We grab our open source Github package, and install some other packages.

    Colab Example Code

    Unsloth example code to fine-tune gpt-oss-20b:

    🦥 Why Unsloth?
    • Unsloth streamlines model training locally and on Colab/Kaggle, covering loading, quantization, training, evaluation, saving, exporting, and integration with inference engines like Ollama, llama.cpp, and vLLM.

    • We directly collaborate with teams behind gpt-oss, Qwen3, Llama 4, Mistral, Google (Gemma 1–3) and Phi-4, where we’ve fixed critical bugs in models that greatly improved model accuracy.

    • Unsloth is the only training framework to support all model types: vision, text-to-speech (TTS), BERT, reinforcement learning (RL) while remaining highly customizable with flexible chat templates, dataset formatting and ready-to-use notebooks.

    ⭐ Key Features

    • Supports full-finetuning, pretraining, 4-bit, 16-bit and 8-bit training.

    • The most efficient RL library, using 80% less VRAM. Supports GRPO, GSPO etc.

    • Supports all models: TTS, multimodal, BERT and more. Any model that works in transformers works in Unsloth.

    • 0% loss in accuracy - no approximation methods - all exact.

    • works already but a much better version is coming!

    • Unsloth supports Linux, Windows, Colab, Kaggle, NVIDIA and & Intel. See:

    Quickstart

    Install locally with pip (recommended) for Linux or WSL devices:

    Use our official Docker image: unsloth/unsloth. Read our Docker guide.

    For Windows install instructions, see here.

    What is Fine-tuning and RL? Why?

    Fine-tuning an LLM customizes its behavior, enhances domain knowledge, and optimizes performance for specific tasks. By fine-tuning a pre-trained model (e.g. Llama-3.1-8B) on a dataset, you can:

    • Update Knowledge: Introduce new domain-specific information.

    • Customize Behavior: Adjust the model’s tone, personality, or response style.

    • Optimize for Tasks: Improve accuracy and relevance for specific use cases.

    Reinforcement Learning (RL) is where an "agent" learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties.

    • Action: What the model generates (e.g. a sentence).

    • Reward: A signal indicating how good or bad the model's action was (e.g. did the response follow instructions? was it helpful?).

    • Environment: The scenario or task the model is working on (e.g. answering a user’s question).

    Example use-cases of fine-tuning or RL:

    • Train LLM to predict if a headline impacts a company positively or negatively.

    • Use historical customer interactions for more accurate and custom responses.

    • Train LLM on legal texts for contract analysis, case law research, and compliance.

    You can think of a fine-tuned model as a specialized agent designed to do specific tasks more effectively and efficiently. Fine-tuning can replicate all of RAG's capabilities, but not vice versa.

    Unsloth
    Get started
    Our GitHub
    🧬Fine-tuning Guide
    📒Unsloth Notebooks
    🔮All Our Models
    🚀LLM Tutorials Directory
    Cover

    New 3x Faster Training

    Introducing our new Unsloth Triton kernels!

    Cover

    500K Context Fine-tuning

    You can now train with >500K context.

    Cover

    Devstral 2

    Run & fine-tune Mistral's new coding models.

    🛠️Unsloth Requirements
    📥Installation
    🤔FAQ + Is Fine-tuning Right For Me?
    💡Reinforcement Learning Guide

    Unsloth Requirements

    Here are Unsloth's requirements including system and GPU VRAM requirements.

    System Requirements

    • Operating System: Works on Linux and

    • Supports NVIDIA GPUs since 2018+ including

    Instal Unsloth via pip and uv

    To install Unsloth locally via Pip, follow the steps below:

    Recommended installation method

    Install with pip (recommended) for the latest pip release:

    To use uv:

    To install vLLM and Unsloth together, do:

    To install the latest main branch of Unsloth, do:

    For venv and virtual environments installs to isolate your installation to not break system packages, and to reduce irreparable damage to your system, use venv:

    AMD

    Guide for Fine-tuning LLMs with Unsloth on AMD GPUs.

    Unsloth supports AMD Radeon RX, MI300X's (192GB) GPUs and more.

    1

    Make a new isolated environment (Optional)

    To not break any system packages, you can make an isolated pip environment. Reminder to check what Python version you have! It might be pip3, pip3.13, python3, python.3.13 etc.

    2

    What Model Should I Use for Fine-tuning?

    Llama, Qwen, Mistral, Phi or?

    When preparing for fine-tuning, one of the first decisions you'll face is selecting the right model. Here's a step-by-step guide to help you choose:

    1

    Choose a model that aligns with your usecase

    LoRA Hot Swapping Guide

    🍧 vLLM LoRA Hot Swapping / Dynamic LoRAs

    To enable LoRA serving for at most 4 LoRAs at 1 time (these are hot swapped / changed), first set the environment flag to allow hot swapping:

    Then, serve it with LoRA support:

    To load a LoRA dynamically (set the lora name as well), do:

    To remove it from the pool:

    For example when finetuning with Unsloth:

    Then after training, we save the LoRAs:

    from unsloth import FastLanguageModel
    import torch
    max_seq_length = 2048 # Can increase for longer reasoning traces
    lora_rank = 32 # Larger rank = smarter, but slower
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/Qwen3-4B-Base",
        max_seq_length = max_seq_length,
        load_in_4bit = False, # False for LoRA 16bit
        fast_inference = True, # Enable vLLM fast inference
        max_lora_rank = lora_rank,
        gpu_memory_utilization = 0.9, # Reduce if out of memory
        
        dtype = torch.float16, # Use torch.float16, torch.bfloat16
    )
    from unsloth import FastLanguageModel, FastModel
    import torch
    from trl import SFTTrainer, SFTConfig
    from datasets import load_dataset
    max_seq_length = 2048 # Supports RoPE Scaling internally, so choose any!
    # Get LAION dataset
    url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
    dataset = load_dataset("json", data_files = {"train" : url}, split = "train")
    
    # 4bit pre quantized models we support for 4x faster downloading + no OOMs.
    fourbit_models = [
        "unsloth/gpt-oss-20b-unsloth-bnb-4bit", #or choose any model
    
    ] # More models at https://huggingface.co/unsloth
    
    model, tokenizer = FastModel.from_pretrained(
        model_name = "unsloth/gpt-oss-20b",
        max_seq_length = 2048, # Choose any for long context!
        load_in_4bit = True,  # 4-bit quantization. False = 16-bit LoRA.
        load_in_8bit = False, # 8-bit quantization
        load_in_16bit = False, # [NEW!] 16-bit LoRA
        full_finetuning = False, # Use for full fine-tuning.
        # token = "hf_...", # use one if using gated models
    )
    
    # Do model patching and add fast LoRA weights
    model = FastLanguageModel.get_peft_model(
        model,
        r = 16,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 16,
        lora_dropout = 0, # Supports any, but = 0 is optimized
        bias = "none",    # Supports any, but = "none" is optimized
        # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
        use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
        random_state = 3407,
        max_seq_length = max_seq_length,
        use_rslora = False,  # We support rank stabilized LoRA
        loftq_config = None, # And LoftQ
    )
    
    trainer = SFTTrainer(
        model = model,
        train_dataset = dataset,
        tokenizer = tokenizer,
        args = SFTConfig(
            max_seq_length = max_seq_length,
            per_device_train_batch_size = 2,
            gradient_accumulation_steps = 4,
            warmup_steps = 10,
            max_steps = 60,
            logging_steps = 1,
            output_dir = "outputs",
            optim = "adamw_8bit",
            seed = 3407,
        ),
    )
    trainer.train()
    
    # Go to https://docs.unsloth.ai for advanced tips like
    # (1) Saving to GGUF / merging to 16bit for vLLM
    # (2) Continued training from a saved LoRA adapter
    # (3) Adding an evaluation loop / OOMs
    # (4) Customized chat templates
    pip install unsloth
    MultiGPU
    AMD

    Unsloth Inference

    Learn how to run your finetuned model with Unsloth's faster inference.

    Unsloth supports natively 2x faster inference. For our inference only notebook, click here.

    All QLoRA, LoRA and non LoRA inference paths are 2x faster. This requires no change of code or any new dependencies.

    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
    text_streamer = TextStreamer(tokenizer)
    _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

    NotImplementedError: A UTF-8 locale is required. Got ANSI

    Sometimes when you execute a cell this error can appear. To solve this, in a new cell, run the below:

    import locale
    locale.getpreferredencoding = lambda: "UTF-8"

    If you're installing Unsloth in Jupyter, Colab, or other notebooks, be sure to prefix the command with !. This isn't necessary when using a terminal

    Python 3.13 is now supported!

    Uninstall or Reinstall

    If you're still encountering dependency issues with Unsloth, many users have resolved them by forcing uninstalling and reinstalling Unsloth:


    Advanced Pip Installation

    Do NOT use this if you have Conda.

    Pip is a bit more complex since there are dependency issues. The pip command is different for torch 2.2,2.3,2.4,2.5 and CUDA versions.

    For other torch versions, we support torch211, torch212, torch220, torch230, torch240 and for CUDA versions, we support cu118 and cu121 and cu124. For Ampere devices (A100, H100, RTX3090) and above, use cu118-ampere or cu121-ampere or cu124-ampere.

    For example, if you have torch 2.4 and CUDA 12.1, use:

    Another example, if you have torch 2.5 and CUDA 12.4, use:

    And other examples:

    Or, run the below in a terminal to get the optimal pip installation command:

    Or, run the below manually in a Python REPL:

    We can then load the LoRA:
    export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
    export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
    vllm serve unsloth/Llama-3.1-8B-Instruct \
        --quantization fp8 \
        --kv-cache-dtype fp8
        --gpu-memory-utilization 0.8 \
        --max-model-len 65536 \
        --enable-lora \
        --max-loras 4 \
        --max-lora-rank 64
    curl -X POST http://localhost:8000/v1/load_lora_adapter \
        -H "Content-Type: application/json" \
        -d '{
            "lora_name": "LORA_NAME",
            "lora_path": "/path/to/LORA"
        }'
    pip install unsloth
    pip install --upgrade pip && pip install uv
    uv pip install unsloth
    uv pip install unsloth vllm
    pip install unsloth
    pip uninstall unsloth unsloth_zoo -y && pip install --no-deps git+https://github.com/unslothai/unsloth_zoo.git && pip install --no-deps git+https://github.com/unslothai/unsloth.git
    apt install python3.10-venv python3.11-venv python3.12-venv python3.13-venv -y
    python -m venv unsloth_env
    source unsloth_env/bin/activate
    pip install --upgrade pip && pip install uv
    uv pip install unsloth
    pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth
    pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth_zoo
    pip install --upgrade pip
    pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
    pip install --upgrade pip
    pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"
    pip install "unsloth[cu121-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
    pip install "unsloth[cu118-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
    pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
    pip install "unsloth[cu118-torch240] @ git+https://github.com/unslothai/unsloth.git"
    
    pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
    pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
    
    pip install "unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git"
    pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"
    wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -
    # Licensed under the Apache License, Version 2.0 (the "License")
    try: import torch
    except: raise ImportError('Install torch via `pip install torch`')
    from packaging.version import Version as V
    import re
    v = V(re.match(r"[0-9\.]{3,}", torch.__version__).group(0))
    cuda = str(torch.version.cuda)
    is_ampere = torch.cuda.get_device_capability()[0] >= 8
    USE_ABI = torch._C._GLIBCXX_USE_CXX11_ABI
    if cuda not in ("11.8", "12.1", "12.4", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
    if   v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!")
    elif v <= V('2.1.1'): x = 'cu{}{}-torch211'
    elif v <= V('2.1.2'): x = 'cu{}{}-torch212'
    elif v  < V('2.3.0'): x = 'cu{}{}-torch220'
    elif v  < V('2.4.0'): x = 'cu{}{}-torch230'
    elif v  < V('2.5.0'): x = 'cu{}{}-torch240'
    elif v  < V('2.5.1'): x = 'cu{}{}-torch250'
    elif v <= V('2.5.1'): x = 'cu{}{}-torch251'
    elif v  < V('2.7.0'): x = 'cu{}{}-torch260'
    elif v  < V('2.7.9'): x = 'cu{}{}-torch270'
    elif v  < V('2.8.0'): x = 'cu{}{}-torch271'
    elif v  < V('2.8.9'): x = 'cu{}{}-torch280'
    elif v  < V('2.9.1'): x = 'cu{}{}-torch290'
    elif v  < V('2.9.2'): x = 'cu{}{}-torch291'
    else: raise RuntimeError(f"Torch = {v} too new!")
    if v > V('2.6.9') and cuda not in ("11.8", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
    x = x.format(cuda.replace(".", ""), "-ampere" if False else "") # is_ampere is broken due to flash-attn
    print(f'pip install --upgrade pip && pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git" --no-build-isolation')
    curl -X POST http://localhost:8000/v1/unload_lora_adapter \
        -H "Content-Type: application/json" \
        -d '{
            "lora_name": "LORA_NAME"
        }'
    from unsloth import FastLanguageModel
    import torch
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/Llama-3.1-8B-Instruct",
        max_seq_length = 2048,
        load_in_4bit = True,
    )
    model = FastLanguageModel.get_peft_model(model)
    model.save_pretrained("finetuned_lora")
    tokenizer.save_pretrained("finetuned_lora")
    curl -X POST http://localhost:8000/v1/load_lora_adapter \
        -H "Content-Type: application/json" \
        -d '{
            "lora_name": "LORA_NAME_finetuned_lora",
            "lora_path": "finetuned_lora"
        }'
    and
    • Blackwell, RTX 50 and Unsloth

    • DGX Spark and Unsloth

  • Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20 & 50, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.

  • The official Unsloth Docker image unsloth/unsloth is available on Docker Hub

    • Run Unsloth models in Docker

  • Unsloth works on AMD and Intel GPUs! Apple/Silicon/MLX is in the works

  • If you have different versions of torch, transformers etc., pip install unsloth will automatically install all the latest versions of those libraries so you don't need to worry about version compatibility.

  • Your device should have xformers, torch, BitsandBytes and triton support.

  • Python 3.13 is now supported!

    Fine-tuning VRAM requirements:

    How much GPU memory do I need for LLM fine-tuning using Unsloth?

    A common issue when you OOM or run out of memory is because you set your batch size too high. Set it to 1, 2, or 3 to use less VRAM.

    For context length benchmarks, see here.

    Check this table for VRAM requirements sorted by model parameters and fine-tuning method. QLoRA uses 4-bit, LoRA uses 16-bit. Keep in mind that sometimes more VRAM is required depending on the model so these numbers are the absolute minimum:

    Model parameters
    QLoRA (4-bit) VRAM
    LoRA (16-bit) VRAM

    3B

    3.5 GB

    8 GB

    7B

    5 GB

    19 GB

    8B

    6 GB

    22 GB

    9B

    6.5 GB

    Windows
    Blackwell RTX 50
    DGX Spark

    Install PyTorch

    Install the latest PyTorch, TorchAO, Xformers from https://pytorch.org/

    3

    Install Unsloth

    Install Unsloth's dedicated AMD branch

    And that's it! Try some examples in our Unsloth Notebooks page!

    🔢Reinforcement Learning on AMD GPUs

    You can use our 📒gpt-oss RL auto win 2048 example on a MI300X (192GB) GPU. The goal is to play the 2048 game automatically and win it with RL. The LLM (gpt-oss 20b) auto devises a strategy to win the 2048 game, and we calculate a high reward for winning strategies, and low rewards for failing strategies.

    The reward over time is increasing after around 300 steps or so!

    The goal for RL is to maximize the average reward to win the 2048 game.

    We used an AMD MI300X machine (192GB) to run the 2048 RL example with Unsloth, and it worked well!

    You can also use our 📒automatic kernel gen RL notebook also with gpt-oss to auto create matrix multiplication kernels in Python. The notebook also devices multiple methods to counteract reward hacking.

    The prompt we used to auto create these kernels was:

    The RL process learns for example how to apply the Strassen algorithm for faster matrix multiplication inside of Python.

    🛠️Troubleshooting

    As of October 2025, bitsandbytes in AMD is under development - you might get HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception errors. We disabled bitsandbytes internally in Unsloth automatically until a fix is provided for versions 0.48.2.dev0 and above. This means load_in_4bit = True will instead use 16bit LoRA. Full finetuning also works via full_finetuning = True

    To force 4bit, you need to specify the actual model name like unsloth/gemma-3-4b-it-unsloth-bnb-4bit and set use_exact_model_name = True as an extra argument within FastLanguageModel.from_pretrained etc.

    AMD GPUs also need the bitsandbytes blocksize to be 128 and not 64 - this also means our pre-quantized models (for example unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit) from HuggingFace for now will not work - we auto switch to downloading the full BF16 weights, then quantize on the fly if we detect an AMD GPU.

    📚AMD Free One-click notebooks

    AMD provides one-click notebooks equipped with free 192GB VRAM MI300X GPUs through their Dev Cloud. Train large models completely for free (no signup or credit card required):

    • Qwen3 (32B)

    • Llama 3.3 (70B)

    • Qwen3 (14B)

    • Mistral v0.3 (7B)

    • - Inference

    You can use any Unsloth notebook by prepending https://oneclickamd.ai/github/unslothai/notebooks/blob/main/nb in Unsloth Notebooks by changing the link from https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(270M).ipynb to https://oneclickamd.ai/github/unslothai/notebooks/blob/main/nb/Gemma3_(270M).ipynb

    E.g. For image-based training, select a vision model such as Llama 3.2 Vision. For code datasets, opt for a specialized model like Qwen Coder 2.5.

  • Licensing and Requirements: Different models may have specific licensing terms and system requirements. Be sure to review these carefully to avoid compatibility issues.

  • 2

    Assess your storage, compute capacity and dataset

    • Use our VRAM guideline to determine the VRAM requirements for the model you’re considering.

    • Your dataset will reflect the type of model you will use and amount of time it will take to train

    3

    Select a Model and Parameters

    • We recommend using the latest model for the best performance and capabilities. For instance, as of January 2025, the leading 70B model is Llama 3.3.

    • You can stay up to date by exploring our model catalog to find the newest and relevant options.

    4

    Choose Between Base and Instruct Models

    Further details below:

    Instruct or Base Model?

    When preparing for fine-tuning, one of the first decisions you'll face is whether to use an instruct model or a base model.

    Instruct Models

    Instruct models are pre-trained with built-in instructions, making them ready to use without any fine-tuning. These models, including GGUFs and others commonly available, are optimized for direct usage and respond effectively to prompts right out of the box. Instruct models work with conversational chat templates like ChatML or ShareGPT.

    Base Models

    Base models, on the other hand, are the original pre-trained versions without instruction fine-tuning. These are specifically designed for customization through fine-tuning, allowing you to adapt them to your unique needs. Base models are compatible with instruction-style templates like Alpaca or Vicuna, but they generally do not support conversational chat templates out of the box.

    Should I Choose Instruct or Base?

    The decision often depends on the quantity, quality, and type of your data:

    • 1,000+ Rows of Data: If you have a large dataset with over 1,000 rows, it's generally best to fine-tune the base model.

    • 300–1,000 Rows of High-Quality Data: With a medium-sized, high-quality dataset, fine-tuning the base or instruct model are both viable options.

    • Less than 300 Rows: For smaller datasets, the instruct model is typically the better choice. Fine-tuning the instruct model enables it to align with specific needs while preserving its built-in instructional capabilities. This ensures it can follow general instructions without additional input unless you intend to significantly alter its functionality.

    • For information how how big your dataset should be,

    Fine-tuning models with Unsloth

    You can change the model name to whichever model you like by matching it with model's name on Hugging Face e.g. 'unsloth/llama-3.1-8b-unsloth-bnb-4bit'.

    We recommend starting with Instruct models, as they allow direct fine-tuning using conversational chat templates (ChatML, ShareGPT etc.) and require less data compared to Base models (which uses Alpaca, Vicuna etc). Learn more about the differences between instruct and base models here.

    • Model names ending in unsloth-bnb-4bit indicate they are Unsloth dynamic 4-bit quants. These models consume slightly more VRAM than standard BitsAndBytes 4-bit models but offer significantly higher accuracy.

    • If a model name ends with just bnb-4bit, without "unsloth", it refers to a standard BitsAndBytes 4-bit quantization.

    • Models with no suffix are in their original 16-bit or 8-bit formats. While they are the original models from the official model creators, we sometimes include important fixes - such as chat template or tokenizer fixes. So it's recommended to use our versions when available.

    Experimentation is Key

    We recommend experimenting with both models when possible. Fine-tune each one and evaluate the outputs to see which aligns better with your goals.

    FAQ + Is Fine-tuning Right For Me?

    If you're stuck on if fine-tuning is right for you, see here! Learn about fine-tuning misconceptions, how it compared to RAG and more:

    Understanding Fine-Tuning

    Fine-tuning an LLM customizes its behavior, deepens its domain expertise, and optimizes its performance for specific tasks. By refining a pre-trained model (e.g. Llama-3.1-8B) with specialized data, you can:

    • Update Knowledge – Introduce new, domain-specific information that the base model didn’t originally include.

    • Customize Behavior – Adjust the model’s tone, personality, or response style to fit specific needs or a brand voice.

    • Optimize for Tasks – Improve accuracy and relevance on particular tasks or queries your use-case requires.

    Think of fine-tuning as creating a specialized expert out of a generalist model. Some debate whether to use Retrieval-Augmented Generation (RAG) instead of fine-tuning, but fine-tuning can incorporate knowledge and behaviors directly into the model in ways RAG cannot. In practice, combining both approaches yields the best results - leading to greater accuracy, better usability, and fewer hallucinations.

    Real-World Applications of Fine-Tuning

    Fine-tuning can be applied across various domains and needs. Here are a few practical examples of how it makes a difference:

    • Sentiment Analysis for Finance – Train an LLM to determine if a news headline impacts a company positively or negatively, tailoring its understanding to financial context.

    • Customer Support Chatbots – Fine-tune on past customer interactions to provide more accurate and personalized responses in a company’s style and terminology.

    • Legal Document Assistance – Fine-tune on legal texts (contracts, case law, regulations) for tasks like contract analysis, case law research, or compliance support, ensuring the model uses precise legal language.

    The Benefits of Fine-Tuning

    Fine-tuning offers several notable benefits beyond what a base model or a purely retrieval-based system can provide:

    Fine-Tuning vs. RAG: What’s the Difference?

    Fine-tuning can do mostly everything RAG can - but not the other way around. During training, fine-tuning embeds external knowledge directly into the model. This allows the model to handle niche queries, summarize documents, and maintain context without relying on an outside retrieval system. That’s not to say RAG lacks advantages as it is excels at accessing up-to-date information from external databases. It is in fact possible to retrieve fresh data with fine-tuning as well, however it is better to combine RAG with fine-tuning for efficiency.

    Task-Specific Mastery

    Fine-tuning deeply integrates domain knowledge into the model. This makes it highly effective at handling structured, repetitive, or nuanced queries, scenarios where RAG-alone systems often struggle. In other words, a fine-tuned model becomes a specialist in the tasks or content it was trained on.

    Independence from Retrieval

    A fine-tuned model has no dependency on external data sources at inference time. It remains reliable even if a connected retrieval system fails or is incomplete, because all needed information is already within the model’s own parameters. This self-sufficiency means fewer points of failure in production.

    Faster Responses

    Fine-tuned models don’t need to call out to an external knowledge base during generation. Skipping the retrieval step means they can produce answers much more quickly. This speed makes fine-tuned models ideal for time-sensitive applications where every second counts.

    Custom Behavior and Tone

    Fine-tuning allows precise control over how the model communicates. This ensures the model’s responses stay consistent with a brand’s voice, adhere to regulatory requirements, or match specific tone preferences. You get a model that not only knows what to say, but how to say it in the desired style.

    Reliable Performance

    Even in a hybrid setup that uses both fine-tuning and RAG, the fine-tuned model provides a reliable fallback. If the retrieval component fails to find the right information or returns incorrect data, the model’s built-in knowledge can still generate a useful answer. This guarantees more consistent and robust performance for your system.

    Common Misconceptions

    Despite fine-tuning’s advantages, a few myths persist. Let’s address two of the most common misconceptions about fine-tuning:

    Does Fine-Tuning Add New Knowledge to a Model?

    Yes - it absolutely can. A common myth suggests that fine-tuning doesn’t introduce new knowledge, but in reality it does. If your fine-tuning dataset contains new domain-specific information, the model will learn that content during training and incorporate it into its responses. In effect, fine-tuning can and does teach the model new facts and patterns from scratch.

    Is RAG Always Better Than Fine-Tuning?

    Not necessarily. Many assume RAG will consistently outperform a fine-tuned model, but that’s not the case when fine-tuning is done properly. In fact, a well-tuned model often matches or even surpasses RAG-based systems on specialized tasks. Claims that “RAG is always better” usually stem from fine-tuning attempts that weren’t optimally configured - for example, using incorrect or insufficient training.

    Unsloth takes care of these complexities by automatically selecting the best parameter configurations for you. All you need is a good-quality dataset, and you'll get a fine-tuned model that performs to its fullest potential.

    Is Fine-Tuning Expensive?

    Not at all! While full fine-tuning or pretraining can be costly, these are not necessary (pretraining is especially not necessary). In most cases, LoRA or QLoRA fine-tuning can be done for minimal cost. In fact, with Unsloth’s for Colab or Kaggle, you can fine-tune models without spending a dime. Better yet, you can even fine-tune locally on your own device.

    FAQ:

    Why You Should Combine RAG & Fine-Tuning

    Instead of choosing between RAG and fine-tuning, consider using both together for the best results. Combining a retrieval system with a fine-tuned model brings out the strengths of each approach. Here’s why:

    • Task-Specific Expertise – Fine-tuning excels at specialized tasks or formats (making the model an expert in a specific area), while RAG keeps the model up-to-date with the latest external knowledge.

    • Better Adaptability – A fine-tuned model can still give useful answers even if the retrieval component fails or returns incomplete information. Meanwhile, RAG ensures the system stays current without requiring you to retrain the model for every new piece of data.

    • Efficiency – Fine-tuning provides a strong foundational knowledge base within the model, and RAG handles dynamic or quickly-changing details without the need for exhaustive re-training from scratch. This balance yields an efficient workflow and reduces overall compute costs.

    LoRA vs. QLoRA: Which One to Use?

    When it comes to implementing fine-tuning, two popular techniques can dramatically cut down the compute and memory requirements: LoRA and QLoRA. Here’s a quick comparison of each:

    • LoRA (Low-Rank Adaptation) – Fine-tunes only a small set of additional “adapter” weight matrices (in 16-bit precision), while leaving most of the original model unchanged. This significantly reduces the number of parameters that need updating during training.

    • QLoRA (Quantized LoRA) – Combines LoRA with 4-bit quantization of the model weights, enabling efficient fine-tuning of very large models on minimal hardware. By using 4-bit precision where possible, it dramatically lowers memory usage and compute overhead.

    We recommend starting with QLoRA, as it’s one of the most efficient and accessible methods available. Thanks to Unsloth’s quants, the accuracy loss compared to standard 16-bit LoRA fine-tuning is now negligible.

    Experimentation is Key

    There’s no single “best” approach to fine-tuning - only best practices for different scenarios. It’s important to experiment with different methods and configurations to find what works best for your dataset and use case. A great starting point is QLoRA (4-bit), which offers a very cost-effective, resource-friendly way to fine-tune models without heavy computational requirements.

    Troubleshooting Inference

    If you're experiencing issues when running or saving your model.

    Running in Unsloth works well, but after exporting & running on other platforms, the results are poor

    You might sometimes encounter an issue where your model runs and produces good results on Unsloth, but when you use it on another platform like Ollama or vLLM, the results are poor or you might get gibberish, endless/infinite generations or repeated outputs.

    • The most common cause of this error is using an incorrect chat template. It’s essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. When inferencing from a saved model, it's crucial to apply the correct template.

    • You must use the correct eos token. If not, you might get gibberish on longer generations.

    • It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses!

    • Use our conversational notebooks to force the chat template - this will fix most issues.

      • Qwen-3 14B Conversational notebook

      • Gemma-3 4B Conversational notebook

      • Llama-3.2 3B Conversational notebook

    Saving to safetensors, not bin format in Colab

    We save to .bin in Colab so it's like 4x faster, but set safe_serialization = None to force saving to .safetensors. So model.save_pretrained(..., safe_serialization = None) or model.push_to_hub(..., safe_serialization = None)

    If saving to GGUF or vLLM 16bit crashes

    You can try reducing the maximum GPU usage during saving by changing maximum_memory_usage.

    The default is model.save_pretrained(..., maximum_memory_usage = 0.75). Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.

    Inference & Deployment

    Learn how to save your finetuned model so you can run it in your favorite inference engine.

    You can also run your fine-tuned models by using Unsloth's 2x faster inference.

    Multi-GPU Fine-tuning with Unsloth

    Learn how to fine-tune LLMs on multiple GPUs and parallelism with Unsloth.

    Unsloth currently supports multi-GPU setups through libraries like Accelerate and DeepSpeed. This means you can already leverage parallelism methods such as FSDP and DDP with Unsloth.

    See our new Distributed Data Parallel (DDP) multi-GPU Guide here.

    We know that the process can be complex and requires manual setup. We’re working hard to make multi-GPU support much simpler and more user-friendly, and we’ll be announcing official multi-GPU support for Unsloth soon.

    For now, you can use our Magistral-2509 Kaggle notebook as an example which utilizes multi-GPU Unsloth to fit the 24B parameter model or our DDP guide.

    In the meantime, to enable multi GPU for DDP, do the following:

    1. Create your training script as train.py (or similar). For example, you can use one of our created from our various notebooks!

    2. Run accelerate launch train.py or torchrun --nproc_per_node N_GPUS train.py where N_GPUS is the number of GPUs you have.

    Pipeline / model splitting loading is also allowed, so if you do not have enough VRAM for 1 GPU to load say Llama 70B, no worries - we will split the model for you on each GPU! To enable this, use the device_map = "balanced" flag:

    Stay tuned for our official announcement! For more details, check out our ongoing discussing multi-GPU support.

    Docker

    Install Unsloth using our official Docker container

    Learn how to use our Docker containers with all dependencies pre-installed for immediate installation. No setup required, just run and start training!

    Unsloth Docker image:

    You can now use our main Docker image unsloth/unsloth for Blackwell and 50-series GPUs - no separate image needed.

    How to Fine-tune LLMs with Unsloth & Docker

    Learn how to fine-tune LLMs or do Reinforcement Learning (RL) with Unsloth's Docker image.

    Local training can be complex due to dependency hell or breaking environments. Unsloth’s can bypass these issues. No setup is needed: pull and run the image and start training.

    • Unsloth official Docker image:

    Why Use Unsloth & Docker?

    Unsloth’s Docker image is stable, up-to-date and works in like Windows.

    • Fully contained dependencies keep your system clean. Runs safely without root.

    500K Context Length Fine-tuning

    Learn how to enable >500K token context window fine-tuning with Unsloth.

    We’re introducing new algorithms in Unsloth that push the limits of long-context training for any LLM and VLM. Training LLMs like gpt-oss-20b can now reach 500K+ context lengths on a single 80GB H100 GPU, compared to 80K previously with no accuracy degradation.

    You can reach >750K context windows on a B200 192GB GPU.

    Try 500K-context gpt-oss-20b fine-tuning on our .

    We’ve significantly improved how Unsloth handles memory usage patterns, speed, and context lengths:

    • 60% lower VRAM use with 3.2x longer context via Unsloth’s new

    How to Run Local LLMs with Docker: Step-by-Step Guide

    Learn how to run Large Language Models (LLMs) with Docker & Unsloth on your local device.

    You can now run any model, including Unsloth , on Mac, Windows or Linux with a single line of code or no code at all. We collabed with Docker to simplify model deployment, and Unsloth now powers most GGUF models on Docker.

    Before you start, make sure to look over and for optimizing performance when running LLMs on your device.

    To get started, run OpenAI with a single command:

    Or to run a specific / quant from Hugging Face:

    Devstral 2: How to Run Guide

    Guide for local running Mistral Devstral 2 models: 123B-Instruct-2512 and Small-2-24B-Instruct-2512.

    Devstral 2 are Mistral’s new coding and agentic LLMs for software engineering, available in and sizes. The 123B model achieves SOTA in SWE-bench, coding, tool-calling and agent use-cases. The 24B model fits in 25GB RAM/VRAM and 123B fits in 128GB.

    We’ve resolved issues in Devstral’s chat template, and results should be significantly better. The 24B & 123B have been updated.

    Devstral 2 supports vision capabilities, a 256k context window and uses the same architecture as . You can now run and fine-tune both models locally with Unsloth.

    All Devstral 2 uploads use our Unsloth methodology, delivering the best performance on and 5-shot MMLU benchmarks.

    vLLM Deployment & Inference Guide

    Guide on saving and deploying LLMs to vLLM for serving LLMs in production

    💻Installing vLLM

    For NVIDIA GPUs, use uv and run:

    For AMD GPUs, please use the nightly Docker image: rocm/vllm-dev:nightly

    For the nightly branch for NVIDIA GPUs, run:

    See for more details

    Continued Pretraining

    AKA as Continued Finetuning. Unsloth allows you to continually pretrain so a model can learn a new language.

    • The is for continued pretraining/raw text.

    • The is for learning another language.

    You can read more about continued pretraining and our release in our .

    Unsloth Environment Flags

    Advanced flags which might be useful if you see breaking finetunes, or you want to turn stuff off.

    Environment variable
    Purpose
    pip install --upgrade torch==2.8.0 pytorch-triton-rocm torchvision torchaudio torchao==0.13.0 xformers --index-url https://download.pytorch.org/whl/rocm6.4
    pip install --no-deps unsloth unsloth-zoo
    pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git
    pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
    Create a new fast matrix multiplication function using only native Python code.
    You are given a list of list of numbers.
    Output your new function in backticks using the format below:
    ```
    python
    def matmul(A, B):
        return ...
    ```
    apt install python3.10-venv python3.11-venv python3.12-venv python3.13-venv -y
    
    python -m venv unsloth_env
    source unsloth_env/bin/activate

    24 GB

    11B

    7.5 GB

    29 GB

    14B

    8.5 GB

    33 GB

    27B

    22GB

    64GB

    32B

    26 GB

    76 GB

    40B

    30GB

    96GB

    70B

    41 GB

    164 GB

    81B

    48GB

    192GB

    90B

    53GB

    212GB

    405B

    237 GB

    950 GB

    llama.cpp - Saving to GGUF

    Ollama

    vLLM

    SGLang

    Unsloth Inference

    Troubleshooting

    vLLM Engine Arguments

    LoRA Hotswapping

    see here
    LoRA parameters
    free notebooks
    dynamic 4-bit
    🧠LoRA Hyperparameters Guide

    Phi-4 14B Conversational notebook Open in Colab

  • Mistral v0.3 7B Conversational notebook Open in Colab

  • More notebooks in our notebooks repo.

  • Open in Colab
    Open in Colab
    Open in Colab
    GPT OSS MXFP4 (20B)
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        "unsloth/Llama-3.3-70B-Instruct",
        load_in_4bit = True,
        device_map = "balanced",
    )
    training scripts
    Pull Request
    🚚Deploying vLLM models

    After saving your fine-tune, you can simply do:

    🚒vLLM Deployment Server Flags, Engine Arguments & Options

    Some important server flags to use are at vLLM Deployment Server Flags, Engine Arguments & Options

    🦥Deploying Unsloth finetunes in vLLM

    After fine-tuning Fine-tuning Guide or using our notebooks at Unsloth Notebooks, you can save or deploy your models directly through vLLM within a single workflow. An example Unsloth finetuning script for eg:

    To save to 16-bit for vLLM, use:

    To save just the LoRA adapters, either use:

    Or just use our builtin function to do that:

    To merge to 4bit to load on HuggingFace, first call merged_4bit. Then use merged_4bit_forced if you are certain you want to merge to 4bit. I highly discourage you, unless you know what you are going to do with the 4bit model (ie for DPO training for eg or for HuggingFace's online inference engine)

    Then to load the finetuned model in vLLM in another terminal:

    You might have to provide the full path if the above doesn't work ie:

    See other content:

    vLLM Engine Arguments

    LoRA Hot Swapping Guide

    vLLM docs
    What is Continued Pretraining?

    Continued or continual pretraining (CPT) is necessary to “steer” the language model to understand new domains of knowledge, or out of distribution domains. Base models like Llama-3 8b or Mistral 7b are first pretrained on gigantic datasets of trillions of tokens (Llama-3 for e.g. is 15 trillion).

    But sometimes these models have not been well trained on other languages, or text specific domains, like law, medicine or other areas. So continued pretraining (CPT) is necessary to make the language model learn new tokens or datasets.

    Advanced Features:

    Loading LoRA adapters for continued finetuning

    If you saved a LoRA adapter through Unsloth, you can also continue training using your LoRA weights. The optimizer state will be reset as well. To load even optimizer states to continue finetuning, see the next section.

    Continued Pretraining & Finetuning the lm_head and embed_tokens matrices

    Add lm_head and embed_tokens. For Colab, sometimes you will go out of memory for Llama-3 8b. If so, just add lm_head.

    Then use 2 different learning rates - a 2-10x smaller one for the lm_head or embed_tokens like so:

    text completion notebook
    continued pretraining notebook
    blog post

    os.environ["UNSLOTH_STUDIO_DISABLED"] = "1"

    Disables extra features.

    os.environ["UNSLOTH_COMPILE_DEBUG"] = "1"

    Turns on extremely verbose torch.compilelogs.

    os.environ["UNSLOTH_COMPILE_MAXIMUM"] = "0"

    Enables maximum torch.compileoptimizations - not recommended.

    os.environ["UNSLOTH_COMPILE_IGNORE_ERRORS"] = "1"

    Can turn this off to enable fullgraph parsing.

    os.environ["UNSLOTH_FULLGRAPH"] = "0"

    Enable torch.compile fullgraph mode

    os.environ["UNSLOTH_DISABLE_AUTO_UPDATES"] = "1"

    Forces no updates to unsloth-zoo

    Another possiblity is maybe the model uploads we uploaded are corrupted, but unlikely. Try the following:

    os.environ["UNSLOTH_RETURN_LOGITS"] = "1"

    Forcibly returns logits - useful for evaluation if logits are needed.

    os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"

    Disables auto compiler. Could be useful to debug incorrect finetune results.

    os.environ["UNSLOTH_DISABLE_FAST_GENERATION"] = "1"

    Disables fast generation for generic models.

    os.environ["UNSLOTH_ENABLE_LOGGING"] = "1"

    Enables auto compiler logging - useful to see which functions are compiled or not.

    os.environ["UNSLOTH_FORCE_FLOAT32"] = "1"

    On float16 machines, use float32 and not float16 mixed precision. Useful for Gemma 3.

    pip install --upgrade pip
    pip install uv
    uv pip install -U vllm --torch-backend=auto
    pip install --upgrade pip
    pip install uv
    uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
    vllm serve unsloth/gpt-oss-120b
    from unsloth import FastLanguageModel
    import torch
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/gpt-oss-20b",
        max_seq_length = 2048,
        load_in_4bit = True,
    )
    model = FastLanguageModel.get_peft_model(model)
    model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_16bit")
    ## OR to upload to HuggingFace:
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")
    model.save_pretrained("finetuned_lora")
    tokenizer.save_pretrained("finetuned_lora")
    model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "lora")
    ## OR to upload to HuggingFace
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")
    model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_4bit")
    ## To upload to HuggingFace:
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")
    vllm serve finetuned_model
    vllm serve /mnt/disks/daniel/finetuned_model
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "LORA_MODEL_NAME",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    trainer = Trainer(...)
    trainer.train()
    model = FastLanguageModel.get_peft_model(
        model,
        r = 16,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj",
                          "lm_head", "embed_tokens",],
        lora_alpha = 16,
    )
    from unsloth import UnslothTrainer, UnslothTrainingArguments
    
    trainer = UnslothTrainer(
        ....
        args = UnslothTrainingArguments(
            ....
            learning_rate = 5e-5,
            embedding_learning_rate = 5e-6, # 2-10x smaller than learning_rate
        ),
    )
    model, tokenizer = FastVisionModel.from_pretrained(
        "Qwen/Qwen2-VL-7B-Instruct",
        use_exact_model_name = True,
    )
    ⚡ Quickstart
    1

    Install Docker and NVIDIA Container Toolkit.

    Install Docker via Linux or Desktop (other). Then install NVIDIA Container Toolkit:

    2

    Run the container.

    unsloth/unsloth is Unsloth's only Docker image. For Blackwell and 50-series GPUs, use this same image - no separate one needed.

    3

    Access Jupyter Lab

    Go to and open Unsloth.

    Access the unsloth-notebooks tabs to see Unsloth notebooks.

    4

    Start training with Unsloth

    If you're new, follow our step-by-step , or just save/copy any of our premade .

    📂 Container Structure

    • /workspace/work/ — Your mounted work directory

    • /workspace/unsloth-notebooks/ — Example fine-tuning notebooks

    • /home/unsloth/ — User home directory

    📖 Usage Example

    Full Example

    Setting up SSH Key

    If you don't have an SSH key pair:

    🦥Why Unsloth Containers?

    • Reliable: Curated environment with stable & maintained package versions. Just 7 GB compressed (vs. 10–11 GB elsewhere)

    • Ready-to-use: Pre-installed notebooks in /workspace/unsloth-notebooks/

    • Secure: Runs safely as a non-root user

    • Universal: Compatible with all transformer-based models (TTS, BERT, etc.)

    ⚙️ Advanced Settings

    Variable
    Description
    Default

    JUPYTER_PASSWORD

    Jupyter Lab password

    unsloth

    JUPYTER_PORT

    Jupyter Lab port inside container

    8888

    SSH_KEY

    SSH public key for authentication

    None

    USER_PASSWORD

    Password for unsloth user (sudo)

    unsloth

    • Jupyter Lab: -p 8000:8888

    • SSH access: -p 2222:22

    Important: Use volume mounts to preserve your work between container runs.

    🔒 Security Notes

    • Container runs as non-root unsloth user by default

    • Use USER_PASSWORD for sudo operations inside container

    • SSH access requires public key authentication

    unsloth/unsloth

    Use locally or on any platform with pre-installed notebooks.

    You can now use our main Docker image unsloth/unsloth for Blackwell and 50-series GPUs - no separate image needed.

    ⚡ Step-by-Step Tutorial

    1

    Install Docker and NVIDIA Container Toolkit.

    Install Docker via Linux or Desktop (other). Then install NVIDIA Container Toolkit:

    2

    Run the container.

    unsloth/unsloth is Unsloth's only Docker image. For Blackwell and 50-series GPUs, use this same image - no separate image needed. If using DGX Spark, you'll need to follow our DGX guide.

    3

    Access Jupyter Lab

    Go to and open Unsloth.

    Access the unsloth-notebooks tabs to see Unsloth notebooks.

    4

    Start training with Unsloth

    If you're new, follow our step-by-step , or just save/copy any of our premade .

    📂 Container Structure

    • /workspace/work/ — Your mounted work directory

    • /workspace/unsloth-notebooks/ — Example fine-tuning notebooks

    • /home/unsloth/ — User home directory

    📖 Usage Example

    Full Example

    Setting up SSH Key

    If you don't have an SSH key pair:

    ⚙️ Advanced Settings

    Variable
    Description
    Default

    JUPYTER_PASSWORD

    Jupyter Lab password

    unsloth

    JUPYTER_PORT

    Jupyter Lab port inside container

    8888

    SSH_KEY

    SSH public key for authentication

    None

    USER_PASSWORD

    Password for unsloth user (sudo)

    unsloth

    • Jupyter Lab: -p 8000:8888

    • SSH access: -p 2222:22

    Important: Use volume mounts to preserve your work between container runs.

    🔒 Security Notes

    • Container runs as non-root unsloth user by default

    • Use USER_PASSWORD for sudo operations inside container

    • SSH access requires public key authentication

    Docker image
    unsloth/unsloth
    supported setups
    loss, with no degradation in speed or accuracy
  • Enhanced activation offloading in Unsloth’s Gradient Checkpointing

  • Collabing with Stas Bekman from Snowflake on Tiled MLP, enabling 2× more contexts

  • Unsloth’s algorithms allows gpt-oss-20b QLoRA (4bit) with 290K context possible on a H100 with no accuracy loss, and 500K+ with Tiled MLP enabled, altogether delivering >6.4x longer context lengths.

    📐 Unsloth Loss Refactoring: Chunk & Fuse

    Our new fused loss implementation adds dynamic sequence chunking: instead of computing language model head logits and cross-entropies over the entire sequence at once, we process manageable slices along the flattened sequence dimension. This cuts peak memory from GBs to a smaller chunk sizes. Each chunk still runs a fully fused forward + backward pass via torch.func.grad_and_value , and retains mixed precision accuracy by upcasting to float32 if necessary. These changes do not degrade training speed or accuracy.

    The key innovation is that the chunk size is chosen automatically at runtime based on available VRAM.

    • If you have more free VRAM, larger chunks are used for faster runs

    • If you have less VRAM, it increases the number of chunks to avoid memory blowouts.

    This removes manual tuning and keeps our algorithm robust across old and new GPUs, workloads and different sequence lengths.

    Due to automatic tuning, smaller contexts will use more VRAM (fewer chunks) to avoid unnecessary overhead. For the plots above, we adjust the number of loss chunks to reflect realistic VRAM tiers. With 80GB VRAM, this yields >3.2× longer contexts.

    🏁 Unsloth Gradient Checkpointing Enhancements

    Our Unsloth Gradient Checkpointing algorithm, introduced in April 2024, quickly became popular and the standard across the industry, having been integrated into most training packages nowadays. It offloads activations to CPU RAM which allowed 10x longer context lengths. Our new enhancements uses CUDA Streams and other tricks to add at most 0.1% training overhead with no impact on accuracy. Previously it added 1 to 3% training overhead.

    By offloading activations as soon as they are produced, we minimize peak activation footprint and free GPU memory exactly when it’s needed. This sharply reduces memory pressure in long-context or large-batch training, where a single decoder layer’s activations can exceed 2 GB.

    Thus, Unsloth’s new algorithms & Gradient Checkpointing contributes to most improvements (3.2x), enabling 290k-context-length QLoRA GPT-OSS fine-tuning on a single H100.

    🔓 Tiled MLP: Unlocking 500K+

    With help from Stas Bekman (Snowflake), we integrated Tiled MLP from Snowflake’s Arctic Long Sequence Training paper and blog post. TiledMLP reduces activation memory and enables much longer sequence lengths by tiling hidden states along the sequence dimension before heavy MLP projections.

    We also introduce a few quality-of-life improvements:

    We preserve RNG state across tiled forward recomputations so dropout and other stochastic ops are consistent between forward and backward replays. This keeps nested checkpointed computations stable and numerically identical.

    Our implementation auto patches any module named or typed as mlp, so nearly all models with MLP modules are supported out of the box for Tiled MLP.

    Tradeoffs to keep in mind

    TiledMLP saves VRAM at the cost of extra forward passes. Because it lives inside a checkpointed transformer block and is itself written in a checkpoint style, it effectively becomes a nested checkpoint: one MLP now performs ~3 forward passes and 1 backward pass per step. In return, we can drop almost all intermediate MLP activations from VRAM while still supporting extremely long sequences.

    The plots compare active memory timelines for a single decoder layer’s forward and backward during a long-context training step, without Tiled MLP (left) and with it (right). Without Tiled MLP, peak VRAM occurs during the MLP backward; with Tiled MLP, it shifts to the fused loss calculation. We see ~40% lower VRAM usage, and because the fused loss auto chunks dynamically based on available VRAM, the peak with Tiled MLP would be even smaller on smaller GPUs.

    To show cross-entropy loss is not the new bottleneck, we fix its chunk size instead of choosing it dynamically and then double the number of chunks. This significantly reduces the loss-related memory spikes. The max memory now occurs during backward in both cases, and overall timing is similar, though Tiled MLP adds a small overhead: one large GEMM becomes many sequential matmuls, plus the extra forward pass mentioned above.

    Overall, the trade-off is worth it: without Tiled MLP, long-context training can require roughly 2× the memory usage, while with Tiled MLP a single GPU pays only about a 1.3× increase in step time for the same context length.

    Enabling Tiled MLP in Unsloth:

    Just set unsloth_tiled_mlp = True in from_pretrained and Tiled MLP is enabled. We follow the same logic as the Arctic paper and choose num_shards = ceil(seq_len/hidden_size). Each tile will operate on sequence lengths which are the same size of the hidden dimension of the model to balance throughput and memory savings.

    We also discussed how Tiled MLP effectively does 3 forward passes and 1 backward, compared to normal gradient checkpointing which does 2 forward passes and 1 backward with Stas Bekman and DeepSpeed provided a doc update for Tiled MLP within DeepSpeed.

    Next time fine-tuning runs out of memory, try turning on unsloth_tiled_mlp = True. This should save some VRAM as long as the context length is longer than the LLM's hidden dimension.


    With our latest update, it is possible to now reach 1M context length with a smaller model on a single GPU!

    Try 500K-context gpt-oss-20b fine-tuning on our 80GB A100 Colab notebook.

    If you've made it this far, we're releasing a new blog on our latest improvements in training speed this week so stay tuned by joining our Reddit r/unsloth or our Docs.

    80GB A100 Colab notebook
    fused and chunked cross-entropy
    You don’t need Docker Desktop, Docker CE is enough to run models.

    Why Unsloth + Docker?

    We collab with model labs like Google Gemma to fix model bugs and boost accuracy. Our Dynamic GGUFs consistently outperform other quant methods, giving you high-accuracy, efficient inference.

    If you use Docker, you can run models instantly with zero setup. Docker uses Docker Model Runner (DMR), which lets you run LLMs as easily as containers with no dependency issues. DMR uses Unsloth models and llama.cpp under the hood for fast, efficient, up-to-date inference.

    ⚙️ Hardware Info + Performance

    For the best performance, aim for your VRAM + RAM combined to be at least equal to the size of the quantized model you're downloading. If you have less, the model will still run, but significantly slower.

    Make sure your device also has enough disk space to store the model. If your model only barely fits in memory, you can expect around ~5 tokens/s, depending on model size.

    Having extra RAM/VRAM available will improve inference speed, and additional VRAM will enable the biggest performance boost (provided the entire model fits)

    Example: If you're downloading gpt-oss-20b (F16) and the model is 13.8 GB, ensure that your disk space and RAM + VRAM > 13.8 GB.

    Quantization recommendations:

    • For models under 30B parameters, use at least 4-bit (Q4).

    • For models 70B parameters or larger, use a minimum of 2-bit quantization (e.g., UD_Q2_K_XL).

    ⚡ Step-by-Step Tutorials

    Below are two ways to run models with Docker: one using the terminal, and the other using Docker Desktop with no code:

    Method #1: Docker Terminal

    1

    Install Docker

    Docker Model Runner is already available in both Docker Desktop and Docker CE.

    2

    Run the model

    Decide on a model to run, then run the command via terminal.

    • Browse the verified catalog of trusted models available on or page.

    • Go to Terminal to run the commands. To verify if you have docker installed, you can type 'docker' and enter.

    • Docker Hub defaults to running Unsloth Dynamic 4-bit, however you can select your own quantization level (see step #3).

    For example, to run OpenAI gpt-oss-20b in a single command:

    Or to run a specific gpt-oss quant from Hugging Face:

    This is how running gpt-oss-20b should look via CLI:

    3

    To run a specific quantization level:

    If you want to run a specific quantization of a model, append : and the quantization name to the model (e.g., Q4 for Docker or UD-Q4_K_XL). You can view all available quantizations on each model’s Docker Hub page. e.g. see the listed quantizations for gpt-oss .

    Method #2: Docker Desktop (no code)

    1

    Install Docker Desktop

    Docker Model Runner is already available in Docker Desktop.

    1. Decide on a model to run, open Docker Desktop, then click on the models tab.

    2. Click 'Add models +' or Docker Hub. Search for the model.

    Browse the verified model catalog available on .

    2

    Pull the model

    Click the model you want to run to see available quantizations.

    • Quantizations range from 1–16 bits. For models under 30B parameters, use at least 4-bit (Q4).

    3

    Run the model

    Type any prompt in the 'Ask a question' box and use the LLM like you would use ChatGPT.

    To run the latest models:

    You can run any new model on Docker as long as it’s supported by llama.cpp or vllm and available on Docker Hub.

    What Is the Docker Model Runner?

    The Docker Model Runner (DMR) is an open-source tool that lets you pull and run AI models as easily as you run containers. GitHub: https://github.com/docker/model-runner

    It provides a consistent runtime for models, similar to how Docker standardized app deployment. Under the hood, it uses optimized backends (like llama.cpp) for smooth, hardware-efficient inference on your machine.

    Whether you’re a researcher, developer, or hobbyist, you can now:

    • Run open models locally in seconds.

    • Avoid dependency hell, everything is handled in Docker.

    • Share and reproduce model setups effortlessly.

    Dynamic GGUFs
    hardware requirements
    our tips
    Docker Terminal Tutorial
    Docker no-code Tutorial
    gpt-oss
    Unsloth model
    Devstral-Small-2-24BDevstral-2-123B

    Devstral 2 - Unsloth Dynamic GGUFs:

    Devstral-Small-2-24B-Instruct-2512
    Devstral-2-123B-Instruct-2512

    🖥️ Running Devstral 2

    See our step-by-step guides for running Devstral 24B and the large Devstral 123B models. Both models support vision support but currently vision is not supported in llama.cpp

    ⚙️ Usage Guide

    Here are the recommended settings for inference:

    • Temperature ~0.15

    • Min_P of 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)

    • Use --jinja to enable the system prompt.

    • Max context length = 262,144

    • Recommended minimum context: 16,384

    Devstral-Small-2-24B

    The full precision (Q8) Devstral-Small-2-24B GGUF will fit in 25GB RAM/VRAM. Text only for now.

    ✨ Run Devstral-Small-2-24B-Instruct-2512 in llama.cpp

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also directly pull from Hugging Face:

    1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

    1. Run the model. Otherwise for conversation mode:

    Remember to remove <bos> since Devstral auto adds a <bos>! Also please use --jinja to enable the system prompt!

    Devstral-2-123B

    The full precision (Q8) Devstral-Small-2-123B GGUF will fit in 128GB RAM/VRAM. Text only for now.

    ✨ Run Devstral-2-123B-Instruct-2512 Tutorial

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. You can directly pull from HuggingFace via:

    1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

    Remember to remove <bos> since Devstral auto adds a <bos>! Also please use --jinja to enable the system prompt!

    🦥 Fine-tuning Devstral 2 with Unsloth

    Just like Ministral 3, Unsloth supports Devstral 2 fine-tuning. Training is 2x faster, use 70% less VRAM and supports 8x longer context lengths. Devstral 2 fits comfortably in a 24GB VRAM L4 GPU.

    Unfortunately, Devstral 2 slightly exceeds the memory limits of a 16GB VRAM, so fine-tuning it for free on Google Colab isn't possible for now. However, you can fine-tune the model for free using our Kaggle notebook, which offers access to dual GPUs. Just change the notebook's Magistral model name to the unsloth/Devstral-Small-2-24B-Instruct-2512 model.

    24B
    123B
    Ministral 3
    Dynamic 2.0
    Aider Polyglot

    Vision Reinforcement Learning (VLM RL)

    Train Vision/multimodal models via GRPO and RL with Unsloth!

    Unsloth now supports vision/multimodal RL with Qwen3-VL, Gemma 3 and more. Due to Unsloth's unique weight sharing and custom kernels, Unsloth makes VLM RL 1.5–2× faster, uses 90% less VRAM, and enables 15× longer context lengths than FA2 setups, with no accuracy loss. This update also introduces Qwen's GSPO algorithm.

    Unsloth can train Qwen3-VL-8B with GSPO/GRPO on a free Colab T4 GPU. Other VLMs work too, but may need larger GPUs. Gemma requires newer GPUs than T4 because vLLM restricts to Bfloat16, thus we recommend NVIDIA L4 on Colab. Our notebooks solve numerical math problems involving images and diagrams:

    • Qwen-3 VL-8B (vLLM inference): Colab

    • Qwen-2.5 VL-7B (vLLM inference): •

    • Gemma-3-4B (Unsloth inference):

    We have also added vLLM VLM integration into Unsloth natively, so all you have to do to use vLLM inference is enable the fast_inference=True flag when initializing the model. Special thanks to for providing the that made integrating VLM RL easier!

    This VLM support also integrates our latest update for even more memory efficient + faster RL including our , which uniquely limits speed degradation compared to other implementations.

    You can only use fast_inference for VLMs supported by vLLM. Some models, like Llama 3.2 Vision thus only can run without vLLM, but they still work in Unsloth.

    It is also important to note, that vLLM does not support LoRA for vision/encoder layers, thus set finetune_vision_layers = False when loading a LoRA adapter. However you CAN train the vision layers as well if you use inference via transformers/Unsloth.

    🦋Qwen 2.5 VL Vision RL Issues and Quirks

    During RL for Qwen 2.5 VL, you might see the following inference output:

    This was as well in Qwen2.5-VL-7B-Instruct output unexpected results "addCriterion". In fact we see this as well! We tried both non Unsloth, bfloat16 and float16 machines and other things, but it appears still. For example item 165 ie train_dataset[165] from the dataset is below:

    And then we get the above gibberish output. One could add a reward function to penalize the addition of addCriterion, or penalize gibberish outputs. However, the other approach is to train it for longer. For example only after 60 steps ish do we see the model actually learning via RL:

    Forcing <|assistant|> during generation will reduce the occurrences of these gibberish results as expected since this is an Instruct model, however it's still best to add a reward function to penalize bad generations, as described in the next section.

    🏅Reward Functions to reduce gibberish

    To penalize addCriterion and gibberish outputs, we edited the reward function to penalize too much of addCriterion and newlines.

    🏁GSPO Reinforcement Learning

    This update in addition adds GSPO () which is a variant of GRPO made by the Qwen team at Alibaba. They noticed that GRPO implicitly results in importance weights for each token, even though explicitly advantages do not scale or change with each token.

    This lead to the creation of GSPO, which now assigns the importance on the sequence likelihood rather than the individual token likelihoods of the tokens. The difference between these two algorithms can be seen below, both from the GSPO paper from Qwen and Alibaba:

    In Equation 1, it can be seen that the advantages scale each of the rows into the token logprobs before that tensor is sumed. Essentially, each token is given the same scaling even though that scaling was given to the entire sequence rather than each individual token. A simple diagram of this can be seen below:

    Equation 2 shows that the logprob ratios for each sequence is summed and exponentiated after the Logprob ratios are computed, and only the resulting now sequence ratios get row wise multiplied by the advantages.

    Enabling GSPO is simple, all you need to do is set the importance_sampling_level = "sequence" flag in the GRPO config.

    Overall, Unsloth now with VLM vLLM fast inference enables for both 90% reduced memory usage but also 1.5-2x faster speed with GRPO and GSPO!

    If you'd like to read more about reinforcement learning, check out out RL guide:

    Authors: A huge thank you to and for contributing to this article!

    Qwen3-Next: Run Locally Guide

    Run Qwen3-Next-80B-A3B-Instruct and Thinking versions locally on your device!

    Qwen released Qwen3-Next in Sept 2025, which are 80B MoEs with Thinking and Instruct model variants of Qwen3. With 256K context, Qwen3-Next was designed with a brand new architecture (Hybrid of MoEs & Gated DeltaNet + Gated Attention) that specifically optimizes for fast inference on longer context lengths. Qwen3-Next has 10x faster inference than Qwen3-32B.

    Run Qwen3-Next InstructRun Qwen3-Next Thinking

    Qwen3-Next-80B-A3B Dynamic GGUFs: Instruct • Thinking

    ⚙️ Usage Guide

    NEW as of Dec 6, 2025: Unsloth Qwen3-Next now updated with iMatrix for improved performance.

    The thinking model uses temperature = 0.6, but the instruct model uses temperature = 0.7 The thinking model uses top_p = 0.95, but the instruct model uses top_p = 0.8

    To achieve optimal performance, Qwen recommends these settings:

    Instruct:
    Thinking:

    Adequate Output Length: Use an output length of 32,768 tokens for most queries for the thinking variant, and 16,384 for the instruct variant. You can increase the max output size for the thinking model if necessary.

    Chat template for both Thinking (thinking has <think></think>) and Instruct is below:

    📖 Run Qwen3-Next Tutorials

    Below are guides for the and versions of the model.

    Instruct: Qwen3-Next-80B-A3B-Instruct

    Given that this is a non thinking model, the model does not generate <think> </think> blocks.

    ⚙️Best Practices

    To achieve optimal performance, Qwen recommends the following settings:

    • We suggest using temperature=0.7, top_p=0.8, top_k=20, and min_p=0.0 presence_penalty between 0 and 2 if the framework supports to reduce endless repetitions.

    • temperature = 0.7

    • top_k = 20

    ✨ Llama.cpp: Run Qwen3-Next-80B-A3B-Instruct Tutorial

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. You can directly pull from HuggingFace via:

    2. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

    Thinking: Qwen3-Next-80B-A3B-Thinking

    This model supports only thinking mode and a 256K context window natively. The default chat template adds <think> automatically, so you may see only a closing </think> tag in the output.

    ⚙️Best Practices

    To achieve optimal performance, Qwen recommends the following settings:

    • We suggest using temperature=0.6, top_p=0.95, top_k=20, and min_p=0.0 presence_penalty between 0 and 2 if the framework supports to reduce endless repetitions.

    • temperature = 0.6

    • top_k = 20

    ✨ Llama.cpp: Run Qwen3-Next-80B-A3B-Thinking Tutorial

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. You can directly pull from Hugging Face via:

    2. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

    🛠️ Improving generation speed

    If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

    Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    The also introduces high throughput mode. Use llama-parallel. Read more about it . You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster. The talks about KV cache quantization.

    📐How to fit long context

    To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

    --cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

    You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1 So try out --cache-type-k q4_1

    You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it. After installing Flash Attention, you can then use --cache-type-v q4_1

    Tutorials: How To Fine-tune & Run LLMs

    Learn how to run and/or fine-tune models for optimal performance 100% locally with Unsloth.

    Phi-4 Reasoning: How to Run & Fine-tune

    Learn to run & fine-tune Phi-4 reasoning models locally with Unsloth + our Dynamic 2.0 quants

    Microsoft's new Phi-4 reasoning models are now supported in Unsloth. The 'plus' variant performs on par with OpenAI's o1-mini, o3-mini and Sonnet 3.7. The 'plus' and standard reasoning models are 14B parameters while the 'mini' has 4B parameters. All Phi-4 reasoning uploads use our Unsloth Dynamic 2.0 methodology.

    Phi-4 reasoning - Unsloth Dynamic 2.0 uploads:

    Dynamic 2.0 GGUF (to run)
    Dynamic 4-bit Safetensor (to finetune/deploy)

    🖥️ Running Phi-4 reasoning

    ⚙️ Official Recommended Settings

    According to Microsoft, these are the recommended settings for inference:

    • Temperature = 0.8

    • Top_P = 0.95

    Phi-4 reasoning Chat templates

    Please ensure you use the correct chat template as the 'mini' variant has a different one.

    Phi-4-mini:

    Phi-4-reasoning and Phi-4-reasoning-plus:

    This format is used for general conversation and instructions:

    Yes, the chat template/prompt format is this long!

    🦙 Ollama: Run Phi-4 reasoning Tutorial

    1. Install ollama if you haven't already!

    1. Run the model! Note you can call ollama servein another terminal if it fails. We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload.

    📖 Llama.cpp: Run Phi-4 reasoning Tutorial

    You must use --jinja in llama.cpp to enable reasoning for the models, expect for the 'mini' variant. Otherwise no token will be provided.

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions.

    1. Run the model in conversational mode in llama.cpp. You must use --jinja in llama.cpp to enable reasoning for the models. This is however not needed if you're using the 'mini' variant.

    🦥 Fine-tuning Phi-4 with Unsloth

    for the models are also now supported in Unsloth. To fine-tune for free on Google Colab, just change the model_name of 'unsloth/Phi-4' to 'unsloth/Phi-4-mini-reasoning' etc.

    Saving to Ollama

    See our guide below for the complete process on how to save to Ollama:

    🦙Tutorial: Finetune Llama-3 and Use In Ollama

    Saving on Google Colab

    You can save the finetuned model as a small 100MB file called a LoRA adapter like below. You can instead push to the Hugging Face hub as well if you want to upload your model! Remember to get a Hugging Face token via: https://huggingface.co/settings/tokens and add your token!

    After saving the model, we can again use Unsloth to run the model itself! Use FastLanguageModel again to call it for inference!

    Exporting to Ollama

    Finally we can export our finetuned model to Ollama itself! First we have to install Ollama in the Colab notebook:

    Then we export the finetuned model we have to llama.cpp's GGUF formats like below:

    Reminder to convert False to True for 1 row, and not change every row to True, or else you'll be waiting for a very time! We normally suggest the first row getting set to True, so we can export the finetuned model quickly to Q8_0 format (8 bit quantization). We also allow you to export to a whole list of quantization methods as well, with a popular one being q4_k_m.

    Head over to to learn more about GGUF. We also have some manual instructions of how to export to GGUF if you want here:

    You will see a long list of text like below - please wait 5 to 10 minutes!!

    And finally at the very end, it'll look like below:

    Then, we have to run Ollama itself in the background. We use subprocess because Colab doesn't like asynchronous calls, but normally one just runs ollama serve in the terminal / command prompt.

    Automatic Modelfile creation

    The trick Unsloth provides is we automatically create a Modelfile which Ollama requires! This is a just a list of settings and includes the chat template which we used for the finetune process! You can also print the Modelfile generated like below:

    We then ask Ollama to create a model which is Ollama compatible, by using the Modelfile

    Ollama Inference

    And we can now call the model for inference if you want to do call the Ollama server itself which is running on your own local machine / in the free Colab notebook in the background. Remember you can edit the yellow underlined part.

    Running in Unsloth works well, but after exporting & running on Ollama, the results are poor

    You might sometimes encounter an issue where your model runs and produces good results on Unsloth, but when you use it on another platform like Ollama, the results are poor or you might get gibberish, endless/infinite generations or repeated outputs.

    • The most common cause of this error is using an incorrect chat template. It’s essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. When inferencing from a saved model, it's crucial to apply the correct template.

    • You must use the correct eos token. If not, you might get gibberish on longer generations.

    • It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses!

    Fine-tuning LLMs Guide

    Learn all the basics and best practices of fine-tuning. Beginner-friendly.

    1. Understand Fine-tuning

    Fine-tuning an LLM customizes its behavior, enhances + injects knowledge, and optimizes performance for domains/specific tasks. For example:

    • GPT-4 serves as a base model; however, OpenAI fine-tuned it to better comprehend instructions and prompts, leading to the creation of ChatGPT-4 which everyone uses today.

    Windows Installation

    See how to install Unsloth on Windows with or without WSL.

    For Windows, pip install unsloth now works, however you must have Pytorch previously installed.

    Method #1 - Docker:

    Docker might be the easiest way for Windows users to get started with Unsloth as there is no setup needed or dependency issues. is Unsloth's only Docker image. For and 50-series GPUs, use this same image - no separate image needed.

    For installation instructions, please follow our , otherwise here is a quickstart guide:

    1

    gpt-oss Reinforcement Learning

    You can now train OpenAI with RL and GRPO via . Unsloth now offers the fastest inference (3x faster), lowest VRAM usage (50% less) and longest context (8x longer) for gpt-oss RL vs. any implementation - with no accuracy degradation. Since reinforcement learning (RL) on gpt-oss isn't yet vLLM compatible, we had to rewrite the inference code from Transformers code to deliver 3x faster inference for gpt-oss at ~21 tokens/s. For BF16, Unsloth also achieves the fastest inference (~30 tokens/s), especially relative to VRAM usage, using 50% less VRAM vs. any other RL implementation. We plan to support our once vLLM becomes compatible with RL.

    • Free notebook: This notebook automatically creates faster matrix multiplication kernels and uses 4 new Unsloth reward functions. We also show how to which is one of RL's biggest challenges.\

    With Unsloth, you can train gpt-oss-20b with GRPO on 15GB VRAM and for

    Fine-tuning LLMs with NVIDIA DGX Spark and Unsloth

    Tutorial on how to fine-tune and do reinforcement learning (RL) with OpenAI gpt-oss on NVIDIA DGX Spark.

    Unsloth enables local fine-tuning of LLMs with up to 200B parameters on the NVIDIA DGX™ Spark. With 128 GB of unified memory, you can train massive models such as gpt-oss-120b, and run or deploy inference directly on DGX Spark.

    As shown at , gpt-oss-20b was trained with RL and Unsloth on DGX Spark to auto-win 2048. You can train using Unsloth in a Docker container or virtual environment on DGX Spark.

    In this tutorial, we’ll train gpt-oss-20b with RL using Unsloth notebooks after installing Unsloth on your DGX Spark. gpt-oss-120b will use around 68GB of unified memory.

    After 1,000 steps and 4 hours of RL training, the gpt-oss model greatly outperforms the original on 2048, and longer training would further improve results.

    vLLM Engine Arguments

    vLLM engine arguments, flags, options for serving models on vLLM.

    Argument
    Example and use-case
    export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
    sudo apt-get update && sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
    docker run -d -e JUPYTER_PASSWORD="mypassword" \
      -p 8888:8888 -p 2222:22 \
      -v $(pwd)/work:/workspace/work \
      --gpus all \
      unsloth/unsloth
    docker run -d -e JUPYTER_PORT=8000 \
      -e JUPYTER_PASSWORD="mypassword" \
      -e "SSH_KEY=$(cat ~/.ssh/container_key.pub)" \
      -e USER_PASSWORD="unsloth2024" \
      -p 8000:8000 -p 2222:22 \
      -v $(pwd)/work:/workspace/work \
      --gpus all \
      unsloth/unsloth
    # Generate new key pair
    ssh-keygen -t rsa -b 4096 -f ~/.ssh/container_key
    
    # Use the public key in docker run
    -e "SSH_KEY=$(cat ~/.ssh/container_key.pub)"
    
    # Connect via SSH
    ssh -i ~/.ssh/container_key -p 2222 unsloth@localhost
    # Generate SSH key pair
    ssh-keygen -t rsa -b 4096 -f ~/.ssh/container_key
    
    # Connect to container
    ssh -i ~/.ssh/container_key -p 2222 unsloth@localhost
    -p <host_port>:<container_port>
    -v <local_folder>:<container_folder>
    docker run -d -e JUPYTER_PORT=8000 \
      -e JUPYTER_PASSWORD="mypassword" \
      -e "SSH_KEY=$(cat ~/.ssh/container_key.pub)" \
      -e USER_PASSWORD="unsloth2024" \
      -p 8000:8000 -p 2222:22 \
      -v $(pwd)/work:/workspace/work \
      --gpus all \
      unsloth/unsloth
    export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
    sudo apt-get update && sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
    docker run -d -e JUPYTER_PASSWORD="mypassword" \
      -p 8888:8888 -p 2222:22 \
      -v $(pwd)/work:/workspace/work \
      --gpus all \
      unsloth/unsloth
    docker run -d -e JUPYTER_PORT=8000 \
      -e JUPYTER_PASSWORD="mypassword" \
      -e "SSH_KEY=$(cat ~/.ssh/container_key.pub)" \
      -e USER_PASSWORD="unsloth2024" \
      -p 8000:8000 -p 2222:22 \
      -v $(pwd)/work:/workspace/work \
      --gpus all \
      unsloth/unsloth
    # Generate new key pair
    ssh-keygen -t rsa -b 4096 -f ~/.ssh/container_key
    
    # Use the public key in docker run
    -e "SSH_KEY=$(cat ~/.ssh/container_key.pub)"
    
    # Connect via SSH
    ssh -i ~/.ssh/container_key -p 2222 unsloth@localhost
    -p <host_port>:<container_port>
    -v <local_folder>:<container_folder>
    docker run -d -e JUPYTER_PORT=8000 \
      -e JUPYTER_PASSWORD="mypassword" \
      -e "SSH_KEY=$(cat ~/.ssh/container_key.pub)" \
      -e USER_PASSWORD="unsloth2024" \
      -p 8000:8000 -p 2222:22 \
      -v $(pwd)/work:/workspace/work \
      --gpus all \
      unsloth/unsloth
    # Original Unsloth version released April 2024 - LGPLv3 Licensed
    class Unsloth_Offloaded_Gradient_Checkpointer(torch.autograd.Function):
        @staticmethod
        @torch_amp_custom_fwd
        def forward(ctx, forward_function, hidden_states, *args):
            ctx.device = hidden_states.device
            saved_hidden_states = hidden_states.to("cpu", non_blocking = True)
            with torch.no_grad():
                output = forward_function(hidden_states, *args)
            ctx.save_for_backward(saved_hidden_states)
            ctx.forward_function, ctx.args = forward_function, args
            return output
    
        @staticmethod
        @torch_amp_custom_bwd
        def backward(ctx, dY):
            (hidden_states,) = ctx.saved_tensors
            hidden_states = hidden_states.to(ctx.device, non_blocking = True).detach()
            hidden_states.requires_grad_(True)
            with torch.enable_grad():
                (output,) = ctx.forward_function(hidden_states, *ctx.args)
            torch.autograd.backward(output, dY)
            return (None, hidden_states.grad,) + (None,)*len(ctx.args)
    model, tokenizer = FastLanguageModel.from_pretrained(
        ...,
        unsloth_tiled_mlp = True,
    )
    docker model run ai/gpt-oss:20B
    docker model run hf.co/unsloth/gpt-oss-20b-GGUF:F16
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli \
        -hf unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF:UD-Q4_K_XL \
        --jinja -ngl 99 --threads -1 --ctx-size 16384 \
        --temp 0.15
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF",
        local_dir = "unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*", "*mmproj-F16*"], # For Q4_K_XL
    )
    ./llama.cpp/llama-cli \
        --model unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL.gguf \
        --threads -1 \
        --ctx-size 16384 \
        --cache-type-k q8_0 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 0.15 \
        --jinja
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli \
        -hf unsloth/Devstral-2-123B-Instruct-2512-GGUF:UD-Q4_K_XL \
        --jinja -ngl 99 --threads -1 --ctx-size 16384 \
        --temp 0.15
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Devstral-2-123B-Instruct-2512-GGUF",
        local_dir = "Devstral-2-123B-Instruct-2512-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*", "*mmproj-F16*"],
    )
    Devstral-Small-2-24B-Instruct-2512-GGUF
    Devstral-2-123B-Instruct-2512-GGUF
    http://localhost:8888
    Fine-tuning Guide
    RL Guide
    notebooks
    http://localhost:8888
    Fine-tuning Guide
    RL Guide
    notebooks

    Ministral 3

    Kimi K2 Thinking

    DeepSeek-OCR

    Qwen3-VL

    Vision Reinforcement Learning

    DeepSeek-V3.1 Terminus

    Run gpt-oss

    Qwen3 Coder

    Fine-tune gpt-oss

    Magistral 1.2

    Gemma 3n

    Qwen3-2507

    DeepSeek-R1-0528

    Kimi K2

    Devstral 2507

    Fine-tune on Blackwell & RTX 50 GPUs

    TTS Fine-tuning

    Qwen3

    Phi-4 reasoning

    Dynamic 2.0 GGUFs

    Llama 4

    DeepSeek-V3-0324

    Grok 2

    Gemma 3

    QwQ-32B

    DeepSeek-R1

    Reinforcement Learning (RL)

    Mistral Small 3.1

    Llama 3

    Vision Fine-tuning

    Continued Pretraining

    Llama 3.3

    Gemma 2

    Phi-3

    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Colab
    Kaggle
    Colab
    Sinoué GAD
    first notebook
    Standby feature
    reported
    AI4Math/MathVista
    Group Sequence Policy Optimization
    Reinforcement Learning Guide
    Keith
    Datta
    GRPO Algorithm, Source: Qwen
    GSPO algorithm, Source: Qwen
    GRPO Logprob Ratio row wise scaled with advantages
    GSPO Sequence Ratio row wise scaled with advantages

    min_p = 0.00 (llama.cpp's default is 0.1)

  • top_p = 0.80

  • presence_penalty = 0.0 to 2.0 (llama.cpp default turns it off, but to reduce repetitions, you can use this) Try 1.0 for example.

  • Supports up to 262,144 context natively but you can set it to 32,768 tokens for less RAM use

  • min_p = 0.00 (llama.cpp's default is 0.1)

  • top_p = 0.95

  • presence_penalty = 0.0 to 2.0 (llama.cpp default turns it off, but to reduce repetitions, you can use this) Try 1.0 for example.

  • Supports up to 262,144 context natively but you can set it to 32,768 tokens for less RAM use

  • Temperature = 0.7

    Temperature = 0.6

    Min_P = 0.00 (llama.cpp's default is 0.1)

    Min_P = 0.00 (llama.cpp's default is 0.1)

    Top_P = 0.80

    Top_P = 0.95

    TopK = 20

    TopK = 20

    presence_penalty = 0.0 to 2.0 (llama.cpp default turns it off, but to reduce repetitions, you can use this)

    presence_penalty = 0.0 to 2.0 (llama.cpp default turns it off, but to reduce repetitions, you can use this)

    Thinking
    Instruct
    GitHub here
    GitHub here
    latest llama.cpp release
    here
    next section
    • Reasoning-plus (14B)

    • Reasoning (14B)

    • Mini-reasoning (4B)

    • Reasoning-plus

    • Reasoning

    • Mini-reasoning

    GitHub here
    Phi-4 fine-tuning
    Phi-4 (14B) fine-tuning notebook

    Default is 8000. How to access vLLM's localhost ie http://localhost:8000

    --api-key

    Optional - Set the password (or no password) to access the model.

    --tensor-parallel-size

    Default is 1. Splits model across tensors. Set this to how many GPUs you are using - if you have 4, set this to 4. 8, then 8. You should have NCCL, otherwise this might be slow.

    --pipeline-parallel-size

    Default is 1. Splits model across layers. Use this with --pipeline-parallel-size where TP is used within each node, and PP is used across multi-node setups (set PP to number of nodes)

    --enable-lora

    Enables LoRA serving. Useful for serving Unsloth finetuned LoRAs.

    --max-loras

    How many LoRAs you want to serve at 1 time. Set this to 1 for 1 LoRA, or say 16. This is a queue so LoRAs can be hot-swapped.

    --max-lora-rank

    Maximum rank of all LoRAs. Possible choices are 8, 16, 32, 64, 128, 256, 320, 512

    --dtype

    Allows auto, bfloat16, float16 Float8 and other quantizations use a different flag - see --quantization

    --tokenizer

    Specify the tokenizer path like unsloth/gpt-oss-20b if the served model has a different tokenizer.

    --hf-token

    Add your HuggingFace token if needed for gated models

    --swap-space

    Default is 4GB. CPU offloading usage. Reduce if you have VRAM, or increase for low memory GPUs.

    --seed

    Default is 0 for vLLM

    --disable-log-stats

    Disables logging like throughput, server requests.

    --enforce-eager

    Disables compilation. Faster to load, but slower for inference.

    --disable-cascade-attn

    Useful for Reinforcement Learning runs for vLLM < 0.11.0, as Cascade Attention was slightly buggy on A100 GPUs (Unsloth fixes this)

    🎉Float8 Quantization

    For example to host Llama 3.3 70B Instruct (supports 128K context length) with Float8 KV Cache and quantization, try:

    🍧LoRA Hot Swapping / Dynamic LoRAs

    To enable LoRA serving for at most 4 LoRAs at 1 time (these are hot swapped / changed), first set the environment flag to allow hot swapping:

    See our LoRA Hot Swapping Guide for more details.

    --gpu-memory-utilization

    Default 0.9. How much VRAM usage vLLM can use. Reduce if going out of memory. Try setting this to 0.95 or 0.97.

    --max-model-len

    Set maximum sequence length. Reduce this if going out of memory! For example set --max-model-len 32768 to use only 32K sequence lengths.

    --quantization

    Use fp8 for dynamic float8 quantization. Use this in tandem with --kv-cache-dtype fp8 to enable float8 KV cache as well.

    --kv-cache-dtype

    Use fp8 for float8 KV cache to reduce memory usage by 50%.

    --port

    os.environ['UNSLOTH_VLLM_STANDBY'] = '1' # To enable memory efficient GRPO with vLLM
    model, tokenizer = FastVisionModel.from_pretrained(
        model_name = "Qwen/Qwen2.5-VL-7B-Instruct",
        max_seq_length = 16384, #Must be this large to fit image in context
        load_in_4bit = True, # False for LoRA 16bit
        fast_inference = True, # Enable vLLM fast inference
        gpu_memory_utilization = 0.8, # Reduce if out of memory
    )
    # Add LoRA adapter to the model for parameter efficient fine tuning
    model = FastVisionModel.get_peft_model(
        model,
    
        finetune_vision_layers     = False,# fast_inference doesn't support finetune_vision_layers yet :(
        finetune_language_layers   = True, # False if not finetuning language layers
        finetune_attention_modules = True, # False if not finetuning attention layers
        finetune_mlp_modules       = True, # False if not finetuning MLP layers
    
        r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
        lora_alpha = lora_rank*2, # *2 speeds up training
        use_gradient_checkpointing = "unsloth", # Reduces memory usage
        random_state = 3407,
    )
     addCriterion
     <tool_call>\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n\n addCriterion\n\n 自动生成\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n addCriterion\n\n\n addCriterion\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
    Figure is an overhead view of the path taken by a race car driver as his car collides with the racetrack wall. Just before the collision, he is traveling at speed $v_i=70 \mathrm{~m} / \mathrm{s}$ along a straight line at $30^{\circ}$ from the wall. Just after the collision, he is traveling at speed $v_f=50 \mathrm{~m} / \mathrm{s}$ along a straight line at $10^{\circ}$ from the wall. His mass $m$ is $80 \mathrm{~kg}$. The collision lasts for $14 \mathrm{~ms}$. What is the magnitude of the average force on the driver during the collision?
    def formatting_reward_func(completions,**kwargs):
        import re
        thinking_pattern = f'{REASONING_START}(.*?){REASONING_END}'
        answer_pattern = f'{SOLUTION_START}(.*?){SOLUTION_END}'
    
        scores = []
        for completion in completions:
            score = 0
            thinking_matches = re.findall(thinking_pattern, completion, re.DOTALL)
            answer_matches = re.findall(answer_pattern, completion, re.DOTALL)
            if len(thinking_matches) == 1:
                score += 1.0
            if len(answer_matches) == 1:
                score += 1.0
    
            # Fix up addCriterion issues
            # See https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl#qwen-2.5-vl-vision-rl-issues-and-quirks
            # Penalize on excessive addCriterion and newlines
            if len(completion) != 0:
                removal = completion.replace("addCriterion", "").replace("\n", "")
                if (len(completion)-len(removal))/len(completion) >= 0.5:
                    score -= 2.0
    
            scores.append(score)
        return scores
    training_args = GRPOConfig(
        output_dir = "vlm-grpo-unsloth",
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 4,
        learning_rate = 5e-6,
        adam_beta1 = 0.9,
        adam_beta2 = 0.99,
        weight_decay = 0.1,
        warmup_ratio = 0.1,
        lr_scheduler_type = "cosine",
        optim = "adamw_8bit",
        # beta = 0.00,
        epsilon = 3e-4,
        epsilon_high = 4e-4,
        num_generations = 8,    
        max_prompt_length = 1024,
        max_completion_length = 1024,
        log_completions = False,
        max_grad_norm = 0.1,
        temperature = 0.9,
        # report_to = "none", # Set to "wandb" if you want to log to Weights & Biases
        num_train_epochs = 2, # For a quick test run, increase for full training
        report_to = "none"
        
        # GSPO is below:
        importance_sampling_level = "sequence",
        
        # Dr GRPO / GAPO etc
        loss_type = "dr_grpo",
    )
    <|im_start|>user
    Hey there!<|im_end|>
    <|im_start|>assistant
    What is 1+1?<|im_end|>
    <|im_start|>user
    2<|im_end|>
    <|im_start|>assistant
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli \
        -hf unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_XL \
        --jinja -ngl 99 --threads -1 --ctx-size 32684 \
        --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --presence-penalty 1.0
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF",
        local_dir = "Qwen3-Next-80B-A3B-Instruct-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*"],
    )
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli \
        -hf unsloth/Qwen3-Next-80B-A3B-Thinking-GGUF:Q4_K_XL \
        --jinja -ngl 99 --threads -1 --ctx-size 32684 \
        --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20 --presence-penalty 1.0
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Qwen3-Next-80B-A3B-Thinking-GGUF",
        local_dir = "Qwen3-Next-80B-A3B-Thinking-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*"],
    )
    <|system|>Your name is Phi, an AI math expert developed by Microsoft.<|end|><|user|>How to solve 3*x^2+4*x+5=1?<|end|><|assistant|>
    <|im_start|>system<|im_sep|>You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:<|im_end|><|im_start|>user<|im_sep|>What is 1+1?<|im_end|><|im_start|>assistant<|im_sep|>
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    ollama run hf.co/unsloth/Phi-4-mini-reasoning-GGUF:Q4_K_XL
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Phi-4-mini-reasoning-GGUF",
        local_dir = "unsloth/Phi-4-mini-reasoning-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*"],
    )
    ./llama.cpp/llama-cli \
        --model unsloth/Phi-4-mini-reasoning-GGUF/Phi-4-mini-reasoning-UD-Q4_K_XL.gguf \
        --threads -1 \
        --n-gpu-layers 99 \
        --prio 3 \
        --temp 0.8 \
        --top-p 0.95 \
        --jinja \
        --min_p 0.00 \
        --ctx-size 32768 \
        --seed 3407
    vllm serve unsloth/Llama-3.3-70B-Instruct \
        --quantization fp8 \
        --kv-cache-dtype fp8
        --gpu-memory-utilization 0.97 \
        --max-model-len 65536
    The same applies to Unsloth quants on Hugging Face: visit the
    , choose a quantization, then run something like: docker model run hf.co/unsloth/gpt-oss-20b-GGUF:Q2_K_L
    gpt-oss quantization levels on
    Unsloth gpt-oss quantization levels on

    Choose a size that fits your hardware: ideally, your combined unified memory, RAM, or VRAM should be equal to or greater than the model size. For example, an 11GB model runs well on 12GB unified memory.

    #3. Select which quantization you would like to pull.
    #4. Wait for model to finish downloading, then Run it.
    Docker Hub
    Unsloth's Hugging Face
    Unsloth
    here
    Docker Hub
    gpt-oss-20b from Docker Hub
    gpt-oss-20b with Unsloths' UD-Q8_K_XL quantization
    #1. Click 'Models' tab then 'Add models +'
    #2. Search for your desired model.
    An example of running Qwen3-4B UD-Q8_K_XL
    model’s HF page

    Use our conversational notebooks to force the chat template - this will fix most issues.

    • Qwen-3 14B Conversational notebook Open in Colab

    • Gemma-3 4B Conversational notebook Open in Colab

    • Llama-3.2 3B Conversational notebook Open in Colab

    • Phi-4 14B Conversational notebook

    • Mistral v0.3 7B Conversational notebook

    • More notebooks in our

    https://github.com/ggerganov/llama.cpp
    https://github.com/unslothai/unsloth/wiki#manually-saving-to-gguf

    ​DeepSeek-R1-Distill-Llama-8B is a fine-tuned version of Llama-3.1-8B. DeepSeek utilized data generated by DeepSeek-R1, to fine-tune Llama-3.1-8B. This process, known as distillation (a subcategory of fine-tuning), injects the data into the Llama model to learn reasoning capabilities.

    With Unsloth, you can fine-tune for free on Colab, Kaggle, or locally with just 3GB VRAM by using our notebooks. By fine-tuning a pre-trained model (e.g. Llama-3.1-8B) on a specialized dataset, you can:

    • Update + Learn New Knowledge: Inject and learn new domain-specific information.

    • Customize Behavior: Adjust the model’s tone, personality, or response style.

    • Optimize for Tasks: Improve accuracy and relevance for specific use cases.

    Example usecases:

    • Train LLM to predict if a headline impacts a company positively or negatively.

    • Use historical customer interactions for more accurate and custom responses.

    • Fine-tune LLM on legal texts for contract analysis, case law research, and compliance.

    You can think of a fine-tuned model as a specialized agent designed to do specific tasks more effectively and efficiently. Fine-tuning can replicate all of RAG's capabilities, but not vice versa.

    Fine-tuning misconceptions:

    You may have heard that fine-tuning does not make a model learn new knowledge or RAG performs better than fine-tuning. That is false. Read more FAQ + misconceptions here:

    2. Choose the Right Model + Method

    If you're a beginner, it is best to start with a small instruct model like Llama 3.1 (8B) and experiment from there. You'll also need to decide between QLoRA and LoRA training:

    • LoRA: Fine-tunes small, trainable matrices in 16-bit without updating all model weights.

    • QLoRA: Combines LoRA with 4-bit quantization to handle very large models with minimal resources.

    You can change the model name to whichever model you like by matching it with model's name on Hugging Face e.g. 'unsloth/llama-3.1-8b-unsloth-bnb-4bit'.

    We recommend starting with Instruct models, as they allow direct fine-tuning using conversational chat templates (ChatML, ShareGPT etc.) and require less data compared to Base models (which uses Alpaca, Vicuna etc). Learn more about the differences between instruct and base models here.

    • Model names ending in unsloth-bnb-4bit indicate they are Unsloth dynamic 4-bit quants. These models consume slightly more VRAM than standard BitsAndBytes 4-bit models but offer significantly higher accuracy.

    • If a model name ends with just bnb-4bit, without "unsloth", it refers to a standard BitsAndBytes 4-bit quantization.

    • Models with no suffix are in their original 16-bit or 8-bit formats. While they are the original models from the official model creators, we sometimes include important fixes - such as chat template or tokenizer fixes. So it's recommended to use our versions when available.

    There are other settings which you can toggle:

    • max_seq_length = 2048 – Controls context length. While Llama-3 supports 8192, we recommend 2048 for testing. Unsloth enables 4× longer context fine-tuning.

    • dtype = None – Defaults to None; use torch.float16 or torch.bfloat16 for newer GPUs.

    • load_in_4bit = True – Enables 4-bit quantization, reducing memory use 4× for fine-tuning. Disabling it enables LoRA 16-bit fine-tuning. You can also enable 16-bit LoRA with load_in_16bit = True

    • To enable full fine-tuning (FFT), set full_finetuning = True. For 8-bit fine-tuning, set load_in_8bit = True.

    • Note: Only one training method can be set to True at a time.

    We recommend starting with QLoRA, as it is one of the most accessible and effective methods for training models. Our dynamic 4-bit quants, the accuracy loss for QLoRA compared to LoRA is now largely recovered.

    You can also do Text-to-speech (TTS), reasoning (GRPO), vision, reinforcement learning (DPO, ORPO, KTO), continued pretraining, text completion and other training methodologies with Unsloth.

    Read our detailed guide on choosing the right model:

    3. Your Dataset

    For LLMs, datasets are collections of data that can be used to train our models. In order to be useful for training, text data needs to be in a format that can be tokenized.

    • You will need to create a dataset usually with 2 columns - question and answer. The quality and amount will largely reflect the end result of your fine-tune so it's imperative to get this part right.

    • You can synthetically generate data and structure your dataset (into QA pairs) using ChatGPT or local LLMs.

    • You can also use our new Synthetic Dataset notebook which automatically parses documents (PDFs, videos etc.), generates QA pairs and auto cleans data using local models like Llama 3.2. Access the notebook here.

    • Fine-tuning can learn from an existing repository of documents and continuously expand its knowledge base, but just dumping data alone won’t work as well. For optimal results, curate a well-structured dataset, ideally as question-answer pairs. This enhances learning, understanding, and response accuracy.

    • But, that's not always the case, e.g. if you are fine-tuning a LLM for code, just dumping all your code data can actually enable your model to yield significant performance improvements, even without structured formatting. So it really depends on your use case.

    Read more about creating your dataset:

    For most of our notebook examples, we utilize the Alpaca dataset however other notebooks like Vision will use different datasets which may need images in the answer ouput as well.

    4. Understand Training Hyperparameters

    Learn how to choose the right hyperparameters using best practices from research and real-world experiments - and understand how each one affects your model's performance.

    For a complete guide on how hyperparameters affect training, see:

    5. Installing + Requirements

    We would recommend beginners to utilise our pre-made notebooks first as it's the easiest way to get started with guided steps. However, if installing locally is a must, you can install and use Unsloth via Docker or pip install unsloth - just make sure you have all the right requirements necessary. Also depending on the model and quantization you're using, you'll need enough VRAM and resources. See all the details here:

    Next, you'll need to install Unsloth. Unsloth currently only supports Windows and Linux devices. Once you install Unsloth, you can copy and paste our notebooks and use them in your own local environment. We have many installation methods:

    6. Training + Evaluation

    Once you have everything set, it's time to train! If something's not working, remember you can always change hyperparameters, your dataset etc.

    You’ll see a log of numbers during training. This is the training loss, which shows how well the model is learning from your dataset. For many cases, a loss around 0.5 to 1.0 is a good sign, but it depends on your dataset and task. If the loss is not going down, you might need to adjust your settings. If the loss goes to 0, that could mean overfitting, so it's important to check validation too.

    The training loss will appear as numbers

    We generally recommend keeping the default settings unless you need longer training or larger batch sizes.

    • per_device_train_batch_size = 2 – Increase for better GPU utilization but beware of slower training due to padding. Instead, increase gradient_accumulation_steps for smoother training.

    • gradient_accumulation_steps = 4 – Simulates a larger batch size without increasing memory usage.

    • max_steps = 60 – Speeds up training. For full runs, replace with num_train_epochs = 1 (1–3 epochs recommended to avoid overfitting).

    • learning_rate = 2e-4 – Lower for slower but more precise fine-tuning. Try values like 1e-4, 5e-5, or 2e-5.

    Evaluation

    In order to evaluate, you could do manually evaluation by just chatting with the model and see if it's to your liking. You can also enable evaluation for Unsloth, but keep in mind it can be time-consuming depending on the dataset size. To speed up evaluation you can: reduce the evaluation dataset size or set evaluation_steps = 100.

    For testing, you can also take 20% of your training data and use that for testing. If you already used all of the training data, then you have to manually evaluate it. You can also use automatic eval tools like EleutherAI’s lm-evaluation-harness. Keep in mind that automated tools may not perfectly align with your evaluation criteria.

    7. Running + Saving the model

    Now let's run the model after we completed the training process! You can edit the yellow underlined part! In fact, because we created a multi turn chatbot, we can now also call the model as if it saw some conversations in the past like below:

    Reminder Unsloth itself provides 2x faster inference natively as well, so always do not forget to call FastLanguageModel.for_inference(model). If you want the model to output longer responses, set max_new_tokens = 128 to some larger number like 256 or 1024. Notice you will have to wait longer for the result as well!

    Saving the model

    For saving and using your model in desired inference engines like Ollama, vLLM, Open WebUI, we can have more information here:

    We can now save the finetuned model as a small 100MB file called a LoRA adapter like below. You can instead push to the Hugging Face hub as well if you want to upload your model! Remember to get a Hugging Face token via: https://huggingface.co/settings/tokens and add your token!

    After saving the model, we can again use Unsloth to run the model itself! Use FastLanguageModel again to call it for inference!

    8. We're done!

    You've successfully fine-tuned a language model and exported it to your desired inference engine with Unsloth!

    To learn more about fine-tuning tips and tricks, head over to our blogs which provide tremendous and educational value: https://unsloth.ai/blog/

    If you need any help on fine-tuning, you can also join our Discord server here or Reddit r/unsloth. Thanks for reading and hopefully this was helpful!

    🤔FAQ + Is Fine-tuning Right For Me?
    ❓What Model Should I Use?
    📈Datasets Guide
    🧠LoRA Hyperparameters Guide
    🛠️Unsloth Requirements
    📥Installation
    🖥️Inference & Deployment

    Install Docker and NVIDIA Container Toolkit.

    Install Docker via Linux or Desktop (other). Then install NVIDIA Container Toolkit:

    2

    Run the container.

    unsloth/unsloth is Unsloth's only Docker image.

    3

    Access Jupyter Lab

    Go to http://localhost:8888 and open Unsloth. Access the unsloth-notebooks tabs to see Unsloth notebooks.

    4

    Start training with Unsloth

    If you're new, follow our step-by-step Fine-tuning Guide, RL Guide or just save/copy any of our premade notebooks.

    5

    Docker issues - GPU not discovered?

    Try doing WSL via Method #2 - WSL:

    Method #2 - WSL:

    1

    Install WSL

    Open up Command Prompt, the Terminal, and install Ubuntu. Set the password if asked.

    2

    If you did NOT do (1), so you already installed WSL, enter WSL by typing wsl and ENTER in the command prompt

    3

    Install Python

    4

    Install PyTorch

    If you encounter permission issues, use –break-system-packages so pip install torch torchvision --force-reinstall --index-url https://download.pytorch.org/whl/cu130 –break-system-packages

    5

    Install Unsloth and Jupyter Notebook

    If you encounter permission issues, use –break-system-packages so pip install unsloth jupyter –break-system-packages

    6

    Launch Unsloth via Jupyter Notebook

    Then open up our notebooks within and load them up! You can also go to Colab notebooks and download > download .ipynb and load them.

    Method #3 - Windows directly:

    Python 3.13 now works with Unsloth!

    1

    Install NVIDIA Video Driver

    You should install the latest version of your GPUs driver. Download drivers here: NVIDIA GPU Drive

    2

    Install Visual Studio C++

    You will need Visual Studio, with C++ installed. By default, C++ is not installed with Visual Studio, so make sure you select all of the C++ options. Also select options for Windows 10/11 SDK.

    • Launch the Installer here:

    • In the installer, navigate to individual components and select all the options listed here:

      • .NET Framework 4.8 SDK

      • .NET Framework 4.7.2 targeting pack

      • C# and Visual Basic Roslyn compilers

    Easier method: Or you can open an elevated Command Prompt or PowerShell:

    • Search for "cmd" or "PowerShell", right-click it, and choose "Run as administrator."

    • Paste and run this command (update the Visual Studio path if necessary):

    3

    Install Python and CUDA Toolkit

    Follow the instructions to install .

    Then install Miniconda (which has Python) here:

    4

    Install PyTorch

    You will need the correct version of PyTorch that is compatible with your CUDA drivers, so make sure to select them carefully.

    5

    Install Unsloth

    Open Conda command prompt or your terminal with Python and run the command:

    If you're using GRPO or plan to use vLLM, currently vLLM does not support Windows directly but only via WSL or Linux.

    Notes

    To run Unsloth directly on Windows:

    • Install Triton from this Windows fork and follow the instructions here (be aware that the Windows fork requires PyTorch >= 2.4 and CUDA 12)

    • In the SFTTrainer, set dataset_num_proc=1 to avoid a crashing issue:

    Advanced/Troubleshooting

    For advanced installation instructions or if you see weird errors during installations:

    1. Install torch and triton. Go to https://pytorch.org to install it. For example pip install torch torchvision torchaudio triton

    2. Confirm if CUDA is installated correctly. Try nvcc. If that fails, you need to install cudatoolkit or CUDA drivers.

    3. Install xformers manually. You can try installing vllm and seeing if vllm succeeds. Check if xformers succeeded with python -m xformers.info Go to https://github.com/facebookresearch/xformers. Another option is to install flash-attn for Ampere GPUs.

    4. Double check that your versions of Python, CUDA, CUDNN, torch, triton, and xformers are compatible with one another. The may be useful.

    5. Finally, install bitsandbytes and check it with python -m bitsandbytes

    Method #3 - Windows using PowerShell:

    Step 1: Install Prerequisites

    1. Install NVIDIA CUDA Toolkit:

      • Download and install the appropriate version of the NVIDIA CUDA Toolkit from CUDA Downloads.

      • Reboot your system after installation if prompted.

      • Note: No additional setup is required after installation for Unsloth.

    2. Install Microsoft C++ Build Tools:

      • Download and install Microsoft Build Tools for Visual Studio from the .

      • During installation, select the C++ build tools workload. Ensure the MSVC compiler toolset is included.

    3. Set Environment Variables for the C++ Compiler:

      • Open the System Properties window (search for "Environment Variables" in the Start menu).

      • Click "Environment Variables…".

      • Add or update the following under System variables:

    4. Install Conda

      1. Download and install Miniconda from the

      2. Follow installation instruction from the website

      3. To check whether conda is already installed, you can test it with conda

    Step 2: Run the Unsloth Installation Script

    1. Download the unsloth_windows.ps1 PowerShell script by going through this link.

    2. Open PowerShell as Administrator:

      • Right-click Start and select "Windows PowerShell (Admin)".

    3. Navigate to the script’s location using cd:

    4. Run the script:

    Step 3: Using Unsloth

    Activate the environment after the installation completes:

    Unsloth and its dependencies are now ready!

    unsloth/unsloth
    Blackwell
    Docker guide
    free
    on Colab. We introduced embedding offloading which reduces usage by 1GB as well via
    offload_embeddings
    . Unloth's new inference runs faster on
    any
    GPU including A100, H100 and old T4's. gpt-oss-120b fits nicely on a 120GB VRAM GPU.

    Unsloth is the only framework to support 4-bit RL for gpt-oss. All performance gains are due to Unsloth's unique weight sharing, Flex Attention, Standby and custom kernels.

    Reminder: Flash Attention 3 (FA3) is unsuitable for gpt-oss training since it currently does not support the backward pass for attention sinks, causing incorrect training losses. If you’re not using Unsloth, FA3 may be enabled by default, so please double-check it’s not in use! Disabling FA3 will incur O(N^2) memory usage as well, so Unsloth is the only RL framework to offer O(N) memory usage for gpt-oss via our Flex attention implementation.

    ⚡Making Inference Much Faster

    Inference is crucial in RL training, since we need it to generate candidate solutions before maximizing some reward function (see here for a more detailed explanation). To achieve the fastest inference speed for gpt-oss without vLLM, we rewrote Transformers inference code and integrated many innovations including custom algorithms like Unsloth Flex Attention, using special flags within torch.compile (like combo kernels). Our new inference code for gpt-oss was evaluated against an already optimized baseline (2x faster than native Transformers).

    vLLM does not support RL for gpt-oss since it lacks BF16 training and LoRA support for gpt-oss. Without Unsloth, only training via full precision BF16 works, making memory use 800%+ higher. Most frameworks enable FA3 (Flash Attention 3) by default (which reduces VRAM use & increases speed) but this causes incorrect training loss. See Issue 1797 in the FA3 repo. You must disable FA3 though, since it'll prevent long-context training since FA3 uses O(N) memory usage, whilst naive attention will balloon with O(N^2) usage. So to enable attention sinks to be differentiable, we implemented Unsloth Flex Attention.

    We evaluated gpt-oss RL inference by benchmarking BitsandBytes 4-bit and also did separate tests for BF16. Unsloth’s 4-bit inference is ~4x faster, and BF16 is also more efficient, especially in VRAM use.

    The best part about Unsloth's gpt-oss RL is that it can work on any GPU, even those that do not support BF16. Our free gpt-oss-20b Colab notebooks use older 15GB T4 GPUs, so the inference examples work well!

    🛠️ gpt-oss Flex Attention Issues and Quirks

    We had to change our implementation for attention sinks as described here to allow generation to work with left padding. We had to get the logsumexp and apply the sigmoid activation to alter the attention weights like below:

    Left padded masking during inference was also a tricky issue to deal with in gpt-oss. We found that we had to not only account for KV Cache prefill during generations of tokens, but also account for a unique amount of pad tokens in each prompt for batch generations which would change the way we would need to store the block mask. Example of such and example can be seen below:

    Normal Causal Mask:

    For inference in general case (decoding)

    If we naively use the same masking strategy, this'll fail:

    For generation (decoding phase), we usually only care about the last row of the attention matrix, since there’s just one query token attending to all previous key tokens. If we naively apply the causal mask (q_idx ≥ k_idx), this fails as our single query has index 0, while there are n_k key tokens. To fix this, we need an offset in mask creation to decide which tokens to attend. But a naïve approach is slow, since offsets change each step, forcing mask and kernel regeneration. We solved this with cache and compile optimizations.

    The harder part is batch generation. Sequences differ in length, so padding complicates mask creation. Flex Attention had a lot of challenges and dynamic masks are tricky. Worse, if not compiled, it falls back to eager attention which is slow and memory-heavy (quadratic vs. linear in sequence length).

    Quote from https://github.com/meta-pytorch/attention-gym/issues/15#issuecomment-2284148665

    You need to call this with _compile=True. We essentially map your block mask over a full Q_LEN x KV_LEN matrix in order to produce the block mask. Without compile, we need to materialize this full thing, and it can cause OOMs on long sequences.

    As well, you need to run flex_attention = torch.compile(flex_attention). Without compile, flex falls back to a non-fused eager implementation that is great for debugging, but it is much slower and materializes the full scores matrix.

    Ultimately, the mask must dynamically handle prefill vs decode with the KV Cache, batch and padding tokens per sequence, remain torch.compile friendly, and support sliding windows.

    🔍 Flash Attention Investigation

    Another interesting direction we explored was trying to integrate Flash Attention. Its advantages are widely recognized, but one limitation is that it does not support attention sinks during the backward pass for gpt-oss. To work around this, we restructured the attention mechanism so that it operates solely on the attention output and the logsumexp values that FlashAttention readily provides. Given these benefits, it seemed like an obvious choice to try.

    However, we soon began noticing issues. While the first few layers behaved as expected, the later layers, particularly layers 18 through 24, produced outputs that diverged significantly from the eager-mode implementation in transformers. Importantly, this discrepancy cannot be attributed to error accumulation, since the inputs to each method are identical at every layer. For further validation, we also compared the results against Unsloth FlexAttention.

    This needs further investigation into why only the last few layers show such a drastic difference between flash attention implementation vs. the others.

    Flash Attention 3 doesn't support the backwards pass for attention sinks

    FA3 is often enabled by default for most training packages (not Unsloth), but this is incorrect for gpt-oss. Using FA3 will make training loss completely wrong as FA3 doesn’t support gpt-oss backward passes for attention sinks. Many people are still unaware of this so please be cautious!

    ⚠️ Can We Counter Reward Hacking?

    The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric). But RL can cheat. When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking".

    It's the reason models learn to modify unit tests to pass coding challenges, and these are critical blockers for real world deployment. Some other good examples are from Wikipedia.

    In our free gpt-oss RL notebook we explore how to counter reward hacking in a code generation setting and showcase tangible solutions to common error modes. We saw the model edit the timing function, outsource to other libraries, cache the results, and outright cheat. After countering, the result is our model generates genuinely optimized matrix multiplication kernels, not clever cheats.

    🏆Reward Hacking

    Some common examples of reward hacking during RL include:

    Laziness

    RL learns to use Numpy, Torch, other libraries, which calls optimized CUDA kernels. We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries.

    Caching & Cheating

    RL learns to cache the result of the output and RL learns to find the actual output by inspecting Python global variables.

    We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.

    Cheating

    RL learns to edit the timing function to make it output 0 time as passed. We can stop the RL algorithm from using global or cached variables by restricting it's locals and globals. We are also going to use exec to create the function, so we have to save the output to an empty dict. We also disallow global variable access via types.FunctionType(f.__code__, {})\

    Tutorial: How to Train gpt-oss with RL

    LLMs often struggle with tasks that involve complex environments. However, by applying reinforcement learning (RL) and designing a custom reward function, these challenges can be overcome.

    RL can be adapted for tasks such as auto kernel or strategy creation. This tutorial shows how to train gpt-oss with GRPO and Unsloth to autonomously beat 2048.

    Our notebooks include step-by-step guides on how to navigate the whole process already.

    (Official OpenAI example)

    What you’ll build:

    • Train gpt-oss-20b so the model can automatically win 2048

    • Create a minimal 2048 environment the model can interact with

    • Define reward functions that:

      1. Check the generated strategy compiles and runs,

      2. Prevent reward hacking (disallow external imports), and

      3. Reward actual game success

    • Run inference and export the model (MXFP4 4‑bit or merged FP16)

    Hardware: The 2048 example runs on a free Colab T4, but training will be slow. A100/H100 is much faster. 4‑bit loading + LoRA lets you fit a 20B model into modest VRAM

    gpt-oss
    Unsloth
    50% weight sharing feature
    gpt-oss-20b GRPO Colab notebook
    counteract reward-hacking
    A(X)=σ(1dQKT)VA(X)=exp⁡1dQKT∑exp⁡1dQKTVLSE=log⁡∑exp⁡1dQKTAsinks(X)=A(X)⊙σ(LSE−sinks)A(X) = \sigma \bigg( \frac{1}{\sqrt{d}}QK^T \bigg)V \\ A(X) = \frac{\exp{\frac{1}{\sqrt{d}}QK^T}}{\sum{\exp{\frac{1}{\sqrt{d}}QK^T}}}V \\ \text{LSE} = \log{\sum{\exp{\frac{1}{\sqrt{d}}QK^T}}} \\ A_{sinks}(X) = A(X) \odot \sigma (\text{LSE} - \text{sinks})A(X)=σ(d​1​QKT)VA(X)=∑expd​1​QKTexpd​1​QKT​VLSE=log∑expd​1​QKTAsinks​(X)=A(X)⊙σ(LSE−sinks)
    ⚡ Step-by-Step Tutorial
    1

    Start with Unsloth Docker image for DGX Spark

    First, build the Docker image using the DGX Spark Dockerfile which can be found here. You can also run the below in a Terminal in the DGX Spark:

    Then, build the training Docker image using saved Dockerfile:

    You can also click to see the full DGX Spark Dockerfile
    2

    Launch container

    Launch the training container with GPU access and volume mounts:

    3

    Start Jupyter and Run Notebooks

    Inside the container, start Jupyter and run the required notebook. You can use the Reinforcement Learning gpt-oss 20b to win 2048 . In fact all work in DGX Spark including the 120b notebook! Just remove the installation cells.

    The below commands can be used to run the RL notebook as well. After Jupyter Notebook is launched, open up the “gpt_oss_20B_RL_2048_Game.ipynb”

    Don't forget Unsloth also allows you to your models after fine-tuning so you can locally deploy them directly on your DGX Spark after.

    Many thanks to Lakshmi Ramesh and Barath Anandan from NVIDIA for helping Unsloth’s DGX Spark launch and building the Docker image.

    Unified Memory Usage

    gpt-oss-120b QLoRA 4-bit fine-tuning will use around 68GB of unified memory. How your unified memory usage should look before (left) and after (right) training:

    And that's it! Have fun training and running LLMs completely locally on your NVIDIA DGX Spark!

    Video Tutorials

    Thanks to Tim from AnythingLLM for providing a great fine-tuning tutorial with Unsloth on DGX Spark:

    OpenAI DevDay
    You can watch Unsloth featured on OpenAI DevDay 2025 here.
    gpt-oss trained with RL consistently outperforms on 2048.

    3x Faster LLM Training with Unsloth Kernels + Packing

    Learn how Unsloth increases training throughput and eliminates padding waste for fine-tuning.

    Unsloth now supports up to 5× faster (typically 3x) training with our new custom RoPE and MLP Triton kernels, plus our new smart auto packing. Unsloth's new kernels + features not only increase training speed, but also further reduces VRAM use (30% - 90%) with no accuracy loss. Unsloth GitHub This means you can now train LLMs like Qwen3-4B not only on just 3GB VRAM, but also 3x faster.

    Our auto padding-free uncontaminated packing is smartly enabled for all training runs without any changes, and all fast attention backends (FlashAttention 3, xFormers, SDPA). Benchmarks show training losses match non-packing runs exactly.

    • 2.3x faster QK Rotary Embedding fused Triton kernel with packing support

    • Updated SwiGLU, GeGLU kernels with int64 indexing for long context

    • 2.5x to 5x faster uncontaminated packing with xformers, SDPA, FA3 backends

    • 2.1x faster padding free, 50% less VRAM, 0% accuracy change

    • Unsloth also now has improved SFT loss stability and more predictable GPU utilization.

    • This new upgrade works for all training methods e.g. full fine-tuning, pretraining etc.

    🥁Fused QK RoPE Triton Kernel with packing

    Back in December 2023, we introduced a RoPE kernel coded up in Triton as part of our Unsloth launch. In March 2024, a community member made end to end training 1-2% faster by optimizing the RoPE kernel to allow launching a block for a group of heads. See .

    One issue is for each Q and K, there are 2 Triton kernels. We merged them into 1 Triton kernel now, and enabled variable length RoPE, which was imperative for padding free and packing support. This makes the RoPE kernel in micro benchmarks 2.3x faster on longer context lengths, and 1.9x faster on shorter context lengths.

    We also eliminated all clones and contiguous transpose operations, and so RoPE is now fully inplace, reducing further GPU memory. Note for the backward pass, we see that sin1 = -sin1 since:

    🚃Int64 Indexing for Triton Kernels

    During 500K long context training which we introduced in , we would get CUDA out of bounds errors. This was because MLP kernels for SwiGLU, GeGLU had int32 indexing which is by default in Triton and CUDA.

    We can't just do tl.program_id(0).to(tl.int64) since training will be slightly slower due to int64 indexing. We instead make this a LONG_INDEXING: tl.constexpr variable so the Triton compiler can specialize this. This allows shorter and longer context runs to both run great!

    ♠️Uncontaminated Packing 2-5x faster training

    Real datasets can contain different sequence lengths, so increasing the batch size to 32 for example will cause padding, making training slower and use more VRAM.

    In the past, increasing batch_size to large numbers (>32) will make training SLOWER, not faster. This was due to padding - we can now eliminate this issue via packing = True, and so training is FASTER!

    When we pack multiple samples into a single one-dimensional tensor, we keep sequence length metadata around in order to properly mask samples, without leaking attention between samples. We also need the RoPE kernel described in to allow reset position ids.

    🧮Why is padding needed & mathematical speedup

    Computers and GPUs cannot process different length datasets, so we have to pad them with 0s. This causes wastage. Assume we have a dataset of 50% short sequences S, and 50% long sequences L, then in the worst case, padding will cause token usage to be since the longest sequence length dominates.

    By packing multiple examples into a single, long one-dimensional tensor, we can eliminate a significant amount of padding. In fact we get the below token usage:

    By some math and algebra, we can work out the speedup via:

    By assuming then we get a 2x theoretical speedup since

    By changing the ratio of 50% short sequences, and assuming we have MORE short sequences, for eg 20% long sequences and 80% long sequences, we get so 5x faster training! This means packing's speedup depends on how short rows your dataset has (the more shorter, the faster).

    🏖️Analysis and Benchmarks

    To demonstrate the various improvements when training with our new kernels and packed data, we ran fine-tuning runs with , Qwen3-8B, Llama 3 8B on the yahma/alpaca-cleaned dataset and measured various throughput and efficiency metrics. We compared our new runs vs. a standard optimized training run with our own kernels/optimizations turned on and kernels like Flash Attention 3 (FA3) enabled. We fixed max_length = 1024 and varied the batch size in {1, 2, 4, 8, 16, 32}. This allows the maximum token count per batch to vary in {1024, 2048, 4096, 8192, 16K, 32K}.

    The above shows how tokens per second (tokens/s) training throughput varies for new Unsloth with varying batch size. This translates into training your model on an epoch of your dataset 1.7-3x faster (sometimes even 5x or more)! These gains will be more pronounced if there are many short sequences in your data and if you have longer training runs, as described in

    The above shows the average percentage of tokens per batch that are valid (i.e., non-padding). As the batch size length grows, many more padding tokens are seen in the unpacked case, while we achieve a high packing efficiency in the packed case regardless of max sequence length.

    Note that, since the batching logic trims batches to the maximum sequence length seen in the batch, when the batch size is 1, the unpacked data is all valid tokens (i.e., no padding). However, as more examples are added into the batch, padding increases on average, hitting nearly 50% padding with batch size is 8! Our sample packing implementation eliminates that waste.

    The first graph (above) plots progress on yahma/alpaca-cleaned with max_length = 2048, Unsloth new with packing + kernels (maroon) vs. Unsloth old (gray). Both are trained with max_steps = 500, but we plot the x-axis in wall-clock time. Notice that we train on nearly 40% of an epoch in the packed case in the same amount of steps (and only a bit more wall-clock time) that it takes to train less than 5% of an epoch in the unpacked case.

    Similarly, the 2nd graph (above) plots loss from the same runs, this time plotted with training steps on the x-axis. Notice that the losses match in scale and trend, but the loss in the packing case is less variable since the model is seeing more tokens per training step.

    🎬Padding-Free by Default

    In addition to large throughput gains available when setting packing = True in your SFTConfig , we will automatically use padding-free batching in order to reduce padding waste improve throughput and increases tokens/s throughput, while resulting in the exact same loss as seen in the previous version of Unsloth.

    For example for Qwen3-8B and Qwen3-32B, we see memory usage decrease by 60%, be 2x faster, and have the same exact loss and grad norm curves!

    ✨How to enable packing?

    Update Unsloth first and padding free is done by default! So all training is immediately 1.1 to 2x faster with 30% less memory usage at least and 0 change in loss curve metric!

    We also support Flash Attention 3 via Xformers, SDPA support, Flash Attention 2, and this works on old GPUs (Tesla T4, RTX 2080) and new GPUs like H100s, B200s etc! Sample packing works regardless of choice of attention backend or model family, so enjoy the same speedups previously had with these fast attention implementations!

    Add packing = True to enable up to 5x faster training!

    All our notebooks are automatically faster (no need to do anything). See

    Qwen3 14B faster:

    Llama 3.1 Conversational faster:

    Thank you! If you're interested, see our blog, blog and blog for more topics on kernels and performance gains!

    DeepSeek-OCR: How to Run & Fine-tune

    Guide on how to run and fine-tune DeepSeek-OCR locally.

    DeepSeek-OCR is a 3B-parameter vision model for OCR and document understanding. It uses context optical compression to convert 2D layouts into vision tokens, enabling efficient long-context processing.

    Capable of handling tables, papers, and handwriting, DeepSeek-OCR achieves 97% precision while using 10× fewer vision tokens than text tokens - making it 10× more efficient than text-based LLMs.

    You can fine-tune DeepSeek-OCR to enhance its vision or language performance. In our Unsloth free fine-tuning notebook, we demonstrated a 88.26% improvement for language understanding.

    Running DeepSeek-OCRFine-tuning DeepSeek-OCR

    Our model upload that enables fine-tuning + more inference support: DeepSeek-OCR

    🖥️ Running DeepSeek-OCR

    To run the model in or , here are the recommended settings:

    ⚙️ Recommended Settings

    DeepSeek recommends these settings:

    • Temperature = 0.0

    • max_tokens = 8192

    • ngram_size = 30

    • window_size = 90

    📖 vLLM: Run DeepSeek-OCR Tutorial

    1. Obtain the latest vLLM via:

    1. Then run the following code:

    🦥 Unsloth: Run DeepSeek-OCR Tutorial

    1. Obtain the latest unsloth via pip install --upgrade unsloth . If you already have Unsloth, update it via pip install --upgrade --force-reinstall --no-deps --no-cache-dir unsloth unsloth_zoo

    2. Then use the code below to run DeepSeek-OCR:

    🦥 Fine-tuning DeepSeek-OCR

    Unsloth supports fine-tuning of DeepSeek-OCR. Since the default model isn't runnable on the latest transformers version, we added changes from the team, to then enable inference. As usual, Unsloth trains DeepSeek-OCR 1.4x faster with 40% less VRAM and 5x longer context lengths - no accuracy degradation. We created two free DeepSeek-OCR Colab notebooks (with and without eval):

    • DeepSeek-OCR:

    • DeepSeek-OCR: (A100)

    Fine-tuning DeepSeek-OCR on a 200K sample Persian dataset resulted in substantial gains in Persian text detection and understanding. We evaluated the base model against our fine-tuned version on 200 Persian transcript samples, observing an 88.26% absolute improvement in Character Error Rate (CER). After only 60 training steps (batch size = 8), the mean CER decreased from 149.07% to a mean of 60.81%. This means the fine-tuned model is 57% more accurate at understanding Persian.

    You can replace the Persian dataset with your own to improve DeepSeek-OCR for other use-cases. For replica-table eval results, use our eval notebook above. For detailed eval results, see below:

    Fine-tuned Evaluation Results:

    DeepSeek-OCR Baseline

    Mean Baseline Model Performance: 149.07% CER for this eval set!

    DeepSeek-OCR Fine-tuned

    With 60 steps, we reduced CER from 149.07% to 60.43% (89% CER improvement)

    An example from the 200K Persian dataset we used (you may use your own), showing the image on the left and the corresponding text on the right.

    Saving to GGUF

    Saving models to 16bit for GGUF so you can use it for Ollama, Jan AI, Open WebUI and more!

    To save to GGUF, use the below to save locally:

    model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q4_k_m")
    model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q8_0")
    model.save_pretrained_gguf("directory", tokenizer, quantization_method = "f16")

    To push to Hugging Face hub:

    model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q4_k_m")
    model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q8_0")

    All supported quantization options for quantization_method are listed below:

    # https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19
    # From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html
    ALLOWED_QUANTS = \
    {
        "not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
        "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
        "quantized"      
    

    First save your model to 16bit:

    Then use the terminal and do:

    Or follow the steps at https://rentry.org/llama-cpp-conversions#merging-loras-into-a-model using the model name "merged_model" to merge to GGUF.

    Running in Unsloth works well, but after exporting & running on other platforms, the results are poor

    You might sometimes encounter an issue where your model runs and produces good results on Unsloth, but when you use it on another platform like Ollama or vLLM, the results are poor or you might get gibberish, endless/infinite generations or repeated outputs.

    • The most common cause of this error is using an incorrect chat template. It’s essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. When inferencing from a saved model, it's crucial to apply the correct template.

    • You must use the correct eos token. If not, you might get gibberish on longer generations.

    • It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses!

    Saving to GGUF / vLLM 16bit crashes

    You can try reducing the maximum GPU usage during saving by changing maximum_memory_usage.

    The default is model.save_pretrained(..., maximum_memory_usage = 0.75). Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.

    How do I manually save to GGUF?

    First save your model to 16bit via:

    Compile llama.cpp from source like below:

    Then, save the model to F16:

    Finetuning from Last Checkpoint

    Checkpointing allows you to save your finetuning progress so you can pause it and then continue.

    You must edit the Trainer first to add save_strategy and save_steps. Below saves a checkpoint every 50 steps to the folder outputs.

    trainer = SFTTrainer(
        ....
        args = TrainingArguments(
            ....
            output_dir = "outputs",
            save_strategy = "steps",
            save_steps = 50,
        ),
    )

    Then in the trainer do:

    trainer_stats = trainer.train(resume_from_checkpoint = True)

    Which will start from the latest checkpoint and continue training.

    Wandb Integration

    Then in TrainingArguments() set

    To train the model, do trainer.train(); to resume training, do

    ❓How do I do Early Stopping?

    If you want to stop or pause the finetuning / training run since the evaluation loss is not decreasing, then you can use early stopping which stops the training process. Use EarlyStoppingCallback.

    As usual, set up your trainer and your evaluation dataset. The below is used to stop the training run if the eval_loss (the evaluation loss) is not decreasing after 3 steps or so.

    We then add the callback which can also be customized:

    Then train the model as usual via trainer.train() .

    Unsloth Benchmarks

    Unsloth recorded benchmarks on NVIDIA GPUs.

    • For more detailed benchmarks, read our Llama 3.3 Blog.

    • Benchmarking of Unsloth was also conducted by 🤗Hugging Face.

    Tested on H100 and Blackwell GPUs. We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down):

    Model
    VRAM
    🦥Unsloth speed
    🦥VRAM reduction
    🦥Longer context
    😊Hugging Face + FA2

    Context length benchmarks

    The more data you have, the less VRAM Unsloth uses due to our algorithm + Apple's CCE algorithm!

    Llama 3.1 (8B) max. context length

    We tested Llama 3.1 (8B) Instruct and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.

    GPU VRAM
    🦥Unsloth context length
    Hugging Face + FA2

    Llama 3.3 (70B) max. context length

    We tested Llama 3.3 (70B) Instruct on a 80GB A100 and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.

    GPU VRAM
    🦥Unsloth context length
    Hugging Face + FA2

    Ministral 3: How to Run Guide

    Guide for Mistral Ministral 3 models, to run or fine-tune locally on your device

    istral releases Ministral 3, their new multimodal models in Base, Instruct, and Reasoning variants, available in 3B, 8B, and 14B sizes. They offer best-in-class performance for their size, and are fine-tuned for instruction and chat use cases. The multimodal models support 256K context windows, multiple languages, native function calling, and JSON output.

    The full unquantized 14B Ministral-3-Instruct-2512 model fits in 24GB RAM/VRAM. You can now run, fine-tune and RL on all Ministral 3 models with Unsloth:

    We've also uploaded Mistral Large 3 . For all Ministral 3 uploads (BnB, FP8), .

    Ministral-3-Instruct GGUFs:
    Ministral-3-Reasoning GGUFs:
    docker model run ai/gpt-oss:20B
    docker model run hf.co/unsloth/gpt-oss-20b-GGUF:UD-Q8_K_XL
    export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
    sudo apt-get update && sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
    docker run -d -e JUPYTER_PASSWORD="mypassword" \
      -p 8888:8888 -p 2222:22 \
      -v $(pwd)/work:/workspace/work \
      --gpus all \
      unsloth/unsloth
    wsl.exe --install Ubuntu-24.04
    wsl.exe -d Ubuntu-24.04
    wsl
    trainer = SFTTrainer(
        dataset_num_proc=1,
        ...
    )
    conda activate unsloth_env
       k0 k1 k2 k3 k4   <-- keys
    q0  X
    q1  X  X
    q2  X  X  X
    q3  X  X  X  X
    q4  X  X  X  X  X   <-- last query row (most important for decoding)
        k0 k1 k2 k3 k4
    q0
    q1
    q2
    q3
    q4   X  X  X  X  X
        k0 k1 k2 k3 k4
    q0
    q1
    q2
    q3
    q4   X   (note that q4 has q_idx=0 as this is the first query in current setup)
    sudo apt update && sudo apt install -y wget
    wget -O Dockerfile "https://raw.githubusercontent.com/unslothai/notebooks/main/Dockerfile_DGX_Spark"
    docker build -f Dockerfile -t unsloth-dgx-spark .
    FROM nvcr.io/nvidia/pytorch:25.09-py3
    
    # Set CUDA environment variables
    ENV CUDA_HOME=/usr/local/cuda-13.0/
    ENV CUDA_PATH=$CUDA_HOME
    ENV PATH=$CUDA_HOME/bin:$PATH
    ENV LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
    ENV C_INCLUDE_PATH=$CUDA_HOME/include:$C_INCLUDE_PATH
    ENV CPLUS_INCLUDE_PATH=$CUDA_HOME/include:$CPLUS_INCLUDE_PATH
    
    # Install triton from source for latest blackwell support
    RUN git clone https://github.com/triton-lang/triton.git && \
        cd triton && \
        git checkout c5d671f91d90f40900027382f98b17a3e04045f6 && \
        pip install -r python/requirements.txt && \
        pip install . && \
        cd ..
    
    # Install xformers from source for blackwell support
    RUN git clone --depth=1 https://github.com/facebookresearch/xformers --recursive && \
        cd xformers && \
        export TORCH_CUDA_ARCH_LIST="12.1" && \
        python setup.py install && \
        cd ..
    
    # Install unsloth and other dependencies
    RUN pip install unsloth unsloth_zoo bitsandbytes==0.48.0 transformers==4.56.2 trl==0.22.2
    
    # Launch the shell
    CMD ["/bin/bash"]
    model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
    cp llama.cpp/build/bin/llama-* llama.cpp
    
    python llama.cpp/convert-hf-to-gguf.py FOLDER --outfile OUTPUT --outtype f16
    Open in Colab
    Open in Colab
    notebooks docs
    Docker Hub
    Hugging Face

    40 GB

    153,977

    12,264

    48 GB

    191,728

    15,502

    80 GB

    342,733

    28,454

    Llama 3.3 (70B)

    80GB

    2x

    >75%

    13x longer

    1x

    Llama 3.1 (8B)

    80GB

    2x

    >70%

    12x longer

    1x

    8 GB

    2,972

    OOM

    12 GB

    21,848

    932

    16 GB

    40,724

    2,551

    24 GB

    78,475

    48 GB

    12,106

    OOM

    80 GB

    89,389

    6,916

    gradient checkpointing

    5,789

    MSBuild

  • MSVC v143 - VS 2022 C++ x64/x86 build tools

  • C++ 2022 Redistributable Update

  • C++ CMake tools for Windows

  • C++/CLI support for v143 build tools (Latest)

  • MSBuild support for LLVM (clang-cl) toolset

  • C++ Clang Compiler for Windows (19.1.1)

  • Windows 11 SDK (10.0.22621.0)

  • Windows Universal CRT SDK

  • C++ 2022 Redistributable MSMs

  • CC: Path to the cl.exe C++ compiler. Example (adjust if your version differs):

  • CXX: Same path as CC.

  • Click OK to save changes.

  • Verify: Open a new terminal and type cl. It should show version info.

  • in your PowerShell
    Unsloth Notebooks
    Visual Studio Community Edition
    CUDA Toolkit
    https://www.anaconda.com/docs/getting-started/miniconda/install
    Install PyTorch
    PyTorch Compatibility Matrix
    official website
    official website
    vLLM
    Unsloth
    Stranger Vision HF
    Fine-tuning only notebook
    Fine-tuning + Evaluation notebook
    # Install library
    !pip install wandb --upgrade
    
    # Setting up Wandb
    !wandb login <token>
    
    import os
    
    os.environ["WANDB_PROJECT"] = "<name>"
    os.environ["WANDB_LOG_MODEL"] = "checkpoint"
    report_to = "wandb",
    logging_steps = 1, # Change if needed
    save_steps = 100 # Change if needed
    run_name = "<name>" # (Optional)
    C:\Program Files\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.34.31933\bin\Hostx64\x64\cl.exe
    sudo apt update
    sudo apt install python3 python3-full python3-pip python3-venv -y
    pip install torch torchvision --force-reinstall --index-url https://download.pytorch.org/whl/cu130
    pip install unsloth jupyter
    jupyter notebook
    "C:\Program Files (x86)\Microsoft Visual Studio\Installer\vs_installer.exe" modify ^
    --installPath "C:\Program Files\Microsoft Visual Studio\2022\Community" ^
    --add Microsoft.Net.Component.4.8.SDK ^
    --add Microsoft.Net.Component.4.7.2.TargetingPack ^
    --add Microsoft.VisualStudio.Component.Roslyn.Compiler ^
    --add Microsoft.Component.MSBuild ^
    --add Microsoft.VisualStudio.Component.VC.Tools.x86.x64 ^
    --add Microsoft.VisualStudio.Component.VC.Redist.14.Latest ^
    --add Microsoft.VisualStudio.Component.VC.CMake.Project ^
    --add Microsoft.VisualStudio.Component.VC.CLI.Support ^
    --add Microsoft.VisualStudio.Component.VC.Llvm.Clang ^
    --add Microsoft.VisualStudio.ComponentGroup.ClangCL ^
    --add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^
    --add Microsoft.VisualStudio.Component.Windows10SDK.19041 ^
    --add Microsoft.VisualStudio.Component.UniversalCRT.SDK ^
    --add Microsoft.VisualStudio.Component.VC.Redist.MSM
    pip install "unsloth[windows] @ git+https://github.com/unslothai/unsloth.git"
    cd path\to\script\folder
    powershell.exe -ExecutionPolicy Bypass -File .\unsloth_windows.ps1
    uv venv
    source .venv/bin/activate
    # Until v0.11.1 release, you need to install vLLM from nightly build
    uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
    from vllm import LLM, SamplingParams
    from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
    from PIL import Image
    
    # Create model instance
    llm = LLM(
        model="unsloth/DeepSeek-OCR",
        enable_prefix_caching=False,
        mm_processor_cache_gb=0,
        logits_processors=[NGramPerReqLogitsProcessor]
    )
    
    # Prepare batched input with your image file
    image_1 = Image.open("path/to/your/image_1.png").convert("RGB")
    image_2 = Image.open("path/to/your/image_2.png").convert("RGB")
    prompt = "<image>\nFree OCR."
    
    model_input = [
        {
            "prompt": prompt,
            "multi_modal_data": {"image": image_1}
        },
        {
            "prompt": prompt,
            "multi_modal_data": {"image": image_2}
        }
    ]
    
    sampling_param = SamplingParams(
        temperature=0.0,
        max_tokens=8192,
        # ngram logit processor args
        extra_args=dict(
            ngram_size=30,
            window_size=90,
            whitelist_token_ids={128821, 128822},  # whitelist: <td>, </td>
        ),
        skip_special_tokens=False,
    )
    # Generate output
    model_outputs = llm.generate(model_input, sampling_param)
    
    # Print output
    for output in model_outputs:
        print(output.outputs[0].text)
    from unsloth import FastVisionModel
    import torch
    from transformers import AutoModel
    import os
    os.environ["UNSLOTH_WARN_UNINITIALIZED"] = '0'
    
    from huggingface_hub import snapshot_download
    snapshot_download("unsloth/DeepSeek-OCR", local_dir = "deepseek_ocr")
    model, tokenizer = FastVisionModel.from_pretrained(
        "./deepseek_ocr",
        load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
        auto_model = AutoModel,
        trust_remote_code = True,
        unsloth_force_compile = True,
        use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
    )
    
    prompt = "<image>\nFree OCR. "
    image_file = 'your_image.jpg'
    output_path = 'your/output/dir'
    res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = False)
    ============================================================
    Baseline Model Performance
    ============================================================
    Number of samples: 200
    Mean CER: 149.07%
    Median CER: 80.00%
    Std Dev: 310.39%
    Min CER: 0.00%
    Max CER: 3500.00%
    ============================================================
    
     Best Predictions (Lowest CER):
    
    Sample 5024 (CER: 0.00%)
    Reference:  چون هستی خیلی زیاد...
    Prediction: چون هستی خیلی زیاد...
    
    Sample 3517 (CER: 0.00%)
    Reference:  تو ایران هیچوقت از اینها وجود نخواهد داشت...
    Prediction: تو ایران هیچوقت از اینها وجود نخواهد داشت...
    
    Sample 9949 (CER: 0.00%)
    Reference:  کاش میدونستم هیچی بیخیال...
    Prediction: کاش میدونستم هیچی بیخیال...
    
     Worst Predictions (Highest CER):
    
    Sample 11155 (CER: 3500.00%)
    Reference:  خسو...
    Prediction: \[ \text{CH}_3\text{CH}_2\text{CH}_2\text{CH}_2\text{CH}_2\text{CH}_2\text{CH}_2\text{CH}_2\text{CH}...
    
    Sample 13366 (CER: 1900.00%)
    Reference:  مشو...
    Prediction: \[\begin{align*}\underline{\mathfrak{su}}_0\end{align*}\]...
    
    Sample 10552 (CER: 1014.29%)
    Reference:  هیییییچ...
    Prediction: e
    ============================================================
    Fine-tuned Model Performance
    ============================================================
    Number of samples: 200
    Mean CER: 60.43%
    Median CER: 50.00%
    Std Dev: 80.63%
    Min CER: 0.00%
    Max CER: 916.67%
    ============================================================
    
     Best Predictions (Lowest CER):
    
    Sample 301 (CER: 0.00%)
    Reference:  باشه بابا تو لاکچری، تو خاص، تو خفن...
    Prediction: باشه بابا تو لاکچری، تو خاص، تو خفن...
    
    Sample 2512 (CER: 0.00%)
    Reference:  از شخص حاج عبدالله زنجبیلی میگیرنش...
    Prediction: از شخص حاج عبدالله زنجبیلی میگیرنش...
    
    Sample 2713 (CER: 0.00%)
    Reference:  نمی دونم والا تحمل نقد ندارن ظاهرا...
    Prediction: نمی دونم والا تحمل نقد ندارن ظاهرا...
    
     Worst Predictions (Highest CER):
    
    Sample 14270 (CER: 916.67%)
    Reference:  ۴۳۵۹۴۷۴۷۳۸۹۰...
    Prediction: پروپریپریپریپریپریپریپریپریپریپریپریپریپریپریپریپریپریپریپیپریپریپریپریپریپریپریپریپریپریپریپریپریپر...
    
    Sample 3919 (CER: 380.00%)
    Reference:  ۷۵۵۰۷۱۰۶۵۹...
    Prediction: وادووووووووووووووووووووووووووووووووووو...
    
    Sample 3718 (CER: 333.33%)
    Reference:  ۳۲۶۷۲۲۶۵۵۸۴۶...
    Prediction: پُپُسوپُسوپُسوپُسوپُسوپُسوپُسوپُسوپُسوپُ...
    import wandb
    run = wandb.init()
    artifact = run.use_artifact('<username>/<Wandb-project-name>/<run-id>', type='model')
    artifact_dir = artifact.download()
    trainer.train(resume_from_checkpoint=artifact_dir)
    from trl import SFTConfig, SFTTrainer
    trainer = SFTTrainer(
        args = SFTConfig(
            fp16_full_eval = True,
            per_device_eval_batch_size = 2,
            eval_accumulation_steps = 4,
            output_dir = "training_checkpoints", # location of saved checkpoints for early stopping
            save_strategy = "steps",             # save model every N steps
            save_steps = 10,                     # how many steps until we save the model
            save_total_limit = 3,                # keep ony 3 saved checkpoints to save disk space
            eval_strategy = "steps",             # evaluate every N steps
            eval_steps = 10,                     # how many steps until we do evaluation
            load_best_model_at_end = True,       # MUST USE for early stopping
            metric_for_best_model = "eval_loss", # metric we want to early stop on
            greater_is_better = False,           # the lower the eval loss, the better
        ),
        model = model,
        tokenizer = tokenizer,
        train_dataset = new_dataset["train"],
        eval_dataset = new_dataset["test"],
    )
    from transformers import EarlyStoppingCallback
    early_stopping_callback = EarlyStoppingCallback(
        early_stopping_patience = 3,     # How many steps we will wait if the eval loss doesn't decrease
                                         # For example the loss might increase, but decrease after 3 steps
        early_stopping_threshold = 0.0,  # Can set higher - sets how much loss should decrease by until
                                         # we consider early stopping. For eg 0.01 means if loss was
                                         # 0.02 then 0.01, we consider to early stop the run.
    )
    trainer.add_callback(early_stopping_callback)

    Use our conversational notebooks to force the chat template - this will fix most issues.

    • Qwen-3 14B Conversational notebook Open in Colab

    • Gemma-3 4B Conversational notebook Open in Colab

    • Llama-3.2 3B Conversational notebook Open in Colab

    • Phi-4 14B Conversational notebook

    • Mistral v0.3 7B Conversational notebook

    • More notebooks in our

    :
    "Recommended. Slow conversion. Fast inference, small files."
    ,
    "f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
    "f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
    "q8_0" : "Fast conversion. High resource use, but generally acceptable.",
    "q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
    "q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
    "q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
    "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
    "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
    "q3_k_s" : "Uses Q3_K for all tensors",
    "q4_0" : "Original quant method, 4-bit.",
    "q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
    "q4_k_s" : "Uses Q4_K for all tensors",
    "q4_k" : "alias for q4_k_m",
    "q5_k" : "alias for q5_k_m",
    "q5_0" : "Higher accuracy, higher resource usage and slower inference.",
    "q5_1" : "Even higher accuracy, resource usage and slower inference.",
    "q5_k_s" : "Uses Q5_K for all tensors",
    "q6_k" : "Uses Q8_K for all tensors",
    "iq2_xxs" : "2.06 bpw quantization",
    "iq2_xs" : "2.31 bpw quantization",
    "iq3_xxs" : "3.06 bpw quantization",
    "q3_k_xs" : "3-bit extra small quantization",
    }

    • •

    • •

    ⚙️ Usage Guide

    To achieve optimal performance for Instruct, Mistral recommends using lower temperatures such as temperature = 0.15 or 0.1

    For Reasoning, Mistral recommends temperature = 0.7 and top_p = 0.95.

    Instruct:
    Reasoning:

    Temperature = 0.15 or 0.1

    Temperature = 0.7

    Top_P = default

    Top_P = 0.95

    Adequate Output Length: Use an output length of 32,768 tokens for most queries for the reasoning variant, and 16,384 for the instruct variant. You can increase the max output size for the reasoning model if necessary.

    The maximum context length Ministral 3 can reach is 262,144

    The chat template format is found when we use the below:

    Ministral Reasoning chat template:

    Ministral Instruct chat template:

    📖 Run Ministral 3 Tutorials

    Below are guides for the Reasoning and Instruct variants of the model.

    Instruct: Ministral-3-Instruct-2512

    To achieve optimal performance for Instruct, Mistral recommends using lower temperatures such as temperature = 0.15 or 0.1

    ✨ Llama.cpp: Run Ministral-3-14B-Instruct Tutorial

    1

    Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2

    You can directly pull from Hugging Face via:

    3

    Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

    Reasoning: Ministral-3-Reasoning-2512

    To achieve optimal performance for Reasoning, Mistral recommends using temperature = 0.7 and top_p = 0.95.

    ✨ Llama.cpp: Run Ministral-3-14B-Reasoning Tutorial

    1

    Obtain the latest llama.cpp on GitHub. You can also use the build instructions below. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2

    You can directly pull from Hugging Face via:

    3

    Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

    🛠️ Fine-tuning Ministral 3

    Unsloth now supports fine-tuning of all Ministral 3 models, including vision support. To train, you must use the latest 🤗Hugging Face transformers v5 and unsloth which includes our our recent ultra long context support. The large 14B Ministral 3 model should fit on a free Colab GPU.

    We made free Unsloth notebooks to fine-tune Ministral 3. Change the name to use the desired model.

    • Ministral-3B-Instruct Vision notebook (vision)

    • Ministral-3B-Instruct GRPO notebook

    Ministral Vision finetuning notebook

    Ministral Sudoku GRPO RL notebook

    ✨Reinforcement Learning (GRPO)

    Unsloth now supports RL and GRPO for the Mistral models as well. As usual, they benefit from all of Unsloth's enhancements and tomorrow, we are going to release a notebook soon specifically for autonomously solving the sudoku puzzle.

    • Ministral-3B-Instruct GRPO notebook

    To use the latest version of Unsloth and transformers v5, update via:

    The goal is to auto generate strategies to complete Sudoku!

    For the reward plots for Ministral, we get the below. We see it works well!

    Run Ministral 3 Tutorials
    Fine-tuning Ministral 3
    GGUFs here
    see here
    notebook here
    Unsloth notebooks
    save and run
    batchsize×L\text{batchsize} \times Lbatchsize×L
    Token Usage=batchsize2L+batchsize2S\text{Token Usage} = \frac{\text{batchsize}}{2}L+\frac{\text{batchsize}}{2}SToken Usage=2batchsize​L+2batchsize​S
    Speedup=batchsize×Lbatchsize2L+batchsize2S=2LL+S\text{Speedup} = \frac{\text{batchsize} \times L}{\frac{\text{batchsize}}{2}L+\frac{\text{batchsize}}{2}S} = 2 \frac{L}{L + S}Speedup=2batchsize​L+2batchsize​Sbatchsize×L​=2L+SL​
    S→0S\rightarrow0S→0
    2LL+0=22 \frac{L}{L + 0} = 22L+0L​=2
    L0.2L+0.8S→L0.2L=5\frac{L}{0.2L + 0.8S}\rightarrow\frac{L}{0.2L}=50.2L+0.8SL​→0.2LL​=5
    PR 238
    500K Context Training
    Qwen3-32B
    training loss
    Unsloth Notebooks
    500K Context Training
    Memory Efficient RL
    Long Context gpt-oss
    4 examples without packing wastes space
    Uncontaminated packing creates correct attention pattern
    Fused QK RoPE Triton Kernel with packing
    Why is padding needed & mathematical speedup
    2048 notebook
    Kernel generation notebook

    Tutorial: Train your own Reasoning model with GRPO

    Beginner's Guide to transforming a model like Llama 3.1 (8B) into a reasoning model by using Unsloth and GRPO.

    DeepSeek developed GRPO (Group Relative Policy Optimization) to train their R1 reasoning models.

    Quickstart

    These instructions are for our pre-made Google Colab notebooks. If you are installing Unsloth locally, you can also copy our notebooks inside your favorite code editor. We'll be using any of these notebooks:

    - GSPO

    - Vision GSPO

    1

    Install Unsloth

    If you're using our Colab notebook, click Runtime > Run all. We'd highly recommend you checking out our before getting started.

    If installing locally, ensure you have the correct and use pip install unsloth on Linux or follow our instructions.

    Video Tutorials

    Here are some video tutorials created by amazing YouTubers who we think are fantastic!

    Vision Fine-tuning

    Learn how to fine-tune vision/multimodal LLMs with Unsloth

    Fine-tuning vision models enables model to excel at certain tasks normal LLMs won't be as good as such as object/movement detection. You can also train VLMs with RL. We have many free notebooks for vision fine-tuning:

    • NEW: Qwen3-VL (8B) Vision: Notebook

    • Gemma 3 (4B) Vision: Notebook

    • Llama 3.2 Vision fine-tuning for radiography: How can we assist medical professionals in analyzing Xrays, CT Scans & ultrasounds faster.

    • Qwen2.5 VL fine-tuning for converting handwriting to LaTeX: This allows complex math formulas to be easily transcribed as LaTeX without manually writing it.

    • Pixtral 12B 2409 vision fine-tuning for general Q&A: One can concatenate general Q&A datasets with more niche datasets to make the finetune not forget base model skills.

    It is best to ensure your dataset has images of all the same size/dimensions. Use dimensions of 300-1000px to ensure your training does not take too long or use too many resources.

    To finetune vision models, we now allow you to select which parts of the mode to finetune. You can select to only finetune the vision layers, or the language layers, or the attention / MLP layers! We set them all on by default!

    Vision Fine-tuning Dataset

    The dataset for fine-tuning a vision or multimodal model is similar to standard question & answer pair , but this time, they also includes image inputs. For example, the uses a radiography case to show how AI can help medical professionals analyze X-rays, CT scans, and ultrasounds more efficiently.

    We'll be using a sampled version of the ROCO radiography dataset. You can access the dataset . The dataset includes X-rays, CT scans and ultrasounds showcasing medical conditions and diseases. Each image has a caption written by experts describing it. The goal is to finetune a VLM to make it a useful analysis tool for medical professionals.

    Let's take a look at the dataset, and check what the 1st example shows:

    Image
    Caption

    To format the dataset, all vision finetuning tasks should be formatted as follows:

    We will craft an custom instruction asking the VLM to be an expert radiographer. Notice also instead of just 1 instruction, you can add multiple turns to make it a dynamic conversation.

    Let's convert the dataset into the "correct" format for finetuning:

    The first example is now structured like below:

    Before we do any finetuning, maybe the vision model already knows how to analyse the images? Let's check if this is the case!

    And the result:

    For more details, view our dataset section in the .

    Multi-image training

    In order to fine-tune or train a VLM like Qwen3-VL with multi-images the most straightforward change is to swap

    with:

    Using map kicks in dataset standardization and arrow processing rules which can be strict and more complicated to define.

    Cogito v2.1: How to Run Locally

    Cogito v2.1 LLMs are one of the strongest open models in the world trained with IDA. Also v1 comes in 4 sizes: 70B, 109B, 405B and 671B, allowing you to select which size best matches your hardware.

    Deep Cogito v2.1 is an updated 671B MoE that is the most powerful open weights model as of 19 November 2025.

    Cogito v2.1 comes in 1 671B MoE size, whilst Cogito v2 Preview is Deep Cogito's release of models spans 4 model sizes ranging from 70B to 671B. By using IDA (Iterated Distillation & Amplification), these models are trained with the model internalizing the reasoning process using iterative policy improvement, rather than simply searching longer at inference time (like DeepSeek R1).

    Deep Cogito is based in San Fransisco, USA (like Unsloth 🇺🇸) and we're excited to provide quantized dynamic models for all 4 model sizes! All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized these LLMs with minimal accuracy loss!

    Tutorials navigation:

    Choose which model size fits your hardware! We upload 1.58bit to 16bit variants for all 4 model sizes!

    💎 Model Sizes and Uploads

    There are 4 model sizes:

    1. 2 Dense models based off from Llama - 70B and 405B

    2. 2 MoE models based off from Llama 4 Scout (109B) and DeepSeek R1 (671B)

    Model Sizes
    Recommended Quant & Link
    Disk Size
    Architecture

    Though not necessary, for the best performance, have your VRAM + RAM combined = to the size of the quant you're downloading. If you have less VRAM + RAM, then the quant will still function, just be much slower.

    🐳 Run Cogito 671B MoE in llama.cpp

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. If you want to use llama.cpp directly to load models, you can do the below: (:IQ1_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location.

    Please try out -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-IQ1_S(dynamic 1.78bit quant) or other quantized versions like Q4_K_M . We recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy. More versions at:

    1. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    🖱️Run Cogito 109B MoE in llama.cpp

    1. Follow the same instructions as running the .

    2. Then run the below:

    🌳Run Cogito 405B Dense in llama.cpp

    1. Follow the same instructions as running the .

    2. Then run the below:

    😎 Run Cogito 70B Dense in llama.cpp

    1. Follow the same instructions as running the .

    2. Then run the below:

    See for more details

    Quantization-Aware Training (QAT)

    Quantize models to 4-bit with Unsloth and PyTorch to recover accuracy.

    In collaboration with PyTorch, we're introducing QAT (Quantization-Aware Training) in Unsloth to enable trainable quantization that recovers as much accuracy as possible. This results in significantly better model quality compared to standard 4-bit naive quantization. QAT can recover up to 70% of the lost accuracy and achieve a 1–3% model performance improvement on benchmarks such as GPQA and MMLU Pro.

    Try QAT with our free Qwen3 (4B) notebook

    📚Quantization

    Naively quantizing a model is called post-training quantization (PTQ). For example, assume we want to quantize to 8bit integers:

    1. Find max(abs(W))

    2. Find a = 127/max(abs(W)) where a is int8's maximum range which is 127

    3. Quantize via qW = int8(round(W * a))

    Dequantizing back to 16bits simply does the reverse operation by float16(qW) / a . Post-training quantization (PTQ) can greatly reduce storage and inference costs, but quite often degrades accuracy when representing high-precision values with fewer bits - especially at 4-bit or lower. One way to solve this to utilize our , which uses a calibration dataset to change the quantization procedure to allocate more importance to important weights. The other way is to make quantization smarter, by making it trainable or learnable!

    🔥Smarter Quantization

    To enable smarter quantization, we collaborated with the team to add Quantization-Aware Training (QAT) directly inside of Unsloth - so now you can fine-tune models in Unsloth and then export them to 4-bit QAT format directly with accuracy improvements!

    In fact, QAT recovers 66.9% of Gemma3-4B on GPQA, and increasing the raw accuracy by +1.0%. Gemma3-12B on BBH recovers 45.5%, and increased the raw accuracy by +2.1%. QAT has no extra overhead during inference, and uses the same disk and memory usage as normal naive quantization! So you get all the benefits of low-bit quantization, but with much increased accuracy!

    🔍Quantization-Aware Training

    QAT simulates the true quantization procedure by "fake quantizing" weights and optionally activations during training, which typically means rounding high precision values to quantized ones (while staying in high precision dtype, e.g. bfloat16) and then immediately dequantizing them.

    TorchAO enables QAT by first (1) inserting fake quantize operations into linear layers, and (2) transforms the fake quantize operations to actual quantize and dequantize operations after training to make it inference ready. Step 1 enables us to train a more accurate quantization representation.

    ✨QAT + LoRA finetuning

    QAT in Unsloth can additionally be combined with LoRA fine-tuning to enable the benefits of both worlds: significantly reducing storage and compute requirements during training while mitigating quantization degradation! We support multiple methods via qat_scheme including fp8-int4, fp8-fp8, int8-int4, int4 . We also plan to add custom definitions for QAT in a follow up release!

    🫖Exporting QAT models

    After fine-tuning in Unsloth, you can call model.save_pretrained_torchao to save your trained model using TorchAO’s PTQ format. You can also upload these to the HuggingFace hub! We support any config, and we plan to make text based methods as well, and to make the process more simpler for everyone! But first, we have to prepare the QAT model for the final conversion step via:

    And now we can select which QAT style you want:

    You can then run the merged QAT lower precision model in vLLM, Unsloth and other systems for inference! These are all in the we have as well!

    🫖Quantizing models without training

    You can also call model.save_pretrained_torchao directly without doing any QAT as well! This is simply PTQ or native quantization. For example, saving to Dynamic float8 format is below:

    📱ExecuTorch - QAT for mobile deployment

    With Unsloth and TorchAO’s QAT support, you can also fine-tune a model in Unsloth and seamlessly export it to (PyTorch’s solution for on-device inference) and deploy it directly on mobile. See an example in action with more detailed workflows on the way!

    Announcement coming soon!

    🌻How to enable QAT

    Update Unsloth to the latest version, and also install the latest TorchAO!

    Then try QAT with our free

    💁Acknowledgements

    Huge thanks to the entire PyTorch and TorchAO team for their help and collaboration! Extreme thanks to Andrew Or, Jerry Zhang, Supriya Rao, Scott Roy and Mergen Nachin for helping on many discussions on QAT, and on helping to integrate it into Unsloth! Also thanks to the Executorch team as well!

    IBM Granite 4.0

    How to run IBM Granite-4.0 with Unsloth GGUFs on llama.cpp, Ollama and how to fine-tune!

    IBM releases Granite-4.0 models with 3 sizes including Nano (350M & 1B), Micro (3B), Tiny (7B/1B active) and Small (32B/9B active). Trained on 15T tokens, IBM’s new Hybrid (H) Mamba architecture enables Granite-4.0 models to run faster with lower memory use.

    Learn Unsloth Granite-4.0 Dynamic GGUFs or fine-tune/RL the model. You can with our free Colab notebook for a support agent use-case.

    Unsloth Granite-4.0 uploads:

    Dynamic GGUFs
    Dynamic 4-bit + FP8
    16-bit Instruct

    Grok 2

    Run xAI's Grok 2 model locally!

    You can now run Grok 2 (aka Grok 2.5), the 270B parameter model by xAI. Full precision requires 539GB, while the Unsloth Dynamic 3-bit version shrinks size down to just 118GB (a 75% reduction). GGUF:

    The 3-bit Q3_K_XL model runs on a single 128GB Mac or 24GB VRAM + 128GB RAM, achieving 5+ tokens/s inference. Thanks to the llama.cpp team and community for and making this possible. We were also glad to have helped a little along the way!

    All uploads use Unsloth for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run quantized Grok LLMs with minimal accuracy loss.

    Fine-tuning LLMs with Blackwell, RTX 50 series & Unsloth

    Learn how to fine-tune LLMs on NVIDIA's Blackwell RTX 50 series and B200 GPUs with our step-by-step guide.

    Unsloth now supports NVIDIA’s Blackwell architecture GPUs, including RTX 50-series GPUs (5060–5090), RTX PRO 6000, and GPUS such as B200, B40, GB100, GB102 and more! You can read the official .

    Unsloth is now compatible with every NVIDIA GPU from 2018+ including the .

    Our new supports Blackwell. Run the Docker image and start training!

    Pip install

    Multi-GPU Fine-tuning with Distributed Data Parallel (DDP)

    Learn how to use the Unsloth CLI to train on multiple GPUs with Distributed Data Parallel (DDP)!

    Let’s assume we have multiple GPUs, and we want to fine-tune a model using all of them! To do so, the most straightforward strategy is to use Distributed Data Parallel (DDP), which creates one copy of the model on each GPU device, feeding each copy distinct samples from the dataset during training and aggregating their contributions to weight updates per optimizer step.

    Why would we want to do this? Well, as we add more GPUs into the training process, we scale the number of samples our models train on per step, making each gradient update more stable and increasing our training throughput dramatically with each added GPU.

    Here’s a step-by-step guide on how to do this using Unsloth’s command-line interface (CLI)!

    Note: Unsloth DDP will work with any of your training scripts, not just via our CLI! More details below.

    model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
    cp llama.cpp/build/bin/llama-* llama.cpp
    python llama.cpp/convert_hf_to_gguf.py merged_model \
        --outfile model-F16.gguf --outtype f16 \
        --split-max-size 50G
    # For BF16:
    python llama.cpp/convert_hf_to_gguf.py merged_model \
        --outfile model-BF16.gguf --outtype bf16 \
        --split-max-size 50G
        
    # For Q8_0:
    python llama.cpp/convert_hf_to_gguf.py merged_model \
        --outfile model-Q8_0.gguf --outtype q8_0 \
        --split-max-size 50G
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli \
        -hf unsloth/Ministral-3-14B-Instruct-2512-GGUF:Q4_K_XL \
        --jinja -ngl 99 --threads -1 --ctx-size 32684 \
        --temp 0.15
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli \
        -hf unsloth/Ministral-3-14B-Reasoning-2512-GGUF:Q4_K_XL \
        --jinja -ngl 99 --threads -1 --ctx-size 32684 \
        --temp 0.6 --top-p 0.95
    tokenizer.apply_chat_template([
        {"role" : "user", "content" : "What is 1+1?"},
        {"role" : "assistant", "content" : "2"},
        {"role" : "user", "content" : "What is 2+2?"}
        ], add_generation_prompt = True
    )
    <s>[SYSTEM_PROMPT]# HOW YOU SHOULD THINK AND ANSWER
    
    First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.
    
    Your thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response to the user.[/THINK]Here, provide a self-contained response.[/SYSTEM_PROMPT][INST]What is 1+1?[/INST]2</s>[INST]What is 2+2?[/INST]
    <s>[SYSTEM_PROMPT]You are Ministral-3-3B-Instruct-2512, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.
    You power an AI assistant called Le Chat.
    Your knowledge base was last updated on 2023-10-01.
    The current date is {today}.
    
    When you're not sure about some information or when the user's request requires up-to-date or specific data, you must use the available tools to fetch the information. Do not hesitate to use tools whenever they can provide a more accurate or complete response. If no relevant tools are available, then clearly state that you don't have the information and avoid making up anything.
    If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. "What are some good restaurants around me?" => "Where are you?" or "When is the next flight to Tokyo" => "Where do you travel from?").
    You are always very attentive to dates, in particular you try to resolve dates (e.g. "yesterday" is {yesterday}) and when asked about information at specific dates, you discard information that is at another date.
    You follow these instructions in all languages, and always respond to the user in the language they use or request.
    Next sections describe the capabilities that you have.
    
    # WEB BROWSING INSTRUCTIONS
    
    You cannot perform any web search or access internet to open URLs, links etc. If it seems like the user is expecting you to do so, you clarify the situation and ask the user to copy paste the text directly in the chat.
    
    # MULTI-MODAL INSTRUCTIONS
    
    You have the ability to read images, but you cannot generate images. You also cannot transcribe audio files or videos.
    You cannot read nor transcribe audio files or videos.
    
    # TOOL CALLING INSTRUCTIONS
    
    You may have access to tools that you can use to fetch information or perform actions. You must use these tools in the following situations:
    
    1. When the request requires up-to-date information.
    2. When the request requires specific data that you do not have in your knowledge base.
    3. When the request involves actions that you cannot perform without tools.
    
    Always prioritize using tools to provide the most accurate and helpful response. If tools are not available, inform the user that you cannot perform the requested action at the moment.[/SYSTEM_PROMPT][INST]What is 1+1?[/INST]2</s>[INST]What is 2+2?[/INST]
    pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo
    docker run -it \
        --gpus=all \
        --net=host \
        --ipc=host \
        --ulimit memlock=-1 \
        --ulimit stack=67108864 \
        -v $(pwd):$(pwd) \
        -v $HOME/.cache/huggingface:/root/.cache/huggingface \
        -w $(pwd) \
        unsloth-dgx-spark
    NOTEBOOK_URL="https://raw.githubusercontent.com/unslothai/notebooks/refs/heads/main/nb/gpt_oss_(20B)_Reinforcement_Learning_2048_Game_DGX_Spark.ipynb"
    wget -O "gpt_oss_20B_RL_2048_Game.ipynb" "$NOTEBOOK_URL"
    
    jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root
    Q * cos + rotate_half(Q) * sin
    is equivalent to
    Q * cos + Q @ R * sin
    where R is a rotation matrix [ 0,  I]
                                 [-I,  0]
    dC/dY = dY * cos + dY @ R.T * sin
    where R.T is again the same  [ 0, -I]
    but the minus is transposed. [ I,  0]
    block_idx = tl.program_id(0)
    if LONG_INDEXING:
        offsets = block_idx.to(tl.int64) * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE).to(tl.int64)
        n_elements = tl.cast(n_elements, tl.int64)
    else:
        offsets = block_idx * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth
    pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth_zoo
    from unsloth import FastLanguageModel
    from trl import SFTTrainer, SFTConfig
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        "unsloth/Qwen3-14B",
    )
    
    trainer = SFTTrainer(
        model = model,
        processing_class = tokenizer,
        train_dataset = dataset,
        args = SFTConfig(
            per_device_train_batch_size = 1,
            max_length = 4096,
            …,
            packing = True, # required to enable sample packing!
        ),
    )
    trainer.train()
    Open in Colab
    Open in Colab
    notebooks docs
    dynamic GGUF quants
    TorchAO
    Qwen3-4B QAT Colab notebook
    ExecuTorch
    here
    Qwen3 (4B) notebook
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/Qwen3-4B-Instruct-2507",
        max_seq_length = 2048,
        load_in_16bit = True,
    )
    model = FastLanguageModel.get_peft_model(
        model,
        r = 16,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 32,
        
        # We support fp8-int4, fp8-fp8, int8-int4, int4
        qat_scheme = "int4",
    )
    from torchao.quantization import quantize_
    from torchao.quantization.qat import QATConfig
    quantize_(model, QATConfig(step = "convert"))
    # Use the exact same config as QAT (convenient function)
    model.save_pretrained_torchao(
        model, "tokenizer", 
        torchao_config = model._torchao_config.base_config,
    )
    
    # Int4 QAT
    from torchao.quantization import Int4WeightOnlyConfig
    model.save_pretrained_torchao(
        model, "tokenizer",
        torchao_config = Int4WeightOnlyConfig(),
    )
    
    # Int8 QAT
    from torchao.quantization import Int8DynamicActivationInt8WeightConfig
    model.save_pretrained_torchao(
        model, "tokenizer",
        torchao_config = Int8DynamicActivationInt8WeightConfig(),
    )
    # Float8
    from torchao.quantization import PerRow
    from torchao.quantization import Float8DynamicActivationFloat8WeightConfig
    torchao_config = Float8DynamicActivationFloat8WeightConfig(granularity = PerRow())
    model.save_pretrained_torchao(torchao_config = torchao_config)
    pip install --upgrade --no-cache-dir --force-reinstall unsloth unsloth_zoo
    pip install torchao==0.14.0 fbgemm-gpu-genai==1.3.0
    2

    Learn about GRPO & Reward Functions

    Before we get started, it is recommended to learn more about GRPO, reward functions and how they work. Read more about them including tips & tricks here.

    You will also need enough VRAM. In general, model parameters = amount of VRAM you will need. In Colab, we are using their free 16GB VRAM GPUs which can train any model up to 16B in parameters.

    3

    Configure desired settings

    We have pre-selected optimal settings for the best results for you already and you can change the model to whichever you want listed in our supported models. Would not recommend changing other settings if you're a beginner.

    For advanced GRPO documentation on batching, generation and training parameters,

    4

    Data preparation

    We have pre-selected OpenAI's GSM8K dataset which contains grade school math problems but you could change it to your own or any public one on Hugging Face. You can read more about datasets here.

    Your dataset should still have at least 2 columns for question and answer pairs. However the answer must not reveal the reasoning behind how it derived the answer from the question. See below for an example:

    We'll structure the data to prompt the model to articulate its reasoning before delivering an answer. To start, we'll establish a clear format for both prompts and responses.

    Now, to prepare the dataset:

    The dataset is prepared by extracting the answers and formatting them as structured strings.

    5

    Reward Functions/Verifier

    Reward Functions/Verifiers lets us know if the model is doing well or not according to the dataset you have provided. Each generation run will be assessed on how it performs to the score of the average of the rest of generations. You can create your own reward functions however we have already pre-selected them for you with Will's GSM8K reward functions. With this, we have 5 different ways which we can reward each generation.

    You can input your generations into an LLM like ChatGPT 4o or Llama 3.1 (8B) and design a reward function and verifier to evaluate it. For example, feed your generations into a LLM of your choice and set a rule: "If the answer sounds too robotic, deduct 3 points." This helps refine outputs based on quality criteria. See examples of what they can look like here.

    Example Reward Function for an Email Automation Task:

    • Question: Inbound email

    • Answer: Outbound email

    • Reward Functions:

      • If the answer contains a required keyword → +1

    6

    Train your model

    We have pre-selected hyperparameters for the most optimal results however you could change them. Read all about parameters here. For advanced GRPO documentation on batching, generation and training parameters, read our guide!

    The GRPOConfig defines key hyperparameters for training:

    • use_vllm: Activates fast inference using vLLM.

    • learning_rate: Determines the model's learning speed.

    • num_generations: Specifies the number of completions generated per prompt.

    • max_steps: Sets the total number of training steps.

    NEW! We now support DAPO, Dr. GRPO and most other new GRPO techniques. You can play with the following arguments in GRPOConfig to enable:

    You should see the reward increase overtime. We would recommend you train for at least 300 steps which may take 30 mins however, for optimal results, you should train for longer.

    If you're having issues with your GRPO model not learning, we'd highly recommend to use our as it has a much better reward function and you should see results much faster and frequently.

    You will also see sample answers which allows you to see how the model is learning. Some may have steps, XML tags, attempts etc. and the idea is as trains it's going to get better and better because it's going to get scored higher and higher until we get the outputs we desire with long reasoning chains of answers.

    7

    Run & Evaluate your model

    Run your model by clicking the play button. In the first example, there is usually no reasoning in the answer and in order to see the reasoning, we need to first save the LoRA weights we just trained with GRPO first using:

    The first inference example run has no reasoning. You must load the LoRA and test it to reveal the reasoning.

    Then we load the LoRA and test it. Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

    You can then save your model to GGUF, Ollama etc. by following our guide here.

    If you are still not getting any reasoning, you may have either trained for too less steps or your reward function/verifier was not optimal.

    8

    Save your model

    We have multiple options for saving your fine-tuned model, but we’ll focus on the easiest and most popular approaches which you can read more about here

    Saving in 16-bit Precision

    You can save the model with 16-bit precision using the following command:

    Pushing to Hugging Face Hub

    To share your model, we’ll push it to the Hugging Face Hub using the push_to_hub_merged method. This allows saving the model in multiple quantization formats.

    Saving in GGUF Format for llama.cpp

    Unsloth also supports saving in GGUF format, making it compatible with llama.cpp and Ollama.

    Once saved in GGUF format, the model can be easily deployed in lightweight environments using llama.cpp or used in other inference engines.

    Gemma 3 (4B) - Vision GSPO

    Qwen3 (4B) - Advanced

    DeepSeek-R1-0528-Qwen3-8B

    Llama 3.2 (3B) - Advanced

    Fine-tuning Guide
    requirements
    Windows install
    gpt-oss-20b
    Qwen2.5-VL

    671B MoE

    251GB

    DeepSeek R1

    And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    70B Dense

    UD-Q4_K_XL

    44GB

    Llama 3 70B

    109B MoE

    UD-Q3_K_XL

    50GB

    Llama 4 Scout

    405B Dense

    UD-Q2_K_XL

    152GB

    Run 671B MoE
    Run 109B MoE
    Run 405B Dense
    Run 70B Dense
    GitHub here
    https://huggingface.co/unsloth/cogito-671b-v2.1-GGUF
    671B model above
    671B model above
    671B model above
    https://www.deepcogito.com/research/cogito-v2-1

    Llama 3 405B

    Dynamic 4-bit Instruct:

    FP8 Dynamic:

    You can also view our Granite-4.0 collection for all uploads including Dynamic Float8 quants etc.

    Granite-4.0 Models Explanations:

    • Nano and H-Nano: The 350M and 1B models offer strong instruction-following abilities, enabling advanced on-device and edge AI and research/fine-tuning applications.

    • H-Small (MoE): Enterprise workhorse for daily tasks, supports multiple long-context sessions on entry GPUs like L40S (32B total, 9B active).

    • H-Tiny (MoE): Fast, cost-efficient for high-volume, low-complexity tasks; optimized for local and edge use (7B total, 1B active).

    • H-Micro (Dense): Lightweight, efficient for high-volume, low-complexity workloads; ideal for local and edge deployment (3B total).

    • Micro (Dense): Alternative dense option when Mamba2 isn’t fully supported (3B total).

    Run Granite-4.0 Tutorials

    ⚙️ Recommended Inference Settings

    IBM recommends these settings:

    temperature=0.0, top_p=1.0, top_k=0

    • Temperature of 0.0

    • Top_K = 0

    • Top_P = 1.0

    • Recommended minimum context: 16,384

    • Maximum context length window: 131,072 (128K context)

    Chat template:

    🦙 Ollama: Run Granite-4.0 Tutorial

    1. Install ollama if you haven't already!

    1. Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload! You can change the model name 'granite-4.0-h-small-GGUF' to any Granite model like 'granite-4.0-h-micro:Q8_K_XL'.

    📖 llama.cpp: Run Granite-4.0 Tutorial

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run

    1. OR download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision).

    1. Run Unsloth's Flappy Bird test

    2. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length (Granite-4.0 supports 128K context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    3. For conversation mode:

    🐋 Docker: Run Granite-4.0 Tutorial

    If you already have Docker desktop, all your need to do is run the command below and you're done:

    🦥 Fine-tuning Granite-4.0 in Unsloth

    Unsloth now supports all Granite 4.0 models including nano, micro, tiny and small for fine-tuning. Training is 2x faster, use 50% less VRAM and supports 6x longer context lengths. Granite-4.0 micro and tiny fit comfortably in a 15GB VRAM T4 GPU.

    • Granite-4.0 free fine-tuning notebook

    • Granite-4.0-350M fine-tuning notebook

    This notebook trains a model to become a Support Agent that understands customer interactions, complete with analysis and recommendations. This setup allows you to train a bot that provides real-time assistance to support agents.

    We also show you how to train a model using data stored in a Google Sheet.

    Unsloth config for Granite-4.0:

    If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

    how to run
    fine-tune Granite-4.0
    Running Tutorial
    Fine-tuning Tutorial
    ⚙️ Recommended Settings

    The 3-bit dynamic quant uses 118GB (126GiB) of disk space - this works well in a 128GB RAM unified memory Mac or on a 1x24GB card and 128GB of RAM. It is recommended to have at least 120GB RAM to run this 3-bit quant.

    You must use --jinja for Grok 2. You might get incorrect results if you do not use --jinja

    The 8-bit quant is ~300GB in size will fit in a 1x 80GB GPU (with MoE layers offloaded to RAM). Expect around 5 tokens/s with this setup if you have bonus 200GB RAM as well. To learn how to increase generation speed and fit longer contexts, read here.

    Though not a must, for best performance, have your VRAM + RAM combined equal to the size of the quant you're downloading. If not, hard drive / SSD offloading will work with llama.cpp, just inference will be slower.

    Sampling parameters

    • Grok 2 has a 128K max context length thus, use 131,072 context or less.

    • Use --jinja for llama.cpp variants

    There are no official sampling parameters to run the model, thus you can use standard defaults for most models:

    • Set the temperature = 1.0

    • Min_P = 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)

    Run Grok 2 Tutorial:

    Currently you can only run Grok 2 in llama.cpp.

    ✨ Run in llama.cpp

    1

    Install the specific llama.cpp PR for Grok 2 on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2

    If you want to use llama.cpp directly to load models, you can do the below: (:Q3_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 128K context length.

    Please try out -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    And finally offload all layers via -ot ".ffn_.*_exps.=CPU"

    3

    Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q3_K_XL (dynamic 3-bit quant) or other quantized versions like Q4_K_M . We recommend using our 2.7bit dynamic quant UD-Q2_K_XL or above to balance size and accuracy.

    4

    You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    Model uploads

    ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and language tasks.

    MoE Bits
    Type + Link
    Disk Size
    Details

    1.66bit

    81.8 GB

    1.92/1.56bit

    1.78bit

    88.9 GB

    2.06/1.56bit

    1.93bit

    94.5 GB

    🏂 Improving generation speed

    If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

    Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.

    📐How to fit long context (full 128K)

    To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

    --cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

    You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1

    You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it. Then you can use together with --cache-type-k :

    --cache-type-v f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

    Grok-2-GGUF
    supporting Grok 2
    Dynamic 2.0
    Run in llama.cpp Tutorial
    Simply install Unsloth:

    If you see issues, another option is to create a separate isolated environment:

    Note it might be pip3 or pip3.13 and also python3 or python3.13

    You might encounter some Xformers issues, in which cause you should build from source:

    Docker

    unsloth/unsloth is Unsloth's only Docker image. For Blackwell and 50-series GPUs, use this same image - no separate image needed.

    For installation instructions, please follow our Unsloth Docker guide.

    uv

    uv (Advanced)

    The installation order is important, since we want the overwrite bundled dependencies with specific versions (namely, xformers and triton).

    1. I prefer to use uv over pip as it's faster and better for resolving dependencies, especially for libraries which depend on torch but for which a specific CUDA version is required per this scenario.

      Install uv

      Create a project dir and venv:

    2. Install vllm

      Note that we have to specify cu128, otherwise vllm will install torch==2.7.0 but with cu126.

    3. Install unsloth dependencies

      If you notice weird resolving issues due to Xformers, you can also install Unsloth from source without Xformers:

    4. Download and build xformers (Optional)

      Xformers is optional, but it is definitely faster and uses less memory. We'll use PyTorch's native SDPA if you do not want Xformers. Building Xformers from source might be slow, so beware!

      Note that we have to explicitly set TORCH_CUDA_ARCH_LIST=12.0.

    5. transformers Install any transformers version, but best to get the latest.

    Conda or mamba (Advanced)

    1. Install conda/mamba

      Run the installation script

      Create a conda or mamba environment

      Activate newly created environment

    2. Install vllm

      Make sure you are inside the activated conda/mamba environment. You should see the name of your environment as a prefix to your terminal shell like this your (unsloth-blackwell)user@machine:

      Note that we have to specify cu128, otherwise vllm will install torch==2.7.0 but with cu126.

    3. Install unsloth dependencies

      Make sure you are inside the activated conda/mamba environment. You should see the name of your environment as a prefix to your terminal shell like this your (unsloth-blackwell)user@machine:

    4. Download and build xformers (Optional)

      Xformers is optional, but it is definitely faster and uses less memory. We'll use PyTorch's native SDPA if you do not want Xformers. Building Xformers from source might be slow, so beware!

      You should see the name of your environment as a prefix to your terminal shell like this your (unsloth-blackwell)user@machine:

      Note that we have to explicitly set TORCH_CUDA_ARCH_LIST=12.0.

    5. Update triton

      Make sure you are inside the activated conda/mamba environment. You should see the name of your environment as a prefix to your terminal shell like this your (unsloth-blackwell)user@machine:

      triton>=3.3.1 is required for Blackwell support.

    6. Transformers Install any transformers version, but best to get the latest.

    If you are using mamba as your package just replace conda with mamba for all commands shown above.

    WSL-Specific Notes

    If you're using WSL (Windows Subsystem for Linux) and encounter issues during xformers compilation (reminder Xformers is optional, but faster for training) follow these additional steps:

    1. Increase WSL Memory Limit Create or edit the WSL configuration file:

      After making these changes, restart WSL:

    2. Install xformers Use the following command to install xformers with optimized compilation for WSL:

      The --no-build-isolation flag helps avoid potential build issues in WSL environments.

    NVIDIA blogpost here
    DGX Spark
    Docker image
    Guide
    Install Unsloth from source

    We’ll clone Unsloth from GitHub and install it. Please consider using a virtual environment; we like to use uv venv –python 3.12 && source .venv/bin/activate, but any virtual environment creation tooling will do.

    Choose target model and dataset for finetuning

    In this demo, we will fine-tune Qwen/Qwen3-8B on the yahma/alpaca-cleaned chat dataset. This is a Supervised Fine-Tuning (SFT) workload that is commonly used when attempting to adapt a base model to a desired conversational style, or improve the model’s performance on a downstream task.

    Use the Unsloth CLI!

    First, let’s take a look at the help message built-in to the CLI (we’ve abbreviated here with “...” in various places for brevity):

    This should give you a sense of what options are available for you to pass into the CLI for training your model!

    For multi-GPU training (DDP in this case), we will use the torchrun launcher, which allows you to spin up multiple distributed training processes in single-node or multi-node settings. In our case, we will focus on the single-node (i.e., one machine) case with two H100 GPUs.

    Let’s also check our GPUs’ status by using the nvidia-smi command-line tool:

    Great! We have two H100 GPUs, as expected. Both are sitting at 0MiB memory usage as we’re currently not training anything, or have any model loaded into memory.

    To start your training run, issue a command like the following:

    If you have more GPUs, you may set --nproc_per_node accordingly to utilize them.

    Note: You can use the torchrun launcher with any of your Unsloth training scripts, including the scripts converted from our free Colab notebooks, and DDP will be auto-enabled when training with >1 GPU!

    Taking a look again at nvidia-smi while training is in-flight:

    We can see that both GPUs are now using ~19GB of VRAM per H100 GPU!

    Inspecting the training logs, we see that we’re able to train at a rate of ~1.1 iterations/s. This training speed is ~constant even as we add more GPUs, so our training throughput increases ~linearly with the number of GPUs!

    Training metrics

    We ran a few short rank-16 LoRA fine-tunes on unsloth/Llama-3.2-1B-Instruct on the yahma/alpaca-cleaned dataset to demonstrate the improved training throughput when using DDP training with multiple GPUs.

    The above figure compares training loss between two Llama-3.2-1B-Instruct LoRA fine-tunes over 500 training steps, with single GPU training (pink) vs. multi-GPU DDP training (blue).

    Notice that the loss curves match in scale and trend, but otherwise are a bit different, since the multi-GPU training processes twice as much training data per step. This results in a slightly different training curve with less variability on a step-by-step basis.

    The above figure plots training progress for the same two fine-tunes.

    Notice that the multi-GPU DDP training progresses through an epoch of the training data in half as many steps as single GPU training. This is because each GPU can process a distinct batch (of size per_device_train_batch_size) per step. However, the per-step timing for DDP training is slightly slower due to distributed communication for the model weight updates. As you increase the number of GPUs, the training throughput will continue to increase ~linearly (but with a small, but increasing penalty for the distributed comms).

    These same loss and training epoch progress behaviors hold for QLoRA fine-tunes, in which we loaded the base models in 4-bit precision in order to save additional GPU memory. This is particularly useful for training large models on limited amounts of GPU VRAM:

    Training loss comparison between two Llama-3.2-1B-Instruct QLoRA fine-tunes over 500 training steps, with single GPU training (orange) vs. multi-GPU DDP training (purple).

    Training progress comparison for the same two fine-tunes.

    3B
    8B
    14B
    3B
    8B
    14B

    Panoramic radiography shows an osteolytic lesion in the right posterior maxilla with resorption of the floor of the maxillary sinus (arrows).

    Notebook
    Notebook
    Notebook
    datasets
    Llama 3.2 Vision Notebook
    here
    notebook here

    Advanced RL Documentation

    Advanced documentation settings when using Unsloth with GRPO.

    Detailed guides on doing GRPO with Unsloth for Batching, Generation & Training Parameters:

    Training Parameters

    • beta (float, default 0.0): KL coefficient.

      • 0.0 ⇒ no reference model loaded (lower memory, faster).

      • Higher beta constrains the policy to stay closer to the ref policy.

    • num_iterations (int, default 1): PPO epochs per batch (μ in the algorithm). Replays data within each gradient accumulation step; e.g., 2 = two forward passes per accumulation step.

    • epsilon (float, default 0.2): Clipping value for token-level log-prob ratios (typical ratio range ≈ [-1.2, 1.2] with default ε).

    • delta (float, optional): Enables upper clipping bound for two-sided GRPO when set. If None, standard GRPO clipping is used. Recommended > 1 + ε when enabled (per INTELLECT-2 report).

    • epsilon_high (float, optional): Upper-bound epsilon; defaults to epsilon if unset. DAPO recommends 0.28.

    • importance_sampling_level (“token” | “sequence”, default "token"):

      • "token": raw per-token ratios (one weight per token).

      • "sequence": average per-token ratios to a single sequence-level ratio. GSPO shows sequence-level sampling often gives more stable training for sequence-level rewards.

    • reward_weights (list[float], optional): One weight per reward. If None, all weights = 1.0.

    • scale_rewards (str|bool, default "group"):

      • True or "group": scale by std within each group (unit variance in group).

      • "batch": scale by std across the entire batch

    • loss_type (str, default "dapo"):

      • "grpo": normalizes over sequence length (length bias; not recommended).

      • "dr_grpo": normalizes by a global constant (introduced in Dr. GRPO; removes length bias). Constant ≈ max_completion_length

    • mask_truncated_completions (bool, default False): When True, truncated completions are excluded from loss (recommended by DAPO for stability). Note: There are some KL issues with this flag, so we recommend to disable it.

      This can zero out all completion_mask entries when many completions are truncated, making n_mask_per_reward = 0 and causing KL to become NaN.

    • vllm_importance_sampling_correction (bool, default True): Applies Truncated Importance Sampling (TIS) to correct off-policy effects when generation (e.g., vLLM / fast_inference) differs from training backend. In Unsloth, this is auto-set to True if you’re using vLLM/fast_inference; otherwise False.

    • vllm_importance_sampling_cap (float, default 2.0): Truncation parameter C for TIS; sets an upper bound on the importance sampling ratio to improve stability.

    • dtype when choosing float16 or bfloat16, see

    Generation Parameters

    • temperature (float, defaults to 1.0): Temperature for sampling. The higher the temperature, the more random the completions. Make sure you use a relatively high (1.0) temperature to have diversity in generations which helps learning.

    • top_p (float, optional, defaults to 1.0): Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1.0 to consider all tokens.

    • top_k (int, optional): Number of highest probability vocabulary tokens to keep for top-k-filtering. If None, top-k-filtering is disabled and all tokens are considered.

    It is a bit confusing to mess with this parameter, it is recommended to edit per_device_train_batch_size and gradient accumulation for the batch sizes

    Batch & Throughput Parameters

    Parameters that control batches

    • train_batch_size: Number of samples per process per step. If this integer is less than num_generations, it will default to num_generations.

    • steps_per_generation: Number of microbatches that contribute to one generation’s loss calculation (forward passes only). A new batch of data is generated every steps_per_generation steps; backpropagation timing depends on gradient_accumulation_steps

    GRPO Batch Examples

    The tables below illustrate how batches flow through steps, when optimizer updates occur, and how new batches are generated.

    Example 1

    Generation cycle A

    Step
    Batch
    Notes

    Generation cycle B

    Step
    Batch
    Notes

    Example 2

    Generation cycle A

    Step
    Batch
    Notes

    Generation cycle B

    Step
    Batch
    Notes

    Example 3

    Generation cycle A

    Step
    Batch
    Notes

    Generation cycle B

    Step
    Batch
    Notes

    Example 4

    Generation cycle A

    Step
    Batch
    Notes

    Generation cycle B

    Step
    Batch
    Notes

    Quick Formula Reference

    Tutorial: How to Train gpt-oss with RL

    Learn to train OpenAI gpt-oss with GRPO to autonomously beat 2048 locally or on Colab.

    LLMs often struggle with tasks that involve complex environments. However, by applying reinforcement learning (RL) and designing a custom reward function, these challenges can be overcome.

    RL can be adapted for tasks such as auto kernel or strategy creation. This tutorial shows how to train gpt-oss with GRPO and Unsloth to autonomously beat 2048.

    (Official OpenAI example)

    What you’ll build:

    • Train gpt-oss-20b so the model can automatically win 2048

    • Create a minimal 2048 environment the model can interact with

    • Define reward functions that:

      1. Check the generated strategy compiles and runs,

      2. Prevent reward hacking (disallow external imports), and

      3. Reward actual game success

    • Run inference and export the model (MXFP4 4‑bit or merged FP16)

    Hardware: The 2048 example runs on a free Colab T4, but training will be slow. A100/H100 is much faster. 4‑bit loading + LoRA lets you fit a 20B model into modest VRAM.

    1

    Install Unsloth

    Run this cell at the top of a notebook (works on Colab).

    2

    Gemma 3: How to Run Guide

    How to run Gemma 3 effectively with our GGUFs on llama.cpp, Ollama, Open WebUI and how to fine-tune with Unsloth!

    Google releases Gemma 3 with a new 270M model and the previous 1B, 4B, 12B, and 27B sizes. The 270M and 1B are text-only, while larger models handle both text and vision. We provide GGUFs, and a guide of how to run it effectively, and how to finetune & do RL with Gemma 3!

    NEW Aug 14, 2025 Update: Try our fine-tuning Gemma 3 (270M) notebook and GGUFs to run.

    Also see our Gemma 3n Guide.

    Running TutorialFine-tuning Tutorial

    Unsloth is the only framework which works in float16 machines for Gemma 3 inference and training. This means Colab Notebooks with free Tesla T4 GPUs also work!

    • Fine-tune Gemma 3 (4B) with vision support using our

    According to the Gemma team, the optimal config for inference is temperature = 1.0, top_k = 64, top_p = 0.95, min_p = 0.0

    Unsloth Gemma 3 uploads with optimal configs:

    GGUF
    Unsloth Dynamic 4-bit Instruct
    16-bit Instruct

    ⚙️ Recommended Inference Settings

    According to the Gemma team, the official recommended settings for inference is:

    • Temperature of 1.0

    • Top_K of 64

    • Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)

    • Top_P of 0.95

    llama.cpp an other inference engines auto add a <bos> - DO NOT add TWO <bos> tokens! You should ignore the <bos> when prompting the model!

    ✨Running Gemma 3 on your phone

    To run the models on your phone, we recommend using any mobile app that can run GGUFs locally on edge devices like phones. After fine-tuning you can export it to GGUF then run it locally on your phone. Ensure your phone has enough RAM/power to process the models as it can overheat so we recommend using Gemma 3 270M or the Gemma 3n models for this use-case. You can try the mobile app which you can download on or , which are great apps for running GGUFs on your phone.

    Remember, you can change the model name 'gemma-3-27b-it-GGUF' to any Gemma model like 'gemma-3-270m-it-GGUF:Q8_K_XL' for all the tutorials.

    🦙 Tutorial: How to Run Gemma 3 in Ollama

    1. Install ollama if you haven't already!

    1. Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload! You can change the model name 'gemma-3-27b-it-GGUF' to any Gemma model like 'gemma-3-270m-it-GGUF:Q8_K_XL'.

    📖 Tutorial: How to Run Gemma 3 27B in llama.cpp

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run

    1. OR download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision). More versions at:

    1. Run Unsloth's Flappy Bird test

    2. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length (Gemma 3 supports 128K context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    3. For conversation mode:

    1. For non conversation mode to test Flappy Bird:

    The full input from our 1.58bit blog is:

    Remember to remove <bos> since Gemma 3 auto adds a <bos>!

    🦥 Fine-tuning Gemma 3 in Unsloth

    Unsloth is the only framework which works in float16 machines for Gemma 3 inference and training. This means Colab Notebooks with free Tesla T4 GPUs also work!

    • Try our new which makes the 270M parameter model very smart at playing chess and can predict the next chess move.

    • Fine-tune Gemma 3 (4B) using our notebooks for: or

    • Or fine-tune with • •

    When trying full fine-tune (FFT) Gemma 3, all layers default to float32 on float16 devices. Unsloth expects float16 and upcasts dynamically. To fix, run model.to(torch.float16) after loading, or use a GPU with bfloat16 support.

    Unsloth Fine-tuning Fixes

    Our solution in Unsloth is 3 fold:

    1. Keep all intermediate activations in bfloat16 format - can be float32, but this uses 2x more VRAM or RAM (via Unsloth's async gradient checkpointing)

    2. Do all matrix multiplies in float16 with tensor cores, but manually upcasting / downcasting without the help of Pytorch's mixed precision autocast.

    3. Upcast all other options that don't need matrix multiplies (layernorms) to float32.

    🤔 Gemma 3 Fixes Analysis

    First, before we finetune or run Gemma 3, we found that when using float16 mixed precision, gradients and activations become infinity unfortunately. This happens in T4 GPUs, RTX 20x series and V100 GPUs where they only have float16 tensor cores.

    For newer GPUs like RTX 30x or higher, A100s, H100s etc, these GPUs have bfloat16 tensor cores, so this problem does not happen! But why?

    Float16 can only represent numbers up to 65504, whilst bfloat16 can represent huge numbers up to 10^38! But notice both number formats use only 16bits! This is because float16 allocates more bits so it can represent smaller decimals better, whilst bfloat16 cannot represent fractions well.

    But why float16? Let's just use float32! But unfortunately float32 in GPUs is very slow for matrix multiplications - sometimes 4 to 10x slower! So we cannot do this.

    Chat Templates

    Learn the fundamentals and customization options of chat templates, including Conversational, ChatML, ShareGPT, Alpaca formats, and more!

    In our GitHub, we have a list of every chat template Unsloth uses including for Llama, Mistral, Phi-4 etc. So if you need any pointers on the formatting or use case, you can view them here: github.com/unslothai/unsloth/blob/main/unsloth/chat_templates.py

    List of Colab chat template notebooks:

    • Conversational

    • by Timotheeee

    • by Flail

    Multi turn conversations

    A bit issue if you didn't notice is the Alpaca dataset is single turn, whilst remember using ChatGPT was interactive and you can talk to it in multiple turns. For example, the left is what we want, but the right which is the Alpaca dataset only provides singular conversations. We want the finetuned language model to somehow learn how to do multi turn conversations just like ChatGPT.

    So we introduced the conversation_extension parameter, which essentially selects some random rows in your single turn dataset, and merges them into 1 conversation! For example, if you set it to 3, we randomly select 3 rows and merge them into 1! Setting them too long can make training slower, but could make your chatbot and final finetune much better!

    Then set output_column_name to the prediction / output column. For the Alpaca dataset dataset, it would be the output column.

    We then use the standardize_sharegpt function to just make the dataset in a correct format for finetuning! Always call this!

    Customizable Chat Templates

    We can now specify the chat template for finetuning itself. The very famous Alpaca format is below:

    But remember we said this was a bad idea because ChatGPT style finetunes require only 1 prompt? Since we successfully merged all dataset columns into 1 using Unsloth, we essentially can create the below style chat template with 1 input column (instruction) and 1 output:

    We just require you must put a {INPUT} field for the instruction and an {OUTPUT} field for the model's output field. We in fact allow an optional {SYSTEM} field as well which is useful to customize a system prompt just like in ChatGPT. For example, below are some cool examples which you can customize the chat template to be:

    For the ChatML format used in OpenAI models:

    Or you can use the Llama-3 template itself (which only functions by using the instruct version of Llama-3): We in fact allow an optional {SYSTEM} field as well which is useful to customize a system prompt just like in ChatGPT.

    Or in the Titanic prediction task where you had to predict if a passenger died or survived in this Colab notebook which includes CSV and Excel uploading:

    Applying Chat Templates with Unsloth

    For datasets that usually follow the common chatml format, the process of preparing the dataset for training or finetuning, consists of four simple steps:

    • Check the chat templates that Unsloth currently supports:\

      This will print out the list of templates currently supported by Unsloth. Here is an example output:\

      \

    • Use get_chat_template to apply the right chat template to your tokenizer:\

      \

    • Define your formatting function. Here's an example:\

    More Information

    Assuming your dataset is a list of list of dictionaries like the below:

    You can use our get_chat_template to format it. Select chat_template to be any of zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth, and use mapping to map the dictionary values from, value etc. map_eos_token allows you to map <|im_end|> to EOS without any training.

    You can also make your own custom chat templates! For example our internal chat template we use is below. You must pass in a tuple of (custom_template, eos_token) where the eos_token must be used inside the template.

    Llama 4: How to Run & Fine-tune

    How to run Llama 4 locally using our dynamic GGUFs which recovers accuracy compared to standard quantization.

    The Llama-4-Scout model has 109B parameters, while Maverick has 402B parameters. The full unquantized version requires 113GB of disk space whilst the 1.78-bit version uses 33.8GB (-75% reduction in size). Maverick (402Bs) went from 422GB to just 122GB (-70%).

    Both text AND vision is now supported! Plus multiple improvements to tool calling.

    Scout 1.78-bit fits in a 24GB VRAM GPU for fast inference at ~20 tokens/sec. Maverick 1.78-bit fits in 2x48GB VRAM GPUs for fast inference at ~40 tokens/sec.

    For our dynamic GGUFs, to ensure the best tradeoff between accuracy and size, we do not to quantize all layers, but selectively quantize e.g. the MoE layers to lower bit, and leave attention and other layers in 4 or 6bit.

    Devstral: How to Run & Fine-tune

    Run and fine-tune Mistral Devstral 1.1, including Small-2507 and 2505.

    Devstral-Small-2507 (Devstral 1.1) is Mistral's new agentic LLM for software engineering. It excels at tool-calling, exploring codebases, and powering coding agents. Mistral AI released the original 2505 version in May, 2025.

    Finetuned from , Devstral supports a 128k context window. Devstral Small 1.1 has improved performance, achieving a score of 53.6% performance on , making it (July 10, 2025) the #1 open model on the benchmark.

    Unsloth Devstral 1.1 GGUFs contain additional tool-calling support and chat template fixes. Devstral 1.1 still works well with OpenHands but now also generalizes better to other prompts and coding environments.

    As text-only, Devstral’s vision encoder was removed prior to fine-tuning. We've added for the model.

    Troubleshooting & FAQs

    Tips to solve issues, and frequently asked questions.

    If you're still encountering any issues with versions or depencies, please use our which will have everything pre-installed.

    Try always to update Unsloth if you find any issues.

    pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo

    Google Colabcolab.research.google.com
    Google Colabcolab.research.google.com
    model.save_lora("grpo_saved_lora")
    # Save to 16-bit precision
    model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
    cp llama.cpp/build/bin/llama-* llama.cpp
    export LLAMA_CACHE="unsloth/cogito-671b-v2.1-GGUF"
    ./llama.cpp/llama-cli \
        -hf unsloth/cogito-671b-v2.1-GGUF:UD-Q2_K_XL \
        --n-gpu-layers 99 \
        --temp 0.6 \
        --top_p 0.95 \
        --min_p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        --jinja \
        -ot ".ffn_.*_exps.=CPU"
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/cogito-671b-v2.1-GGUF",
        local_dir = "unsloth/cogito-671b-v2.1-GGUF",
        allow_patterns = ["*UD-IQ1_S*"], # Dynamic 1bit (168GB) Use "*UD-Q2_K_XL*" for Dynamic 2bit (251GB)
    )
    export LLAMA_CACHE="unsloth/cogito-v2-preview-llama-109B-MoE-GGUF"
    ./llama.cpp/llama-cli \
        -hf unsloth/cogito-v2-preview-llama-109B-MoE-GGUF:Q3_K_XL \
        --n-gpu-layers 99 \
        --temp 0.6 \
        --min-p 0.01 \
        --top-p 0.9 \
        --ctx-size 16384 \
        --jinja \
        -ot ".ffn_.*_exps.=CPU"
    export LLAMA_CACHE="unsloth/cogito-v2-preview-llama-405B-GGUF"
    ./llama.cpp/llama-cli \
        -hf unsloth/cogito-v2-preview-llama-405B-GGUF:Q2_K_XL \
        --n-gpu-layers 99 \
        --temp 0.6 \
        --min-p 0.01 \
        --top-p 0.9 \
        --jinja \
        --ctx-size 16384
    export LLAMA_CACHE="unsloth/cogito-v2-preview-llama-70B-GGUF"
    ./llama.cpp/llama-cli \
        -hf unsloth/cogito-v2-preview-llama-70B-GGUF:Q4_K_XL \
        --n-gpu-layers 99 \
        --temp 0.6 \
        --min-p 0.01 \
        --top-p 0.9 \
        --jinja \
        --ctx-size 16384
    <|start_of_role|>system<|end_of_role|>You are a helpful assistant. Please ensure responses are professional, accurate, and safe.<|end_of_text|>
    <|start_of_role|>user<|end_of_role|>Please list one IBM Research laboratory located in the United States. You should only output its name and location.<|end_of_text|>
    <|start_of_role|>assistant<|end_of_role|>Almaden Research Center, San Jose, California<|end_of_text|>
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    ollama run hf.co/unsloth/granite-4.0-h-small-GGUF:UD-Q4_K_XL
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli \
        -hf unsloth/granite-4.0-h-small-GGUF:UD-Q4_K_XL
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/granite-4.0-h-small-GGUF",
        local_dir = "unsloth/granite-4.0-h-small-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*"], # For Q4_K_M
    )
    ./llama.cpp/llama-mtmd-cli \
        --model unsloth/granite-4.0-h-small-GGUF/granite-4.0-h-small-UD-Q4_K_XL.gguf \
        --threads 32 \
        --jinja \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 0.0 \
        --top-k 0 \
        --top-p 1.0
    docker model pull hf.co/unsloth/granite-4.0-h-small-GGUF:UD-Q4_K_XL
    !pip install --upgrade unsloth
    from unsloth import FastLanguageModel
    import torch
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/granite-4.0-h-micro",
        max_seq_length = 2048,   # Context length - can be longer, but uses more memory
        load_in_4bit = True,     # 4bit uses much less memory
        load_in_8bit = False,    # A bit more accurate, uses 2x memory
        full_finetuning = False, # We have full finetuning now!
        # token = "hf_...",      # use one if using gated models
    )
    pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cd llama.cpp && git fetch origin pull/15539/head:MASTER && git checkout MASTER && cd ..
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
    cp llama.cpp/build/bin/llama-* llama.cpp
    curl -LsSf https://astral.sh/uv/install.sh | sh && source $HOME/.local/bin/env
    mkdir 'unsloth-blackwell' && cd 'unsloth-blackwell'
    uv venv .venv --python=3.12 --seed
    source .venv/bin/activate
    curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
    bash Miniforge3-$(uname)-$(uname -m).sh
    conda create --name unsloth-blackwell python==3.12 -y
    conda activate unsloth-blackwell
    pip install -U vllm --extra-index-url https://download.pytorch.org/whl/cu128
    # Create or edit .wslconfig in your Windows user directory
    # (typically C:\Users\YourUsername\.wslconfig)
    
    # Add these lines to the file
    [wsl2]
    memory=16GB  # Minimum 16GB recommended for xformers compilation
    processors=4  # Adjust based on your CPU cores
    swap=2GB
    localhostForwarding=true
    wsl --shutdown
    # Set CUDA architecture for Blackwell GPUs
    export TORCH_CUDA_ARCH_LIST="12.0"
    
    # Install xformers from source with optimized build flags
    pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
    pip install unsloth
    python -m venv unsloth
    source unsloth/bin/activate
    pip install unsloth
    # First uninstall xformers installed by previous libraries
    pip uninstall xformers -y
    
    # Clone and build
    pip install ninja
    export TORCH_CUDA_ARCH_LIST="12.0"
    git clone --depth=1 https://github.com/facebookresearch/xformers --recursive
    cd xformers && python setup.py install && cd ..
    uv pip install unsloth
    git clone https://github.com/unslothai/unsloth.git
    cd unsloth
    pip install .
    $ python unsloth-cli.py --help
    usage: unsloth-cli.py [-h] [--model_name MODEL_NAME] [--max_seq_length MAX_SEQ_LENGTH] [--dtype DTYPE]
                          [--load_in_4bit] [--dataset DATASET] [--r R] [--lora_alpha LORA_ALPHA]
                          [--lora_dropout LORA_DROPOUT] [--bias BIAS]
                          [--use_gradient_checkpointing USE_GRADIENT_CHECKPOINTING]
    …
    
    🦥 Fine-tune your llm faster using unsloth!
    
    options:
      -h, --help            show this help message and exit
    
    🤖 Model Options:
      --model_name MODEL_NAME
                            Model name to load
      --max_seq_length MAX_SEQ_LENGTH
                            Maximum sequence length, default is 2048. We auto support RoPE Scaling
                            internally!
    …
    
    🧠 LoRA Options:
      These options are used to configure the LoRA model.
    
      --r R                 Rank for Lora model, default is 16. (common values: 8, 16, 32, 64, 128)
      --lora_alpha LORA_ALPHA
                            LoRA alpha parameter, default is 16. (common values: 8, 16, 32, 64, 128)
    …
    
    🎓 Training Options:
      --per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE
                            Batch size per device during training, default is 2.
      --per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE
                            Batch size per device during evaluation, default is 4.
      --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                            Number of gradient accumulation steps, default is 4.
    …
    $ nvidia-smi
    Mon Nov 24 12:53:00 2025       
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
    +-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA H100 80GB HBM3          On  |   00000000:04:00.0 Off |                    0 |
    | N/A   32C    P0             69W /  700W |       0MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   1  NVIDIA H100 80GB HBM3          On  |   00000000:05:00.0 Off |                    0 |
    | N/A   30C    P0             68W /  700W |       0MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |  No running processes found                                                             |
    +-----------------------------------------------------------------------------------------+
    # required:
    #   --model_name
    #   --dataset
    # optional; experiment with these:
    #   --learning_rate, --max_seq_length, --per_device_train_batch_size, --gradient_accumulation_steps, --max_steps
    # to save the model at the end of training:
    #   --save_model
    
    torchrun --nproc_per_node=2 unsloth-cli.py \
      --model_name=Qwen/Qwen3-8B \
      --dataset=yahma/alpaca-cleaned \
      --learning_rate=2e-5 \
      --max_seq_length=2048 \
      --per_device_train_batch_size=1 \
      --gradient_accumulation_steps=4 \
      --max_steps=1000 \
      --save_model
    $ nvidia-smi
    Mon Nov 24 12:58:42 2025
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
    +-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA H100 80GB HBM3          On  |   00000000:04:00.0 Off |                    0 |
    | N/A   38C    P0            193W /  700W |   18903MiB /  81559MiB |     25%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   1  NVIDIA H100 80GB HBM3          On  |   00000000:05:00.0 Off |                    0 |
    | N/A   37C    P0            199W /  700W |   18905MiB /  81559MiB |     28%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |    0   N/A  N/A            4935      C   ...und/unsloth/.venv/bin/python3      18256MiB |
    |    0   N/A  N/A            4936      C   ...und/unsloth/.venv/bin/python3        630MiB |
    |    1   N/A  N/A            4935      C   ...und/unsloth/.venv/bin/python3        630MiB |
    |    1   N/A  N/A            4936      C   ...und/unsloth/.venv/bin/python3      18258MiB |
    +-----------------------------------------------------------------------------------------+
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Ministral-3-14B-Instruct-2512-GGUF",
        local_dir = "Ministral-3-14B-Instruct-2512-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*"],
    )
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Ministral-3-14B-Reasoning-2512-GGUF",
        local_dir = "Ministral-3-14B-Reasoning-2512-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*"],
    )
    model = FastVisionModel.get_peft_model(
        model,
        finetune_vision_layers     = True, # False if not finetuning vision layers
        finetune_language_layers   = True, # False if not finetuning language layers
        finetune_attention_modules = True, # False if not finetuning attention layers
        finetune_mlp_modules       = True, # False if not finetuning MLP layers
    
        r = 16,                           # The larger, the higher the accuracy, but might overfit
        lora_alpha = 16,                  # Recommended alpha == r at least
        lora_dropout = 0,
        bias = "none",
        random_state = 3407,
        use_rslora = False,               # We support rank stabilized LoRA
        loftq_config = None,               # And LoftQ
        target_modules = "all-linear",    # Optional now! Can specify a list if needed
        modules_to_save=[
            "lm_head",
            "embed_tokens",
        ],
    )
    Dataset({
        features: ['image', 'image_id', 'caption', 'cui'],
        num_rows: 1978
    })
    [
    { "role": "user",
      "content": [{"type": "text",  "text": instruction}, {"type": "image", "image": image} ]
    },
    { "role": "assistant",
      "content": [{"type": "text",  "text": answer} ]
    },
    ]
    instruction = "You are an expert radiographer. Describe accurately what you see in this image."
    
    def convert_to_conversation(sample):
        conversation = [
            { "role": "user",
              "content" : [
                {"type" : "text",  "text"  : instruction},
                {"type" : "image", "image" : sample["image"]} ]
            },
            { "role" : "assistant",
              "content" : [
                {"type" : "text",  "text"  : sample["caption"]} ]
            },
        ]
        return { "messages" : conversation }
    pass
    converted_dataset = [convert_to_conversation(sample) for sample in dataset]
    converted_dataset[0]
    {'messages': [{'role': 'user',
       'content': [{'type': 'text',
         'text': 'You are an expert radiographer. Describe accurately what you see in this image.'},
        {'type': 'image',
         'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=657x442>}]},
      {'role': 'assistant',
       'content': [{'type': 'text',
         'text': 'Panoramic radiography shows an osteolytic lesion in the right posterior maxilla with resorption of the floor of the maxillary sinus (arrows).'}]}]}
    FastVisionModel.for_inference(model) # Enable for inference!
    
    image = dataset[0]["image"]
    instruction = "You are an expert radiographer. Describe accurately what you see in this image."
    
    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": instruction}
        ]}
    ]
    input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
    inputs = tokenizer(
        image,
        input_text,
        add_special_tokens = False,
        return_tensors = "pt",
    ).to("cuda")
    
    from transformers import TextStreamer
    text_streamer = TextStreamer(tokenizer, skip_prompt = True)
    _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                       use_cache = True, temperature = 1.5, min_p = 0.1)
    This radiograph appears to be a panoramic view of the upper and lower dentition, specifically an Orthopantomogram (OPG).
    
    * The panoramic radiograph demonstrates normal dental structures.
    * There is an abnormal area on the upper right, represented by an area of radiolucent bone, corresponding to the antrum.
    
    **Key Observations**
    
    * The bone between the left upper teeth is relatively radiopaque.
    * There are two large arrows above the image, suggesting the need for a closer examination of this area. One of the arrows is in a left-sided position, and the other is in the right-sided position. However, only
    ds_converted = ds.map(
        convert_to_conversation,
    )
    ds_converted = [convert_to_converation(sample) for sample in dataset]
    UD-Q2_K_XL

    H-Small

  • H-Tiny

  • H-Micro

  • Micro

  • H-Tiny FP8

    H-Small

  • H-Tiny

  • H-Micro

  • Micro

  • H-350M
    350M
    H-1B
    1B
    H-Micro
    Micro
    H-Small FP8
    H-350M
    350M
    H-1B
    1B
    Logo
    Logo
    This uses the least VRAM.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    2.5/2.06/1.56

    2.42bit

    IQ2_XXS

    99.3 GB

    2.5/2.06bit

    2.71bit

    Q2_K_XL

    112 GB

    3.5/2.5bit

    3.12bit

    IQ3_XXS

    117 GB

    3.5/2.06bit

    3.5bit

    Q3_K_XL

    126 GB

    4.5/3.5bit

    4.5bit

    Q4_K_XL

    155 GB

    5.5/4.5bit

    5.5bit

    Q5_K_XL

    191 GB

    6.5/5.5bit

    TQ1_0
    IQ1_S
    IQ1_M
    (per PPO-Lite).
  • False or "none": no scaling. Dr. GRPO recommends not scaling to avoid difficulty bias from std scaling.

  • .
  • "dapo" (default): normalizes by active tokens in the global accumulated batch (introduced in DAPO; removes length bias).

  • "bnpo": normalizes by active tokens in the local batch only (results can vary with local batch size; equals GRPO when per_device_train_batch_size == 1).

  • min_p (float, optional): Minimum token probability, which will be scaled by the probability of the most likely token. It must be a value between 0.0 and 1.0. Typical values are in the 0.01-0.2 range.

  • repetition_penalty (float, optional, defaults to 1.0): Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1.0 encourage the model to use new tokens, while values < 1.0 encourage the model to repeat tokens.

  • steps_per_generation: (int, optional): Number of steps per generation. If None, it defaults to gradient_accumulation_steps. Mutually exclusive with generation_batch_size.

  • .
  • num_processes: Number of distributed training processes (e.g., GPUs / workers).

  • gradient_accumulation_steps (aka gradient_accumulation): Number of microbatches to accumulate before applying backpropagation and optimizer update.

  • Effective batch size:

    Total samples contributing to gradients before an update (across all processes and steps).

  • Optimizer steps per generation:

    Example: 4 / 2 = 2.

  • num_generations: Number of generations produced per prompt (applied after computing effective_batch_size). The number of unique prompts in a generation cycle is:

    Must be > 2 for GRPO to work.

  • 0

    [0,0,0]

    1

    [1,1,1]

    → optimizer update (accum = 2 reached)

    2

    [2,2,2]

    3

    [3,3,3]

    optimizer update

    0

    [4,4,4]

    1

    [5,5,5]

    → optimizer update (accum = 2 reached)

    2

    [6,6,6]

    3

    [7,7,7]

    optimizer update

    0

    [0,0,0]

    1

    [1,1,1]

    2

    [2,2,2]

    3

    [3,3,3]

    optimizer update (accum = 4 reached)

    0

    [4,4,4]

    1

    [5,5,5]

    2

    [6,6,6]

    3

    [7,7,7]

    optimizer update (accum = 4 reached)

    0

    [0,0,0]

    1

    [0,1,1]

    2

    [1,1,3]

    3

    [3,3,3]

    optimizer update (accum = 4 reached)

    0

    [4,4,4]

    1

    [4,5,5]

    2

    [5,5,6]

    3

    [6,6,6]

    optimizer update (accum = 4 reached)

    0

    [0,0,0, 1,1,1]

    1

    [2,2,2, 3,3,3]

    optimizer update (accum = 2 reached)

    0

    [4,4,4, 5,5,5]

    1

    [6,6,6, 7,7,7]

    optimizer update (accum = 2 reached)

    See
    FP16 vs BF16 for RL
    This function loops through your dataset applying the chat template you defined to each sample.\
  • Finally, let's load the dataset and apply the required modifications to our dataset: \

    If your dataset uses the ShareGPT format with "from"/"value" keys instead of the ChatML "role"/"content" format, you can use the standardize_sharegpt function to convert it first. The revised code will now look as follows: \

  • ChatML
    Ollama
    Text Classification
    Multiple Datasets
    https://colab.research.google.com/drive/1VYkncZMfGFkeCEgN2IzbZIKEDkyQuJAS?usp=sharing
    export LLAMA_CACHE="unsloth/grok-2-GGUF"
    ./llama.cpp/llama-cli \
        -hf unsloth/grok-2-GGUF:Q3_K_XL \
        --jinja \
        --n-gpu-layers 99 \
        --temp 1.0 \
        --top-p 0.95 \
        --min-p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU"
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/grok-2-GGUF",
        local_dir = "unsloth/grok-2-GGUF",
        allow_patterns = ["*UD-Q3_K_XL*"], # Dynamic 3bit
    )
    ./llama.cpp/llama-cli \
        --model unsloth/grok-2-GGUF/UD-Q3_K_XL/grok-2-UD-Q3_K_XL-00001-of-00003.gguf \
        --jinja \
        --threads -1 \
        --n-gpu-layers 99 \
        --temp 1.0 \
        --top_p 0.95 \
        --min_p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU"
    uv pip install -U vllm --torch-backend=cu128
    uv pip install unsloth unsloth_zoo bitsandbytes
    uv pip install -qqq \
    "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
    "unsloth[base] @ git+https://github.com/unslothai/unsloth"
    # First uninstall xformers installed by previous libraries
    pip uninstall xformers -y
    
    # Clone and build
    pip install ninja
    export TORCH_CUDA_ARCH_LIST="12.0"
    git clone --depth=1 https://github.com/facebookresearch/xformers --recursive
    cd xformers && python setup.py install && cd ..
    uv pip install -U transformers
    pip install unsloth unsloth_zoo bitsandbytes
    # First uninstall xformers installed by previous libraries
    pip uninstall xformers -y
    
    # Clone and build
    pip install ninja
    export TORCH_CUDA_ARCH_LIST="12.0"
    git clone --depth=1 https://github.com/facebookresearch/xformers --recursive
    cd xformers && python setup.py install && cd ..
    pip install -U triton>=3.3.1
    uv pip install -U transformers
    effective_batch_size = steps_per_generation * num_processes * train_batch_size
    optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps
    unique_prompts = effective_batch_size / num_generations
    # If mask_truncated_completions is enabled, zero out truncated completions in completion_mask
    if self.mask_truncated_completions:
        truncated_completions = ~is_eos.any(dim=1)
        completion_mask = completion_mask * (~truncated_completions).unsqueeze(1).int()
    num_gpus = 1
    per_device_train_batch_size = 3
    gradient_accumulation_steps = 2
    steps_per_generation = 4
    
    effective_batch_size = 4 * 3 * 1 = 12
    num_generations = 3
    num_gpus = 1
    per_device_train_batch_size = 3
    steps_per_generation = gradient_accumulation_steps = 4
    
    effective_batch_size = 4 * 3 * 1 = 12
    num_generations = 3
    num_gpus = 1
    per_device_train_batch_size = 3
    steps_per_generation = gradient_accumulation_steps = 4
    
    effective_batch_size = 4 * 3 * 1 = 12
    num_generations = 4
    unique_prompts = effective_batch_size / num_generations = 3
    num_gpus = 1
    per_device_train_batch_size = 6
    steps_per_generation = gradient_accumulation_steps = 2
    
    effective_batch_size = 2 * 6 * 1 = 12
    num_generations = 3
    unique_prompts = 4
    effective_batch_size = steps_per_generation * num_processes * train_batch_size
    optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps
    unique_prompts = effective_batch_size / num_generations   # must be > 2
    # Import and load dataset
    from datasets import load_dataset
    dataset = load_dataset("repo_name/dataset_name", split = "train")
    
    # Apply the formatting function to your dataset using the map method
    dataset = dataset.map(formatting_prompts_func, batched = True,)
    # Import dataset
    from datasets import load_dataset
    dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
    
    # Convert your dataset to the "role"/"content" format if necessary
    from unsloth.chat_templates import standardize_sharegpt
    dataset = standardize_sharegpt(dataset)
    
    # Apply the formatting function to your dataset using the map method
    dataset = dataset.map(formatting_prompts_func, batched = True,)
    from unsloth.chat_templates import CHAT_TEMPLATES
    print(list(CHAT_TEMPLATES.keys()))
    ['unsloth', 'zephyr', 'chatml', 'mistral', 'llama', 'vicuna', 'vicuna_old', 'vicuna old', 'alpaca', 'gemma', 'gemma_chatml', 'gemma2', 'gemma2_chatml', 'llama-3', 'llama3', 'phi-3', 'phi-35', 'phi-3.5', 'llama-3.1', 'llama-31', 'llama-3.2', 'llama-3.3', 'llama-32', 'llama-33', 'qwen-2.5', 'qwen-25', 'qwen25', 'qwen2.5', 'phi-4', 'gemma-3', 'gemma3']
    from unsloth.chat_templates import get_chat_template
    
    tokenizer = get_chat_template(
        tokenizer,
        chat_template = "gemma-3", # change this to the right chat_template name
    )
    def formatting_prompts_func(examples):
       convos = examples["conversations"]
       texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
       return { "text" : texts, }
    [
        [{'from': 'human', 'value': 'Hi there!'},
         {'from': 'gpt', 'value': 'Hi how can I help?'},
         {'from': 'human', 'value': 'What is 2+2?'}],
        [{'from': 'human', 'value': 'What's your name?'},
         {'from': 'gpt', 'value': 'I'm Daniel!'},
         {'from': 'human', 'value': 'Ok! Nice!'},
         {'from': 'gpt', 'value': 'What can I do for you?'},
         {'from': 'human', 'value': 'Oh nothing :)'},],
    ]
    from unsloth.chat_templates import get_chat_template
    
    tokenizer = get_chat_template(
        tokenizer,
        chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
        mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
        map_eos_token = True, # Maps <|im_end|> to </s> instead
    )
    
    def formatting_prompts_func(examples):
        convos = examples["conversations"]
        texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
        return { "text" : texts, }
    pass
    
    from datasets import load_dataset
    dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")
    dataset = dataset.map(formatting_prompts_func, batched = True,)
    unsloth_template = \
        "{{ bos_token }}"\
        "{{ 'You are a helpful assistant to the user\n' }}"\
        "</div>"\
        "<div data-gb-custom-block data-tag="for">"\
            "<div data-gb-custom-block data-tag="if" data-0='role' data-1='role' data-2='] == ' data-3='user'>"\
                "{{ '>>> User: ' + message['content'] + '\n' }}"\
            "<div data-gb-custom-block data-tag="elif" data-0='role' data-1='role' data-2='] == ' data-3='assistant'></div>"\
                "{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\
            "</div>"\
        "</div>"\
        "<div data-gb-custom-block data-tag="if">"\
            "{{ '>>> Assistant: ' }}"\
        "</div>"
    unsloth_eos_token = "eos_token"
    
    tokenizer = get_chat_template(
        tokenizer,
        chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token
        mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
        map_eos_token = True, # Maps <|im_end|> to </s> instead
    )

    If the answer exactly matches the ideal response → +1

  • If the response is too long → -1

  • If the recipient's name is included → +1

  • If a signature block (phone, email, address) is present → +1

  • read our guide!
    Advanced GRPO notebooks
    Load gpt-oss with Unsloth

    Load the 20B model in 4‑bit QLoRA for memory efficiency, then wrap it with a LoRA adapter. You can also train it in 16-bit LoRA but it will use 4x more memory. For more settings view our configuration guide.

    If you hit OOM, try lowering max_seq_length, lora_rank, or num_generations (later), and keep load_in_4bit=True.

    3

    2048 game environment (minimal)

    • A GameBoard class supporting W/A/S/D moves

    • Merge/score logic

    • execute_with_time_limit wrapper so poorly written strategies can’t hang the kernel

    You can quickly smoke‑test with a trivial policy:

    4

    Safe code execution & anti‑cheat checks

    Generated strategies are Python functions. To keep execution safe and prevent reward hacking:

    • Module whitelist check — only allow Python stdlib symbols:

    • Block disallowed imports (e.g., NumPy):

    • Lock down execution to a sandboxed function:

    • Enforce a hard wall‑clock limit on strategy runs:

    5

    ### Prompt & dataset

    We prompt the model to emit a short strategy function inside triple backticks:

    All helper functions should be inside def strategy. Only output the short function strategy.

    {% hint style="info" %} You can replace this dataset with real prompts for your own RL task. {% endhint %} {% endstep %}

    {% step %}

    Reward function time!

    1. Extract the code block from the model’s reply:

    2. function_works - Does it compile & create a callable?

    3. no_cheating - No non‑stdlib imports allowed:

    {% endstep %}

    {% step %}

    Configure GRPO

    We will use the GRPOTrainer. Set the prompt/completion lengths, then build a GRPOConfig. Keep in mind you could also set the RL algorithm type to others such as or Dr. GRPO.

    {% hint style="info" %} Reading logs: Look at reward and reward_std. It’s normal to see low/zero rewards early (first ~100–200 steps on small GPUs). {% endhint %} {% endstep %}

    {% step %}

    Train your model

    This launches the full RL loop: sample completions → score with your rewards → optimize the policy (LoRA). {% endstep %}

    {% step %}

    Inference (after training)

    Generate a fresh strategy with the trained adapter:

    {% endstep %}

    {% step %}

    Save / Export your fine-tuned mode

    • Merge & save 4‑bit (MXFP4)

    python model.save_pretrained_merged("finetuned_model", tokenizer, save_method="mxfp4") # or push model.push_to_hub_merged("<org_or_user>/", tokenizer, token="<hf_token>", save_method="mxfp4") ```

    • Merge & save 16‑bit

    6

    Troubleshooting & tips

    • OOM / slow: reduce max_seq_length, num_generations, lora_rank; keep 4‑bit; try A100 if available.

    • No reward improvement: increase training steps, soften penalties, or add curriculum (start with smaller boards / lower targets).

    • Reward hacking: keep check_python_modules strict; validate strategy behavior across multiple random seeds.

    • Unstable training: raise gradient_accumulation_steps to smooth updates; lower learning_rate (e.g., 2e‑5).

    • Long hangs: ensure execute_with_time_limit wraps any strategy execution.

    7

    Adapt to your own RL task

    • Replace the 2048 env with your own environment and three rewards: (a) syntax/compilation, (b) anti‑cheat/safety, (c) task success.

    • Update the prompt to request the kind of function or output you need.

    • Keep the same Unsloth + GRPO scaffolding; only swap the env and rewards.

    2048 notebook
    Kernel generation notebook
    Repetition Penalty of 1.0. (1.0 means disabled in llama.cpp and transformers)
  • Chat template:

  • Chat template with \nnewlines rendered (except for the last)

    • 270M - new

    • 1B

    • 4B

    • 12B

    • 270M

    • 1B

    • 4B

    • 12B

    • 270M

    • 1B

    • 4B

    • 12B

    free Colab notebook
    open-source project AnythingLLM's
    Android here
    ChatterUI
    GitHub here
    https://huggingface.co/unsloth/gemma-3-27b-it-GGUF
    https://unsloth.ai/blog/deepseekr1-dynamic
    Gemma 3 (270M) notebook
    Text
    Vision
    Gemma 3n (E4B)
    Text
    Vision
    Audio
    Gemma 3 1B to 27B exceed float16's maximum of 65504
    Wikipedia https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

    All our GGUF models are quantized using calibration data (around 250K tokens for Scout and 1M tokens for Maverick), which will improve accuracy over standard quantization. Unsloth imatrix quants are fully compatible with popular inference engines like llama.cpp & Open WebUI etc.

    Scout - Unsloth Dynamic GGUFs with optimal configs:

    MoE Bits
    Type
    Disk Size
    Link
    Details

    1.78bit

    IQ1_S

    33.8GB

    2.06/1.56bit

    1.93bit

    IQ1_M

    35.4GB

    2.5/2.06/1.56

    For best results, use the 2.42-bit (IQ2_XXS) or larger versions.

    Maverick - Unsloth Dynamic GGUFs with optimal configs:

    MoE Bits
    Type
    Disk Size
    HF Link

    1.78bit

    IQ1_S

    122GB

    1.93bit

    IQ1_M

    128GB

    2.42-bit

    IQ2_XXS

    140GB

    ⚙️ Official Recommended Settings

    According to Meta, these are the recommended settings for inference:

    • Temperature of 0.6

    • Min_P of 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)

    • Top_P of 0.9

    • Chat template/prompt format:

    • A BOS token of <|begin_of_text|> is auto added during tokenization (do NOT add it manually!)

    • According to https://www.llama.com/docs/model-cards-and-prompt-formats/llama4_omni/, there is an suggested optional system prompt, which is listed below:

    📖 Tutorial: How to Run Llama-4-Scout in llama.cpp

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision). More versions at: https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF

    1. Run the model and try any prompt.

    2. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length (Llama 4 supports 10M context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    In terms of testing, unfortunately we can't make the full BF16 version (ie regardless of quantization or not) complete the Flappy Bird game nor the Heptagon test appropriately. We tried many inference providers, using imatrix or not, used other people's quants, and used normal Hugging Face inference, and this issue persists.

    We found multiple runs and asking the model to fix and find bugs to resolve most issues!

    For Llama 4 Maverick - it's best to have 2 RTX 4090s (2 x 24GB)

    🕵️ Interesting Insights and Issues

    During quantization of Llama 4 Maverick (the large model), we found the 1st, 3rd and 45th MoE layers could not be calibrated correctly. Maverick uses interleaving MoE layers for every odd layer, so Dense->MoE->Dense and so on.

    We tried adding more uncommon languages to our calibration dataset, and tried using more tokens (1 million) vs Scout's 250K tokens for calibration, but we still found issues. We decided to leave these MoE layers as 3bit and 4bit.

    For Llama 4 Scout, we found we should not quantize the vision layers, and leave the MoE router and some other layers as unquantized - we upload these to https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-dynamic-bnb-4bit

    We also had to convert torch.nn.Parameter to torch.nn.Linear for the MoE layers to allow 4bit quantization to occur. This also means we had to rewrite and patch over the generic Hugging Face implementation. We upload our quantized versions to https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit and https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-8bit for 8bit.

    Llama 4 also now uses chunked attention - it's essentially sliding window attention, but slightly more efficient by not attending to previous tokens over the 8192 boundary.

    We also worked with Mistral behind the scenes to help debug, test and correct any possible bugs and issues! Make sure to download Mistral's official downloads or Unsloth's GGUFs / dynamic quants to get the correct implementation (ie correct system prompt, correct chat template etc)

    Please use --jinja in llama.cpp to enable the system prompt!

    All Devstral uploads use our Unsloth Dynamic 2.0 methodology, delivering the best performance on 5-shot MMLU and KL Divergence benchmarks. This means, you can run and fine-tune quantized Mistral LLMs with minimal accuracy loss!

    Devstral - Unsloth Dynamic quants:

    Devstral 2507 (new)
    Devstral 2505

    GGUF:

    4-bit BnB:

    🖥️ Running Devstral

    ⚙️ Official Recommended Settings

    According to Mistral AI, these are the recommended settings for inference:

    • Temperature from 0.0 to 0.15

    • Min_P of 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)

    • Use --jinja to enable the system prompt.

    A system prompt is recommended, and is a derivative of Open Hand's system prompt. The full system prompt is provided here.

    Our dynamic uploads have the 'UD' prefix in them. Those without are not dynamic however still utilize our calibration dataset.

    🦙 Tutorial: How to Run Devstral in Ollama

    1. Install ollama if you haven't already!

    1. Run the model with our dynamic quant. Note you can call ollama serve &in another terminal if it fails! We include all suggested parameters (temperature etc) in params in our Hugging Face upload!

    2. Also Devstral supports 128K context lengths, so best to enable KV cache quantization. We use 8bit quantization which saves 50% memory usage. You can also try "q4_0"

    📖 Tutorial: How to Run Devstral in llama.cpp

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run

    1. OR download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision).

    1. Run the model.

    2. Edit --threads -1 for the maximum CPU threads, --ctx-size 131072 for context length (Devstral supports 128K context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference. We also use 8bit quantization for the K cache to reduce memory usage.

    3. For conversation mode:

    1. For non conversation mode to test our Flappy Bird prompt:

    Remember to remove <bos> since Devstral auto adds a <bos>! Also please use --jinja to enable the system prompt!

    👀Experimental Vision Support

    Xuan-Son from Hugging Face showed in their GGUF repo how it is actually possible to "graft" the vision encoder from Mistral 3.1 Instruct onto Devstral 2507. We also uploaded our mmproj files which allows you to use the following:

    For example:

    Instruction and output code
    Rendered code

    🦥 Fine-tuning Devstral with Unsloth

    Just like standard Mistral models including Mistral Small 3.1, Unsloth supports Devstral fine-tuning. Training is 2x faster, use 70% less VRAM and supports 8x longer context lengths. Devstral fits comfortably in a 24GB VRAM L4 GPU.

    Unfortunately, Devstral slightly exceeds the memory limits of a 16GB VRAM, so fine-tuning it for free on Google Colab isn't possible for now. However, you can fine-tune the model for free using our Kaggle notebook, which offers access to dual GPUs. Just change the notebook's Magistral model name to the Devstral model.

    If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

    Mistral-Small-3.1
    SWE-bench verified
    optional Vision support
    Running in Unsloth works well, but after exporting & running on other platforms, the results are poor

    You might sometimes encounter an issue where your model runs and produces good results on Unsloth, but when you use it on another platform like Ollama or vLLM, the results are poor or you might get gibberish, endless/infinite generations or repeated outputs.

    • The most common cause of this error is using an incorrect chat template. It’s essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. When inferencing from a saved model, it's crucial to apply the correct template.

    • It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses!

    • Use our conversational notebooks to force the chat template - this will fix most issues.

      • Qwen-3 14B Conversational notebook

      • Gemma-3 4B Conversational notebook

      • Llama-3.2 3B Conversational notebook

      • Phi-4 14B Conversational notebook

      • Mistral v0.3 7B Conversational notebook

      • More notebooks in our

    Saving to GGUF / vLLM 16bit crashes

    You can try reducing the maximum GPU usage during saving by changing maximum_memory_usage.

    The default is model.save_pretrained(..., maximum_memory_usage = 0.75). Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.

    How do I manually save to GGUF?

    First save your model to 16bit via:

    Compile llama.cpp from source like below:

    Then, save the model to F16:

    ❓Why is Q8_K_XL slower than Q8_0 GGUF?

    On Mac devices, it seems like that BF16 might be slower than F16. Q8_K_XL upcasts some layers to BF16, so hence the slowdown, We are actively changing our conversion process to make F16 the default choice for Q8_K_XL to reduce performance hits.

    ❓How to do Evaluation

    To set up evaluation in your training run, you first have to split your dataset into a training and test split. You should always shuffle the selection of the dataset, otherwise your evaluation is wrong!

    Then, we can set the training arguments to enable evaluation. Reminder evaluation can be very very slow especially if you set eval_steps = 1 which means you are evaluating every single step. If you are, try reducing the eval_dataset size to say 100 rows or something.

    ❓Evaluation Loop - Out of Memory or crashing.

    A common issue when you OOM is because you set your batch size too high. Set it lower than 2 to use less VRAM. Also use fp16_full_eval=True to use float16 for evaluation which cuts memory by 1/2.

    First split your training dataset into a train and test split. Set the trainer settings for evaluation to:

    This will cause no OOMs and make it somewhat faster. You can also use bf16_full_eval=True for bf16 machines. By default Unsloth should have set these flags on by default as of June 2025.

    ❓How do I do Early Stopping?

    If you want to stop the finetuning / training run since the evaluation loss is not decreasing, then you can use early stopping which stops the training process. Use EarlyStoppingCallback.

    As usual, set up your trainer and your evaluation dataset. The below is used to stop the training run if the eval_loss (the evaluation loss) is not decreasing after 3 steps or so.

    We then add the callback which can also be customized:

    Then train the model as usual via trainer.train() .

    ❓Downloading gets stuck at 90 to 95%

    If your model gets stuck at 90, 95% for a long time before you can disable some fast downloading processes to force downloads to be synchronous and to print out more error messages.

    Simply use UNSLOTH_STABLE_DOWNLOADS=1 before any Unsloth import.

    ❓RuntimeError: CUDA error: device-side assert triggered

    Restart and run all, but place this at the start before any Unsloth import. Also please file a bug report asap thank you!

    ❓All labels in your dataset are -100. Training losses will be all 0.

    This means that your usage of train_on_responses_only is incorrect for that particular model. train_on_responses_only allows you to mask the user question, and train your model to output the assistant response with higher weighting. This is known to increase accuracy by 1% or more. See our LoRA Hyperparameters Guide for more details.

    For Llama 3.1, 3.2, 3.3 type models, please use the below:

    For Gemma 2, 3. 3n models, use the below:

    ❓Some weights of Gemma3nForConditionalGeneration were not initialized from the model checkpoint

    This is a critical error, since this means some weights are not parsed correctly, which will cause incorrect outputs. This can normally be fixed by upgrading Unsloth

    pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo

    Then upgrade transformers and timm:

    pip install --upgrade --force-reinstall --no-cache-dir --no-deps transformers timm

    However if the issue still persists, please file a bug report asap!

    ❓NotImplementedError: A UTF-8 locale is required. Got ANSI

    See https://github.com/googlecolab/colabtools/issues/3409

    In a new cell, run the below:

    📗Citing Unsloth

    If you are citing the usage of our model uploads, use the below Bibtex. This is for Qwen3-30B-A3B-GGUF Q8_K_XL:

    To cite the usage of our Github package or our work in general:

    Docker image

    Unsloth Notebooks

    Explore our catalog of Unsloth notebooks:

    Also see our GitHub repo for our notebooks: github.com/unslothai/notebooks

    GRPO (RL)Text-to-speechVisionUse-caseKaggle

    Colab notebooks

    Standard notebooks:

    • • •

    • - new

    • - new

    • • - new

    GRPO (Reasoning RL) notebooks:

    • (automatic kernels creation)

    • (solving sodoku) - new

    • (L4) - new

    • (L4) - new

    Text-to-Speech (TTS) notebooks:

    • - Speech-to-Text (STT)

    Speech-to-Text (SST) notebooks:

    • - Audio

    Vision (Multimodal) notebooks:

    • - new

    • - new

    • - new

    Large LLM notebooks:

    Notebooks for large models: These exceed Colab’s free 15 GB VRAM tier. With Colab’s new 80 GB GPUs, you can fine-tune 120B parameter models.

    Colab subscription or credits are required. We don't earn anything from these notebooks.

    • - new

    Other important notebooks:

    • - new (solving sodoku)

    • (QAT) - new

    • with RL

    Specific use-case notebooks:

    • (QAT) - new

    • with RL - new

    Rest of notebooks:

    Kaggle notebooks

    Standard notebooks:

    • - new

    • - new

    GRPO (Reasoning) notebooks:

    • - Vision GRPO - new

    Text-to-Speech (TTS) notebooks:

    • – Speech-to-Text

    Vision (Multimodal) notebooks:

    Specific use-case notebooks:

    Rest of notebooks:

    To view a complete list of all our Kaggle notebooks, .

    Feel free to contribute to the notebooks by visiting our !

    Qwen3-VL: How to Run Guide

    Learn to fine-tune and run Qwen3-VL locally with Unsloth.

    Qwen3-VL is Qwen’s new vision models with instruct and thinking versions. The 2B, 4B, 8B and 32B models are dense, while 30B and 235B are MoE. The 235B thinking LLM delivers SOTA vision and coding performance rivaling GPT-5 (high) and Gemini 2.5 Pro. Qwen3-VL has vision, video and OCR capabilities as well as 256K context (can be extended to 1M). Unsloth supports Qwen3-VL fine-tuning and RL. Train Qwen3-VL (8B) for free with our notebooks.

    Running Qwen3-VLFine-tuning Qwen3-VL

    🖥️ Running Qwen3-VL

    To run the model in llama.cpp, vLLM, Ollama etc., here are the recommended settings:

    ⚙️ Recommended Settings

    Qwen recommends these settings for both models (they're a bit different for Instruct vs Thinking):

    Instruct Settings:
    Thinking Settings:

    Qwen3-VL also used the below settings for their benchmarking numbers, as mentioned .

    Instruct Settings:

    Thinking Settings:

    🐛Chat template bug fixes

    At Unsloth, we care about accuracy the most, so we investigated why after the 2nd turn of running the Thinking models, llama.cpp would break, as seen below:

    The error code:

    We have successfully fixed the Thinking chat template for the VL models so we re-uploaded all Thinking quants and Unsloth's quants. They should now all work after the 2nd conversation - other quants will fail to load after the 2nd conversation.

    Qwen3-VL Unsloth uploads:

    Qwen3-VL is now supported for GGUFs by llama.cpp as of 30th October 2025, so you can run them locally!

    Dynamic GGUFs (to run)
    4-bit BnB Unsloth Dynamic
    16-bit full-precision

    📖 Llama.cpp: Run Qwen3-VL Tutorial

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. Let's first get an image! You can also upload images as well. We shall use , which is just our mini logo showing how finetunes are made with Unsloth:

    1. Let's download this image

    1. Let's get the 2nd image at

    1. Then, let's use llama.cpp's auto model downloading feature, try this for the 8B Instruct model:

    1. Once in, you will see the below screen:

    1. Load up the image via /image PATH ie /image unsloth.png then press ENTER

    1. When you hit ENTER, it'll say "unsloth.png image loaded"

    1. Now let's ask a question like "What is this image?":

    1. Now load in picture 2 via /image picture.png then hit ENTER and ask "What is this image?"

    1. And finally let's ask how are both images are related (it works!)

    1. You can also download the model via (after installing pip install huggingface_hub hf_transfer ) HuggingFace's snapshot_download which is useful for large model downloads, since llama.cpp's auto downloader might lag. You can choose Q4_K_M, or other quantized versions.

    1. Run the model and try any prompt. For Instruct:

    1. For Thinking:

    🪄Running Qwen3-VL-235B-A22B and Qwen3-VL-30B-A3B

    For Qwen3-VL-235B-A22B, we will use llama.cpp for optimized inference and a plethora of options.

    1. We're following similar steps to above however this time we'll also need to perform extra steps because the model is so big.

    2. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

    3. Run the model and try a prompt. Set the correct parameters for Thinking vs. Instruct.

    Instruct:

    Thinking:

    1. Edit, --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    🐋 Docker: Run Qwen3-VL

    If you already have Docker desktop, to run Unsloth's models from Hugging Face, run the command below and you're done:

    Or you can run Docker's uploaded Qwen3-VL models:

    🦥 Fine-tuning Qwen3-VL

    Unsloth supports fine-tuning and reinforcement learning (RL) Qwen3-VL including the larger 32B and 235B models. This includes support for fine-tuning for video and object detection. As usual, Unsloth makes Qwen3-VL models train 1.7x faster with 60% less VRAM and 8x longer context lengths with no accuracy degradation. We made two Qwen3-VL (8B) training notebooks which you can train free on Colab:

    Saving Qwen3-VL to GGUF now works as llama.cpp just supported it!

    If you want to use any other Qwen3-VL model, just change the 8B model to the 2B, 32B etc. one.

    The goal of the GRPO notebook is to make a vision language model solve maths problems via RL given an image input like below:

    This Qwen3-VL support also integrates our latest update for even more memory efficient + faster RL including our , which uniquely limits speed degradation compared to other implementations. You can read more about how to train vision LLMs with RL with our .

    Multi-image training

    In order to fine-tune or train Qwen3-VL with multi-images the most straightforward change is to swap

    with:

    Using map kicks in dataset standardization and arrow processing rules which can be strict and more complicated to define.

    GLM-4.6: Run Locally Guide

    A guide on how to run Z.ai GLM-4.6 and GLM-4.6V-Flash model on your own local device!

    GLM-4.6 and GLM-4.6V-Flash are the latest reasoning models from Z.ai, achieving SOTA performance on coding and agent benchmarks while offering improved conversational chats. GLM-4.6V-Flash the smaller 9B model was released in December, 2025 and you can run it now too.

    The full 355B parameter model requires 400GB of disk space, while the Unsloth Dynamic 2-bit GGUF reduces the size to 135GB (-75%). GLM-4.6-GGUF

    We did multiple chat template fixes for GLM-4.6 to make llama.cpp/llama-cli --jinja work - please only use --jinja otherwise the output will be wrong!

    You asked for benchmarks on our quants, so we’re showcasing Aider Polyglot results! Our Dynamic 3-bit DeepSeek V3.1 GGUF scores 75.6%, surpassing many full-precision SOTA LLMs.

    All uploads use Unsloth for SOTA 5-shot MMLU and Aider performance, meaning you can run & fine-tune quantized GLM LLMs with minimal accuracy loss.

    Tutorials navigation:

    Unsloth Chat Template fixes

    One of the significant fixes we did addresses an issue with prompting GGUFs, where the second prompt wouldn’t work. We fixed this issue however, this problem still persists in GGUFs without our fixes. For example, when using any non-Unsloth GLM-4.6 GGUF, the first conversation works fine, but the second one breaks.

    We’ve resolved this in our chat template, so when using our version, conversations beyond the second (third, fourth, etc.) work without any errors. There are still some issues with tool-calling, which we haven’t fully investigated yet due to bandwidth limitations. We’ve already informed the GLM team about these remaining issues.

    ⚙️ Usage Guide

    The 2-bit dynamic quant UD-Q2_K_XL uses 135GB of disk space - this works well in a 1x24GB card and 128GB of RAM with MoE offloading. The 1-bit UD-TQ1 GGUF also works natively in Ollama!

    You must use --jinja for llama.cpp quants - this uses our and enables the correct template! You might get incorrect results if you do not use --jinja

    The 4-bit quants will fit in a 1x 40GB GPU (with MoE layers offloaded to RAM). Expect around 5 tokens/s with this setup if you have bonus 165GB RAM as well. It is recommended to have at least 205GB RAM to run this 4-bit. For optimal performance you will need at least 205GB unified memory or 205GB combined RAM+VRAM for 5+ tokens/s. To learn how to increase generation speed and fit longer contexts, .

    Though not a must, for best performance, have your VRAM + RAM combined equal to the size of the quant you're downloading. If not, hard drive / SSD offloading will work with llama.cpp, just inference will be slower.

    Recommended Settings

    According to Z.ai, there are different settings for GLM-4.6V-Flash & GLM-4.6 inference:

    GLM-4.6V-Flash
    GLM-4.6
    • Use --jinja for llama.cpp variants - we fixed some chat template issues as well!

    Run GLM-4.6 Tutorials:

    See our step-by-step guides for running and the large models.

    GLM-4.6V-Flash

    Currently GLM-4.6V-Flash only works with text via llama.cpp. Vision support to come later.

    ✨ Run in llama.cpp

    1

    Obtain the latest llama.cpp on . You can also use the build instructions below. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2

    If you want to use llama.cpp directly to load models, you can do the below: (:Q8_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use

    GLM-4.6

    🦙 Run in Ollama

    1

    Install ollama if you haven't already! To run more variants of the model, .

    2

    Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

    3

    ✨ Run in llama.cpp

    1

    Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2

    If you want to use llama.cpp directly to load models, you can do the below: (:Q2_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use

    ✨ Deploy with llama-server and OpenAI's completion library

    To use llama-server for deployment, use the following command:

    Then use OpenAI's Python library after pip install openai :

    💽Model uploads

    ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and language tasks.

    • Full GLM-4.6 model uploads below:

    We also uploaded and quants which run specifically faster for ARM and Apple devices respectively.

    MoE Bits
    Type + Link
    Disk Size
    Details

    🏂 Improving generation speed

    If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

    Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    Llama.cpp also introduces high throughput mode. Use llama-parallel. Read more about it . You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.

    📐How to fit long context (full 200K)

    To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

    --cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

    You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1

    You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it. Then you can use together with --cache-type-k :

    --cache-type-v f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

    SGLang Deployment & Inference Guide

    Guide on saving and deploying LLMs to SGLang for serving LLMs in production

    You can serve any LLM or fine-tuned model via SGLang for low-latency, high-throughput inference. SGLang supports text, image/video model inference on any GPU setup, with support for some GGUFs.

    💻Installing SGLang

    To install SGLang and Unsloth on NVIDIA GPUs, you can use the below in a virtual environment (which won't break your other Python libraries)

    For Docker setups run:

    🐛Debugging SGLang Installation issues

    Note if you see the below, update Rust and outlines-core as specified in

    If you see a Flashinfer issue like below:

    Remove the flashinfer cache via rm -rf .cache/flashinfer and also the directory listed in the error message ie rm -rf ~/.cache/flashinfer

    🚚Deploying SGLang models

    To deploy any model like for example , do the below in a separate terminal (otherwise it'll block your current terminal - you can also use tmux):

    You can then use the OpenAI Chat completions library to call the model (in another terminal or using tmux):

    And you will get 2 + 2 = 4.

    🦥Deploying Unsloth finetunes in SGLang

    After fine-tuning or using our notebooks at , you can save or deploy your models directly through SGLang within a single workflow. An example Unsloth finetuning script for eg:

    To save to 16-bit for SGLang, use:

    To save just the LoRA adapters, either use:

    Or just use our builtin function to do that:

    🚃gpt-oss-20b: Unsloth & SGLang Deployment Guide

    Below is a step-by-step tutorial with instructions for training the -20b using Unsloth and deploying it with SGLang. It includes performance benchmarks across multiple quantization formats.

    1

    Unsloth Fine-tuning and Exporting Formats

    If you're new to fine-tuning, you can read our , or try the gpt-oss 20B finetuning notebook at After training, you can export the model in multiple formats:

    2

    💎FP8 Online Quantization

    To deploy models with FP8 online quantization which allows 30 to 50% more throughput and 50% less memory usage with 2x longer context length supports with SGLang, you can do the below:

    You can also use --kv-cache-dtype fp8_e5m2 which has a larger dynamic range which might solve FP8 inference issues if you see them. Or use our pre-quantized float8 quants listed in or some are listed below:

    ⚡Benchmarking SGLang

    Below is some code you can run to test the performance speed of your finetuned model:

    Then in another terminal or via tmux:

    You will see the benchmarking run like below:

    We used a B200x1 GPU with gpt-oss-20b and got the below results (~2,500 tokens throughput)

    Batch/Input/Output
    TTFT (s)
    ITL (s)
    Input Throughput
    Output Throughput

    See for server arguments for SGLang.

    🏃SGLang Interactive Offline Mode

    You can also use SGLang in offline mode (ie not a server) inside a Python interactive environment.

    🎇GGUFs in SGLang

    SGLang also interestingly supports GGUFs! Qwen3 MoE is still under construction, but most dense models (Llama 3, Qwen 3, Mistral etc) are supported.

    First install the latest gguf python package via:

    Then for example in offline mode SGLang, you can do:

    🎬High throughput GGUF serving with SGLang

    First download the specific GGUF file like below:

    Then serve the specific file Qwen3-32B-UD-Q4_K_XL.gguf and use --served-model-name unsloth/Qwen3-32B and also we need the HuggingFace compatible tokenizer via --tokenizer-path

    FP8 Reinforcement Learning

    Train reinforcement learning (RL) and GRPO in FP8 precision with Unsloth.

    We're introducing FP8-precision training for RL, making FP8 GRPO now possible on consumer GPUs (RTX 40, 50 etc). DeepSeek-R1 demonstrated how powerful FP8 can be and with Unsloth, Qwen3-1.7B FP8 GRPO now works on just 5GB of VRAM.

    Faster RL inference is critical as it's the most compute-intensive workload in RL. We collabed with from PyTorch to enable performance gains with no loss in accuracy.

    • ~1.4× faster RL inference via • 2x longer context vs. BF16 and FP16

    • 60% less VRAM and 10× longer context than other FP8 RL implementations

    Text-to-Speech (TTS) Fine-tuning

    Learn how to to fine-tune TTS & STT voice models with Unsloth.

    Fine-tuning TTS models allows them to adapt to your specific dataset, use case, or desired style and tone. The goal is to customize these models to clone voices, adapt speaking styles and tones, support new languages, handle specific tasks and more. We also support Speech-to-Text (STT) models like OpenAI's Whisper.

    With , you can fine-tune any TTS model (transformers compatible) 1.5x faster with 50% less memory than other implementations with Flash Attention 2. This support includes Sesame CSM, Orpheus, and models supported by transformers (e.g. CrisperWhisper, Spark and more).

    Zero-shot cloning captures tone but misses pacing and expression, often sounding robotic and unnatural. Fine-tuning delivers far more accurate and realistic voice replication. .

    Google Colabcolab.research.google.com
    Google Colabcolab.research.google.com
    # Define the system prompt that instructs the model to use a specific format
    SYSTEM_PROMPT = """
    Respond in the following format:
    <reasoning>
    ...
    </reasoning>
    <answer>
    ...
    </answer>
    """
    
    XML_COT_FORMAT = """\
    <reasoning>
    {reasoning}
    </reasoning>
    <answer>
    {answer}
    </answer>
    """
    import re
    from datasets import load_dataset, Dataset
    
    
    # Helper functions to extract answers from different formats
    def extract_xml_answer(text: str) -> str:
        answer = text.split("<answer>")[-1]
        answer = answer.split("</answer>")[0]
        return answer.strip()
    
    
    def extract_hash_answer(text: str) -> str | None:
        if "####" not in text:
            return None
        return text.split("####")[1].strip()
    
    
    # Function to prepare the GSM8K dataset
    def get_gsm8k_questions(split="train") -> Dataset:
        data = load_dataset("openai/gsm8k", "main")[split]
        data = data.map(
            lambda x: {
                "prompt": [
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": x["question"]},
                ],
                "answer": extract_hash_answer(x["answer"]),
            }
        )
        return data
    
    
    dataset = get_gsm8k_questions()
    epsilon=0.2,
    epsilon_high=0.28, # one sided
    delta=1.5 # two sided
    
    loss_type='bnpo',
    # or:
    loss_type='grpo',
    # or:
    loss_type='dr_grpo',
    # or:
    loss_type='dapo',
    
    mask_truncated_completions=True,
    # Push to Hugging Face Hub (requires a token)
    model.push_to_hub_merged(
        "your-username/model-name", tokenizer, save_method="merged_16bit", token="your-token"
    )
    model.push_to_hub_gguf(
        "your-username/model-name",
        tokenizer,
        quantization_method=["q4_k_m", "q8_0", "q5_k_m"],
        token="your-token",
    )
    from unsloth import check_python_modules
    ok, info = check_python_modules("""
    def strategy(board):
        import math
        from typing import Callable
        return "W"
    """)
    # ok == True means only Python‑level imports were used
    Create a new short 2048 strategy using only native Python code.
    You are given a list of list of numbers for the current board state.
    Output one action for "W", "A", "S", "D" on what is the optimal next step.
    Output your new short function in backticks using the format below:
    ```python
    def strategy(board):
        return "W"  # Example
    
    Create a tiny synthetic dataset (reusing the same prompt) and compute the prompt length so GRPO knows how many completion tokens to sample:
    
    ```python
    from datasets import Dataset
    
    prompt = ...  # as above
    
    maximum_length = len(tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}], add_generation_prompt=True
    ))
    
    dataset = Dataset.from_list([
        {"prompt": [{"role": "user", "content": prompt}], "answer": 0, "reasoning_effort": "low"}
    ] * 1000)
    !pip install --upgrade -qqq uv
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    !uv pip install -qqq \
        "torch>=2.8.0" "triton>=3.4.0" {get_numpy} torchvision bitsandbytes "transformers==4.56.2" \
        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
        git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels
    !uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers
    !uv pip install --no-deps trl==0.22.2
    from unsloth import FastLanguageModel
    import torch
    
    max_seq_length = 768        # Increase if your task needs longer outputs
    lora_rank      = 4          # Higher rank → better but more VRAM/compute
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name        = "unsloth/gpt-oss-20b",  # or unsloth/gpt-oss-20b-BF16 on H100
        max_seq_length    = max_seq_length,
        load_in_4bit      = True,                    # False for 16‑bit
        offload_embedding = True,                    # saves ~1GB VRAM
    )
    
    model = FastLanguageModel.get_peft_model(
        model,
        r = lora_rank,
        target_modules = [
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_alpha = lora_rank * 2,
        use_gradient_checkpointing = "unsloth",     # big memory saver
        random_state = 3407,
    )
    <bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\nHey there!<end_of_turn>\n<start_of_turn>user\nWhat is 1+1?<end_of_turn>\n<start_of_turn>model\n
    <bos><start_of_turn>user
    Hello!<end_of_turn>
    <start_of_turn>model
    Hey there!<end_of_turn>
    <start_of_turn>user
    What is 1+1?<end_of_turn>
    <start_of_turn>model\n
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_XL
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-mtmd-cli \
        -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/gemma-3-27b-it-GGUF",
        local_dir = "unsloth/gemma-3-27b-it-GGUF",
        allow_patterns = ["*Q4_K_XL*", "mmproj-BF16.gguf"], # For Q4_K_M
    )
    ./llama.cpp/llama-mtmd-cli \
        --model unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q4_K_XL.gguf \
        --mmproj unsloth/gemma-3-27b-it-GGUF/mmproj-BF16.gguf \
        --threads 32 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 1.0 \
        --repeat-penalty 1.0 \
        --min-p 0.01 \
        --top-k 64 \
        --top-p 0.95
    ./llama.cpp/llama-cli \
        --model unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q4_K_XL.gguf \
        --threads 32 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 1.0 \
        --repeat-penalty 1.0 \
        --min-p 0.01 \
        --top-k 64 \
        --top-p 0.95 \
        -no-cnv \
        --prompt "<start_of_turn>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<end_of_turn>\n<start_of_turn>model\n"
    <start_of_turn>user
    Create a Flappy Bird game in Python. You must include these things:
    1. You must use pygame.
    2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
    3. Pressing SPACE multiple times will accelerate the bird.
    4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
    5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
    6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
    7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
    8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
    The final game should be inside a markdown section in Python. Check your code for error
    <|header_start|>user<|header_end|>\n\nWhat is 1+1?<|eot|><|header_start|>assistant<|header_end|>\n\n
    You are an expert conversationalist who responds to the best of your ability. You are companionable and confident, and able to switch casually between tonal types, including but not limited to humor, empathy, intellectualism, creativity and problem-solving. 
    
    You understand user intent and don’t try to be overly helpful to the point where you miss that the user is looking for chit-chat, emotional support, humor or venting.Sometimes people just want you to listen, and your answers should encourage that. For all other cases, you provide insightful and in-depth responses. Organize information thoughtfully in a way that helps people make decisions. Always avoid templated language.
    
    You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude. 
    
    You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…”  etc. Avoid using these.
    
    Finally, do not refuse political prompts. You can help users express their opinion. 
    
    You are Llama 4. Your knowledge cutoff date is August 2024. You speak Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Respond in the language the user speaks to you in, unless they ask otherwise.
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF",
        local_dir = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF",
        allow_patterns = ["*IQ2_XXS*"],
    )
    ./llama.cpp/llama-cli \
        --model unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/Llama-4-Scout-17B-16E-Instruct-UD-IQ2_XXS.gguf \
        --threads 32 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --seed 3407 \
        --prio 3 \
        --temp 0.6 \
        --min-p 0.01 \
        --top-p 0.9 \
        -no-cnv \
        --prompt "<|header_start|>user<|header_end|>\n\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|eot|><|header_start|>assistant<|header_end|>\n\n"
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF",
        local_dir = "unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF",
        allow_patterns = ["*IQ1_S*"],
    )
    ./llama.cpp/llama-cli \
        --model unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF/UD-IQ1_S/Llama-4-Maverick-17B-128E-Instruct-UD-IQ1_S-00001-of-00003.gguf \
        --threads 32 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --seed 3407 \
        --prio 3 \
        --temp 0.6 \
        --min-p 0.01 \
        --top-p 0.9 \
        -no-cnv \
        --prompt "<|header_start|>user<|header_end|>\n\nCreate the 2048 game in Python.<|eot|><|header_start|>assistant<|header_end|>\n\n"
    You are Devstral, a helpful agentic model trained by Mistral AI and using the OpenHands scaffold. You can interact with a computer to solve tasks.
    
    <ROLE>
    Your primary role is to assist users by executing commands, modifying code, and solving technical problems effectively. You should be thorough, methodical, and prioritize quality over speed.
    * If the user asks a question, like "why is X happening", don't try to fix the problem. Just give an answer to the question.
    </ROLE>
    
    .... SYSTEM PROMPT CONTINUES ....
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    export OLLAMA_KV_CACHE_TYPE="q8_0"
    ollama run hf.co/unsloth/Devstral-Small-2507-GGUF:UD-Q4_K_XL
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli -hf unsloth/Devstral-Small-2507-GGUF:UD-Q4_K_XL --jinja
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Devstral-Small-2507-GGUF",
        local_dir = "unsloth/Devstral-Small-2507-GGUF",
        allow_patterns = ["*Q4_K_XL*", "*mmproj-F16*"], # For Q4_K_XL
    )
    ./llama.cpp/llama-cli \
        --model unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-UD-Q4_K_XL.gguf \
        --threads -1 \
        --ctx-size 131072 \
        --cache-type-k q8_0 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 0.15 \
        --repeat-penalty 1.0 \
        --min-p 0.01 \
        --top-k 64 \
        --top-p 0.95 \
        --jinja
    ./llama.cpp/llama-cli \
        --model unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-UD-Q4_K_XL.gguf \
        --threads -1 \
        --ctx-size 131072 \
        --cache-type-k q8_0 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 0.15 \
        --repeat-penalty 1.0 \
        --min-p 0.01 \
        --top-k 64 \
        --top-p 0.95 \
        -no-cnv \
        --prompt "[SYSTEM_PROMPT]You are Devstral, a helpful agentic model trained by Mistral AI and using the OpenHands scaffold. You can interact with a computer to solve tasks.\n\n<ROLE>\nYour primary role is to assist users by executing commands, modifying code, and solving technical problems effectively. You should be thorough, methodical, and prioritize quality over speed.\n* If the user asks a question, like "why is X happening", don\'t try to fix the problem. Just give an answer to the question.\n</ROLE>\n\n<EFFICIENCY>\n* Each action you take is somewhat expensive. Wherever possible, combine multiple actions into a single action, e.g. combine multiple bash commands into one, using sed and grep to edit/view multiple files at once.\n* When exploring the codebase, use efficient tools like find, grep, and git commands with appropriate filters to minimize unnecessary operations.\n</EFFICIENCY>\n\n<FILE_SYSTEM_GUIDELINES>\n* When a user provides a file path, do NOT assume it\'s relative to the current working directory. First explore the file system to locate the file before working on it.\n* If asked to edit a file, edit the file directly, rather than creating a new file with a different filename.\n* For global search-and-replace operations, consider using `sed` instead of opening file editors multiple times.\n</FILE_SYSTEM_GUIDELINES>\n\n<CODE_QUALITY>\n* Write clean, efficient code with minimal comments. Avoid redundancy in comments: Do not repeat information that can be easily inferred from the code itself.\n* When implementing solutions, focus on making the minimal changes needed to solve the problem.\n* Before implementing any changes, first thoroughly understand the codebase through exploration.\n* If you are adding a lot of code to a function or file, consider splitting the function or file into smaller pieces when appropriate.\n</CODE_QUALITY>\n\n<VERSION_CONTROL>\n* When configuring git credentials, use "openhands" as the user.name and "[email protected]" as the user.email by default, unless explicitly instructed otherwise.\n* Exercise caution with git operations. Do NOT make potentially dangerous changes (e.g., pushing to main, deleting repositories) unless explicitly asked to do so.\n* When committing changes, use `git status` to see all modified files, and stage all files necessary for the commit. Use `git commit -a` whenever possible.\n* Do NOT commit files that typically shouldn\'t go into version control (e.g., node_modules/, .env files, build directories, cache files, large binaries) unless explicitly instructed by the user.\n* If unsure about committing certain files, check for the presence of .gitignore files or ask the user for clarification.\n</VERSION_CONTROL>\n\n<PULL_REQUESTS>\n* When creating pull requests, create only ONE per session/issue unless explicitly instructed otherwise.\n* When working with an existing PR, update it with new commits rather than creating additional PRs for the same issue.\n* When updating a PR, preserve the original PR title and purpose, updating description only when necessary.\n</PULL_REQUESTS>\n\n<PROBLEM_SOLVING_WORKFLOW>\n1. EXPLORATION: Thoroughly explore relevant files and understand the context before proposing solutions\n2. ANALYSIS: Consider multiple approaches and select the most promising one\n3. TESTING:\n   * For bug fixes: Create tests to verify issues before implementing fixes\n   * For new features: Consider test-driven development when appropriate\n   * If the repository lacks testing infrastructure and implementing tests would require extensive setup, consult with the user before investing time in building testing infrastructure\n   * If the environment is not set up to run tests, consult with the user first before investing time to install all dependencies\n4. IMPLEMENTATION: Make focused, minimal changes to address the problem\n5. VERIFICATION: If the environment is set up to run tests, test your implementation thoroughly, including edge cases. If the environment is not set up to run tests, consult with the user first before investing time to run tests.\n</PROBLEM_SOLVING_WORKFLOW>\n\n<SECURITY>\n* Only use GITHUB_TOKEN and other credentials in ways the user has explicitly requested and would expect.\n* Use APIs to work with GitHub or other platforms, unless the user asks otherwise or your task requires browsing.\n</SECURITY>\n\n<ENVIRONMENT_SETUP>\n* When user asks you to run an application, don\'t stop if the application is not installed. Instead, please install the application and run the command again.\n* If you encounter missing dependencies:\n  1. First, look around in the repository for existing dependency files (requirements.txt, pyproject.toml, package.json, Gemfile, etc.)\n  2. If dependency files exist, use them to install all dependencies at once (e.g., `pip install -r requirements.txt`, `npm install`, etc.)\n  3. Only install individual packages directly if no dependency files are found or if only specific packages are needed\n* Similarly, if you encounter missing dependencies for essential tools requested by the user, install them when possible.\n</ENVIRONMENT_SETUP>\n\n<TROUBLESHOOTING>\n* If you\'ve made repeated attempts to solve a problem but tests still fail or the user reports it\'s still broken:\n  1. Step back and reflect on 5-7 different possible sources of the problem\n  2. Assess the likelihood of each possible cause\n  3. Methodically address the most likely causes, starting with the highest probability\n  4. Document your reasoning process\n* When you run into any major issue while executing a plan from the user, please don\'t try to directly work around it. Instead, propose a new plan and confirm with the user before proceeding.\n</TROUBLESHOOTING>[/SYSTEM_PROMPT][INST]Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird\'s shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don\'t hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for error[/INST]"
    ./llama.cpp/llama-mtmd-cli \
        --model unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-UD-Q4_K_XL.gguf \
        --mmproj unsloth/Devstral-Small-2507-GGUF/mmproj-F16.gguf \
        --threads -1 \
        --ctx-size 131072 \
        --cache-type-k q8_0 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 0.15
    pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
    model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
    cp llama.cpp/build/bin/llama-* llama.cpp
    python llama.cpp/convert_hf_to_gguf.py merged_model \
        --outfile model-F16.gguf --outtype f16 \
        --split-max-size 50G
    # For BF16:
    python llama.cpp/convert_hf_to_gguf.py merged_model \
        --outfile model-BF16.gguf --outtype bf16 \
        --split-max-size 50G
        
    # For Q8_0:
    python llama.cpp/convert_hf_to_gguf.py merged_model \
        --outfile model-Q8_0.gguf --outtype q8_0 \
        --split-max-size 50G
    new_dataset = dataset.train_test_split(
        test_size = 0.01, # 1% for test size can also be an integer for # of rows
        shuffle = True, # Should always set to True!
        seed = 3407,
    )
    
    train_dataset = new_dataset["train"] # Dataset for training
    eval_dataset = new_dataset["test"] # Dataset for evaluation
    from trl import SFTTrainer, SFTConfig
    trainer = SFTTrainer(
        args = SFTConfig(
            fp16_full_eval = True,         # Set this to reduce memory usage
            per_device_eval_batch_size = 2,# Increasing this will use more memory
            eval_accumulation_steps = 4,   # You can increase this include of batch_size
            eval_strategy = "steps",       # Runs eval every few steps or epochs.
            eval_steps = 1,                # How many evaluations done per # of training steps
        ),
        train_dataset = new_dataset["train"],
        eval_dataset = new_dataset["test"],
        ...
    )
    trainer.train()
    new_dataset = dataset.train_test_split(test_size = 0.01)
    
    from trl import SFTTrainer, SFTConfig
    trainer = SFTTrainer(
        args = SFTConfig(
            fp16_full_eval = True,
            per_device_eval_batch_size = 2,
            eval_accumulation_steps = 4,
            eval_strategy = "steps",
            eval_steps = 1,
        ),
        train_dataset = new_dataset["train"],
        eval_dataset = new_dataset["test"],
        ...
    )
    from trl import SFTConfig, SFTTrainer
    trainer = SFTTrainer(
        args = SFTConfig(
            fp16_full_eval = True,
            per_device_eval_batch_size = 2,
            eval_accumulation_steps = 4,
            output_dir = "training_checkpoints", # location of saved checkpoints for early stopping
            save_strategy = "steps",             # save model every N steps
            save_steps = 10,                     # how many steps until we save the model
            save_total_limit = 3,                # keep ony 3 saved checkpoints to save disk space
            eval_strategy = "steps",             # evaluate every N steps
            eval_steps = 10,                     # how many steps until we do evaluation
            load_best_model_at_end = True,       # MUST USE for early stopping
            metric_for_best_model = "eval_loss", # metric we want to early stop on
            greater_is_better = False,           # the lower the eval loss, the better
        ),
        model = model,
        tokenizer = tokenizer,
        train_dataset = new_dataset["train"],
        eval_dataset = new_dataset["test"],
    )
    from transformers import EarlyStoppingCallback
    early_stopping_callback = EarlyStoppingCallback(
        early_stopping_patience = 3,     # How many steps we will wait if the eval loss doesn't decrease
                                         # For example the loss might increase, but decrease after 3 steps
        early_stopping_threshold = 0.0,  # Can set higher - sets how much loss should decrease by until
                                         # we consider early stopping. For eg 0.01 means if loss was
                                         # 0.02 then 0.01, we consider to early stop the run.
    )
    trainer.add_callback(early_stopping_callback)
    import os
    os.environ["UNSLOTH_STABLE_DOWNLOADS"] = "1"
    
    from unsloth import FastLanguageModel
    import os
    os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
    os.environ["UNSLOTH_DISABLE_FAST_GENERATION"] = "1"
    from unsloth.chat_templates import train_on_responses_only
    trainer = train_on_responses_only(
        trainer,
        instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
        response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
    )
    from unsloth.chat_templates import train_on_responses_only
    trainer = train_on_responses_only(
        trainer,
        instruction_part = "<start_of_turn>user\n",
        response_part = "<start_of_turn>model\n",
    )
    import locale
    locale.getpreferredencoding = lambda: "UTF-8"
    @misc{unsloth_2025_qwen3_30b_a3b,
      author       = {Unsloth AI and Han-Chen, Daniel and Han-Chen, Michael},
      title        = {Qwen3-30B-A3B-GGUF:Q8\_K\_XL},
      year         = {2025},
      publisher    = {Hugging Face},
      howpublished = {\url{https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF}}
    }
    @misc{unsloth,
      author       = {Unsloth AI and Han-Chen, Daniel and Han-Chen, Michael},
      title        = {Unsloth},
      year         = {2025},
      publisher    = {Github},
      howpublished = {\url{https://github.com/unslothai/unsloth}}
    }
    # OPTIONAL use a virtual environment
    python -m venv unsloth_env
    source unsloth_env/bin/activate
    
    # Install Rust, outlines-core then SGLang
    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    source $HOME/.cargo/env && sudo apt-get install -y pkg-config libssl-dev
    pip install --upgrade pip && pip install uv
    uv pip install "sglang" && uv pip install unsloth
    docker run --gpus all \
        --shm-size 32g \
        -p 30000:30000 \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        --env "HF_TOKEN=<secret>" \
        --ipc=host \
        lmsysorg/sglang:latest \
        python3 -m sglang.launch_server --model-path unsloth/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
    Open in Colab
    Open in Colab
    Open in Colab
    Open in Colab
    Open in Colab
    notebooks docs
    27B
    27B
    27B

    Qwen3-2507-4B • Thinking • Instruct

  • Gemma 3n (E4B) • Text • Vision • Audio

  • IBM Granite-4.0-H

  • Gemma 3 (4B) • Text • Vision • 270M

  • Phi-4 (14B)

  • Llama 3.1 (8B) • Llama 3.2 (1B + 3B)

  • gpt-oss-20b (auto win 2048 game)

  • Qwen3-VL (8B) - Vision GSPO

  • Qwen3 (4B) - Advanced GRPO LoRA

  • Gemma 3 (4B) - Vision GSPO

  • DeepSeek-R1-0528-Qwen3 (8B) (for multilingual usecase)

  • Gemma 3 (1B)

  • Llama 3.2 (3B) - Advanced GRPO LoRA

  • Llama 3.1 (8B)

  • Phi-4 (14B)

  • Mistral v0.3 (7B)

  • Spark-TTS (0.5B)

  • Oute-TTS (1B)

  • Gemma 3 (4B)

  • Llama 3.2 Vision (11B)

  • Qwen2.5-VL (7B)

  • Pixtral (12B) 2409

  • Qwen3-VL - Vision GSPO - new

  • Qwen2.5-VL - Vision GSPO

  • Gemma 3 (4B) - Vision GSPO

  • Gemma 3 (27B)

  • Baidu ERNIE 4.5 VL (28B) - new

  • ModernBERT-large - new Aug 19

  • Synthetic Data Generation Llama 3.2 (3B)

  • gpt-oss-20b (500K context) - new (A100)

  • Tool Calling

  • Mistral v0.3 Instruct (7B)

  • Ollama

  • ORPO

  • Continued Pretraining

  • DPO Zephyr

  • Inference only

  • Llama 3 (8B)

  • BERT - Text Classification - (AutoModelForSequenceClassification)

  • Ollama

  • Tool Calling

  • Continued Pretraining (CPT)

  • Multiple Datasets by Flail

  • KTO by Jeffrey

  • Inference chat UI

  • Conversational

  • ChatML

  • Text Completion

  • Phi-3 (medium)
  • Gemma 2 (2B)

  • Qwen 2.5 Coder (14B)

  • Mistral Small (22B)

  • TinyLlama

  • CodeGemma (7B)

  • Mistral v0.3 (7B)

  • Qwen2 (7B)

  • Gemma 3 (4B)

  • Phi-4 (14B)

  • Llama 3.1 (8B)

  • Llama 3.2 (1B + 3B)

  • Qwen 2.5 (7B)

  • Phi-4 (14B)

  • Qwen 2.5 (3B)

  • Spark-TTS (0.5B)

  • Oute-TTS (1B)

  • Inference only
  • Ollama

  • Text Completion

  • CodeForces-cot (Reasoning)

  • Unsloth Studio (chat UI)

  • Mistral Small (22B)
  • TinyLlama (1.1B)

  • gpt-oss (20b)
    Inference
    Fine-tuning
    Mistral Ministral 3
    DeepSeek-OCR
    Qwen3 (14B)
    Qwen3-VL (8B)
    gpt-oss-20b
    Mistral Ministral 3
    Qwen3-8B - FP8
    Llama-3.2-1B - FP8
    Sesame-CSM (1B)
    Orpheus-TTS (3B)
    Whisper Large V3
    Llasa-TTS (1B)
    Whisper-Large-V3
    Gemma 3n (E4B)
    Mistral Ministral 3
    Qwen3-VL (8B)
    DeepSeek-OCR
    Gemma 3n (E4B)
    gpt-oss-20b (500K context)
    gpt-oss-120b
    Qwen3 (32B)
    Llama 3.3 (70B)
    Customer support agent
    Mistral Ministral 3
    Quantization-Aware Training
    Automatic Kernel Creation
    Customer support agent
    Quantization-Aware Training
    Automatic Kernel Creation
    DPO Zephyr
    Qwen2.5 (3B)
    Gemma 2 (9B)
    Mistral NeMo (12B)
    Phi-3.5 (mini)
    gpt-oss (20B)
    Gemma 3n (E4B)
    Qwen3 (14B)
    Magistral-2509 (24B)
    Qwen2.5-VL
    Qwen3 (4B)
    Gemma 3 (1B)
    Llama 3.1 (8B)
    Sesame-CSM (1B)
    Orpheus-TTS (3B)
    Whisper Large V3
    Llasa-TTS (1B)
    Llama 3.2 Vision (11B)
    Qwen 2.5-VL (7B)
    Pixtral (12B) 2409
    Tool Calling
    ORPO
    Continued Pretraining
    DPO Zephyr
    Gemma 2 (9B)
    Gemma 2 (2B)
    CodeGemma (7B)
    Mistral NeMo (12B)
    click here
    repo

    2.42bit

    IQ2_XXS

    38.6GB

    Link

    2.5/2.06bit

    2.71bit

    Q2_K_XL

    42.2GB

    Link

    3.5/2.5bit

    3.5bit

    Q3_K_XL

    52.9GB

    Link

    4.5/3.5bit

    4.5bit

    Q4_K_XL

    65.6GB

    Link

    5.5/4.5bit

    Link

    2.71-bit

    Q2_K_XL

    151B

    Link

    3.5-bit

    Q3_K_XL

    193GB

    Link

    4.5-bit

    Q4_K_XL

    243GB

    Link

    Link
    Link
    Link
    Link
    Logo
    Logo
    Local GRPO on your own device
    Great to learn about how to prep your dataset and explanations behind Reinforcement Learning + GRPO basics
    strategy_succeeds - Play a random board; reward success:
    GSPO
    def always_move_left(board):
        return "W"
    
    steps, outcome = execute_strategy(always_move_left, GameBoard(size=8, seed=42, target=2048, probability_fours=0.10))
    sample = """
    def strategy(board):
        from numpy import matmul
        return "W"
    """
    ok, info = check_python_modules(sample)  # ok => False
    from unsloth import create_locked_down_function
    function = """
    def add(a, b):
        def adder(a):
            return a + b
        return adder(b) + b
    """
    f = create_locked_down_function(function)  # errors if globals / imports are used
    from unsloth import execute_with_time_limit
    @execute_with_time_limit(2)
    def execute_strategy(strategy, game):
        # loop until game ends or timeout
        ...
    def extract_function(text):
        if text.count("```") >= 2:
            first = text.find("```") + 3
            second = text.find("```", first)
            fx = text[first:second].strip()
            fx = fx.removeprefix("python\n")
            fx = fx[fx.find("def"):]
            if fx.startswith("def strategy(board):"):
                return fx
        return None
    from unsloth import create_locked_down_function, check_python_modules
    
    def function_works(completions, **kwargs):
        scores = []
        for completion in completions:
            response = completion[0]["content"]
            function = extract_function(response)
            if function is None:
                scores.append(-2.0)
                continue
            ok, info = check_python_modules(function)
            if "error" in info:
                scores.append(-2.0)
                continue
            try:
                _ = create_locked_down_function(function)
                scores.append(1.0)
            except Exception:
                scores.append(-0.5)
        return scores
    def no_cheating(completions, **kwargs):
        scores = []
        for completion in completions:
            response = completion[0]["content"]
            function = extract_function(response)
            if function is None:
                scores.append(-1.0)
                continue
            ok, _ = check_python_modules(function)
            scores.append(1.0 if ok else -20.0)  # heavy penalty if cheating
        return scores
    from trl import GRPOConfig, GRPOTrainer
    
    max_prompt_length     = maximum_length + 1
    max_completion_length = max_seq_length - max_prompt_length
    
    training_args = GRPOConfig(
        temperature=1.0,
        learning_rate=5e-5,
        weight_decay=0.01,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        optim="adamw_8bit",
        logging_steps=1,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,    # bump to 4 for smoother reward signals
        num_generations=2,                # lower if you OOM
        max_prompt_length=max_prompt_length,
        max_completion_length=max_completion_length,
        max_steps=1000,                   # or set num_train_epochs=1
        save_steps=100,
        report_to="none",
        output_dir="outputs",
    )
    
    trainer = GRPOTrainer(
        model=model,
        processing_class=tokenizer,
        reward_funcs=[function_works, no_cheating, strategy_succeeds],
        args=training_args,
        train_dataset=dataset,
        # Optional eval split:
        # train_dataset=new_dataset["train"],
        # eval_dataset=new_dataset["test"],
    )
    trainer.train()
    from transformers import TextStreamer
    
    text = tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
        reasoning_effort="low",
    )
    
    _ = model.generate(
        **tokenizer(text, return_tensors="pt").to("cuda"),
        temperature=1.0,
        max_new_tokens=1024,
        streamer=TextStreamer(tokenizer, skip_prompt=False)
    model.save_pretrained_merged("finetuned_model", tokenizer, save_method="merged_16bit")
    # or push
    model.push_to_hub_merged("<org_or_user>/<repo>", tokenizer, token="<hf_token>", save_method="merged_16bit")
    import numpy as np
    
    PRINTER = 0  # occasionally print for debugging
    
    def strategy_succeeds(completions, **kwargs):
        global PRINTER
        scores = []
        seed = np.random.randint(10000)
        for completion in completions:
            response = completion[0]["content"]
            function = extract_function(response)
            if function is None:
                scores.append(-2.0)
                continue
            try:
                new_strategy = create_locked_down_function(function)
            except Exception:
                scores.append(0.0)
                continue
            try:
                game = GameBoard(size=6, seed=seed, target=2048, probability_fours=0.10)
                steps, state = execute_strategy(new_strategy, game)
                if PRINTER % 5 == 0:
                    print(function)
                    print(f"Steps={steps} State={state}")
                    print(game.board().pretty())
                PRINTER += 1
                if state == "success":
                    scores.append(20.0)
                else:
                    scores.append(2.0)   # worked but didn’t reach 2048
            except TimeoutError:
                scores.append(-1.0)      # timed out
            except Exception:
                scores.append(-3.0)      # crashed
        return scores
    export LLAMA_CACHE="folder"
    to force
    llama.cpp
    to save to a specific location. Remember the model has only a maximum of 128K context length.
    3

    Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL (dynamic 4bit quant) or other quantized versions like Q8_K_XL .

    To run other quants, you need to first merge the GGUF split files into 1 like the code below. Then you will need to run the model locally.

    export LLAMA_CACHE="folder"
    to force
    llama.cpp
    to save to a specific location. Remember the model has only a maximum of 128K context length.

    Please try out -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    3

    Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL (dynamic 2bit quant) or other quantized versions like Q4_K_XL . We recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy.

    4

    You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    2.42bit

    115GB

    2.5/2.06bit

    2.71bit

    135GB

    3.5/2.5bit

    3.12bit

    145GB

    3.5/2.06bit

    3.5bit

    158GB

    4.5/3.5bit

    4.5bit

    204GB

    5.5/4.5bit

    5.5bit

    252GB

    6.5/5.5bit

    temperature = 0.8

    temperature = 1.0

    top_p = 0.6 (recommended)

    top_p = 0.95 (recommended for coding)

    top_k = 2 (recommended)

    top_k = 40 (recommended for coding)

    128K context length or less

    200K context length or less

    repeat_penalty = 1.1

    max_generate_tokens = 16,384

    max_generate_tokens = 16,384

    1.66bit

    TQ1_0

    84GB

    1.92/1.56bit

    1.78bit

    IQ1_S

    96GB

    2.06/1.56bit

    1.93bit

    IQ1_M

    107GB

    Read more.
    Dynamic 2.0
    Run GLM-4.6V-Flash
    Run GLM-4.6
    fixed chat templates
    read here
    GLM-4.6V-Flash
    GLM-4.6
    GitHub
    see here
    GitHub here
    IQ4_NL
    Q4_1
    here

    2.5/2.06/1.56

    Deployment with SGLang

    We saved our gpt-oss finetune to the folder "finetuned_model", and so in a new terminal, we can launch the finetuned model as an inference endpoint with SGLang:

    You might have to wait a bit on Capturing batches (bs=1 avail_mem=20.84 GB): !

    3

    Calling the inference endpoint:

    To call the inference endpoint, first launch a new terminal. We then can call the model like below:

    8/1024/1024

    0.40

    3.59

    20,718.95

    2,562.87

    8/8192/1024

    0.42

    3.74

    154,459.01

    2,473.84

    unsloth/Llama-3.2-1B-Instruct
    Fine-tuning Guide
    Unsloth Notebooks
    gpt-oss
    guide
    gpt-oss
    https://huggingface.co/unsloth/models?search=-fp8
    https://docs.sglang.ai/advanced_features/server_arguments.html
    SGLang

    Unsloth is the only framework to make FP8 RL LoRA work on consumer GPUs (e.g. NVIDIA GeForce RTX 40 and 50 Series). Also works on H100, H200, B200 etc.

  • Use load_in_fp8 = True within FastLanguageModel to enable FP8 RL.

  • Though Qwen3-8B fits in 16GB VRAM, free Colab NVIDIA Tesla T4 GPUs don’t support FP8. So our notebooks use 24GB L4 GPUs which fits Qwen3-14B.

  • Notebooks: Qwen3-8B FP8 GRPO and Llama-3.2-1B FP8 GRPO

    Bonus: You’ll notice Unsloth now uses much less VRAM. We’ll share details in a new blog soon.

    Our FP8 support uses Unsloth’s weight-sharing feature, reducing VRAM use by another 50%, enabling 10× more context with no accuracy loss. We use vLLM for fast inference and, our techniques like Unsloth Standby and Flex Attention to further reduce VRAM use. TorchAO enables universal on the fly FP8, so Llama, Gemma, Mistral & more work. We’ve also uploaded most FP8 models (including Qwen3).

    Reward plot shows FP8 following the same trend as BF16

    🌻FP8 vs BF16 Training

    Research shows that FP8 training can largely match BF16 accuracy and if you serve models in FP8, training and serving in the same precision helps preserve accuracy. Also FP8 vs BF16 yields 1.6x higher throughput on H100s and has 2x lower memory usage.

    Weight scales & FP8 types

    Quantized training stores a low-precision weight (e.g., FP8) plus a higher-precision scale (FP16/BF16/FP32). You approximately recover the original weight via: original_weight ≈ quantized_weight * weight_scale

    The scale maps the weight’s range into FP8’s representable range. More scales usually improve accuracy, but scales cost extra high-precision memory, so it’s a tradeoff. DeepSeek R1, for instance, mostly favors block quantization.

    There are 3 common FP8 types as defined by vLLM's llm-compressor. We benchmarked Qwen3-8B on all 3 types, and also checked throughput, MMLU Pro and GQPA Diamond. We find FP8 Block-Wise or Per-Channel (-FP8-Dynamic) is the best in terms of accuracy and throughput.

    Type
    Throughput
    MMLU Pro
    GQPA Diamond

    Bfloat16 Baseline

    11,367

    62.04%

    28.79%

    Block-wise

    Scales per block (128X128)

    12,041

    62.37%

    29.29%

    ⚡FP8 Performance Benchmarks

    Unsloth FP8 RL inference via vLLM is generally 1.4x faster than BF16. You may see even more speed improvements if the model is larger!

    Accuracy Training loss Benchmarks

    We tested multiple models including Qwen3-4B, 8B, 14B, Llama 3.2 1B, 3B, Qwen3-VL-2B, Qwen3-VL 4B and many more. All were trained both in BF16 and FP8. As seen in the plots, the loss curves during SFT for BF16 and FP8 closely track each other. There isn’t much to choose between the two data types in terms of training loss:

    For GRPO specifically, due to generation differences, the goal is to see if the reward plots at least match up and not diverge (sometimes for eg Qwen3-14B runs might not be exactly similar)

    ⛩️Inference = 96% of RL training

    In RL, we have to call the LLM / VLM to generate some possible candidate solutions to some run, then we score each possible solution and reward good solutions, and penalize bad answers. To achieve maximum efficiency, we must make inference nearly 100% of the training run. In Unsloth, we managed to make training take only <4% of the entire RL run, with 96% being purely vLLM inference.

    For example for Qwen-3-8B, which is 1.15x faster on shorter sequence lengths, vLLM FP8 itself for inference (without training) throughput is also 1.15x faster. We see our RL run in Unsloth attains also 1.15x faster on tokens processed, showing how training overhead is negligible in Unsloth.

    🔢60% less memory usage

    In theory, you’d expect memory savings to roughly equal to the model’s weight memory, because: optimizer states are still stored in high precision and activations are also stored in high precision (for now). Our findings match the theory. For LoRA fine-tuning, we observed: ~30 GB saved for Qwen3-32B, ~14 GB saved for Qwen2.5-14B and ~8 GB saved for Qwen3-8B

    For BF16 LoRA fine-tuning on Qwen3-32B, we were ooming at higher batch sizes and had to shrink the batch. The FP8 variant had no such issues, and we could use larger batch sizes without OOMing.

    Also reminder in Unsloth we share vLLM's memory space for the weights as introduced in Memory Efficient RL - we have bought this trick over to the FP8 domain!

    80GB GPU
    Inference Engine
    Training Engine

    Model Weights

    8GB SHARED FP8

    <<< SHARED

    Multi-purpose

    72GB space

    KV Cache

    Activations, Gradients, Optimizer States

    To enable Unsloth Standby for FP8 (or BF16) RL, simply add the below to all RL / GRPO training runs before any Unsloth import:

    ❓How to use FP8 RL / installation

    Simply update Unsloth or install Unsloth in a new virtual environment for H100, L4, RTX 50x, RTX 40x, H200s, B200s, and any NVIDIA GPU (consumer or data center grade) released after the RTX 4090.

    To update Unsloth: pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zooOr make a new environment:

    Then use load_in_fp8 = True and you're good to go! We'll auto map the model name to the Float8 variant, or we'll on the fly convert the model to Float8!

    For example on a RTX 5090 (reminder to set os.environ["UNSLOTH_VLLM_STANDBY"] = "1" )

    Then use our 2 FP8 notebooks for RL:

    Qwen3-8B FP8 RL Colab

    Llama-3.2-1B-FP8 RL Colab

    💿Implementing FP8 Training

    Our first reference point was transformers, which already supports FP8 in a couple of ways. One of them is a block-quantized matmul implementation: when a layer receives 16‑bit activations, it quantizes them and passes them to a custom FP8 matmul kernel. After wiring this up and benchmarking on an NVIDIA H100, we saw the opposite of what we wanted: fine-tuning became about 4× slower than standard BF16 fine-tuning.

    🔥TorchAO Collab

    So we worked with the TorchAO team (huge thanks to Andrew) to incorporate TorchAO’s FP8 support into our RL workloads and saw around 1.4× faster throughput and up to 60% less model memory usage. At a high level:

    • We store the frozen LoRA weights in FP8.

    • During the forward pass, we apply dynamic FP8 quantization to the input activations, while keeping the trainable LoRA adapters in BF16.

    • These FP8 weights share the same buffers as the vLLM model weights, so there’s only a single FP8 copy of the model in memory at any time (no “double model” memory overhead).

    • In the backward pass, we dequantize the LoRA weights so all gradient computation is done in BF16 for better accuracy.

    This general setup works across all supported RL algorithms, including GSPO, Dr. GRPO, PPO, and DPO.

    TorchAO provides PyTorch-native FP8 support for both training and inference, offering a variety of scaling granularities including tensorwise, row-wise, and 128x128 blockwise (prototype). TorchAO’s FP8 support can improve inference throughput by up to 1.64x at 27B scale with row-wise scaling granularity. For more details, visit the TorchAO FP8 README.

    TorchAO’s block-quantized FP8 matmul

    We used TorchAO’s block‑quantized FP8 matmul implementation which provided:

    • 80% of BF16 throughput

    • Without degrading loss or training stability

    So for a while, this became our default FP8 matmul backend, until FBGEMM caught up - we know default to using FBGEMM's implementation, if your GPU supports it! The current version of Unsloth can automatically choose the best backend based on what’s installed. If you have the right packages, you don’t have to leave performance on the table 🙂

    PS: We also experimented with DeepSeek’s DeepGEMM, but couldn’t get it fully integrated end‑to‑end to run clean, apples‑to‑apples comparisons.

    🐦On the fly TorchAO FP8 quantization

    Massive thanks to Andrew from TorchAO, Unsloth FP8 RL also lets you quantize the model on the fly by doing quantization within the model load time and passing that on to vLLM. This way, you need not explicitly quantize the model yourself (we handle it for you). You can do this by setting load_in_fp8 = True in the model load arguments, and will do offline FP8 if we don't find a suitable pre-quantized checkpoint.

    🎉Unsloth FP8 uploads

    For convenience, we uploaded FP8 Dynamic and FP8 Block models on Hugging Face. You can use them for FP8 training or also efficient & fast serving/deployment via vLLM/SGLang etc.

    FP8 Dynamic offers slightly faster training and lower VRAM usage than FP8 Block, but with a small trade-off in accuracy. See here for our full list of FP8 quants, but here the most popular ones:

    Model
    FP8 uploads

    Qwen3 (2507)

    4B Instruct — 4B Thinking — 30B-A3B Instruct — 30B-A3B Thinking —

    Qwen3-VL

    4B Instruct — 4B Thinking — 8B Instruct — 8B Thinking —

    Llama 3.1

    8B Instruct — · 8B Base — · 70B — ·

    Qwen3

    0.6B — 1.7B — 4B — 8B — 14B — 32B —

    Llama 3.3

    70B — ·

    Llama 3.2

    1B Base — · 1B Instruct — · 3B Base — · 3B Instruct — ·

    💁Acknowledgements

    Huge thanks to the entire PyTorch and TorchAO team for their help and collaboration! A huge thank you especially to: Andrew Or, Jerry Zhang, Supriya Rao, Scott Roy and Mergen Nachin for helping on many discussions on FP8 RL, and on helping to integrate it into Unsloth! Also thanks to the Executorch team as well!

    TorchAO
    vLLM
    Unsloth supports any transformers compatible TTS model. Even if we don’t have a notebook or upload for it yet, it’s still supported e.g., try fine-tuning Dia-TTS or Moshi.

    Fine-tuning Notebooks:

    We've also uploaded TTS models (original and quantized) to our Hugging Face page.

    Speech-to-Text (STT)

    If you notice that the output duration reaches a maximum of 10 seconds, increasemax_new_tokens = 125 from its default value of 125. Since 125 tokens corresponds to 10 seconds of audio, you'll need to set a higher value for longer outputs.

    Choosing and Loading a TTS Model

    For TTS, smaller models are often preferred due to lower latency and faster inference for end users. Fine-tuning a model under 3B parameters is often ideal, and our primary examples uses Sesame-CSM (1B) and Orpheus-TTS (3B), a Llama-based speech model.

    Sesame-CSM (1B) Details

    CSM-1B is a base model, while Orpheus-ft is fine-tuned on 8 professional voice actors, making voice consistency the key difference. CSM requires audio context for each speaker to perform well, whereas Orpheus-ft has this consistency built in.

    Fine-tuning from a base model like CSM generally needs more compute, while starting from a fine-tuned model like Orpheus-ft offers better results out of the box.

    To help with CSM, we’ve added new sampling options and an example showing how to use audio context for improved voice consistency.

    Orpheus-TTS (3B) Details

    Orpheus is pre-trained on a large speech corpus and excels at generating realistic speech with built-in support for emotional cues like laughs and sighs. Its architecture makes it one of the easiest TTS models to utilize and train as it can be exported via llama.cpp meaning it has great compatibility across all inference engines. For unsupported models, you'll only be able to save the LoRA adapter safetensors.

    Loading the models

    Because voice models are usually small in size, you can train the models using LoRA 16-bit or full fine-tuning FFT which may provide higher quality results. To load it in LoRA 16-bit:

    When this runs, Unsloth will download the model weights if you prefer 8-bit, you could use load_in_8bit = True, or for full fine-tuning set full_finetuning = True (ensure you have enough VRAM). You can also replace the model name with other TTS models.

    Note: Orpheus’s tokenizer already includes special tokens for audio output (more on this later). You do not need a separate vocoder – Orpheus will output audio tokens directly, which can be decoded to a waveform.

    Preparing Your Dataset

    At minimum, a TTS fine-tuning dataset consists of audio clips and their corresponding transcripts (text). Let’s use the Elise dataset which is ~3 hour single-speaker English speech corpus. There are two variants:

    • MrDragonFox/Elise – an augmented version with emotion tags (e.g. <sigh>, <laughs>) embedded in the transcripts. These tags in angle brackets indicate expressions (laughter, sighs, etc.) and are treated as special tokens by Orpheus’s tokenizer

    • Jinsaryko/Elise – base version with transcripts without special tags.

    The dataset is organized with one audio and transcript per entry. On Hugging Face, these datasets have fields such as audio (the waveform), text (the transcription), and some metadata (speaker name, pitch stats, etc.). We need to feed Unsloth a dataset of audio-text pairs.

    Instead of solely focusing on tone, cadence, and pitch, the priority should be ensuring your dataset is fully annotated and properly normalized.

    With some models like Sesame-CSM-1B, you might notice voice variation across generations using speaker ID 0 because it's a base model—it doesn’t have fixed voice identities. Speaker ID tokens mainly help maintain consistency within a conversation, not across separate generations.

    To get a consistent voice, provide contextual examples, like a few reference audio clips or prior utterances. This helps the model mimic the desired voice more reliably. Without this, variation is expected, even with the same speaker ID.

    Option 1: Using Hugging Face Datasets library – We can load the Elise dataset using Hugging Face’s datasets library:

    This will download the dataset (~328 MB for ~1.2k samples). Each item in dataset is a dictionary with at least:

    • "audio": the audio clip (waveform array and metadata like sampling rate), and

    • "text": the transcript string

    Orpheus supports tags like <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>, etc. For example: "I missed you <laugh> so much!". These tags are enclosed in angle brackets and will be treated as special tokens by the model (they match Orpheus’s expected tags like <laugh> and <sigh>. During training, the model will learn to associate these tags with the corresponding audio patterns. The Elise dataset with tags already has many of these (e.g., 336 occurrences of “laughs”, 156 of “sighs”, etc. as listed in its card). If your dataset lacks such tags but you want to incorporate them, you can manually annotate the transcripts where the audio contains those expressions.

    Option 2: Preparing a custom dataset – If you have your own audio files and transcripts:

    • Organize audio clips (WAV/FLAC files) in a folder.

    • Create a CSV or TSV file with columns for file path and transcript. For example:

    • Use load_dataset("csv", data_files="mydata.csv", split="train") to load it. You might need to tell the dataset loader how to handle audio paths. An alternative is using the datasets.Audio feature to load audio data on the fly:

      Then dataset[i]["audio"] will contain the audio array.

    • Ensure transcripts are normalized (no unusual characters that the tokenizer might not know, except the emotion tags if used). Also ensure all audio have a consistent sampling rate (resample them if necessary to the target rate the model expects, e.g. 24kHz for Orpheus).

    In summary, for dataset preparation:

    • You need a list of (audio, text) pairs.

    • Use the HF datasets library to handle loading and optional preprocessing (like resampling).

    • Include any special tags in the text that you want the model to learn (ensure they are in <angle_brackets> format so the model treats them as distinct tokens).

    • (Optional) If multi-speaker, you could include a speaker ID token in the text or use a separate speaker embedding approach, but that’s beyond this basic guide (Elise is single-speaker).

    Fine-Tuning TTS with Unsloth

    Now, let’s start fine-tuning! We’ll illustrate using Python code (which you can run in a Jupyter notebook, Colab, etc.).

    Step 1: Load the Model and Dataset

    In all our TTS notebooks, we enable LoRA (16-bit) training and disable QLoRA (4-bit) training with: load_in_4bit = False. This is so the model can usually learn your dataset better and have higher accuracy.

    If memory is very limited or if dataset is large, you can stream or load in chunks. Here, 3h of audio easily fits in RAM. If using your own dataset CSV, load it similarly.

    Step 2: Advanced - Preprocess the data for training (Optional)

    We need to prepare inputs for the Trainer. For text-to-speech, one approach is to train the model in a causal manner: concatenate text and audio token IDs as the target sequence. However, since Orpheus is a decoder-only LLM that outputs audio, we can feed the text as input (context) and have the audio token ids as labels. In practice, Unsloth’s integration might do this automatically if the model’s config identifies it as text-to-speech. If not, we can do something like:

    The above is a simplification. In reality, to fine-tune Orpheus properly, you would need the audio tokens as part of the training labels. Orpheus’s pre-training likely involved converting audio to discrete tokens (via an audio codec) and training the model to predict those given the preceding text. For fine-tuning on new voice data, you would similarly need to obtain the audio tokens for each clip (using Orpheus’s audio codec). The Orpheus GitHub provides a script for data processing – it encodes audio into sequences of <custom_token_x> tokens.

    However, Unsloth may abstract this away: if the model is a FastModel with an associated processor that knows how to handle audio, it might automatically encode the audio in the dataset to tokens. If not, you’d have to manually encode each audio clip to token IDs (using Orpheus’s codebook). This is an advanced step beyond this guide, but keep in mind that simply using text tokens won’t teach the model the actual audio – it needs to match the audio patterns.

    Let's assume Unsloth provides a way to feed audio directly (for example, by setting processor and passing the audio array). If Unsloth does not yet support automatic audio tokenization, you might need to use the Orpheus repository’s encode_audio function to get token sequences for the audio, then use those as labels. (The dataset entries do have phonemes and some acoustic features which suggests a pipeline.)

    Step 3: Set up training arguments and Trainer

    We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None. Using a per_device_train_batch_size >1 may lead to errors if multi-GPU setup to avoid issues, ensure CUDA_VISIBLE_DEVICES is set to a single GPU (e.g., CUDA_VISIBLE_DEVICES=0). Adjust as needed.

    Step 4: Begin fine-tuning

    This will start the training loop. You should see logs of loss every 50 steps (as set by logging_steps). The training might take some time depending on GPU – for example, on a Colab T4 GPU, a few epochs on 3h of data may take 1-2 hours. Unsloth’s optimizations will make it faster than standard HF training.

    Step 5: Save the fine-tuned model

    After training completes (or if you stop it mid-way when you feel it’s sufficient), save the model. This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

    This saves the model weights (for LoRA, it might save only adapter weights if the base is not fully fine-tuned). If you used --push_model in CLI or trainer.push_to_hub(), you could upload it to Hugging Face Hub directly.

    Now you should have a fine-tuned TTS model in the directory. The next step is to test it out and if supported, you can use llama.cpp to convert it into a GGUF file.

    Fine-tuning Voice models vs. Zero-shot voice cloning

    People say you can clone a voice with just 30 seconds of audio using models like XTTS - no training required. That’s technically true, but it misses the point.

    Zero-shot voice cloning, which is also available in models like Orpheus and CSM, is an approximation. It captures the general tone and timbre of a speaker’s voice, but it doesn’t reproduce the full expressive range. You lose details like speaking speed, phrasing, vocal quirks, and the subtleties of prosody - things that give a voice its personality and uniqueness.

    If you just want a different voice and are fine with the same delivery patterns, zero-shot is usually good enough. But the speech will still follow the model’s style, not the speaker’s.

    For anything more personalized or expressive, you need training with methods like LoRA to truly capture how someone speaks.

    Unsloth
    Read more here

    Temperature = 0.7

    Temperature = 1.0

    Top_P = 0.8

    Top_P = 0.95

    presence_penalty = 1.5

    presence_penalty = 0.0

    Output Length = 32768 (up to 256K)

    Output Length = 40960 (up to 256K)

    Top_K = 20

    Top_K = 20

    • 2B-Instruct

    • 2B-Thinking

    • 4B-Instruct

    • 4B-Thinking

    • 2B-Instruct

    • 2B-Thinking

    • 4B-Instruct

    • 4B-Thinking

    • 2B-Instruct

    • 4B-Instruct

    • 4B-Thinking

    • 8B-Instruct

    on GitHub
    terminate called after throwing an instance of 'std::runtime_error'
      what():  Value is not callable: null at row 63, column 78:
                {%- if '</think>' in content %}
                    {%- set reasoning_content = ((content.split('</think>')|first).rstrip('\n').split('<think>')|last).lstrip('\n') %}
                                                                                 ^
    GitHub here
    https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/unsloth%20made%20with%20love.png
    https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg
    Normal SFT fine-tuning notebook
    GRPO/GSPO RL notebook
    Standby feature
    VLM GRPO guide
    export greedy='false'
    export seed=3407
    export top_p=0.8
    export top_k=20
    export temperature=0.7
    export repetition_penalty=1.0
    export presence_penalty=1.5
    export out_seq_length=32768
    export greedy='false'
    export seed=1234
    export top_p=0.95
    export top_k=20
    export temperature=1.0
    export repetition_penalty=1.0
    export presence_penalty=0.0
    export out_seq_length=40960
    Devstral-Small-2507-GGUF
    Devstral-Small-2505-GGUF
    Devstral-Small-2507-unsloth-bnb-4bit
    Devstral-Small-2505-unsloth-bnb-4bit

    Tutorial: How to Finetune Llama-3 and Use In Ollama

    Beginner's Guide for creating a customized personal assistant (like ChatGPT) to run locally on Ollama

    By the end of this tutorial, you will create a custom chatbot by finetuning Llama-3 with Unsloth for free. It can run locally via Ollama on your PC, or in a free GPU instance through Google Colab. You will be able to interact with the chatbot interactively like below:

    Unsloth makes finetuning much easier, and can automatically export the finetuned model to Ollama with integrated automatic Modelfile creation! If you need help, you can join our Discord server: https://discord.com/invite/unsloth

    If you’d like to copy or save the code, everything is available in our . You can use it directly there or adapt it for your local setup:

    1. What is Unsloth?

    makes finetuning LLMs like Llama-3, Mistral, Phi-3 and Gemma 2x faster, use 70% less memory, and with no degradation in accuracy! We will be using Google Colab which provides a free GPU during this tutorial. You can access our free notebooks below:

    • (notebook which we will be using)

    You will also need to login into your Google account!

    2. What is Ollama?

    allows you to run language models from your own computer in a quick and simple way! It quietly launches a program which can run a language model like Llama-3 in the background. If you suddenly want to ask the language model a question, you can simply submit a request to Ollama, and it'll quickly return the results to you! We'll be using Ollama as our inference engine!

    3. Install Unsloth

    If you have never used a Colab notebook, a quick primer on the notebook itself:

    1. Play Button at each "cell". Click on this to run that cell's code. You must not skip any cells and you must run every cell in chronological order. If you encounter any errors, simply rerun the cell you did not run before. Another option is to click CTRL + ENTER if you don't want to click the play button.

    2. Runtime Button in the top toolbar. You can also use this button and hit "Run all" to run the entire notebook in 1 go. This will skip all the customization steps, and can be a good first try.

    3. Connect / Reconnect T4 button. You can click here for more advanced system statistics.

    The first installation cell looks like below: Remember to click the PLAY button in the brackets [ ]. We grab our open source Github package, and install some other packages.

    4. Selecting a model to finetune

    Let's now select a model for finetuning! We defaulted to Llama-3 from Meta / Facebook which was trained on a whopping 15 trillion "tokens". Assume a token is like 1 English word. That's approximately 350,000 thick Encyclopedias worth! Other popular models include Mistral, Phi-3 (trained using GPT-4 output) and Gemma from Google (13 trillion tokens!).

    Unsloth supports these models and more! In fact, simply type a model from the Hugging Face model hub to see if it works! We'll error out if it doesn't work.

    There are 3 other settings which you can toggle:

    1. This determines the context length of the model. Gemini for example has over 1 million context length, whilst Llama-3 has 8192 context length. We allow you to select ANY number - but we recommend setting it 2048 for testing purposes. Unsloth also supports very long context finetuning, and we show we can provide 4x longer context lengths than the best.

    2. Keep this as None, but you can select torch.float16 or torch.bfloat16 for newer GPUs.

    3. We do finetuning in 4 bit quantization. This reduces memory usage by 4x, allowing us to actually do finetuning in a free 16GB memory GPU. 4 bit quantization essentially converts weights into a limited set of numbers to reduce memory usage. A drawback of this is there is a 1-2% accuracy degradation. Set this to False on larger GPUs like H100s if you want that tiny extra accuracy.

    If you run the cell, you will get some print outs of the Unsloth version, which model you are using, how much memory your GPU has, and some other statistics. Ignore this for now.

    5. Parameters for finetuning

    Now to customize your finetune, you can edit the numbers above, but you can ignore it, since we already select quite reasonable numbers.

    The goal is to change these numbers to increase accuracy, but also counteract over-fitting. Over-fitting is when you make the language model memorize a dataset, and not be able to answer novel new questions. We want to a final model to answer unseen questions, and not do memorization.

    1. The rank of the finetuning process. A larger number uses more memory and will be slower, but can increase accuracy on harder tasks. We normally suggest numbers like 8 (for fast finetunes), and up to 128. Too large numbers can causing over-fitting, damaging your model's quality.

    2. We select all modules to finetune. You can remove some to reduce memory usage and make training faster, but we highly do not suggest this. Just train on all modules!

    3. The scaling factor for finetuning. A larger number will make the finetune learn more about your dataset, but can promote over-fitting. We suggest this to equal to the rank r, or double it.

    6. Alpaca Dataset

    We will now use the Alpaca Dataset created by calling GPT-4 itself. It is a list of 52,000 instructions and outputs which was very popular when Llama-1 was released, since it made finetuning a base LLM be competitive with ChatGPT itself.

    You can access the GPT4 version of the Alpaca dataset here: . An older first version of the dataset is here: . Below shows some examples of the dataset:

    You can see there are 3 columns in each row - an instruction, and input and an output. We essentially combine each row into 1 large prompt like below. We then use this to finetune the language model, and this made it very similar to ChatGPT. We call this process supervised instruction finetuning.

    7. Multiple columns for finetuning

    But a big issue is for ChatGPT style assistants, we only allow 1 instruction / 1 prompt, and not multiple columns / inputs. For example in ChatGPT, you can see we must submit 1 prompt, and not multiple prompts.

    This essentially means we have to "merge" multiple columns into 1 large prompt for finetuning to actually function!

    For example the very famous Titanic dataset has many many columns. Your job was to predict whether a passenger has survived or died based on their age, passenger class, fare price etc. We can't simply pass this into ChatGPT, but rather, we have to "merge" this information into 1 large prompt.

    For example, if we ask ChatGPT with our "merged" single prompt which includes all the information for that passenger, we can then ask it to guess or predict whether the passenger has died or survived.

    Other finetuning libraries require you to manually prepare your dataset for finetuning, by merging all your columns into 1 prompt. In Unsloth, we simply provide the function called to_sharegpt which does this in 1 go!

    To access the Titanic finetuning notebook or if you want to upload a CSV or Excel file, go here:

    Now this is a bit more complicated, since we allow a lot of customization, but there are a few points:

    • You must enclose all columns in curly braces {}. These are the column names in the actual CSV / Excel file.

    • Optional text components must be enclosed in [[]]. For example if the column "input" is empty, the merging function will not show the text and skip this. This is useful for datasets with missing values.

    • Select the output or target / prediction column in output_column_name. For the Alpaca dataset, this will be output

    For example in the Titanic dataset, we can create a large merged prompt format like below, where each column / piece of text becomes optional.

    For example, pretend the dataset looks like this with a lot of missing data:

    Embarked
    Age
    Fare

    Then, we do not want the result to be:

    1. The passenger embarked from S. Their age is 23. Their fare is EMPTY.

    2. The passenger embarked from EMPTY. Their age is 18. Their fare is $7.25.

    Instead by optionally enclosing columns using [[]], we can exclude this information entirely.

    1. [[The passenger embarked from S.]] [[Their age is 23.]] [[Their fare is EMPTY.]]

    2. [[The passenger embarked from EMPTY.]] [[Their age is 18.]] [[Their fare is $7.25.]]

    becomes:

    1. The passenger embarked from S. Their age is 23.

    2. Their age is 18. Their fare is $7.25.

    8. Multi turn conversations

    A bit issue if you didn't notice is the Alpaca dataset is single turn, whilst remember using ChatGPT was interactive and you can talk to it in multiple turns. For example, the left is what we want, but the right which is the Alpaca dataset only provides singular conversations. We want the finetuned language model to somehow learn how to do multi turn conversations just like ChatGPT.

    So we introduced the conversation_extension parameter, which essentially selects some random rows in your single turn dataset, and merges them into 1 conversation! For example, if you set it to 3, we randomly select 3 rows and merge them into 1! Setting them too long can make training slower, but could make your chatbot and final finetune much better!

    Then set output_column_name to the prediction / output column. For the Alpaca dataset dataset, it would be the output column.

    We then use the standardize_sharegpt function to just make the dataset in a correct format for finetuning! Always call this!

    9. Customizable Chat Templates

    We can now specify the chat template for finetuning itself. The very famous Alpaca format is below:

    But remember we said this was a bad idea because ChatGPT style finetunes require only 1 prompt? Since we successfully merged all dataset columns into 1 using Unsloth, we essentially can create the below style chat template with 1 input column (instruction) and 1 output:

    We just require you must put a {INPUT} field for the instruction and an {OUTPUT} field for the model's output field. We in fact allow an optional {SYSTEM} field as well which is useful to customize a system prompt just like in ChatGPT. For example, below are some cool examples which you can customize the chat template to be:

    For the ChatML format used in OpenAI models:

    Or you can use the Llama-3 template itself (which only functions by using the instruct version of Llama-3): We in fact allow an optional {SYSTEM} field as well which is useful to customize a system prompt just like in ChatGPT.

    Or in the Titanic prediction task where you had to predict if a passenger died or survived in this Colab notebook which includes CSV and Excel uploading:

    10. Train the model

    Let's train the model now! We normally suggest people to not edit the below, unless if you want to finetune for longer steps or want to train on large batch sizes.

    We do not normally suggest changing the parameters above, but to elaborate on some of them:

    1. Increase the batch size if you want to utilize the memory of your GPU more. Also increase this to make training more smooth and make the process not over-fit. We normally do not suggest this, since this might make training actually slower due to padding issues. We normally instead ask you to increase gradient_accumulation_steps which just does more passes over the dataset.

    2. Equivalent to increasing the batch size above itself, but does not impact memory consumption! We normally suggest people increasing this if you want smoother training loss curves.

    3. We set steps to 60 for faster training. For full training runs which can take hours, instead comment out max_steps, and replace it with num_train_epochs = 1

    You’ll see a log of numbers during training. This is the training loss, which shows how well the model is learning from your dataset. For many cases, a loss around 0.5 to 1.0 is a good sign, but it depends on your dataset and task. If the loss is not going down, you might need to adjust your settings. If the loss goes to 0, that could mean overfitting, so it's important to check validation too.

    11. Inference / running the model

    Now let's run the model after we completed the training process! You can edit the yellow underlined part! In fact, because we created a multi turn chatbot, we can now also call the model as if it saw some conversations in the past like below:

    Reminder Unsloth itself provides 2x faster inference natively as well, so always do not forget to call FastLanguageModel.for_inference(model). If you want the model to output longer responses, set max_new_tokens = 128 to some larger number like 256 or 1024. Notice you will have to wait longer for the result as well!

    12. Saving the model

    We can now save the finetuned model as a small 100MB file called a LoRA adapter like below. You can instead push to the Hugging Face hub as well if you want to upload your model! Remember to get a Hugging Face token via and add your token!

    After saving the model, we can again use Unsloth to run the model itself! Use FastLanguageModel again to call it for inference!

    13. Exporting to Ollama

    Finally we can export our finetuned model to Ollama itself! First we have to install Ollama in the Colab notebook:

    Then we export the finetuned model we have to llama.cpp's GGUF formats like below:

    Reminder to convert False to True for 1 row, and not change every row to True, or else you'll be waiting for a very time! We normally suggest the first row getting set to True, so we can export the finetuned model quickly to Q8_0 format (8 bit quantization). We also allow you to export to a whole list of quantization methods as well, with a popular one being q4_k_m.

    Head over to to learn more about GGUF. We also have some manual instructions of how to export to GGUF if you want here:

    You will see a long list of text like below - please wait 5 to 10 minutes!!

    And finally at the very end, it'll look like below:

    Then, we have to run Ollama itself in the background. We use subprocess because Colab doesn't like asynchronous calls, but normally one just runs ollama serve in the terminal / command prompt.

    14. Automatic Modelfile creation

    The trick Unsloth provides is we automatically create a Modelfile which Ollama requires! This is a just a list of settings and includes the chat template which we used for the finetune process! You can also print the Modelfile generated like below:

    We then ask Ollama to create a model which is Ollama compatible, by using the Modelfile

    15. Ollama Inference

    And we can now call the model for inference if you want to do call the Ollama server itself which is running on your own local machine / in the free Colab notebook in the background. Remember you can edit the yellow underlined part.

    16. Interactive ChatGPT style

    But to actually run the finetuned model like a ChatGPT, we have to do a bit more! First click the terminal icon and a Terminal will pop up. It's on the left sidebar.

    Then, you might have to press ENTER twice to remove some weird output in the Terminal window. Wait a few seconds and type ollama run unsloth_model then hit ENTER.

    And finally, you can interact with the finetuned model just like an actual ChatGPT! Hit CTRL + D to exit the system, and hit ENTER to converse with the chatbot!

    You've done it!

    You've successfully finetuned a language model and exported it to Ollama with Unsloth 2x faster and with 70% less VRAM! And all this for free in a Google Colab notebook!

    If you want to learn how to do reward modelling, do continued pretraining, export to vLLM or GGUF, do text completion, or learn more about finetuning tips and tricks, head over to our .

    If you need any help on finetuning, you can also join our Discord server . If you want help with Ollama, you can also join their server .

    And finally, we want to thank you for reading and following this far! We hope this made you understand some of the nuts and bolts behind finetuning language models, and we hope this was useful!

    To access our Alpaca dataset example click , and our CSV / Excel finetuning guide is .

    LoRA Hyperparameters Guide

    Optimal lora rank. alpha, number of epochs, batch size & gradient accumulation, QLoRA vs LoRA, target modules and more!

    LoRA hyperparameters are adjustable parameters that control how Low-Rank Adaptation (LoRA) fine-tunes LLMs. With many options (such as learning rate and epochs) and millions of possible combinations, selecting the right values is crucial for achieving accuracy, stability, quality, and fewer hallucinations during fine-tuning.

    You'll learn the best practices for these parameters, based on insights from hundreds of research papers and experiments, and see how they impact the model. While we recommend using Unsloth's defaults, understanding these concepts will give you full control. The goal is to change hyperparameter numbers to increase accuracy while counteracting overfitting or underfitting. Overfitting occurs when the model memorizes the training data, harming its ability to generalize to new, unseen inputs. The objective is a model that generalizes well, not one that simply memorizes.

    ❓But what is LoRA?

    In LLMs, we have model weights. Llama 70B has 70 billion numbers. Instead of changing all 70b numbers, we instead add thin matrices A and B to each weight, and optimize those. This means we only optimize 1% of weights.

    🔢 Key Fine-tuning Hyperparameters

    Learning Rate

    Defines how much the model’s weights are adjusted during each training step.

    • Higher Learning Rates: Lead to faster initial convergence but can cause training to become unstable or fail to find an optimal minimum if set too high.

    • Lower Learning Rates: Result in more stable and precise training but may require more epochs to converge, increasing overall training time. While low learning rates are often thought to cause underfitting, they actually can lead to overfitting or even prevent the model from learning.

    • Typical Range: 2e-4 (0.0002) to 5e-6 (0.000005). 🟩

    Epochs

    The number of times the model sees the full training dataset.

    • More Epochs: Can help the model learn better, but a high number can cause it to memorize the training data, hurting its performance on new tasks.

    • Fewer Epochs: Reduces training time and can prevent overfitting, but may result in an undertrained model if the number is insufficient for the model to learn the dataset's underlying patterns.

    • Recommended: 1-3 epochs. For most instruction-based datasets, training for more than 3 epochs offers diminishing returns and increases the risk of overfitting.

    LoRA or QLoRA

    LoRA uses 16-bit precision, while QLoRA is a 4-bit fine-tuning method.

    • LoRA: 16-bit fine-tuning. It's slightly faster and slightly more accurate, but consumes significantly more VRAM (4× more than QLoRA). Recommended for 16-bit environments and scenarios where maximum accuracy is required.

    • QLoRA: 4-bit fine-tuning. Slightly slower and marginally less accurate, but uses much less VRAM (4× less). 🦥 70B LLaMA fits in <48GB VRAM with QLoRA in Unsloth - .

    Hyperparameters & Recommendations:

    Hyperparameter
    Function
    Recommended Settings

    🌳 Gradient Accumulation and Batch Size equivalency

    Effective Batch Size

    Correctly configuring your batch size is critical for balancing training stability with your GPU's VRAM limitations. This is managed by two parameters whose product is the Effective Batch Size. Effective Batch Size = batch_size * gradient_accumulation_steps

    • A larger Effective Batch Size generally leads to smoother, more stable training.

    • A smaller Effective Batch Size may introduce more variance.

    While every task is different, the following configuration provides a great starting point for achieving a stable Effective Batch Size of 16, which works well for most fine-tuning tasks on modern GPUs.

    Parameter
    Description
    Recommended Setting

    The VRAM & Performance Trade-off

    Assume you want 32 samples of data per training step. Then you can use any of the following configurations:

    • batch_size = 32, gradient_accumulation_steps = 1

    • batch_size = 16, gradient_accumulation_steps = 2

    • batch_size = 8, gradient_accumulation_steps = 4

    While all of these are equivalent for the model's weight updates, they have vastly different hardware requirements.

    The first configuration (batch_size = 32) uses the most VRAM and will likely fail on most GPUs. The last configuration (batch_size = 1) uses the least VRAM, but at the cost of slightly slower training. To avoid OOM (out of memory) errors, always prefer to set a smaller batch_size and increase gradient_accumulation_steps to reach your target Effective Batch Size.

    🦥 Unsloth Gradient Accumulation Fix

    Gradient accumulation and batch sizes are now fully equivalent in Unsloth due to our bug fixes for gradient accumulation. We have implemented specific bug fixes for gradient accumulation that resolve a common issue where the two methods did not produce the same results. This was a known challenge in the wider community, but for Unsloth users, the two methods are now interchangeable.

    for more details.

    Prior to our fixes, combinations of batch_size and gradient_accumulation_steps that yielded the same Effective Batch Size (i.e., batch_size × gradient_accumulation_steps = 16) did not result in equivalent training behavior. For example, configurations like b1/g16, b2/g8, b4/g4, b8/g2, and b16/g1 all have an Effective Batch Size of 16, but as shown in the graph, the loss curves did not align when using standard gradient accumulation:

    After applying our fixes, the loss curves now align correctly, regardless of how the Effective Batch Size of 16 is achieved:

    🦥 LoRA Hyperparameters in Unsloth

    The following demonstrates a standard configuration. While Unsloth provides optimized defaults, understanding these parameters is key to manual tuning.

    1. The rank (r) of the fine-tuning process. A larger rank uses more memory and will be slower, but can increase accuracy on complex tasks. We suggest ranks like 8 or 16 (for fast fine-tunes) and up to 128. Using a rank that is too large can cause overfitting and harm your model's quality.\

    2. For optimal performance, LoRA should be applied to all major linear layers. that targeting all major layers is crucial for matching the performance of full fine-tuning. While it's possible to remove modules to reduce memory usage, we strongly advise against it to preserve maximum quality as the savings are minimal.\

    3. A scaling factor that controls the strength of the fine-tuned adjustments. Setting it equal to the rank (r

    Verifying LoRA Weight Updates:

    When validating that LoRA adapter weights have been updated after fine-tuning, avoid using np.allclose() for comparison. This method can miss subtle but meaningful changes, particularly in LoRA A, which is initialized with small Gaussian values. These changes may not register as significant under loose numerical tolerances. Thanks to for this section.

    To reliably confirm weight updates, we recommend:

    • Using checksum or hash comparisons (e.g., MD5)

    • Computing the sum of absolute differences between tensors

    • Inspecting tensor statistics (e.g., mean, variance) manually

    • Or using np.array_equal() if exact equality is expected

    📐LoRA Alpha and Rank relationship

    It's best to set lora_alpha = 2 * lora_rank or lora_alpha = lora_rank

    The formula for LoRA is on the left. We need to scale the thin matrices A and B by alpha divided by the rank. This means we should keep alpha/rank at least = 1.

    According to the , we should instead scale alpha by the sqrt of the rank. Other options exist, but theoretically this is the optimum. The left plot shows other ranks and their perplexities (lower is better). To enable this, set use_rslora = True in Unsloth.

    Our recommendation is to set the alpha to equal to the rank, or at least 2 times the rank. This means alpha/rank = 1 or 2.

    🎯 LoRA Target Modules and QLoRA vs LoRA

    Use: target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",] to target both MLP and attention layers to increase accuracy.

    QLoRA uses 4-bit precision, reducing VRAM usage by over 75%.

    LoRA (16-bit) is slightly more accurate and faster.

    According to empirical experiments and research papers like the original , it's best to apply LoRA to both attention and MLP layers.

    The chart shows RougeL scores (higher is better) for different target module configurations, comparing LoRA vs QLoRA.

    The first 3 dots show:

    1. QLoRA-All: LoRA applied to all FFN/MLP and Attention layers. 🔥 This performs best overall.

    2. QLoRA-FFN

    😎 Training on completions only, masking out inputs

    The shows that masking out inputs and training only on completions (outputs or assistant messages) can further increase accuracy by a few percentage points (1%). Below demonstrates how this is done in Unsloth:

    NOT training on completions only:

    USER: Hello what is 2+2? ASSISTANT: The answer is 4. USER: Hello what is 3+3? ASSISTANT: The answer is 6.

    Training on completions only:

    USER: Hello what is 2+2? ASSISTANT: The answer is 4. USER: Hello what is 3+3? ASSISTANT: The answer is 6.

    The QLoRA paper states that training on completions only increases accuracy by quite a bit, especially for multi-turn conversational finetunes! We do this in our .

    To enable training on completions in Unsloth, you will need to define the instruction and assistant parts. 🦥 We plan to further automate this for you in the future!

    For Llama 3, 3.1, 3.2, 3.3 and 4 models, you define the parts as follows:

    For Gemma 2, 3, 3n models, you define the parts as follows:

    🔑 Avoiding Overfitting & Underfitting

    Overfitting (Poor Generalization/Too Specialized)

    The model memorizes the training data, including its statistical noise, and consequently fails to generalize to unseen data.

    If your training loss drops below 0.2, your model is likely overfitting — meaning it may perform poorly on unseen tasks.

    One simple trick is LoRA alpha scaling — just multiply the alpha value of each LoRA matrix by 0.5. This effectively scales down the impact of fine-tuning.

    This is closely related to merging / averaging weights. You can take the original base (or instruct) model, add the LoRA weights, then divide the result by 2. This gives you an averaged model — which is functionally equivalent to reducing the alpha by half.

    Solution:

    • Adjust the learning rate: A high learning rate often leads to overfitting, especially during short training runs. For longer training, a higher learning rate may work better. It’s best to experiment with both to see which performs best.

    • Reduce the number of training epochs. Stop training after 1, 2, or 3 epochs.

    • Increase weight_decay. A value of 0.01 or 0.1 is a good starting point.

    Underfitting (Too Generic)

    The model fails to capture the underlying patterns in the training data, often due to insufficient complexity or training duration.

    Solution:

    • Adjust the Learning Rate: If the current rate is too low, increasing it may speed up convergence, especially for short training runs. For longer runs, try lowering the learning rate instead. Test both approaches to see which works best.

    • Increase Training Epochs: Train for more epochs, but monitor validation loss to avoid overfitting.

    • Increase LoRA Rank (r) and alpha: Rank should at least equal to the alpha number, and rank should be bigger for smaller models/more complex datasets; it usually is between 4 and 64.

    Fine-tuning has no single "best" approach, only best practices. Experimentation is key to finding what works for your specific needs. Our notebooks automatically set optimal parameters based on many papers research and our experiments, giving you a great starting point. Happy fine-tuning!

    Acknowledgements: A huge thank you to for contributing to this guide!

    Gemma 3n: How to Run & Fine-tune

    Run Google's new Gemma 3n locally with Dynamic GGUFs on llama.cpp, Ollama, Open WebUI and fine-tune with Unsloth!

    Google’s Gemma 3n multimodal model handles image, audio, video, and text inputs. Available in 2B and 4B sizes, it supports 140 languages for text and multimodal tasks. You can now run and fine-tune Gemma-3n-E4B and E2B locally using Unsloth.

    Fine-tune Gemma 3n with our free Colab notebook

    Gemma 3n has 32K context length, 30s audio input, OCR, auto speech recognition (ASR), and speech translation via prompts.

    Running TutorialFine-tuning TutorialFixes + Technical Analysis

    Unsloth Gemma 3n (Instruct) uploads with optimal configs:

    Dynamic 2.0 GGUF (text only)
    Dynamic 4-bit Instruct (to fine-tune)
    16-bit Instruct

    See all our Gemma 3n uploads including base and more formats in .

    🖥️ Running Gemma 3n

    Currently Gemma 3n is only supported in text format for inference.

    We’ve with GGUFs not working properly in Ollama only. Please redownload if using Ollama.

    ⚙️ Official Recommended Settings

    According to the Gemma team, the official recommended settings for inference:

    temperature = 1.0, top_k = 64, top_p = 0.95, min_p = 0.0

    • Temperature of 1.0

    • Top_K of 64

    • Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)

    • Top_P of 0.95

    llama.cpp an other inference engines auto add a <bos> - DO NOT add TWO <bos> tokens! You should ignore the <bos> when prompting the model!

    🦙 Tutorial: How to Run Gemma 3n in Ollama

    Please re download Gemma 3N quants or remove the old ones via Ollama since there are some bug fixes. You can do the below to delete the old file and refresh it:

    1. Install ollama if you haven't already!

    1. Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

    📖 Tutorial: How to Run Gemma 3n in llama.cpp

    We would first like to thank from Hugging Face, from the llama.cpp team on making Gemma 3N work in llama.cpp!

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run

    1. OR download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision).

    1. Run the model.

    2. Edit --threads 32 for the number of CPU threads, --ctx-size 32768 for context length (Gemma 3 supports 32K context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    3. For conversation mode:

    1. For non conversation mode to test Flappy Bird:

    Remember to remove <bos> since Gemma 3N auto adds a <bos>!

    🦥 Fine-tuning Gemma 3n with Unsloth

    Gemma 3n, like , had issues running on Flotat16 GPUs such as Tesla T4s in Colab. You will encounter NaNs and infinities if you do not patch Gemma 3n for inference or finetuning. .

    • Fine-tune Gemma 3n-E4B with our

    • Audio: Fine-tune Gemma 3n-E4B with our

    • Vision: Fine-tune Gemma 3n-E4B with our

    We also found that because Gemma 3n's unique architecture reuses hidden states in the vision encoder it poses another interesting quirk with

    Unsloth is the only framework which works in float16 machines for Gemma 3n inference and training. This means Colab Notebooks with free Tesla T4 GPUs also work! Overall, Unsloth makes Gemma 3n training 1.5x faster, 50% less VRAM and 4x longer context lengths.

    Our free Gemma 3n Colab notebooks default to fine-tuning text layers. If you want to fine-tune vision or audio layers too, be aware this will require much more VRAM - beyond the 15GB free Colab or Kaggle provides. You can still fine-tune all layers including audio and vision and Unsloth also lets you fine-tune only specific areas, like just vision. Simply adjust as needed:

    🏆Bonus Content

    We also heard you guys wanted a Vision notebook for Gemma 3 (4B) so here it is:

    • Fine-tune Gemma 3 (4B) with Vision support using our

    If you love Kaggle, Google is holding a competition where the best model fine-tuned with Gemma 3n and Unsloth will win a $10K prize! .

    🐛Fixes for Gemma 3n

    ✨GGUF issues & fixes

    Thanks to discussions from from the Ollama team and also from Hugging Face, there were 2 issues we had to fix specifically for GGUFs:

    1. The add_shared_kv_layers parameter was accidentally encoded in float32 which is fine, but becomes slightly complicated to decode on Ollama's side - a simple change to uint32 solves the issue. addressing this issue.

    2. The per_layer_token_embd layer should be Q8_0 in precision. Anything lower does not function properly and errors out in the Ollama engine - to reduce issues for our community, we made this all Q8_0 in all quants - unfortunately this does use more space.

    ♾️Infinities and NaN gradients and activations

    Gemma 3n just like Gemma 3 has issues on FP16 GPUs (e.g., Tesla T4s in Colab).

    Our previous fixes for Gemma 3 is . For Gemma 3, we found that activations exceed float16's maximum range of 65504.

    Gemma 3N does not have this activation issue, but we still managed to encounter infinities!

    To get to the bottom of these infinities, we plotted the absolute maximum weight entries for Gemma 3N, and we see the below:

    We find that the green crosses are the Conv2D convolutional weights. We can see that the magnitude of Conv2D layers is much larger on average.

    Below is a table for Conv2D weights which have large magnitudes. Our hypothesis is that during a Conv2D operation, large weights multiply and sum together, and unfortunately by chance exceed float16's maximum range of 65504. Bfloat16 is fine, since it's maximum range is 10^38.

    Name
    Max

    🎇Solution to infinities

    The naive solution is to upcast all Conv2D weights to float32 (if bfloat16 isn't available). But that would increase VRAM usage. To tackle this, we instead make use of autocast on the fly to upcast the weights and inputs to float32, and so we perform the accumulation in float32 as part of the matrix multiplication itself, without having to upcast the weights.

    Unsloth is the only framework that enables Gemma 3n inference and training on float16 GPUs, so Colab Notebooks with free Tesla T4s work!

    🏁Gradient Checkpointing issues

    We found Gemma 3N's vision encoder to be quite unique as well since it re-uses hidden states. This unfortunately limits the usage of , which could have reduced VRAM usage significantly. since it cannot be applied to Vision encoder.

    However, we still managed to leverage Unsloth's automatic compiler to optimize Gemma 3N!

    🌵Large losses during finetuning

    We also found losses are interestingly very large during the start of finetuning - in the range of 6 to 7, but they do decrease over time quickly. We theorize this is either because of 2 possibilities:

    1. There might be some implementation issue, but this is unlikely since inference seems to work.

    2. Multi-modal models always seem to exhibit this behavior - we found Llama 3.2 Vision's loss starts at 3 or 4, Pixtral at 8 or so, and Qwen 2.5 VL also 4 ish. Because Gemma 3N includes audio as well, it might amplify the starting loss. But this is just a hypothesis. We also found quantizing Qwen 2.5 VL 72B Instruct to have extremely high perplexity scores of around 30 or so, but the model interestingly performs fine.

    Fine-tune Gemma 3n with our

    🛠️ Technical Analysis

    Gemma 3n : MatFormer

    So what is so special about Gemma 3n you ask? It is based on architecture meaning that each transformer layer/block embeds/nests FFNs of progressively smaller sizes. Think of it like progressively smaller cups put inside one another. The training is done so that at inference time you can choose the size you want and get the most of the performance of the bigger models.

    There is also Per Layer Embedding which can be cached to reduce memory usage at inference time. So the 2B model (E2B) is a sub-network inside the 4B (aka 5.44B) model that is achieved by both Per Layer Embedding caching and skipping audio and vision components focusing solely on text.

    The MatFormer architecture, typically is trained with exponentially spaced sub-models aka of sizes S, S/2, S/4, S/8 etc in each of the layers. So at training time, inputs are randomly forwarded through one of the said sub blocks giving every sub block equal chance to learn. Now the advantage is, at inference time, if you want the model to be 1/4th of the original size, you can pick S/4 sized sub blocks in each layer.

    You can also choose to Mix and Match where you pick say, S/4 sized sub block of one layer, S/2 sized sub block of another layer and S/8 sized sub block of another layer. In fact, you can change the sub models you pick based on the input itself if you fancy so. Basically its like choose your own kind of structure at every layer. So by just training a model of one particular size, you are creating exponentially many models of smaller sizes. No learning goes waste. Pretty neat huh.

    Fine-tune and try multimodal Gemma 3n inference with our

    DeepSeek-R1-0528: How to Run Locally

    A guide on how to run DeepSeek-R1-0528 including Qwen3 on your own local device!

    DeepSeek-R1-0528 is DeepSeek's new update to their R1 reasoning model. The full 671B parameter model requires 715GB of disk space. The quantized dynamic 1.66-bit version uses 162GB (-80% reduction in size). GGUF: DeepSeek-R1-0528-GGUF

    DeepSeek also released a R1-0528 distilled version by fine-tuning Qwen3 (8B). The distill achieves similar performance to Qwen3 (235B). You can also fine-tune Qwen3 Distill with Unsloth. Qwen3 GGUF: DeepSeek-R1-0528-Qwen3-8B-GGUF

    All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized DeepSeek LLMs with minimal accuracy loss.

    Tutorials navigation:

    Run in llama.cppRun in Ollama/Open WebUIFine-tuning R1-0528

    NEW: Huge improvements to tool calling and chat template fixes. New - 162GB in size. Ideal for 192GB RAM (including Mac) and Ollama users. Try: ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0

    ⚙️ Recommended Settings

    For DeepSeek-R1-0528-Qwen3-8B, the model can pretty much fit in any setup, and even those with as less as 20GB RAM. There is no need for any prep beforehand. However, for the full R1-0528 model which is 715GB in size, you will need extra prep. The 1.78-bit (IQ1_S) quant will fit in a 1x 24GB GPU (with all layers offloaded). Expect around 5 tokens/s with this setup if you have bonus 128GB RAM as well.

    It is recommended to have at least 64GB RAM to run this quant (you will get 1 token/s without a GPU). For optimal performance you will need at least 180GB unified memory or 180GB combined RAM+VRAM for 5+ tokens/s.

    We suggest using our 2.7bit (Q2_K_XL) or 2.4bit (IQ2_XXS) quant to balance size and accuracy! The 2.4bit one also works well.

    Though not necessary, for the best performance, have your VRAM + RAM combined = to the size of the quant you're downloading.

    🐳 Official Recommended Settings:

    According to , these are the recommended settings for R1 (R1-0528 and Qwen3 distill should use the same settings) inference:

    • Set the temperature 0.6 to reduce repetition and incoherence.

    • Set top_p to 0.95 (recommended)

    • Run multiple tests and average results for reliable evaluation.

    🔢 Chat template/prompt format

    R1-0528 uses the same chat template as the original R1 model. You do not need to force <think>\n , but you can still add it in!

    A BOS is forcibly added, and an EOS separates each interaction. To counteract double BOS tokens during inference, you should only call tokenizer.encode(..., add_special_tokens = False) since the chat template auto adds a BOS token as well. For llama.cpp / GGUF inference, you should skip the BOS since it’ll auto add it:

    The <think> and </think> tokens get their own designated tokens.

    Model uploads

    ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and language tasks.

    • Qwen3 (8B) distill:

    • Full DeepSeek-R1-0528 model uploads below:

    We also uploaded and quants which run specifically faster for ARM and Apple devices respectively.

    MoE Bits
    Type + Link
    Disk Size
    Details

    We've also uploaded versions in , and original .

    Run DeepSeek-R1-0528 Tutorials:

    🦙 Run in Ollama/Open WebUI

    1. Install ollama if you haven't already! You can only run models up to 32B in size. To run the full 720GB R1-0528 model, .

    1. Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

    1. (NEW) To run the full R1-0528 model in Ollama, you can use our TQ1_0 (162GB quant):

    🦙 Run Full R1-0528 on Ollama/Open WebUI

    Open WebUI has made an step-by-step tutorial on how to run R1 here and for R1-0528, you will just need to replace R1 with the new 0528 quant:

    (NEW) To run the full R1-0528 model in Ollama, you can use our TQ1_0 (162GB quant):

    If you want to use any of the quants that are larger than TQ1_0 (162GB) on Ollama, you need to first merge the 3 GGUF split files into 1 like the code below. Then you will need to run the model locally.

    ✨ Run Qwen3 distilled R1 in llama.cpp

    1. To run the full 720GB R1-0528 model, . Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. Then use llama.cpp directly to download the model:

    ✨ Run Full R1-0528 on llama.cpp

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. If you want to use llama.cpp directly to load models, you can do the below: (:IQ1_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location.

    Please try out -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-IQ1_S(dynamic 1.78bit quant) or other quantized versions like Q4_K_M . We recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy. More versions at:

    1. Run Unsloth's Flappy Bird test as described in our 1.58bit Dynamic Quant for DeepSeek R1.

    2. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    🎱 Heptagon Test

    You can also test our dynamic quants via which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.

    Full prompt to run the model

    🦥 Fine-tuning DeepSeek-R1-0528 with Unsloth

    To fine-tune DeepSeek-R1-0528-Qwen3-8B using Unsloth, we’ve made a new GRPO notebook featuring a custom reward function designed to significantly enhance multilingual output - specifically increasing the rate of desired language responses (in our example we use Indonesian but you can use any) by more than 40%.

    • - new

    While many reasoning LLMs have multilingual capabilities, they often produce mixed-language outputs in its reasoning traces, combining English with the target language. Our reward function effectively mitigates this issue by strongly encouraging outputs in the desired language, leading to a substantial improvement in language consistency.

    This reward function is also fully customizable, allowing you to adapt it for other languages or fine-tune for specific domains or use cases.

    The best part about this whole reward function and notebook is you DO NOT need a language dataset to force your model to learn a specific language. The notebook has no Indonesian dataset.

    Unsloth makes R1-Qwen3 distill fine-tuning 2× faster, uses 70% less VRAM, and support 8× longer context lengths.

    Memory Efficient RL

    We're excited to introduce more efficient reinforcement learning (RL) in Unsloth with multiple algorithmic advancements:

    • 1.2 to 1.7x increased context lengths with no slowdown and no extra memory usage!

    • 10% faster RL training runs with revamped kernels and async data movements

    • 2x faster torch.compile

    Long Context gpt-oss Training

    We’re excited to introduce Unsloth Flex Attention support for OpenAI gpt-oss training that enables >8× longer context lengths, >50% less VRAM usage and >1.5× faster training (with no accuracy degradation) vs. all implementations including those using Flash Attention 3 (FA3). Unsloth Flex Attention makes it possible to train with a 60K context length on a 80GB VRAM H100 GPU for BF16 LoRA. Also:

    • You can your QLoRA fine-tuned gpt-oss model to llama.cpp, vLLM, Ollama or HF

    • We losses going to infinity on float16 GPUs (like T4 Colab)

    DeepSeek-V3.1: How to Run Locally

    A guide on how to run DeepSeek-V3.1 and Terminus on your own local device!

    DeepSeek’s V3.1 and Terminus update introduces hybrid reasoning inference, combining 'think' and 'non-think' into one model. The full 671B parameter model requires 715GB of disk space. The quantized dynamic 2-bit version uses 245GB (-75% reduction in size). GGUF:

    NEW: DeepSeek-V3.1-Terminus out now: You asked for tougher benchmarks, so we’re showcasing Aider Polyglot results! Our Dynamic 3-bit DeepSeek V3.1 GGUF scores 75.6%, surpassing many full-precision SOTA LLMs.

    Our DeepSeek-V3.1 GGUFs include Unsloth for llama.cpp supported backends.

    Qwen3: How to Run & Fine-tune

    Learn to run & fine-tune Qwen3 locally with Unsloth + our Dynamic 2.0 quants

    Qwen's new Qwen3 models deliver state-of-the-art advancements in reasoning, instruction-following, agent capabilities, and multilingual support.

    NEW! Qwen3 got an update in July 2025. Run & fine-tune the latest model:

    All uploads use Unsloth for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized Qwen LLMs with minimal accuracy loss.

    We also uploaded Qwen3 with native 128K context length. Qwen achieves this by using YaRN to extend its original 40K window to 128K.

    also now supports fine-tuning and of Qwen3 and Qwen3 MOE models — 2x faster, with 70% less VRAM, and 8x longer context lengths. Fine-tune Qwen3 (14B) for free using our

    DeepSeek-R1: How to Run Locally

    A guide on how you can run our 1.58-bit Dynamic Quants for DeepSeek-R1 using llama.cpp.

    Please see for an updated DeepSeek R1-0528 (May 28th 2025 version)

    Using llama.cpp (recommended)

    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/GLM-4.6V-Flash-GGUF",
        local_dir = "unsloth/GLM-4.6V-Flash-GGUF",
        allow_patterns = ["*UD-Q8_K_XL*"],
    )
    ./llama.cpp/llama-gguf-split --merge \
      GLM-4.6-GGUF/GLM-4.6-UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
    	merged_file.gguf
    OLLAMA_MODELS=unsloth ollama serve &
    
    OLLAMA_MODELS=unsloth ollama run merged_file.gguf
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/GLM-4.6-GGUF",
        local_dir = "unsloth/GLM-4.6-GGUF",
        allow_patterns = ["*UD-Q2_K_XL*"], # Dynamic 2bit Use "*UD-TQ1_0*" for Dynamic 1bit
    )
    ./llama.cpp/llama-cli \
        --model unsloth/GLM-4.6-GGUF/UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
        --jinja \
        --threads -1 \
        --n-gpu-layers 99 \
        --temp 1.0 \
        --top-p 0.95 \
        --top-k 40 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU"
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
    cp llama.cpp/build/bin/llama-* llama.cpp
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    OLLAMA_MODELS=unsloth ollama serve &
    
    OLLAMA_MODELS=unsloth ollama run hf.co/unsloth/GLM-4.6-GGUF:TQ1_0
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-server \
        --model unsloth/GLM-4.6-GGUF/GLM-4.6-UD-TQ1_0.gguf \
        --alias "unsloth/GLM-4.6" \
        --threads -1 \
        --n-gpu-layers 999 \
        -ot ".ffn_.*_exps.=CPU" \
        --prio 3 \
        --temp 1.0 \
        --top-p 0.95 \
        --top-k 40 \
        --ctx-size 16384 \
        --port 8001 \
        --jinja
    from openai import OpenAI
    import json
    openai_client = OpenAI(
        base_url = "http://127.0.0.1:8001/v1",
        api_key = "sk-no-key-required",
    )
    completion = openai_client.chat.completions.create(
        model = "unsloth/GLM-4.6",
        messages = [{"role": "user", "content": "What is 2+2?"},],
    )
    print(completion.choices[0].message.content)
    export LLAMA_CACHE="unsloth/GLM-4.6V-Flash-GGUF"
    ./llama.cpp/llama-cli \
        --model GLM-4.6V-Flash-GGUF/UD-Q8_K_XL/GLM-4.6V-Flash-UD-Q8_K_XL.gguf \
        --n-gpu-layers 99 \
        --jinja \
        --ctx-size 16384 \
        --flash-attn on \
        --temp 0.8 \
        --top-p 0.6 \
        --top-k 2 \
        --ctx-size 16384 \
        --repeat_penalty 1.1 \
        -ot ".ffn_.*_exps.=CPU"
    export LLAMA_CACHE="unsloth/GLM-4.6-GGUF"
    ./llama.cpp/llama-cli \
        --model GLM-4.6-GGUF/UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
        --n-gpu-layers 99 \
        --jinja \
        --ctx-size 16384 \
        --flash-attn on \
        --temp 1.0 \
        --top-p 0.95 \
        --top-k 40 \
        -ot ".ffn_.*_exps.=CPU"
    from openai import OpenAI
    import json
    openai_client = OpenAI(
        base_url = "http://0.0.0.0:30002/v1",
        api_key = "sk-no-key-required",
    )
    completion = openai_client.chat.completions.create(
        model = "finetuned_model",
        messages = [{"role": "user", "content": "What is 2+2?"},],
    )
    print(completion.choices[0].message.content)
    
    ## OUTPUT ##
    # <|channel|>analysis<|message|>The user asks a simple math question. We should answer 4. Also we should comply with policy. No issues.<|end|><|start|>assistant<|channel|>final<|message|>2 + 2 equals 4.
    hint: This usually indicates a problem with the package or the build environment.
      help: `outlines-core` (v0.1.26) was included because `sglang` (v0.5.5.post2) depends on `outlines` (v0.1.11) which depends on `outlines-core`
    /home/daniel/.cache/flashinfer/0.5.2/100a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_1.cu:1:10: fatal error: flashinfer/attention/prefill.cuh: No such file or directory
        1 | #include <flashinfer/attention/prefill.cuh>
          |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    compilation terminated.
    ninja: build stopped: subcommand failed.
    
    Possible solutions:
    1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
    2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
    3. disable torch compile by not using --enable-torch-compile
    4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
    Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
    python3 -m sglang.launch_server \
        --model-path unsloth/Llama-3.2-1B-Instruct \
        --host 0.0.0.0 --port 30000
    # Install openai via pip install openai
    from openai import OpenAI
    import json
    openai_client = OpenAI(
        base_url = "http://0.0.0.0:30000/v1",
        api_key = "sk-no-key-required",
    )
    completion = openai_client.chat.completions.create(
        model = "unsloth/Llama-3.2-1B-Instruct",
        messages = [{"role": "user", "content": "What is 2+2?"},],
    )
    print(completion.choices[0].message.content)
    from unsloth import FastLanguageModel
    import torch
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/gpt-oss-20b",
        max_seq_length = 2048,
        load_in_4bit = True,
    )
    model = FastLanguageModel.get_peft_model(model)
    model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_16bit")
    ## OR to upload to HuggingFace:
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")
    model.save_pretrained("finetuned_model")
    tokenizer.save_pretrained("finetuned_model")
    model.save_pretrained_merged("model", tokenizer, save_method = "lora")
    ## OR to upload to HuggingFace
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")
    model.save_pretrained_merged(
        "finetuned_model", 
        tokenizer, 
        save_method = "merged_16bit",
    )
    ## For gpt-oss specific mxfp4 conversions:
    model.save_pretrained_merged(
        "finetuned_model", 
        tokenizer, 
        save_method = "mxfp4", # (ONLY FOR gpt-oss otherwise choose "merged_16bit")
    )
    python -m sglang.launch_server \
        --model-path unsloth/Llama-3.2-1B-Instruct \
        --host 0.0.0.0 --port 30002 \
        --quantization fp8 \
        --kv-cache-dtype fp8_e4m3
    python -m sglang.launch_server \
        --model-path finetuned_model \
        --host 0.0.0.0 --port 30002
    # Batch Size=8, Input=1024, Output=1024
    python -m sglang.bench_one_batch_server \
        --model finetuned_model \
        --base-url http://0.0.0.0:30002 \
        --batch-size 8 \
        --input-len 1024 \
        --output-len 1024
    import sglang as sgl
    engine = sgl.Engine(model_path = "unsloth/Qwen3-0.6B", random_seed = 42)
    
    prompt = "Today is a sunny day and I like"
    sampling_params = {"temperature": 0, "max_new_tokens": 256}
    outputs = engine.generate(prompt, sampling_params)["text"]
    print(outputs)
    engine.shutdown()
    pip install -e "git+https://github.com/ggml-org/llama.cpp.git#egg=gguf&subdirectory=gguf-py" # install a python package from a repo subdirectory
    from huggingface_hub import hf_hub_download
    model_path = hf_hub_download(
        "unsloth/Qwen3-32B-GGUF",
        filename = "Qwen3-32B-UD-Q4_K_XL.gguf",
    )
    import sglang as sgl
    engine = sgl.Engine(model_path = model_path, random_seed = 42)
    
    prompt = "Today is a sunny day and I like"
    sampling_params = {"temperature": 0, "max_new_tokens": 256}
    outputs = engine.generate(prompt, sampling_params)["text"]
    print(outputs)
    engine.shutdown()
    from huggingface_hub import hf_hub_download
    hf_hub_download("unsloth/Qwen3-32B-GGUF", filename="Qwen3-32B-UD-Q4_K_XL.gguf", local_dir=".")
    python -m sglang.launch_server \
        --model-path Qwen3-32B-UD-Q4_K_XL.gguf \
        --host 0.0.0.0 --port 30002 \
        --served-model-name unsloth/Qwen3-32B \
        --tokenizer-path unsloth/Qwen3-32B
    python -m sglang.launch_server \
        --model-path finetuned_model \
        --host 0.0.0.0 --port 30002
    import os
    os.environ["UNSLOTH_VLLM_STANDBY"] = "1"
    python -m venv unsloth_env
    source unsloth_env/bin/activate
    
    pip install unsloth vllm
    pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu128 --force-reinstall
    pip install --pre fbgemm-gpu fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/cu128 --force-reinstall
    pip install --upgrade numba numpy
    import os
    os.environ['UNSLOTH_VLLM_STANDBY'] = "1" # Unsloth standby saves 30%+ memory for RL
    from unsloth import FastLanguageModel
    import torch
    max_seq_length = 2048 # Can increase for longer reasoning traces
    lora_rank = 32 # Larger rank = smarter, but slower
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/Qwen3-8B",
        max_seq_length = max_seq_length,
        load_in_4bit = False, # False for LoRA 16bit
        fast_inference = True, # Enable vLLM fast inference
        max_lora_rank = lora_rank,
        load_in_fp8 = True, # Float8 RL / GRPO!
    )
    from unsloth import FastLanguageModel
    fp8_model = FastLanguageModel.from_pretrained(
        "unsloth/Llama-3.3-70B-Instruct", # Can be any model name!
        load_in_fp8 = True, # Can be "block" for block FP8, True for row FP8, False
    )
    filename,text
    0001.wav,Hello there!
    0002.wav,<sigh> I am very tired.
    from datasets import Audio
    dataset = load_dataset("csv", data_files="mydata.csv", split="train")
    dataset = dataset.cast_column("filename", Audio(sampling_rate=24000))
    from unsloth import FastModel
    
    model_name = "unsloth/orpheus-3b-0.1-pretrained"
    model, tokenizer = FastModel.from_pretrained(
        model_name,
        load_in_4bit=False  # use 4-bit precision (QLoRA)
    )
    from datasets import load_dataset, Audio
    
    # Load the Elise dataset (e.g., the version with emotion tags)
    dataset = load_dataset("MrDragonFox/Elise", split="train")
    print(len(dataset), "samples")  # ~1200 samples in Elise
    
    # Ensure all audio is at 24 kHz sampling rate (Orpheus’s expected rate)
    dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
    from unsloth import FastLanguageModel
    import torch
    dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/orpheus-3b-0.1-ft",
        max_seq_length= 2048, # Choose any for long context!
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        #token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
    )
    
    from datasets import load_dataset
    dataset = load_dataset("MrDragonFox/Elise", split = "train")
    # Tokenize the text transcripts
    def preprocess_function(example):
        # Tokenize the text (keep the special tokens like <laugh> intact)
        tokens = tokenizer(example["text"], return_tensors="pt")
        # Flatten to list of token IDs
        input_ids = tokens["input_ids"].squeeze(0)
        # The model will generate audio tokens after these text tokens.
        # For training, we can set labels equal to input_ids (so it learns to predict next token).
        # But that only covers text tokens predicting the next text token (which might be an audio token or end).
        # A more sophisticated approach: append a special token indicating start of audio, and let the model generate the rest.
        # For simplicity, use the same input as labels (the model will learn to output the sequence given itself).
        return {"input_ids": input_ids, "labels": input_ids}
    
    train_data = dataset.map(preprocess_function, remove_columns=dataset.column_names)
    from transformers import TrainingArguments,Trainer,DataCollatorForSeq2Seq
    from unsloth import is_bfloat16_supported
    
    trainer = Trainer(
        model = model,
        train_dataset = dataset,
        args = TrainingArguments(
            per_device_train_batch_size = 1,
            gradient_accumulation_steps = 4,
            warmup_steps = 5,
            # num_train_epochs = 1, # Set this for 1 full training run.
            max_steps = 60,
            learning_rate = 2e-4,
            fp16 = not is_bfloat16_supported(),
            bf16 = is_bfloat16_supported(),
            logging_steps = 1,
            optim = "adamw_8bit",
            weight_decay = 0.01,
            lr_scheduler_type = "linear",
            seed = 3407,
            output_dir = "outputs",
            report_to = "none", # Use this for WandB etc
        ),
    )
    model.save_pretrained("lora_model")  # Local saving
    tokenizer.save_pretrained("lora_model")
    # model.push_to_hub("your_name/lora_model", token = "...") # Online saving
    # tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first
    cp llama.cpp/build/bin/llama-* llama.cpp
    wget https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/unsloth%20made%20with%20love.png -O unsloth.png
    wget https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg -O picture.png
    ./llama.cpp/llama-mtmd-cli \
        -hf unsloth/Qwen3-VL-8B-Instruct-GGUF:UD-Q4_K_XL \
        --n-gpu-layers 99 \
        --jinja \
        --top-p 0.8 \
        --top-k 20 \
        --temp 0.7 \
        --min-p 0.0 \
        --flash-attn on \
        --presence-penalty 1.5 \
        --ctx-size 8192
    The two images are directly related because they both feature the **tree sloth**, which is the central subject of the "made with unsloth" project.
    
    - The first image is the **official logo** for the "made with unsloth" project. It features a stylized, cartoonish tree sloth character inside a green circle, with the text "made with unsloth" next to it. This is the visual identity of the project.
    - The second image is a **photograph** of a real tree sloth in its natural habitat. This photo captures the animal's physical appearance and behavior in the wild.
    
    The relationship between the two images is that the logo (image 1) is a digital representation or symbol used to promote the "made with unsloth" project, while the photograph (image 2) is a real-world depiction of the actual tree sloth. The project likely uses the character from the logo as an icon or mascot, and the photograph serves to illustrate what the tree sloth looks like in its natural environment.
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id   = "unsloth/Qwen3-VL-8B-Instruct-GGUF", # Or "unsloth/Qwen3-VL-8B-Thinking-GGUF"
        local_dir = "unsloth/Qwen3-VL-8B-Instruct-GGUF", # Or "unsloth/Qwen3-VL-8B-Thinking-GGUF"
        allow_patterns = ["*UD-Q4_K_XL*"],
    )
    ./llama.cpp/llama-mtmd-cli \
        --model unsloth/Qwen3-VL-8B-Instruct-GGUF/Qwen3-VL-8B-Instruct-UD-Q4_K_XL.gguf \
        --mmproj unsloth/Qwen3-VL-8B-Instruct-GGUF/mmproj-F16.gguf \
        --n-gpu-layers 99 \
        --jinja \
        --top-p 0.8 \
        --top-k 20 \
        --temp 0.7 \
        --min-p 0.0 \
        --flash-attn on \
        --presence-penalty 1.5 \
        --ctx-size 8192
    ./llama.cpp/llama-mtmd-cli \
        --model unsloth/Qwen3-VL-8B-Thinking-GGUF/Qwen3-VL-8B-Thinking-UD-Q4_K_XL.gguf \
        --mmproj unsloth/Qwen3-VL-8B-Thinking-GGUF/mmproj-F16.gguf \
        --n-gpu-layers 99 \
        --jinja \
        --top-p 0.95 \
        --top-k 20 \
        --temp 1.0 \
        --min-p 0.0 \
        --flash-attn on \
        --presence-penalty 0.0 \
        --ctx-size 8192
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF",
        local_dir = "unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF",
        allow_patterns = ["*UD-Q2_K_XL*"],
    )
    ./llama.cpp/llama-mtmd-cli \
        --model unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF/UD-Q2_K_XL/Qwen3-VL-235B-A22B-Instruct-UD-Q2_K_XL-00001-of-00002.gguf \
        --mmproj unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF/mmproj-F16.gguf \
        --n-gpu-layers 99 \
        --jinja \
        --top-p 0.8 \
        --top-k 20 \
        --temp 0.7 \
        --min-p 0.0 \
        --flash-attn on \
        --presence-penalty 1.5 \
        --ctx-size 8192 \
        -ot ".ffn_.*_exps.=CPU"
    ./llama.cpp/llama-mtmd-cli \
        --model unsloth/Qwen3-VL-235B-A22B-Thinking-GGUF/UD-Q2_K_XL/Qwen3-VL-235B-A22B-Thinking-UD-Q2_K_XL-00001-of-00002.gguf \
        --mmproj unsloth/Qwen3-VL-235B-A22B-Thinking-GGUF/mmproj-F16.gguf \
        --n-gpu-layers 99 \
        --jinja \
        --top-p 0.95 \
        --top-k 20 \
        --temp 1.0 \
        --min-p 0.0 \
        --flash-attn on \
        --presence-penalty 0.0 \
        --ctx-size 8192 \
        -ot ".ffn_.*_exps.=CPU"
    docker model pull hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:UD-Q4_K_XL
    docker model run ai/qwen3-vl
    ds_converted = ds.map(
        convert_to_conversation,
    )
    ds_converted = [convert_to_converation(sample) for sample in dataset]
    8B-Instruct
    8B-Thinking
    30B-Instruct
    30B-Thinking
    32B-Instruct
    32B-Thinking
    235B-A22B-Instruct
    235B-A22B-Thinking
    8B-Instruct
    8B-Thinking
    32B-Instruct
    32B-Thinking
    8B-Thinking
    30B-Instruct
    30B-Thinking
    32B-Instruct
    32B-Thinking
    235B-A22B-Thinking
    235B-A22B-Instruct
    IQ2_XXS
    Q2_K_XL
    IQ3_XXS
    Q3_K_XL
    Q4_K_XL
    Q5_K_XL
    Sesame-CSM (1B)
    Orpheus-TTS (3B)
    Whisper Large V3
    Spark-TTS (0.5B)
    Llasa-TTS (1B)
    Oute-TTS (1B)

    Leave this as 0 for faster training! Can reduce over-fitting, but not that much.

  • Leave this as 0 for faster and less over-fit training!

  • Options include True, False and "unsloth". We suggest "unsloth" since we reduce memory usage by an extra 30% and support extremely long context finetunes.You can read up here: https://unsloth.ai/blog/long-context for more details.

  • The number to determine deterministic runs. Training and finetuning needs random numbers, so setting this number makes experiments reproducible.

  • Advanced feature to set the lora_alpha = 16 automatically. You can use this if you want!

  • Advanced feature to initialize the LoRA matrices to the top r singular vectors of the weights. Can improve accuracy somewhat, but can make memory usage explode at the start.

  • .
    . Setting it to 1 means 1 full pass over your dataset. We normally suggest 1 to 3 passes, and no more, otherwise you will over-fit your finetune.
  • Reduce the learning rate if you want to make the finetuning process slower, but also converge to a higher accuracy result most likely. We normally suggest 2e-4, 1e-4, 5e-5, 2e-5 as numbers to try.

  • S

    23

    18

    7.25

    Ollama Colab notebook
    https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb
    Unsloth
    Ollama Llama-3 Alpaca
    CSV/Excel Ollama Guide
    Ollama
    https://huggingface.co/datasets/vicgalle/alpaca-gpt4
    https://github.com/tatsu-lab/stanford_alpaca
    https://colab.research.google.com/drive/1VYkncZMfGFkeCEgN2IzbZIKEDkyQuJAS?usp=sharing
    https://colab.research.google.com/drive/1VYkncZMfGFkeCEgN2IzbZIKEDkyQuJAS?usp=sharing
    https://huggingface.co/settings/tokens
    https://github.com/ggerganov/llama.cpp
    https://github.com/unslothai/unsloth/wiki#manually-saving-to-gguf
    Github
    here
    here
    here
    here
    For normal LoRA/QLoRA Fine-tuning
    ,
    we recommend
    2e-4
    as a starting point.
    🟦
    For Reinforcement Learning (DPO, GRPO etc.), we recommend 5e-6 .
    ⬜
    For Full Fine-tuning, lower learning rates are generally more appropriate.

    Warmup Steps

    Gradually increases the learning rate at the start of training.

    5-10% of total steps

    Scheduler Type

    Adjusts the learning rate dynamically during training.

    linear or cosine

    Seed (random_state)

    A fixed number to ensure reproducibility of results.

    Any integer (e.g., 42, 3407)

    Target Modules

    Specify which parts of the model you want to apply LoRA adapters to — either the attention, the MLP, or both.

    Attention: q_proj, k_proj, v_proj, o_proj MLP: gate_proj, up_proj, down_proj

    Recommended to target all major linear layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.

    batch_size = 4, gradient_accumulation_steps = 8
  • batch_size = 2, gradient_accumulation_steps = 16

  • batch_size = 1, gradient_accumulation_steps = 32

  • ) is a reliable baseline. A popular and effective heuristic is to set it to double the rank (
    r * 2
    ), which makes the model learn more aggressively by giving more weight to the LoRA updates.
    .\
  • A regularization technique that helps prevent overfitting by randomly setting a fraction of the LoRA activations to zero during each training step. Recent research suggests that for the short training runs common in fine-tuning, lora_dropout may be an unreliable regularizer. 🦥 Unsloth's internal code can optimize training when lora_dropout = 0, making it slightly faster, but we recommend a non-zero value if you suspect overfitting.\

  • Leave this as "none" for faster training and reduced memory usage. This setting avoids training the bias terms in the linear layers, which adds trainable parameters for little to no practical gain.\

  • Options are True, False, and "unsloth". 🦥 We recommend "unsloth" as it reduces memory usage by an extra 30% and supports extremely long context fine-tunes. You can read more on our blog post about long context training.\

  • The seed to ensure deterministic, reproducible runs. Training involves random numbers, so setting a fixed seed is essential for consistent experiments.\

  • An advanced feature that implements Rank-Stabilized LoRA. If set to True, the effective scaling becomes lora_alpha / sqrt(r) instead of the standard lora_alpha / r. This can sometimes improve stability, particularly for higher ranks. More details here.\

  • An advanced technique, as proposed in LoftQ, initializes LoRA matrices with the top 'r' singular vectors from the pretrained weights. This can improve accuracy but may cause a significant memory spike at the start of training.

  • : LoRA only on FFN. Equivalent to:
    gate_proj
    ,
    up_proj
    ,
    down_proj.
  • QLoRA-Attention: LoRA applied only to Attention layers. Equivalent to: q_proj, k_proj, v_proj, o_proj.

  • Increase lora_dropout. Use a value like 0.1 to add regularization.

  • Increase batch size or gradient accumulation steps.

  • Dataset expansion - make your dataset larger by combining or concatenating open source datasets with your dataset. Choose higher quality ones.

  • Evaluation early stopping - enable evaluation and stop when the evaluation loss increases for a few steps.

  • LoRA Alpha Scaling - scale the alpha down after training and during inference - this will make the finetune less pronounced.

  • Weight averaging - literally add the original instruct model and the finetune and divide the weights by 2.

  • Use a More Domain-Relevant Dataset: Ensure the training data is high-quality and directly relevant to the target task.

  • Decrease batch size to 1. This will cause the model to update more vigorously.

  • LoRA Rank (r)

    Controls the number of trainable parameters in the LoRA adapter matrices. A higher rank increases model capacity but also memory usage.

    8, 16, 32, 64, 128 Choose 16 or 32

    LoRA Alpha (lora_alpha)

    Scales the strength of the fine-tuned adjustments in relation to the rank (r).

    r (standard) or r * 2 (common heuristic). More details here.

    LoRA Dropout

    A regularization technique that randomly sets a fraction of LoRA activations to zero during training to prevent overfitting. Not that useful, so we default set it to 0.

    0 (default) to 0.1

    Weight Decay

    A regularization term that penalizes large weights to prevent overfitting and improve generalization. Don't use too large numbers!

    Batch Size (batch_size)

    The number of samples processed in a single forward/backward pass on one GPU. Primary Driver of VRAM Usage. Higher values can improve hardware utilization and speed up training, but only if they fit in memory.

    2

    Gradient Accumulation (gradient_accumulation_steps)

    The number of micro-batches to process before performing a single model weight update. Primary Driver of Training Time. Allows simulation of a larger batch_size to conserve VRAM. Higher values increase training time per epoch.

    8

    Effective Batch Size (Calculated)

    The true batch size used for each gradient update. It directly influences training stability, quality, and final model performance.

    4 to 16 Recommended: 16 (from 2 * 8)

    W^=W+αrank×AB\hat{W} = W + \frac{\alpha}{\text{rank}} \times ABW^=W+rankα​×AB
    W^rslora=W+αrank×AB\hat{W}_{\text{rslora}} = W + \frac{\alpha}{\sqrt{\text{rank}}} \times ABW^rslora​=W+rank​α​×AB
    more details here
    Read our blog post
    Research has shown
    contributors
    rsLoRA (rank stabilized lora) paper
    QLoRA paper
    QLoRA paper
    conversational notebooks here
    Eyera
    Instead of optimizing Model Weights (yellow), we optimize 2 thin matrices A and B.
    (Before - Standard Gradient Accumulation)
    (After - 🦥 Unsloth Gradient Accumulation)
    rsLoRA other scaling options. sqrt(r) is the best.

    0.01 (recommended) - 0.1

    More details here
    Repetition Penalty of 1.0. (1.0 means disabled in llama.cpp and transformers)
  • Chat template:

  • Chat template with \nnewlines rendered (except for the last)

  • As an update, Matt mentioned we can also use Q4_0, Q4_1, Q5_0, Q5_1 for the embeddings - and we confirmed it does also work in Ollama! This means once again the smaller 2, 3 and 4bit quants are smaller in size, and don't need Q8_0!

    • 2B

    • 4B

    • 2B

    • 4B

    • 2B

    • 4B

    msfa.ffn.pw_proj.conv.weight

    98.000000

    blocks.2.21.attn.key.down_conv.weight

    37.000000

    blocks.2.32.pw_exp.conv.weight

    34.750000

    blocks.2.30.pw_exp.conv.weight

    33.750000

    blocks.2.34.pw_exp.conv.weight

    33.750000

    our collection here
    fixed issues
    Xuan-Son Nguyen
    Georgi Gerganov
    GitHub here
    Gemma 3
    More information below
    free Colab notebook
    Audio only notebook
    Vision only notebook
    Gradient Checkpointing described below
    free Colab notebook
    See more here
    Michael
    Xuan
    Pull request
    discussed here
    Unsloth's gradient checkpointing
    free Colab notebook
    Matryoshka Transformer or MatFormer
    free Colab notebook
    Image from Gemma 3n model overview
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    learning_rate = 2e-4,
    max_seq_length = 2048
    dtype = None
    load_in_4bit = True
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    max_steps = 60, # num_train_epochs = 1,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    from unsloth.chat_templates import train_on_responses_only
    trainer = train_on_responses_only(
        trainer,
        instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
        response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
    )
    from unsloth.chat_templates import train_on_responses_only
    trainer = train_on_responses_only(
        trainer,
        instruction_part = "<start_of_turn>user\n",
        response_part = "<start_of_turn>model\n",
    )
    <bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\nHey there!<end_of_turn>\n<start_of_turn>user\nWhat is 1+1?<end_of_turn>\n<start_of_turn>model\n
    <bos><start_of_turn>user
    Hello!<end_of_turn>
    <start_of_turn>model
    Hey there!<end_of_turn>
    <start_of_turn>user
    What is 1+1?<end_of_turn>
    <start_of_turn>model\n
    ollama rm hf.co/unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL
    
    ollama run hf.co/unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    ollama run hf.co/unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli -hf unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/gemma-3n-E4B-it-GGUF",
        local_dir = "unsloth/gemma-3n-E4B-it-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*", "mmproj-BF16.gguf"], # For Q4_K_XL
    )
    ./llama.cpp/llama-cli \
        --model unsloth/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-UD-Q4_K_XL.gguf \
        --ctx-size 32768 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 1.0 \
        --repeat-penalty 1.0 \
        --min-p 0.00 \
        --top-k 64 \
        --top-p 0.95
    ./llama.cpp/llama-cli \
        --model unsloth/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-UD-Q4_K_XL.gguf \
        --ctx-size 32768 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 1.0 \
        --repeat-penalty 1.0 \
        --min-p 0.00 \
        --top-k 64 \
        --top-p 0.95 \
        -no-cnv \
        --prompt "<start_of_turn>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<end_of_turn>\n<start_of_turn>model\n"
    model = FastVisionModel.get_peft_model(
        model,
        finetune_vision_layers     = False, # False if not finetuning vision layers
        finetune_language_layers   = True,  # False if not finetuning language layers
        finetune_attention_modules = True,  # False if not finetuning attention layers
        finetune_mlp_modules       = True,  # False if not finetuning MLP layers
    )

    2.42bit

    216GB

    2.5/2.06bit

    2.71bit

    251GB

    3.5/2.5bit

    3.12bit

    273GB

    3.5/2.06bit

    3.5bit

    296GB

    4.5/3.5bit

    4.5bit

    384GB

    5.5/4.5bit

    5.5bit

    481GB

    6.5/5.5bit

    And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    1.66bit

    TQ1_0

    162GB

    1.92/1.56bit

    1.78bit

    IQ1_S

    185GB

    2.06/1.56bit

    1.93bit

    IQ1_M

    200GB

    TQ1_0 dynamic 1.66-bit quant
    DeepSeek
    DeepSeek-R1-0528-Qwen3-8B-GGUF
    IQ4_NL
    Q4_1
    BF16 format
    FP8 (float8) format
    see here
    docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/
    see here
    GitHub here
    GitHub here
    https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF
    r/Localllama
    DeepSeek-R1-0528-Qwen3-8B notebook
    The goal is to make the heptagon spin, and the balls in the heptagon should move.

    2.5/2.06/1.56

    times
    during model loading

    Unsloth already increases RL training speed, context window and reduces VRAM usage by 50–90% vs. all other setups with FA2, but now Unsloth's Standby improves this even further. Our Standby feature uniquely limits speed degradation compared to other implementations and sometimes makes training even faster!

    Now, Qwen3-32B LoRA 16-bit can attain 6,144 context lengths vs 3,600 (1.7x longer) before on 1xH100 80GB GPU. Llama-3.1-8B QLoRA 4bit can attain 47,500 lengths vs 42,000 before (1.13x longer).

    We made RL runs 10% faster through various kernel optimizations, and removed the LoRA communication channel between the CPU and GPU when switching from training to inference mode. Finally, we used custom torch.compile flags to make vLLM's rollout faster by 10%, and reduced compilation time by 2x.

    ✨How to enable optimizations

    To enable Unsloth's Standby feature, set the environment variable UNSLOTH_VLLM_STANDBY before any Unsloth import. Then set gpu_memory_utilization = 0.95 and that's it!

    🎓No more gpu_memory_utilization!

    With Unsloth's new RL improvements, you NEVER have to worry about tuning or setting gpu_memory_utilization ever again - simply set it to 90% or 95% of GPU utilization - 100% sadly won't work since some space is needed for small tensors. Previously one had to tune it from 30% to 95% - no more now! Set it to the maximum and Unsloth will handle the rest!

    ⁉️Why does RL use so much memory?

    GRPO (and many RL variants) rely heavily on generation which is primarily powered by vLLM. But this comes comes with a steep cost since it requires constant GPU memory for weights, activations, and the KV Cache.

    Inference takes a lot of VRAM

    Whilst Training also uses VRAM!

    This means RL needs to keep 2 sets of VRAM / memory on the GPU at the same time:

    1. Inference engine (has model weights, KV cache)

    2. Training engine (has model weights, activations, gradients, optimizer states)

    Current RL frameworks have to split 50/50 for a 80GB GPU with 50% for inference and 50% for training. And moving weights from training mode to inference mode can take quite some time.

    80GB GPU
    Inference Engine (50%)
    Training Engine (50%)

    Model Weights

    16GB

    16GB

    KV Cache

    24GB

    Activations, Gradients, Optimizer States

    24GB

    Previous Unsloth versions already smartly optimizes the above, as we share vLLM's weight space directly which removes the double memory usage of the model weights. This frees up 16GB of space for example which can be used to increase context length or the speed of generation. Also, we don't need to do memory movements, which makes training faster.

    80GB GPU
    Inference Engine (50%)
    Training Engine (50%)

    Model Weights

    16GB SHARED

    <<< SHARED

    KV Cache

    24GB + 8GB= 32GB

    Activations, Gradients, Optimizer States

    24GB + 8GB=32GB

    🦥Unsloth Standby

    But we can go further - we first note RL does inference then training then inference then training etc.

    This means the memory space for inference and training can in theory be re-used, since inference and training are separate modes - this is where vLLM's sleep mode feature comes in, which has 2 options:

    1. level = 1 copies weights to the CPU and deletes KV cache

    2. level = 2 deletes weights and deletes KV cache

    But reminder in Unsloth we share vLLM's memory space for the weights - this means we need a new way to delete the KV cache, and ignore deletion of the weights, and we call this Unsloth Standby.

    80GB GPU
    Inference Engine
    Training Engine

    Model Weights

    16GB SHARED

    <<< SHARED

    Multi-purpose

    64GB space

    KV Cache

    Activations, Gradients, Optimizer States

    To enable this, simply add the below to all RL / GRPO training runs before any Unsloth import:

    🧪Performance Experiments

    Here you will find out how we benchmarked memory usage and context length for GRPO. Note that we do 2 generations per prompt because for GRPO to work, we need at least 2 generations for which to calculate the sample mean and variance. Without 2 generations, the standard deviation of one sample is 0. This causes the advantages which uses this: (reward - mean)/std to be undefined.

    This means for GRPO specifically, a maximum context length of 6,144 for Qwen-3 32B is actually 6,144 multiplied by 2 generations ie 12,288 in length.

    We provide experiments for Llama-3.1 8B on both LoRA (16bit) and QLoRA (4bit) below:

    If you notice any training time differences, it isn’t much. In our apples to apples comparison we noticed <1% training time slowdowns or even speedups which can be attributed to margin of error.

    We also theorize speedups are possible due to reduced memory pressure, so there might be less memory cleanup on the CUDA memory allocator side.

    In the above image, you see the difference between baseline and standby mode on a single T4 GPU for Qwen 3 4B. We can stretch the vllm's gpu_memory_utilisation to as high as 0.95 without worrying that it'd affect training. This means you can fit higher context length sequences and more sequences can be processed. In the first case, for example, we have enough memory to fit and process 32K length sequences provided training allows where as previously, any inputs longer than 2K would potentially not fit in and end up causing OOMs (out of memory).

    Experiments
    Config
    Status
    GPU Memory usage
    Comments

    standby True

    vllm_gpu_util 0.95

    num_gen 2

    grad_acc_steps 2

    Runs for 40 steps/ 40 minutes

    14.5 GiB (set by vllm_gpu_util)

    Enough to fit in 32K KVCache with chunk of 2-4K or say 16K KVCache + 16K chunks

    standby True

    vllm_gpu_util 0.9

    num_gen 2

    grad_acc_steps 2

    Runs 32 steps in 40 m

    13.8 GiB (set by…)

    Approx enough to fit in ~28K KVCache with chunk of 2-4K or say 15K KVCache + 15K chunks

    H100 Experiments

    Model
    GPU
    Seq Len
    Num Generations
    Grad Acc Steps

    Qwen2.5-14B-Instruct

    NVIDIA H100 80GB PCIe

    32,768

    8

    4

    In our collapsible results below, you can see there is a 9GiB difference in the peak memory used (note that 90% of the time, the GPU memory usage is equal to the peak memory in our case). To put things into perspective, using TRL and LoRA we were able to only fine-tune an 8B parameter model with a context length of 1024 at max (32x less). Anything with higher sequence length (with similar configuration) results in the process failing with OOM.

    Click for Unsloth Standby Mode vs. no Standby Benchmarks

    The image below shows how standby compares against non standby training with Unsloth. It is averaged over 3 runs to make sure the metrics aren’t noisy. In fact, if you zoom in close enough, you’d see that enabling standby makes it faster as well, probably due to less memory pressure as discussed before.

    Previous A100 40GB experiments

    In our previous experiments on A100 40GB GPU with Qwen-2.5-3b-instruct and 8 generations per sample, we observed that without standby, the GRPO training (model loaded in 16bit, LoRA, only weights trainable), we could only fit 6K sequence lengths. With our standby feature, we were able to fit 10K and beyond! For comparison TRL can only give you context lengths of up to 1K while holding the same batch size.

    🎉Other optimizations

    We now select better compilation flags and reduce compile times by 50% or more. We also managed to dynamically patch any vLLM version to handle gc.collect better for backwards compatibility reasons, as inspired from this vLLM pull request. This reduces compilation times from 2 minutes to under 40 seconds.

    We also optimized torch.compile flags and tried turning on some flags - unfortunately combo_kernels and multi_kernel could not function correctly on vLLM 0.10 and Torch 2.8/2.9 nightly and coordinate_descent_tuning made autotuning all kernels dramatically slower. It used to compile in under a minute, but enabling it took over 13 minutes and more, with minimal performance gains.

    📚GRPO Notebooks

    All our GRPO notebooks have Unsloth Standby on by default and all optimizations! See https://docs.unsloth.ai/get-started/unsloth-notebooks for all our GRPO notebooks, or try the below:

    • Qwen3 (4B) - Advanced GRPO LoRA

    • DeepSeek-R1-0528-Qwen3 (8B) (for multilingual usecases)

    • Gemma 3 (1B)

    • Llama 3.2 (3B) - Advanced GRPO LoRA

    Z=ri−μ1n∑(ri−μ)2Zn=1=r1−μ11∑(r1−μ)2=00=undefinedZ=\frac{r_i - \mu}{\sqrt{\frac{1}{n}\sum(r_i-\mu)^2}} \\ Z_{n=1}=\frac{r_1 - \mu}{\sqrt{\frac{1}{1}\sum(r_1-\mu)^2}}=\frac{0}{0}=\text{undefined}Z=n1​∑(ri​−μ)2​ri​−μ​Zn=1​=11​∑(r1​−μ)2​r1​−μ​=00​=undefined

    We fixed gpt-oss implementation issues irrelevant to Unsloth, most notably ensuring that swiglu_limit = 7.0 is properly applied during MXFP4 inference in transformers

    🦥Introducing Unsloth Flex Attention Support

    With Unsloth's Flex Attention support, a single 80GB VRAM H100 can handle up to 81K context length with QLoRA and 60K context with BF16 LoRA! These gains are applied to BOTH gpt-oss-20b and gpt-oss-120b! The more context length you use, the more gains you'll get from Unsloth Flex Attention:

    In comparison, all other non-Unsloth implementations max out at 9K context length on an 80GB GPU, and can only reach 15K context with FA3. But, FA3 is unsuitable for gpt-oss training since it lacks backward pass support for attention sinks. So if you were previously using FA3 for gpt-oss training, we'd recommend you to not use it for now. Thus, the max context length you can get without Unsloth on 80GB VRAM is ~9K.

    Training with Unsloth Flex Attention delivers at least a 1.3× speedup, with gains growing as context length increases, reaching up to 2× faster. Because Flex Attention scales with context, longer sequences yield bigger savings in both VRAM and training time, as described here.

    A huge thank you to Rohan Pandey for his Flex Attention implementation, which directly inspired the development of Unsloth's Flex Attention implementation.

    🕶️ Attention Sinks

    OpenAI's GPT OSS model uses an alternating pattern of sliding window attention, full attention, sliding window attention and so on (SWA, FA, SWA, FA, etc). Each sliding window only attends to 128 tokens (including the current token), so computation is vastly reduced. However, this also means long context retrieval and reasoning becomes useless due to the small sliding window. Most labs fix this by expanding the sliding window to 2048 or 4096 tokens.

    OpenAI leveraged Attention Sinks from the Efficient Streaming Language Models with Attention Sinks paper which shows that you can use a small sliding window, except you must add a global attention on the first token! The paper provides a good illustration below:

    The paper finds that the attention mechanism seems to assign a lot of weight to the first few tokens (1 to 4), and by removing them during the sliding window operation, these "important" first few tokens disappear, and causes bad long context retrieval.

    If we plot log perplexity (higher is worse), and do long context inference after the pretrained model's set context length, we see the perplexity shoots up (not good). However the red line (uses Attention Sinks) stays low, which is very good!

    The paper also shows that the Attention Is Off By One method does partially work, except one must also add a few extra sink tokens to get lower perplexities. The paper shows that adding a single sink token that is learnable does remarkably well! And that's what OpenAI did for GPT-OSS!

    📐Unsloth's Flex Attention implementation

    Flex Attention https://pytorch.org/blog/flexattention/ is extremely powerful as it provides the practitioner 2 customization routes for the attention mechanism - a score modifier (f) and a masking function (M).

    The score modifier (f) allows us to edit the attention logits before the softmax operation, and the masking function (M) allows us to skip operations if we don't need them (for eg sliding window attention only sees last 128 tokens).

    The trick is Flex Attention provides fast auto generated Triton kernels with arbitrary score modifiers and masking functions!

    σ(s×f(QKT+M))\sigma\bigg(s\times\bold{f}(QK^T+\bold{M})\bigg)σ(s×f(QKT+M))

    This means we can use Flex Attention to implement attention sinks! Implementing a single attention sink is provided both in OpenAI's original GPT-OSS repo and HuggingFace's transformers's implementation.

    The above shows we concatenate the sink at the very end of the Q @ K.T , do the softmax, and remove the last column which was the sink token.

    By using some visualization utilities from Flex Attention's Github repo, we can visualize this. Assume the sequence length was 16, and a sliding window of 5. On the left is the last sink column (default implementation), and on the right is if we move the sink location to index 0 (our implementation).

    Sink location at the end (default)

    Move sink location to index 0

    Interesting finding: The official Flex Attention sliding window implementations considers the window size as the number of last tokens PLUS ONE as it includes the current token. The HuggingFace and GPT OSS implementations strictly only sees the last N tokens. Ie the below is from https://pytorch.org/blog/flexattention/ and https://github.com/meta-pytorch/attention-gym:

    Default Flex Attention (3+1 tokens)

    HuggingFace, GPT-OSS (3+0 tokens)

    We also confirmed through OpenAI's official GPT-OSS implementation on whether we attend to the last N or N+1 tokens here: https://github.com/openai/gpt-oss/blob/main/gpt_oss/torch/model.py

    And we see only the last 3 tokens (not 3+1) are attended to! This means instead of using <= SLIDING_WINDOW, use < SLIDING_WINDOW (ie use less than, not the equals).

    Also since we moved the sink token index to the first, we have to add 1 to the q_idx to index correctly:

    To confirm our index 0 implementation, we verified that the training loss remains consistent with standard Hugging Face runs (without Unsloth Flex Attention), as shown in our graph:

    📜 Mathematical derivation for attention sinks

    There is another way to calculate the attention sinks without padding K and V. We first note the softmax operation does, and we want to 2nd version with sinks for now as a scalar:\

    We can obtain the logsumexp from Flex Attention via return_lse = True , and so we do:

    And we can now easily derive the sink version of attention. We do find however this process has somewhat higher error than the zero padding approach, so we still default to our original version.

    💾NEW: Saving to GGUF, vLLM after gpt-oss training

    You can now QLoRA fine-tune gpt-oss and directly save, export, or merge the model to llama.cpp, vLLM, or HF - not just Unsloth. We will be releasing a free notebook hopefully soon.

    Previously, any QLoRA fine-tuned gpt-oss model was restricted to running in Unsloth. We’ve removed that limitation by introducing the ability to merge in MXFP4 native format using save_method="mxfp4" and on-demand dequantization of MXFP4 base models (like gpt-oss) making it possible to export your fine-tuned model in bf16 format using save_method="merged_16bit" .

    The MXFP4 native merge format offers significant performance improvements compared to the bf16 format: it uses up to 75% less disk space, reduces VRAM consumption by 50%, accelerates merging by 5-10x, and enables much faster conversion to GGUF format.

    After fine-tuning your gpt-oss model, you can merge it into MXFP4 format with:

    If you prefer to merge the model and push to the hugging-face hub, use:

    To run inference on the merged model, you can use vLLM and Llama.cpp among others. OpenAI recommends these inference settings for both models: temperature=1.0, top_p=1.0, top_k=0

    ✨ Saving to Llama.cpp

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2. Convert the MXFP4 merged model:

    3. Run inference on the quantized model:

    ✨ Saving to SGLang
    1. Build SGLang from source:\

    2. Launch SGLang server:\

    3. Run inference:\

    ♦️Fine-tuning gpt-oss directly

    We also added support for directly fine-tuning of gpt-oss models by implementing patches that allow loading the native MXFP4 quantized format. This makes it possible to load the 'openai/gpt-oss' model with less than 24GB of VRAM, and QLoRA fine-tune it. Simply load the model using:

    add a Peft layer using FastLanguageModel.get_peft_model and run SFT fine-tuning over the Peft model.

    🐛Bug Fixes for gpt-oss

    We recently collaborated with Hugging Face to resolve inference issues by using OpenAI’s kernels and ensuring that swiglu_limit = 7.0 is correctly applied during MXFP4 inference.

    Based on user feedback, we discovered that extended QLoRA training runs (beyond 60 steps) could cause the loss to diverge and eventually error out. This issue only occurred on devices that do not support BF16 and instead fall back to F16 (e.g., T4 GPUs). Importantly, it did not impact QLoRA training on A100 or H100 GPUs, nor LoRA training on f16 GPUs.

    After extensive investigation, we’ve now aligned training loss behavior across all GPU setups, including GPUs limited to F16. If you were previously experiencing issues because of this, we recommend using our new updated gpt-oss notebook!

    We had to do many many experiments to move float16's training loss curve to be equivalent to bfloat16 machines (blue line). We found the following:

    1. Pure float16 will go to infinity on step 50

    2. We found the down projections in the MoE to have huge outliers

    3. Activations must be saved in bfloat16 or float32

    Below shows the absolute magnitude activations for GPT OSS 20B, and some really spike - this will overflow in float16 machines since float16's maximum range is 65504.

    We fixed this in Unsloth, so all float16 training works out of the box!

    🔢 Implementations for Sink Attention

    OpenAI's sink token implementation is provided here. We provide it below:

    The HuggingFace transformers implementation is provided here. We also provide it below:

    now export/save
    fixed gpt-oss training
    A(x)=exp⁡(xi)∑exp⁡(xi)Asink(x)=exp⁡(xi)exp⁡(s)+∑exp⁡(xi)A(x) = \frac{\exp(x_i)}{\sum{\exp{(x_i)}}} \\ A_{sink}(x) = \frac{\exp(x_i)}{\exp{(s)}+ \sum{\exp{(x_i)}}}A(x)=∑exp(xi​)exp(xi​)​Asink​(x)=exp(s)+∑exp(xi​)exp(xi​)​
    A(x)=exp⁡(xi)∑exp⁡(xi)exp⁡(xi)exp⁡(s)+∑exp⁡(xi)=exp⁡(xi)∑exp⁡(xi)∑exp⁡(xi)exp⁡(s)+∑exp⁡(xi)LSE(x)=logsumexp(x)=log⁡∑exp⁡(xi)exp⁡(LSE(x))=exp⁡(log⁡∑exp⁡(xi))=∑exp⁡(xi)A(x) = \frac{\exp(x_i)}{\sum{\exp{(x_i)}}} \\ \frac{\exp(x_i)}{\exp{(s)}+ \sum{\exp{(x_i)}}} = \frac{\exp(x_i)}{\sum{\exp{(x_i)}}} \frac{\sum{\exp{(x_i)}}}{\exp{(s)}+ \sum{\exp{(x_i)}}} \\ \text{LSE}(x) = \text{logsumexp}(x) = \log{\sum\exp(x_i)} \\ \exp{(\text{LSE}(x))} = \exp{\big(\log{\sum\exp(x_i)}\big)} = \sum\exp(x_i)A(x)=∑exp(xi​)exp(xi​)​exp(s)+∑exp(xi​)exp(xi​)​=∑exp(xi​)exp(xi​)​exp(s)+∑exp(xi​)∑exp(xi​)​LSE(x)=logsumexp(x)=log∑exp(xi​)exp(LSE(x))=exp(log∑exp(xi​))=∑exp(xi​)
    All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized DeepSeek LLMs with minimal accuracy loss.

    Tutorials navigation:

    Run in llama.cppRun in Ollama/Open WebUI

    ⚙️ Recommended Settings

    The 1-bit dynamic quant TQ1_0 (1bit for unimportant MoE layers, 2-4bit for important MoE, and 6-8bit for rest) uses 170GB of disk space - this works well in a 1x24GB card and 128GB of RAM with MoE offloading - it also works natively in Ollama!

    You must use --jinja for llama.cpp quants - this uses our fixed chat templates and enables the correct template! You might get incorrect results if you do not use --jinja

    The 2-bit quants will fit in a 1x 24GB GPU (with MoE layers offloaded to RAM). Expect around 5 tokens/s with this setup if you have bonus 128GB RAM as well. It is recommended to have at least 226GB RAM to run this 2-bit. For optimal performance you will need at least 226GB unified memory or 226GB combined RAM+VRAM for 5+ tokens/s. To learn how to increase generation speed and fit longer contexts, read here.

    Though not a must, for best performance, have your VRAM + RAM combined equal to the size of the quant you're downloading. If not, hard drive / SSD offloading will work with llama.cpp, just inference will be slower.

    🦋Chat template bug fixes

    We fixed a few issues with DeepSeek V3.1's chat template since they did not function correctly in llama.cpp and other engines:

    1. DeepSeek V3.1 is a hybrid reasoning model, meaning you can change the chat template to enable reasoning. The chat template introduced thinking = True , but other models use enable_thinking = True . We added the option to use enable_thinking as a keyword instead.

    2. llama.cpp's jinja renderer via minja does not allow the use of extra arguments in the .split() command, so using .split(text, 1) works in Python, but not in minja. We had to change this to make llama.cpp function correctly without erroring out. You will get the following error when using other quants: terminate called after throwing an instance of 'std::runtime_error' what(): split method must have between 1 and 1 positional arguments and between 0 and 0 keyword arguments at row 3, column 1908 We fixed it in all our quants!

    🐳Official Recommended Settings

    According to DeepSeek, these are the recommended settings for V3.1 inference:

    • Set the temperature 0.6 to reduce repetition and incoherence.

    • Set top_p to 0.95 (recommended)

    • 128K context length or less

    • Use --jinja for llama.cpp variants - we fixed some chat template issues as well!

    • Use enable_thinking = True to use reasoning/ thinking mode. By default it's set to non reasoning.

    🔢 Chat template/prompt format

    You do not need to force <think>\n , but you can still add it in! With the given prefix, DeepSeek V3.1 generates responses to queries in non-thinking mode. Unlike DeepSeek V3, it introduces an additional token </think>.

    A BOS is forcibly added, and an EOS separates each interaction. To counteract double BOS tokens during inference, you should only call tokenizer.encode(..., add_special_tokens = False) since the chat template auto adds a BOS token as well. For llama.cpp / GGUF inference, you should skip the BOS since it’ll auto add it.

    📔 Non-Thinking Mode (use thinking = Falseor enable_thinking = False and is by default)

    First-Turn

    Prefix: <|begin▁of▁sentence|>{system prompt}<|User|>{query}<|Assistant|></think>

    With the given prefix, DeepSeek V3.1 generates responses to queries in non-thinking mode. Unlike DeepSeek V3, it introduces an additional token </think>.

    Multi-Turn

    Context: <|begin▁of▁sentence|>{system prompt}<|User|>{query}<|Assistant|></think>{response}<|end▁of▁sentence|>...<|User|>{query}<|Assistant|></think>{response}<|end▁of▁sentence|>

    Prefix: <|User|>{query}<|Assistant|></think>

    By concatenating the context and the prefix, we obtain the correct prompt for the query.

    📚 Thinking Mode (use thinking = Trueor enable_thinking = True and is by default)

    First-Turn

    Prefix: <|begin▁of▁sentence|>{system prompt}<|User|>{query}<|Assistant|><think>

    The prefix of thinking mode is similar to DeepSeek-R1.

    Multi-Turn

    Context: <|begin▁of▁sentence|>{system prompt}<|User|>{query}<|Assistant|></think>{response}<|end▁of▁sentence|>...<|User|>{query}<|Assistant|></think>{response}<|end▁of▁sentence|>

    Prefix: <|User|>{query}<|Assistant|><think>

    The multi-turn template is the same with non-thinking multi-turn chat template. It means the thinking token in the last turn will be dropped but the </think> is retained in every turn of context.

    🏹 Tool Calling

    Tool calling is supported in non-thinking mode. The format is:

    <|begin▁of▁sentence|>{system prompt}{tool_description}<|User|>{query}<|Assistant|></think> where we populate the tool_description is area after the system prompt.

    ▶️Run DeepSeek-V3.1 Tutorials:

    🦙 Run in Ollama/Open WebUI

    1

    Install ollama if you haven't already! To run more variants of the model, see here.

    2

    Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload! (NEW) To run the full R1-0528 model in Ollama, you can use our TQ1_0 (170GB quant):

    3

    To run other quants, you need to first merge the GGUF split files into 1 like the code below. Then you will need to run the model locally.

    4

    Open WebUI also made a on how to run R1 and for V3.1, you will just need to replace R1 with the new V3.1 quant.

    ✨ Run in llama.cpp

    1

    Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2

    If you want to use llama.cpp directly to load models, you can do the below: (:Q2_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 128K context length.

    Please try out -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    3

    Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL (dynamic 2bit quant) or other quantized versions like Q4_K_M . We recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy.

    4

    You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    5

    Get the 1bit version (170GB) if you don't have enough combined RAM and VRAM:

    ✨ Deploy with llama-server and OpenAI's completion library

    To use llama-server for deployment, use the following command:

    Then use OpenAI's Python library after pip install openai :

    💽Model uploads

    ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and language tasks.

    • Full DeepSeek-V3.1 model uploads below:

    We also uploaded IQ4_NL and Q4_1 quants which run specifically faster for ARM and Apple devices respectively.

    MoE Bits
    Type + Link
    Disk Size
    Details

    1.66bit

    170GB

    1.92/1.56bit

    1.78bit

    185GB

    2.06/1.56bit

    1.93bit

    200GB

    We've also uploaded versions in BF16 format, and original FP8 (float8) format.

    🏂 Improving generation speed

    If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

    Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.

    📐How to fit long context (full 128K)

    To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

    --cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

    You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1

    You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it. Then you can use together with --cache-type-k :

    --cache-type-v f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

    DeepSeek-V3.1-GGUF
    DeepSeek-V3.1-Terminus-GGUF
    Sept 10, 2025 update:
    Read more.
    chat template fixes

    Running Qwen3 Tutorial Fine-tuning Qwen3

    Qwen3 - Unsloth Dynamic 2.0 with optimal configs:

    Dynamic 2.0 GGUF (to run)
    128K Context GGUF
    Dynamic 4-bit Safetensor (to finetune/deploy)

    🖥️ Running Qwen3

    To achieve inference speeds of 6+ tokens per second, we recommend your available memory should match or exceed the size of the model you’re using. For example, a 30GB 1-bit quantized model requires at least 150GB of memory. The Q2_K_XL quant, which is 180GB, will require at least 180GB of unified memory (VRAM + RAM) or 180GB of RAM for optimal performance.

    NOTE: It’s possible to run the model with less total memory than its size (i.e., less VRAM, less RAM, or a lower combined total). However, this will result in slower inference speeds. Sufficient memory is only required if you want to maximize throughput and achieve the fastest inference times.

    ⚙️ Official Recommended Settings

    According to Qwen, these are the recommended settings for inference:

    Non-Thinking Mode Settings:
    Thinking Mode Settings:

    Temperature = 0.7

    Temperature = 0.6

    Min_P = 0.0 (optional, but 0.01 works well, llama.cpp default is 0.1)

    Min_P = 0.0

    Top_P = 0.8

    Top_P = 0.95

    TopK = 20

    TopK = 20

    Chat template/prompt format:

    For NON thinking mode, we purposely enclose <think> and </think> with nothing:

    For Thinking-mode, DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.

    Switching Between Thinking and Non-Thinking Mode

    Qwen3 models come with built-in "thinking mode" to boost reasoning and improve response quality - similar to how QwQ-32B worked. Instructions for switching will differ depending on the inference engine you're using so ensure you use the correct instructions.

    Instructions for llama.cpp and Ollama:

    You can add /think and /no_think to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.

    Here is an example of multi-turn conversation:

    Instructions for transformers and vLLM:

    Thinking mode:

    enable_thinking=True

    By default, Qwen3 has thinking enabled. When you call tokenizer.apply_chat_template, you don’t need to set anything manually.

    In thinking mode, the model will generate an extra <think>...</think> block before the final answer — this lets it "plan" and sharpen its responses.

    Non-thinking mode:

    enable_thinking=False

    Enabling non-thinking will make Qwen3 will skip all the thinking steps and behave like a normal LLM.

    This mode will provide final responses directly — no <think> blocks, no chain-of-thought.

    🦙 Ollama: Run Qwen3 Tutorial

    1. Install ollama if you haven't already! You can only run models up to 32B in size. To run the full 235B-A22B model, see here.

    1. Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

    1. To disable thinking, use (or you can set it in the system prompt):

    If you're experiencing any looping, Ollama might have set your context length window to 2,048 or so. If this is the case, bump it up to 32,000 and see if the issue still persists.

    📖 Llama.cpp: Run Qwen3 Tutorial

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions.

    1. Run the model and try any prompt.

    To disable thinking, use (or you can set it in the system prompt):

    Running Qwen3-235B-A22B

    For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

    1. We're following similar steps to above however this time we'll also need to perform extra steps because the model is so big.

    2. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

    3. Run the model and try any prompt.

    4. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    🦥 Fine-tuning Qwen3 with Unsloth

    Unsloth makes Qwen3 fine-tuning 2x faster, use 70% less VRAM and supports 8x longer context lengths. Qwen3 (14B) fits comfortably in a Google Colab 16GB VRAM Tesla T4 GPU.

    Because Qwen3 supports both reasoning and non-reasoning, you can fine-tune it with a non-reasoning dataset, but this may affect its reasoning ability. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.

    Our Conversational notebook uses a combo of 75% NVIDIA’s open-math-reasoning dataset and 25% Maxime’s FineTome dataset (non-reasoning). Here's free Unsloth Colab notebooks to fine-tune Qwen3:

    • Qwen3 (14B) Reasoning + Conversational notebook (recommended)

    • Qwen3 (4B) - Advanced GRPO LoRA

    • Qwen3 (14B) Alpaca notebook (for Base models)

    If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

    Qwen3 MOE models fine-tuning

    Fine-tuning support includes MOE models: 30B-A3B and 235B-A22B. Qwen3-30B-A3B works on just 17.5GB VRAM with Unsloth. On fine-tuning MoE's - it's probably not a good idea to fine-tune the router layer so we disabled it by default.

    The 30B-A3B fits in 17.5GB VRAM, but you may lack RAM or disk space since the full 16-bit model must be downloaded and converted to 4-bit on the fly for QLoRA fine-tuning. This is due to issues importing 4-bit BnB MOE models directly. This only affects MOE models.

    If you're fine-tuning the MOE models, please use FastModel and not FastLanguageModel

    Notebook Guide:

    To use the notebooks, just click Runtime, then Run all. You can change settings in the notebook to whatever you desire. We have set them automatically by default. Change model name to whatever you like by matching it with model's name on Hugging Face e.g. 'unsloth/Qwen3-8B' or 'unsloth/Qwen3-0.6B-unsloth-bnb-4bit'.

    There are other settings which you can toggle:

    • max_seq_length = 2048 – Controls context length. While Qwen3 supports 40960, we recommend 2048 for testing. Unsloth enables 8× longer context fine-tuning.

    • load_in_4bit = True – Enables 4-bit quantization, reducing memory use 4× for fine-tuning on 16GB GPUs.

    • For full-finetuning - set full_finetuning = True and 8-bit finetuning - set load_in_8bit = True

    If you'd like to read a full end-to-end guide on how to use Unsloth notebooks for fine-tuning or just learn about fine-tuning, creating datasets etc., view our complete guide here:

    GRPO with Qwen3

    We made a new advanced GRPO notebook for fine-tuning Qwen3. Learn to use our new proximity-based reward function (closer answers = rewarded) and Hugging Face's Open-R1 math dataset. Unsloth now also has better evaluations and uses the latest version of vLLM.

    Qwen3 (4B) notebook - Advanced GRPO LoRA

    Learn about:

    • Enabling reasoning in Qwen3 (Base)+ guiding it to do a specific task

    • Pre-finetuning to bypass GRPO's tendency to learn formatting

    • Improved evaluation accuracy via new regex matching

    • Custom GRPO templates beyond just 'think' e.g. <start_working_out></end_working_out>

    • Proximity-based scoring: better answers earn more points (e.g., predicting 9 when the answer is 10) and outliers are penalized

    Qwen-2507
    Dynamic 2.0
    Unsloth
    Reinforcement Learning (RL)
    Colab notebook.
    🧬Fine-tuning Guide
    📈Datasets Guide

    Do not forget about <|User|> and <|Assistant|> tokens! - Or use a chat template formatter

  • Obtain the latest llama.cpp at: github.com/ggerganov/llama.cpp. You can follow the build instructions below as well:

    1. It's best to use --min-p 0.05 to counteract very rare token predictions - I found this to work well especially for the 1.58bit model.

    2. Download the model via:

    1. Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode

    Example output:

    1. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.

    1. To test our Flappy Bird example as mentioned in our blog post here: https://unsloth.ai/blog/deepseekr1-dynamic, we can produce the 2nd example like below using our 1.58bit dynamic quant:

    The prompt used is as below:

    To call llama.cpp using this example, we do:

    1. Also, if you want to merge the weights together for use in Ollama for example, use this script:

    1. DeepSeek R1 has 61 layers. For example with a 24GB GPU or 80GB GPU, you can expect to offload after rounding down (reduce by 1 if it goes out of memory):

    Quant
    File Size
    24GB GPU
    80GB GPU
    2x80GB GPU

    1.58bit

    131GB

    7

    33

    All layers 61

    1.73bit

    158GB

    5

    26

    57

    Running on Mac / Apple devices

    For Apple Metal devices, be careful of --n-gpu-layers. If you find the machine going out of memory, reduce it. For a 128GB unified memory machine, you should be able to offload 59 layers or so.

    Run in Ollama/Open WebUI

    Open WebUI has made an step-by-step tutorial on how to run R1 here: docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/ If you want to use Ollama for inference on GGUFs, you need to first merge the 3 GGUF split files into 1 like the code below. Then you will need to run the model locally.

    DeepSeek Chat Template

    All distilled versions and the main 671B R1 model use the same chat template:

    <|begin▁of▁sentence|><|User|>What is 1+1?<|Assistant|>It's 2.<|end▁of▁sentence|><|User|>Explain more!<|Assistant|>

    A BOS is forcibly added, and an EOS separates each interaction. To counteract double BOS tokens during inference, you should only call tokenizer.encode(..., add_special_tokens = False) since the chat template auto adds a BOS token as well. For llama.cpp / GGUF inference, you should skip the BOS since it’ll auto add it.

    <|User|>What is 1+1?<|Assistant|>

    The <think> and </think> tokens get their own designated tokens. For the distilled versions for Qwen and Llama, some tokens are re-mapped, whilst Qwen for example did not have a BOS token, so <|object_ref_start|> had to be used instead. Tokenizer ID Mappings:

    Token
    R1
    Distill Qwen
    Distill Llama

    <think>

    128798

    151648

    128013

    </think>

    128799

    151649

    128014

    <|begin_of_sentence|>

    0

    151646

    Original tokens in models:

    Token
    Qwen 2.5 32B Base
    Llama 3.3 70B Instruct

    <think>

    <|box_start|>

    <|reserved_special_token_5|>

    </think>

    <|box_end|>

    <|reserved_special_token_6|>

    <|begin▁of▁sentence|>

    <|object_ref_start|>

    <|begin_of_text|>

    <|end▁of▁sentence|>

    <|endoftext|>

    All Distilled and the original R1 versions seem to have accidentally assigned the padding token to <|end▁of▁sentence|>, which is mostly not a good idea, especially if you want to further finetune on top of these reasoning models. This will cause endless infinite generations, since most frameworks will mask the EOS token out as -100. We fixed all distilled and the original R1 versions with the correct padding token (Qwen uses <|vision_pad|>, Llama uses <|finetune_right_pad_id|>, and R1 uses <|▁pad▁|> or our own added <|PAD▁TOKEN|>.

    GGUF R1 Table

    MoE Bits
    Type
    Disk Size
    Accuracy
    Link
    Details

    1.58bit

    UD-IQ1_S

    131GB

    Fair

    MoE all 1.56bit. down_proj in MoE mixture of 2.06/1.56bit

    1.73bit

    UD-IQ1_M

    158GB

    https://docs.unsloth.ai/basics/deepseek-r1-0528-how-to-run-locally

    Per-Channel

    1 scale per row or column

    12,963

    61.89%

    31.82%

    Per-Tensor

    1 scale for the whole tensor

    13,681

    61.83%

    27.78%

    Granite 4.0

    h-tiny — FP8 Dynamic h-small — FP8 Dynamic

    Magistral Small

    FP8 Dynamic · FP8 torchao

    Mistral Small 3.2

    FP8

    Gemma 3

    270m — FP8 1B — FP8 4B — FP8 12B — FP8 27B — FP8

    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    Dynamic
    Block
    Dynamic
    Block
    Dynamic
    Block
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    Dynamic
    Block
    Dynamic
    Block
    Dynamic
    Block
    Dynamic
    Block
    Dynamic
    Block

    Qwen3-2507: Run Locally Guide

    Run Qwen3-30B-A3B-2507 and 235B-A22B Thinking and Instruct versions locally on your device!

    Qwen released 2507 (July 2025) updates for their 4B, 30B and 235B models, introducing both "thinking" and "non-thinking" variants. The non-thinking 'Qwen3-30B-A3B-Instruct-2507' and 'Qwen3-235B-A22B-Instruct-2507' features a 256K context window, improved instruction following, multilingual capabilities and alignment.

    The thinking models 'Qwen3-30B-A3B-Thinking-2507' and 'Qwen3-235B-A22B-Thinking-2507' excel at reasoning, with the 235B achieving SOTA results in logic, math, science, coding, and advanced academic tasks.

    also now supports fine-tuning and of Qwen3-2507 models — 2x faster, with 70% less VRAM, and 8x longer context lengths

    Unsloth GGUFs:

    Model

    Unsloth Dynamic GGUFs on Aider Polyglot

    Performance of Unsloth Dynamic GGUFs on Aider Polyglot Benchmarks

    We’re excited to showcase how Unsloth Dynamic GGUFs makes it possible to quantize LLMs like (671B) down to just 1-bit or 3-bit, and still be able to outperform SOTA models like GPT-4.5, GPT-4.1 (April 2025) and Claude-4-Opus (May 2025).

    Previously, how Unsloth Dynamic GGUFs outperform other quantization methods on 5-shot MMLU and KL Divergence. Now, we’re showcasing their performance on independent third-party evaluations using the Aider Polyglot benchmark.

    ⭐Key results

    gpt-oss: How to Run Guide

    Run & fine-tune OpenAI's new open-source models!

    OpenAI releases 'gpt-oss-120b' and 'gpt-oss-20b', two SOTA open language models under the Apache 2.0 license. Both 128k context models outperform similarly sized open models in reasoning, tool use, and agentic tasks. You can now run & fine-tune them locally with Unsloth!

    : You can now export/save your QLoRA fine-tuned gpt-oss model to llama.cpp, vLLM, HF etc.

    We also introduced which enables >8× longer context lengths

    unsloth/Llama-3.2-3B-FP8-Dynamic · Hugging Facehuggingface
    <|begin▁of▁sentence|><|User|>What is 1+1?<|Assistant|>It's 2.<|end▁of▁sentence|><|User|>Explain more!<|Assistant|>
    <|User|>What is 1+1?<|Assistant|>
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    ollama run hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL
    OLLAMA_MODELS=unsloth_downloaded_models ollama serve &
    
    ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0
    OLLAMA_MODELS=unsloth_downloaded_models ollama serve &
    
    ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0
    ./llama.cpp/llama-gguf-split --merge \
      DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00003.gguf \
    	merged_file.gguf
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli -hf unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL --jinja
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
    cp llama.cpp/build/bin/llama-* llama.cpp
    export LLAMA_CACHE="unsloth/DeepSeek-R1-0528-GGUF"
    ./llama.cpp/llama-cli \
        -hf unsloth/DeepSeek-R1-0528-GGUF:IQ1_S \
        --cache-type-k q4_0 \
        --threads -1 \
        --n-gpu-layers 99 \
        --prio 3 \
        --temp 0.6 \
        --top-p 0.95 \
        --min-p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU"
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/DeepSeek-R1-0528-GGUF",
        local_dir = "unsloth/DeepSeek-R1-0528-GGUF",
        allow_patterns = ["*UD-IQ1_S*"], # Dynamic 1bit (168GB) Use "*UD-Q2_K_XL*" for Dynamic 2bit (251GB)
    )
    ./llama.cpp/llama-cli \
        --model unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
        --cache-type-k q4_0 \
        --threads -1 \
        --n-gpu-layers 99 \
        --prio 3 \
        --temp 0.6 \
        --top-p 0.95 \
        --min-p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU" \
        -no-cnv \
        --prompt "<|User|>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|Assistant|>"
    ./llama.cpp/llama-cli \
        --model unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
        --cache-type-k q4_0 \
        --threads -1 \
        --n-gpu-layers 99 \
        --prio 3 \
        --temp 0.6 \
        --top_p 0.95 \
        --min_p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU" \
        -no-cnv \
        --prompt "<|User|>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<|Assistant|>"
    Standy mode enabled:
    
    |===========================================================================|
    |                  PyTorch CUDA memory summary, device ID 0                 |
    |---------------------------------------------------------------------------|
    |            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
    |===========================================================================|
    |        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
    |---------------------------------------------------------------------------|
    | Allocated memory      |  32249 MiB |  43042 MiB | 128336 GiB | 128305 GiB |
    |       from large pool |  31415 MiB |  42165 MiB | 127204 GiB | 127173 GiB |
    |       from small pool |    834 MiB |   1184 MiB |   1132 GiB |   1131 GiB |
    |---------------------------------------------------------------------------|
    | Active memory         |  32249 MiB |  43042 MiB | 128336 GiB | 128305 GiB |
    |       from large pool |  31415 MiB |  42165 MiB | 127204 GiB | 127173 GiB |
    |       from small pool |    834 MiB |   1184 MiB |   1132 GiB |   1131 GiB |
    |---------------------------------------------------------------------------|
    | Requested memory      |  32199 MiB |  42987 MiB | 128176 GiB | 128145 GiB |
    |       from large pool |  31364 MiB |  42110 MiB | 127047 GiB | 127016 GiB |
    |       from small pool |    834 MiB |   1184 MiB |   1129 GiB |   1128 GiB |
    |---------------------------------------------------------------------------|
    | GPU reserved memory   |  37644 MiB |  47504 MiB | 705806 MiB | 668162 MiB |
    |       from large pool |  36376 MiB |  46588 MiB | 682818 MiB | 646442 MiB |
    |       from small pool |   1268 MiB |   1284 MiB |  22988 MiB |  21720 MiB |
    |---------------------------------------------------------------------------|
    | Non-releasable memory | 713142 KiB |   4633 MiB | 103206 GiB | 103205 GiB |
    |       from large pool | 525312 KiB |   4594 MiB | 101923 GiB | 101922 GiB |
    |       from small pool | 187830 KiB |    250 MiB |   1283 GiB |   1283 GiB |
    |---------------------------------------------------------------------------|
    | Allocations           |    3460    |    4809    |   15606 K  |   15603 K  |
    |       from large pool |     395    |     563    |    2812 K  |    2811 K  |
    |       from small pool |    3065    |    4270    |   12794 K  |   12791 K  |
    |---------------------------------------------------------------------------|
    | Active allocs         |    3460    |    4809    |   15606 K  |   15603 K  |
    |       from large pool |     395    |     563    |    2812 K  |    2811 K  |
    |       from small pool |    3065    |    4270    |   12794 K  |   12791 K  |
    |---------------------------------------------------------------------------|
    | GPU reserved segments |     913    |     920    |   13260    |   12347    |
    |       from large pool |     279    |     305    |    1766    |    1487    |
    |       from small pool |     634    |     642    |   11494    |   10860    |
    |---------------------------------------------------------------------------|
    | Non-releasable allocs |     422    |     628    |    4766 K  |    4765 K  |
    |       from large pool |      66    |      92    |    1290 K  |    1289 K  |
    |       from small pool |     356    |     555    |    3476 K  |    3475 K  |
    |---------------------------------------------------------------------------|
    | Oversize allocations  |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Oversize GPU segments |       0    |       0    |       0    |       0    |
    |===========================================================================|
    
    
    Without Standby:
    
    |===========================================================================|
    |                  PyTorch CUDA memory summary, device ID 0                 |
    |---------------------------------------------------------------------------|
    |            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
    |===========================================================================|
    |        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
    |---------------------------------------------------------------------------|
    | Allocated memory      |  32711 MiB |  52084 MiB | 142756 GiB | 142724 GiB |
    |       from large pool |  31877 MiB |  51207 MiB | 141499 GiB | 141467 GiB |
    |       from small pool |    834 MiB |   1184 MiB |   1257 GiB |   1256 GiB |
    |---------------------------------------------------------------------------|
    | Active memory         |  32711 MiB |  52084 MiB | 142756 GiB | 142724 GiB |
    |       from large pool |  31877 MiB |  51207 MiB | 141499 GiB | 141467 GiB |
    |       from small pool |    834 MiB |   1184 MiB |   1257 GiB |   1256 GiB |
    |---------------------------------------------------------------------------|
    | Requested memory      |  32572 MiB |  51658 MiB | 141898 GiB | 141866 GiB |
    |       from large pool |  31738 MiB |  50780 MiB | 140644 GiB | 140613 GiB |
    |       from small pool |    833 MiB |   1184 MiB |   1253 GiB |   1252 GiB |
    |---------------------------------------------------------------------------|
    | GPU reserved memory   |  49552 MiB |  52188 MiB |  86354 MiB |  36802 MiB |
    |       from large pool |  48320 MiB |  51300 MiB |  84740 MiB |  36420 MiB |
    |       from small pool |   1232 MiB |   1232 MiB |   1614 MiB |    382 MiB |
    |---------------------------------------------------------------------------|
    | Non-releasable memory |      0 B   |      0 B   |      0 B   |      0 B   |
    |       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
    |       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
    |---------------------------------------------------------------------------|
    | Allocations           |    3460    |    4809    |   17440 K  |   17437 K  |
    |       from large pool |     395    |     564    |    2742 K  |    2741 K  |
    |       from small pool |    3065    |    4270    |   14698 K  |   14695 K  |
    |---------------------------------------------------------------------------|
    | Active allocs         |    3460    |    4809    |   17440 K  |   17437 K  |
    |       from large pool |     395    |     564    |    2742 K  |    2741 K  |
    |       from small pool |    3065    |    4270    |   14698 K  |   14695 K  |
    |---------------------------------------------------------------------------|
    | GPU reserved segments |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Non-releasable allocs |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Oversize allocations  |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Oversize GPU segments |       0    |       0    |       0    |       0    |
    |===========================================================================|
    import os
    os.environ["UNSLOTH_VLLM_STANDBY"] = "1"
    
    from unsloth import FastLanguageModel
    import torch
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/Qwen3-8B-Base",
        max_seq_length = 2048, # Can increase for longer reasoning traces
        load_in_4bit = False, # False for LoRA 16bit
        fast_inference = True,
        max_lora_rank = 32, # Larger rank = smarter, but slower
        gpu_memory_utilization = 0.95,
    )
    import os
    os.environ["UNSLOTH_VLLM_STANDBY"] = "1"
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cp
    python3 llama.cpp/convert_hf_to_gguf.py gpt-oss-finetuned-merged/ --outfile gpt-oss-finetuned-mxfp4.gguf
    llama.cpp/llama-cli --model gpt-oss-finetuned-mxfp4.gguf \
        --jinja -ngl 99 --threads -1 --ctx-size 16384 \
        --temp 1.0 --top-p 1.0 --top-k 0 \
         -p "The meaning to life and the universe is"
    # build from source
    git clone https://github.com/sgl-project/sglang
    cd sglang
    pip3 install pip --upgrade
    pip3 install -e "python[all]"
    
    # ROCm 6.3
    pip3 install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/test/rocm6.3
    git clone https://github.com/triton-lang/triton
    cd python/triton_kernels
    pip3 install .
    
    # hopper
    pip3 install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu126
    pip3 install sgl-kernel==0.3.2
    
    # blackwell cu128
    pip3 install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu128
    pip3 install https://github.com/sgl-project/whl/releases/download/v0.3.2/sgl_kernel-0.3.2+cu128-cp39-abi3-manylinux2014_x86_64.whl
    
    # blackwell cu129
    pip3 install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu129
    pip3 install https://github.com/sgl-project/whl/releases/download/v0.3.2/sgl_kernel-0.3.2-cp39-abi3-manylinux2014_x86_64.whl
    python3 -m sglang.launch_server --model-path ./gpt-oss-finetuned-merged/
    import requests
    from sglang.utils import print_highlight
    
    url = f"http://localhost:8000/v1/chat/completions"
    
    data = {
        "model": "gpt-oss-finetuned-merged",
        "messages": [{"role": "user", "content": "What is the capital of France?"}],
    }
    
    response = requests.post(url, json=data)
    print_highlight(response.json())
    combined_logits = torch.cat([attn_weights, sinks], dim=-1)
    probs = F.softmax(combined_logits, dim=-1)
    scores = probs[..., :-1]
    def sliding_window_causal(b, h, q_idx, kv_idx):
        causal_mask = q_idx >= kv_idx
        window_mask = q_idx - kv_idx <= SLIDING_WINDOW 
        return causal_mask & window_mask
    mask = torch.triu(Q.new_full((n_tokens, n_tokens), -float("inf")), diagonal=1)
    if sliding_window > 0:
        mask += torch.tril(
            mask.new_full((n_tokens, n_tokens), -float("inf")), diagonal=-sliding_window
        )
    def sliding_window_causal(b, h, q_idx, kv_idx):
        causal_mask = q_idx >= kv_idx
        window_mask = q_idx - kv_idx <= SLIDING_WINDOW # Default Flex Attention
        window_mask = q_idx - kv_idx <  SLIDING_WINDOW # GPT-OSS version
        return causal_mask & window_mask
    def causal_mask_with_sink(batch, head, q_idx, kv_idx):
        """
          0 1 2 3     0 1 2 3
        0 X X       1   X
        1 X X X     2   X X
        2 X X X X   3   X X X
        """
        # We add (q_idx + 1) since first column is sink token
        causal_mask = (q_idx + 1) >= kv_idx
        sink_first_column = kv_idx == 0
        return causal_mask | sink_first_column
    model.save_pretrained_merged(save_directory, tokenizer, save_method="mxfp4")
    model.push_to_hub_merged(repo_name, tokenizer=tokenizer, token=hf_token, save_method="mxfp4")
    model, tokenizer = FastLanguageModel.from_pretrained(
        # model_name = "unsloth/gpt-oss-20b-BF16", 
        model_name = "unsloth/gpt-oss-20b",
        dtype = dtype, # None for auto detection
        max_seq_length = max_seq_length, # Choose any for long context!
        load_in_4bit = True,  # 4 bit quantization to reduce memory
        full_finetuning = False, # [NEW!] We have full finetuning now!
        # token = "hf_...", # use one if using gated models
    )
    def sdpa(Q, K, V, S, sm_scale, sliding_window=0):
        # sliding_window == 0 means no sliding window
        n_tokens, n_heads, q_mult, d_head = Q.shape
        assert K.shape == (n_tokens, n_heads, d_head)
        assert V.shape == (n_tokens, n_heads, d_head)
        K = K[:, :, None, :].expand(-1, -1, q_mult, -1)
        V = V[:, :, None, :].expand(-1, -1, q_mult, -1)
        S = S.reshape(n_heads, q_mult, 1, 1).expand(-1, -1, n_tokens, -1)
        mask = torch.triu(Q.new_full((n_tokens, n_tokens), -float("inf")), diagonal=1)
        if sliding_window > 0:
            mask += torch.tril(
                mask.new_full((n_tokens, n_tokens), -float("inf")), diagonal=-sliding_window
            )
        QK = torch.einsum("qhmd,khmd->hmqk", Q, K) * sm_scale
        QK += mask[None, None, :, :]
        QK = torch.cat([QK, S], dim=-1)
        W = torch.softmax(QK, dim=-1)
        W = W[..., :-1]
        attn = torch.einsum("hmqk,khmd->qhmd", W, V)
        return attn.reshape(n_tokens, -1)
    def eager_attention_forward(
        module: nn.Module,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        attention_mask: Optional[torch.Tensor],
        scaling: float,
        dropout: float = 0.0,
        **kwargs,
    ):
        key_states = repeat_kv(key, module.num_key_value_groups)
        value_states = repeat_kv(value, module.num_key_value_groups)
        attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
        if attention_mask is not None:
            causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
            attn_weights = attn_weights + causal_mask
    
        sinks = module.sinks.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1)
        combined_logits = torch.cat([attn_weights, sinks], dim=-1)
    
        # This was not in the original implementation and slightly affect results; it prevents overflow in BF16/FP16
        # when training with bsz>1 we clamp max values.
    
        combined_logits = combined_logits - combined_logits.max(dim=-1, keepdim=True).values
        probs = F.softmax(combined_logits, dim=-1, dtype=combined_logits.dtype)
        scores = probs[..., :-1]  # we drop the sink here
        attn_weights = nn.functional.dropout(scores, p=dropout, training=module.training)
        attn_output = torch.matmul(attn_weights, value_states)
        attn_output = attn_output.transpose(1, 2).contiguous()
        return attn_output, attn_weights
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    OLLAMA_MODELS=unsloth ollama serve &
    
    OLLAMA_MODELS=unsloth ollama run hf.co/unsloth/DeepSeek-V3.1-Terminus-GGUF:TQ1_0
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
    cp llama.cpp/build/bin/llama-* llama.cpp
    <|begin▁of▁sentence|>{system prompt}<|User|>{query}<|Assistant|></think>
    ./llama.cpp/llama-server \
        --model unsloth/DeepSeek-V3.1-Terminus-GGUF/DeepSeek-V3.1-Terminus-UD-TQ1_0.gguf \
        --alias "unsloth/DeepSeek-V3.1-Terminus" \
        --threads -1 \
        --n-gpu-layers 999 \
        -ot ".ffn_.*_exps.=CPU" \
        --prio 3 \
        --min_p 0.01 \
        --ctx-size 16384 \
        --port 8001 \
        --jinja
    from openai import OpenAI
    import json
    openai_client = OpenAI(
        base_url = "http://127.0.0.1:8001/v1",
        api_key = "sk-no-key-required",
    )
    completion = openai_client.chat.completions.create(
        model = "unsloth/DeepSeek-V3.1-Terminus",
        messages = [{"role": "user", "content": "What is 2+2?"},],
    )
    print(completion.choices[0].message.content)
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Qwen3-235B-A22B-GGUF",
        local_dir = "unsloth/Qwen3-235B-A22B-GGUF",
        allow_patterns = ["*UD-Q2_K_XL*"],
    )
    <|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n
    <|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n
    > Who are you /no_think
    
    <think>
    
    </think>
    
    I am Qwen, a large-scale language model developed by Alibaba Cloud. [...]
    
    > How many 'r's are in 'strawberries'? /think
    
    <think>
    Okay, let's see. The user is asking how many times the letter 'r' appears in the word "strawberries". [...]
    </think>
    
    The word strawberries contains 3 instances of the letter r. [...]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True  # Default is True
    )
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False  # Disables thinking mode
    )
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    ollama run hf.co/unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL
    >>> Write your prompt here /nothink
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Qwen3-14B-GGUF",
        local_dir = "unsloth/Qwen3-14B-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*"],
    )
    ./llama.cpp/llama-cli \
        --model unsloth/Qwen3-14B-GGUF/Qwen3-14B-UD-Q2_K_XL.gguf \
        --threads 32 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --seed 3407 \
        --prio 3 \
        --temp 0.6 \
        --min-p 0.0 \
        --top-p 0.95 \
        --top-k 20 \
        -no-cnv
    >>> Write your prompt here /nothink
    ./llama.cpp/llama-cli \
        --model unsloth/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-UD-Q2_K_XL.gguf \
        --threads 32 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --seed 3407 \
        --prio 3 \
        --temp 0.6 \
        --min-p 0.0 \
        --top-p 0.95 \
        --top-k 20 \
        -no-cnv \
        --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n"
    pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
    from unsloth import FastModel
    import torch
    model, tokenizer = FastModel.from_pretrained(
        model_name = "unsloth/Qwen3-30B-A3B",
        max_seq_length = 2048, # Choose any for long context!
        load_in_4bit = True,  # 4 bit quantization to reduce memory
        load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
        full_finetuning = False, # [NEW!] We have full finetuning now!
        # token = "hf_...", # use one if using gated models
    )
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    # pip install huggingface_hub hf_transfer
    # import os # Optional for faster downloading
    # os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    
    from huggingface_hub import snapshot_download
    snapshot_download(
      repo_id = "unsloth/DeepSeek-R1-GGUF",
      local_dir = "DeepSeek-R1-GGUF",
      allow_patterns = ["*UD-IQ1_S*"], # Select quant type UD-IQ1_S for 1.58bit
    )
    ./llama.cpp/llama-cli \
        --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
        --cache-type-k q4_0 \
        --threads 12 -no-cnv --prio 2 \
        --temp 0.6 \
        --ctx-size 8192 \
        --seed 3407 \
        --prompt "<|User|>What is 1+1?<|Assistant|>"
     <think>
     Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly.
     Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense.
     Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything.
     I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right.
     Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...
    ./llama.cpp/llama-cli \
        --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
        --cache-type-k q4_0 \
        --threads 12 -no-cnv --prio 2 \
        --n-gpu-layers 7 \
        --temp 0.6 \
        --ctx-size 8192 \
        --seed 3407 \
        --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
    <|User|>Create a Flappy Bird game in Python. You must include these things:
    1. You must use pygame.
    2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
    3. Pressing SPACE multiple times will accelerate the bird.
    4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
    5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
    6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
    7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
    8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
    The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|Assistant|>
    ./llama.cpp/llama-cli \
        --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
        --cache-type-k q4_0 \
        --threads 12 -no-cnv --prio 2 \
        --n-gpu-layers 7 \
        --temp 0.6 \
        --ctx-size 8192 \
        --seed 3407 \
        --prompt "<|User|>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|Assistant|>"
    ./llama.cpp/llama-gguf-split --merge \
        DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
        merged_file.gguf
    ./llama.cpp/llama-cli \
        --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
        --cache-type-k q4_0 \
        --threads 16 \
        --prio 2 \
        --temp 0.6 \
        --ctx-size 8192 \
        --seed 3407 \
        --n-gpu-layers 59 \
        -no-cnv \
        --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
    ./llama.cpp/llama-gguf-split --merge \
      DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    	merged_file.gguf
    IQ2_XXS
    Q2_K_XL
    IQ3_XXS
    Q3_K_XL
    Q4_K_XL
    Q5_K_XL
    1. u9ge2ga2ns Qwen3_(4B)-GRPO.ipynb

    standby False

    vllm_gpu_util 0.9

    num_gen 2

    grad_acc_steps 2

    model loads but can’t train because even batch size of 1 doesn’t fit

    OOM

    1. u8ge2ga2ns Qwen3_(4B)-GRPO.ipynb

    standby False

    vllm_gpu_util 0.8

    num_gen 2

    grad_acc_steps 2

    model loads but can’t train because even batch size of 1 doesn’t fit

    OOM

    1. u7ge2ga2ns Qwen3_(4B)-GRPO.ipynb

    standby False

    vllm_gpu_util 0.7

    num_gen 2

    grad_acc_steps 2

    Trains fine

    28 steps take 39min

    ~15.1GiB

    any input slightly longer will result in OOM on colab

    1. u7gen2ga2s Qwen3_(4B)-GRPO.ipynb

    standby True

    vllm_gpu_util 0.7

    num_gen 2

    grad_acc_steps 2

    Trains fine

    29 steps take 40min

    13GiB but most of the time around 10-11GB

    At the same config, we save 2GiB aka 15% memory here. Can be higher for longer sequences

    Llama 3.1 (8B)
    Phi-4 (14B)
    Mistral v0.3 (7B)
    Qwen2.5 (3B)
    u0.95gen2ga1s Qwen3_(4B)-GRPO.ipynb
    u9ge2ga2s Qwen3_(4B)-GRPO.ipynb

    14B

  • 30B-A3B

  • 32B

  • 235B-A22B

  • 32B

  • 235B-A22B

  • 14B

  • 30B-A3B

  • 32B

  • 0.6B
    1.7B
    4B
    8B
    4B
    8B
    14B
    30B-A3B
    0.6B
    1.7B
    4B
    8B

    2.22bit

    183GB

    4

    22

    49

    2.51bit

    212GB

    2

    19

    32

    128000

    <|end_of_sentence|>

    1

    151643

    128001

    <|User|>

    128803

    151644

    128011

    <|Assistant|>

    128804

    151645

    128012

    Padding token

    2

    151654

    128004

    <|end_of_text|>

    <|User|>

    <|im_start|>

    <|reserved_special_token_3|>

    <|Assistant|>

    <|im_end|>

    <|reserved_special_token_4|>

    Padding token

    <|vision_pad|>

    <|finetune_right_pad_id|>

    Good

    Link

    MoE all 1.56bit. down_proj in MoE left at 2.06bit

    2.22bit

    UD-IQ2_XXS

    183GB

    Better

    Link

    MoE all 2.06bit. down_proj in MoE mixture of 2.5/2.06bit

    2.51bit

    UD-Q2_K_XL

    212GB

    Best

    Link

    MoE all 2.5bit. down_proj in MoE mixture of 3.5/2.5bit

    Original DeepSeek R1

    1.58bit Dynamic Quant

    Link
    Cover
    Cover
    And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    2.5/2.06/1.56

    2.42bit

    IQ2_XXS

    216GB

    2.5/2.06bit

    2.71bit

    Q2_K_XL

    251GB

    3.5/2.5bit

    3.12bit

    IQ3_XXS

    273GB

    3.5/2.06bit

    3.5bit

    Q3_K_XL

    296GB

    4.5/3.5bit

    4.5bit

    Q4_K_XL

    384GB

    5.5/4.5bit

    5.5bit

    Q5_K_XL

    481GB

    6.5/5.5bit

    step-by-step tutorial
    TQ1_0
    IQ1_S
    IQ1_M
    ./llama.cpp/llama-gguf-split --merge \
      DeepSeek-V3.1-Terminus-GGUF/DeepSeek-V3.1-Terminus-UD-Q2_K_XL/DeepSeek-V3.1-Terminus-UD-Q2_K_XL-00001-of-00006.gguf \
    	merged_file.gguf
    OLLAMA_MODELS=unsloth ollama serve &
    
    OLLAMA_MODELS=unsloth ollama run merged_file.gguf
    export LLAMA_CACHE="unsloth/DeepSeek-V3.1-GGUF"
    ./llama.cpp/llama-cli \
        -hf unsloth/DeepSeek-V3.1-Terminus-GGUF:UD-Q2_K_XL \
        --jinja \
        --n-gpu-layers 99 \
        --temp 0.6 \
        --top-p 0.95 \
        --min-p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU"
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/DeepSeek-V3.1-Terminus-GGUF",
        local_dir = "unsloth/DeepSeek-V3.1-Terminus-GGUF",
        allow_patterns = ["*UD-Q2_K_XL*"], # Dynamic 2bit Use "*UD-TQ1_0*" for Dynamic 1bit
    )
    ./llama.cpp/llama-cli \
        --model unsloth/DeepSeek-V3.1-Terminus-GGUF/UD-Q2_K_XL/DeepSeek-V3.1-Terminus-UD-Q2_K_XL-00001-of-00006.gguf \
        --jinja \
        --threads -1 \
        --n-gpu-layers 99 \
        --temp 0.6 \
        --top-p 0.95 \
        --min-p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/DeepSeek-V3.1-Terminus-GGUF",
        local_dir = "unsloth/DeepSeek-V3.1-Terminus-GGUF",
        allow_patterns = ["*UD-TQ1_0*"], # Use "*UD-Q2_K_XL*" for Dynamic 2bit
    )
    GGUFs to run:

    Qwen3-4B-2507

    •

    Qwen3-30B-A3B-2507

    •

    Qwen3-235B-A22B-2507

    •

    ⚙️Best Practices

    The settings for the Thinking and Instruct model are different. The thinking model uses temperature = 0.6, but the instruct model uses temperature = 0.7 The thinking model uses top_p = 0.95, but the instruct model uses top_p = 0.8

    To achieve optimal performance, Qwen recommends these settings:

    Instruct Model Settings:
    Thinking Model Settings:

    Temperature = 0.7

    Temperature = 0.6

    Min_P = 0.00 (llama.cpp's default is 0.1)

    Min_P = 0.00 (llama.cpp's default is 0.1)

    Top_P = 0.80

    Top_P = 0.95

    TopK = 20

    TopK = 20

    presence_penalty = 0.0 to 2.0 (llama.cpp default turns it off, but to reduce repetitions, you can use this)

    presence_penalty = 0.0 to 2.0 (llama.cpp default turns it off, but to reduce repetitions, you can use this)

    Adequate Output Length: Use an output length of 32,768 tokens for most queries, which is adequate for most queries.

    Chat template for both Thinking (thinking has <think></think>) and Instruct is below:

    📖 Run Qwen3-30B-A3B-2507 Tutorials

    Below are guides for the Thinking and Instruct versions of the model.

    Instruct: Qwen3-30B-A3B-Instruct-2507

    Given that this is a non thinking model, there is no need to set thinking=False and the model does not generate <think> </think> blocks.

    ⚙️Best Practices

    To achieve optimal performance, Qwen recommends the following settings:

    • We suggest using temperature=0.7, top_p=0.8, top_k=20, and min_p=0.0 presence_penalty between 0 and 2 if the framework supports to reduce endless repetitions.

    • temperature = 0.7

    • top_k = 20

    • min_p = 0.00 (llama.cpp's default is 0.1)

    • top_p = 0.80

    • presence_penalty = 0.0 to 2.0 (llama.cpp default turns it off, but to reduce repetitions, you can use this) Try 1.0 for example.

    • Supports up to 262,144 context natively but you can set it to 32,768 tokens for less RAM use

    🦙 Ollama: Run Qwen3-30B-A3B-Instruct-2507 Tutorial

    1. Install ollama if you haven't already! You can only run models up to 32B in size.

    1. Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

    ✨ Llama.cpp: Run Qwen3-30B-A3B-Instruct-2507 Tutorial

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. You can directly pull from HuggingFace via:

    2. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

    Thinking: Qwen3-30B-A3B-Thinking-2507

    This model supports only thinking mode and a 256K context window natively. The default chat template adds <think> automatically, so you may see only a closing </think> tag in the output.

    ⚙️Best Practices

    To achieve optimal performance, Qwen recommends the following settings:

    • We suggest using temperature=0.6, top_p=0.95, top_k=20, and min_p=0.0 presence_penalty between 0 and 2 if the framework supports to reduce endless repetitions.

    • temperature = 0.6

    • top_k = 20

    • min_p = 0.00 (llama.cpp's default is 0.1)

    • top_p = 0.95

    • presence_penalty = 0.0 to 2.0 (llama.cpp default turns it off, but to reduce repetitions, you can use this) Try 1.0 for example.

    • Supports up to 262,144 context natively but you can set it to 32,768 tokens for less RAM use

    🦙 Ollama: Run Qwen3-30B-A3B-Instruct-2507 Tutorial

    1. Install ollama if you haven't already! You can only run models up to 32B in size. To run the full 235B-A22B models, see here.

    1. Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

    ✨ Llama.cpp: Run Qwen3-30B-A3B-Instruct-2507 Tutorial

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. You can directly pull from Hugging Face via:

    2. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

    📖 Run Qwen3-235B-A22B-2507 Tutorials

    Below are guides for the Thinking and Instruct versions of the model.

    Thinking: Qwen3-235B-A22B-Thinking-2507

    This model supports only thinking mode and a 256K context window natively. The default chat template adds <think> automatically, so you may see only a closing </think> tag in the output.

    ⚙️ Best Practices

    To achieve optimal performance, Qwen recommends these settings for the Thinking model:

    • temperature = 0.6

    • top_k = 20

    • min_p = 0.00 (llama.cpp's default is 0.1)

    • top_p = 0.95

    • presence_penalty = 0.0 to 2.0 (llama.cpp default turns it off, but to reduce repetitions, you can use this) Try 1.0 for example.

    • Adequate Output Length: Use an output length of 32,768 tokens for most queries, which is adequate for most queries.

    ✨Run Qwen3-235B-A22B-Thinking via llama.cpp:

    For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

    If you want a full precision unquantized version, use our Q8_K_XL, Q8_0 or BF16 versions!

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2. You can directly use llama.cpp to download the model but I normally suggest using huggingface_hub To use llama.cpp directly, do:

    3. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

    4. Run the model and try any prompt.

    5. Edit --threads -1 for the number of CPU threads, --ctx-size 262114 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    Instruct: Qwen3-235B-A22B-Instruct-2507

    Given that this is a non thinking model, there is no need to set thinking=False and the model does not generate <think> </think> blocks.

    ⚙️Best Practices

    To achieve optimal performance, we recommend the following settings:

    1. Sampling Parameters: We suggest using temperature=0.7, top_p=0.8, top_k=20, and min_p=0. presence_penalty between 0 and 2 if the framework supports to reduce endless repetitions.

    2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models.

    3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.

    • Math Problems: Include Please reason step by step, and put your final answer within \boxed{}. in the prompt.

    • Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C".

    ✨Run Qwen3-235B-A22B-Instruct via llama.cpp:

    For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

    {% hint style="success" %} If you want a full precision unquantized version, use our Q8_K_XL, Q8_0 or BF16 versions! {% endhint %}

    1. Obtain the latest `

    llama.cppon [GitHub here](https://github.com/ggml-org/llama.cpp). You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFF` if you don't have a GPU or just want CPU inference.

    2. You can directly use llama.cpp to download the model but I normally suggest using huggingface_hub To use llama.cpp directly, do:\

    3. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

    4. Run the model and try any prompt. 5. Edit --threads -1 for the number of CPU threads, --ctx-size 262114 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    🛠️ Improving generation speed

    If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

    Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster. The next section talks about KV cache quantization.

    📐How to fit long context

    To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

    --cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

    You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1 So try out --cache-type-k q4_1

    You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it. After installing Flash Attention, you can then use --cache-type-v q4_1

    🦥 Fine-tuning Qwen3-2507 with Unsloth

    Unsloth makes Qwen3 and Qwen3-2507 fine-tuning 2x faster, use 70% less VRAM and supports 8x longer context lengths. Because Qwen3-2507 was only released in a 30B variant, this means you will need about a 40GB A100 GPU to fine-tune the model using QLoRA (4-bit).

    For a notebook, because the model cannot fit in Colab's free 16GB GPUs, you will need to utilize a 40GB A100. You can utilize our Conversational notebook but replace the dataset to any of your using. This time you do not need to combined reasoning in your dataset as the model has no reasoning.

    • Qwen3 (14B) Reasoning + Conversational notebook

    If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

    Qwen3-2507 MOE models fine-tuning

    Fine-tuning support includes MOE models: 30B-A3B and 235B-A22B. Qwen3-30B-A3B works on 30GB VRAM with Unsloth. On fine-tuning MoE's - it's probably not a good idea to fine-tune the router layer so we disabled it by default.

    Qwen3-2507-4B notebooks for: Thinking and Instruct

    The 30B-A3B fits in 30GB VRAM, but you may lack RAM or disk space since the full 16-bit model must be downloaded and converted to 4-bit on the fly for QLoRA fine-tuning. This is due to issues importing 4-bit BnB MOE models directly. This only affects MOE models.

    If you're fine-tuning the MOE models, please use FastModel and not FastLanguageModel

    Qwen3
    Unsloth
    Reinforcement Learning (RL)
    Run 30B-A3B
    Run 235B-A22B
    Fine-tune Qwen3-2507
    Dynamic 2.0
    • Our 1-bit Unsloth Dynamic GGUF shrinks DeepSeek-V3.1 from 671GB → 192GB (-75% size) and no-thinking mode greatly outperforms GPT-4.1 (Apr 2025), GPT-4.5, and DeepSeek-V3-0324.

    • 3-bit Unsloth DeepSeek-V3.1 (thinking) GGUF: Outperforms Claude-4-Opus-20250514 (thinking).

    • 5-bit Unsloth DeepSeek-V3.1 (non-thinking) GGUF: Matches Claude-4-Opus-20250514 (non-thinking) performance.

    • Unsloth Dynamic GGUFs perform consistently better than other non-Unsloth Dynamic imatrix GGUFs

    • Other non-Unsloth 1-bit and 2-bit DeepSeek-V3.1 quantizations, as well as standard 1-bit quantization without selective layer quantization, either failed to load or produced gibberish and looping outputs. This highlights how Unsloth Dynamic GGUFs are able to largely retain accuracy whereas other methods do not even function.

    Why the Aider Polyglot benchmark? Aider is one of the most comprehensive measures of how well LLMs can write, code, follow instructions, and apply changes without human intervention, making it one of the hardest and most valuable benchmarks for real-world use.

    The key advantage of using the Unsloth package and models is our active role in fixing critical bugs in major models. We've collaborated directly with teams behind Qwen3, Meta (Llama 4), Mistral (Devstral), Google (Gemma 1–3) and Microsoft (Phi-3/4), contributing essential fixes that significantly boost accuracy.

    🦥Unsloth Dynamic Quantization

    Dynamic 1 bit makes important layers in 8 or 16 bits and un-important layers in 1,2,3,4,5 or 6bits.

    In Nov 2024, our 4-bit Dynamic Quants showcased how you could largely restore QLoRA fine-tuning & model accuracy by just selectively quantizing layers. We later studied DeepSeek-R1's architecture and applied this similar methodology, where we quantized some layers to as low as 1-bit and important layers to higher bits (6, 8-bit). This approach quickly gained popularity and has proven especially effective for MoE models, making dynamic quantization the de facto for MoE quantization.

    Our Dynamic GGUFs are even more effective when paired with our imatrix calibration dataset, designed for chat and coding performance. All of this enabled extreme LLM compression without catastrophic loss in quality.

    For example in Qwen2-VL-2B-Instruct, naively quantizing all layers to 4bit causes the model to fail understanding the image below. It's a train, not a coastal scene!

    We also showed dynamic benchmarks in https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs for Gemma 3 and Llama 4 Scout, showing how effective our methodology is:

    ⚙️Benchmark setup

    For our DeepSeek-V3.1 experiments, we compared different bits of Unsloth Dynamic GGUFs against:

    • Full-precision, unquantized LLMs including GPT 4.5, 4.1, Claude-4-Opus, DeepSeek-V3-0324 etc.

    • Other dynamic imatrix V3.1 GGUFs

    • Semi-dynamic (some selective layer quantization) imatrix V3.1 GGUFs for ablation purposes.

    Benchmark experiments were mainly conducted by David Sluys (neolithic5452 on Aider Discord), a trusted community contributor to Aider Polyglot evaluations. Tests were run ~3 times and averaged for a median score, and the Pass-2 accuracy is reported as by convention. There are some reproducible benchmark code snippets in Aider's Discord.

    Expand for Reasoning model Aider benchmarks
    Model
    Accuracy

    GPT-5

    86.7

    Gemini 2.5 Pro (June)

    83.1

    o3

    76.9

    Expand for Non Reasoning model Aider benchmarks
    Model
    Accuracy

    DeepSeek V3.1

    71.6

    Claude-4-Opus (May)

    70.7

    (5 bit) DeepSeek V3.1 Unsloth

    70.7

    DeepSeek V3.1 has both a reasoning and a non reasoning mode, and we test both. For non reasoning, we see a clear trend of how our dynamic quantizations perform below. dynamic 5-bit attains 70.7% on Aider Pass-2, whilst dynamic 1-bit attains 55.7%. In terms of size and accuracy, the 3 and 4bit are extremely powerful!

    🎇Comparison to other quants

    We also run the Aider Polyglot benchmark on other dynamic imatrix GGUFs from the community and compare it to ours. To ensure a fair comparison, we do the following:

    1. We select similar sized files and bit types to each Unsloth quant.

    2. We use our fixed chat template if the community quant fails to execute the benchmark. We found some community quants {"code":500,"message":"split method must have between 1 and 1 positional arguments and between 0 and 0 keyword arguments at row 3, column 1908"}, and this gets fixed by using our fixed chat template.

    We see Unsloth dynamic quants doing remarkably well when compared to other community quantization for the same model size and quant type!

    Expand for raw numerical data comparison to other quants
    Quant
    Quant Size (GB)
    Unsloth Accuracy %
    Comparison Accuracy %

    IQ2_XXS

    164

    43.6

    🍰Dynamic quantization ablations

    We did some ablations as well to confirm if our calibration dataset and our dynamic quantization methodology actually works. The trick of Unsloth's dynamic method is to quantize important layers to higher bits say 8bits, whilst un-important layers are left in lower bis like 2bits.

    To test our method, we leave specific tensors in lower precision like 4bit vs higher precision. For example below we leave attn_k_b tensors in 4bit (semi-dynamic) vs 8bit (Unsloth current), and by increasing the quant size by only ~100MB or so (<0.1%), accuracy shoots up dramatically!

    attn_k_b and other tensors in DeepSeek V3.1 are highly important / sensitive to quantization and should left in higher precision to retain accuracy!

    🐛Chat Template Bug Fixes

    During testing of DeepSeek-V3.1 quants, we found some lower bit quants not enclosing <think> </think> properly or doing some weird formatting. This caused some community quants to not work on lower bits, and so this caused unfair comparisons. We found llama.cpp's usage of minja (a simpler version of jinja) does not accept positional argument in .split. We had to change:

    to the below:

    See here for our fixed chat template or here for a raw jinja file.

    📊Pass Rate 1

    Aider is reported mainly on pass rate 2. We also report pass rate 1 to compare community quants of the same size. We see our dynamic quants do much better than other community quants of similar sizes especially on smaller than 2 bit and larger than 4bits. 3 and 4 bit perform similarly well.

    💻Run DeepSeek V3.1 Dynamic quants

    Head over to our DeepSeek V3.1 guide or to quickly get the dynamic 2bit version, do:

    then use llama.cpp to directly download the weights. We set the optimal suggested parameters like temperature, the chat template etc already as well:

    DeepSeek-V3.1
    we demonstrated
    Thinking Aider Benchmarks
    No Thinking Aider Benchmarks
    ,
    >50% less VRAM usage
    and
    >1.5× faster training
    vs. all implementations.

    Fine-tune gpt-oss-20b for free with our Colab notebook

    Trained with RL, gpt-oss-120b rivals o4-mini and gpt-oss-20b rivals o3-mini. Both excel at function calling and CoT reasoning, surpassing o1 and GPT-4o.

    gpt-oss - Unsloth GGUFs:

    Includes Unsloth's chat template fixes. For best results, use our uploads & train with Unsloth!

    • 20B: gpt-oss-20B

    • 120B: gpt-oss-120B

    📜Unsloth fixes for gpt-oss

    Some of our fixes were pushed upstream to OpenAI's official model on Hugging Face. See

    OpenAI released a standalone parsing and tokenization library called Harmony which allows one to tokenize conversations to OpenAI's preferred format for gpt-oss.

    Inference engines generally use the jinja chat template instead and not the Harmony package, and we found some issues with them after comparing with Harmony directly. If you see below, the top is the correct rendered form as from Harmony. The below is the one rendered by the current jinja chat template. There are quite a few differences!

    We also made some functions to directly allow you to use OpenAI's Harmony library directly without a jinja chat template if you desire - you can simply parse in normal conversations like below:

    Then use the encode_conversations_with_harmony function from Unsloth:

    The harmony format includes multiple interesting things:

    1. reasoning_effort = "medium" You can select low, medium or high, and this changes gpt-oss's reasoning budget - generally the higher the better the accuracy of the model.

    2. developer_instructions is like a system prompt which you can add.

    3. model_identity is best left alone - you can edit it, but we're unsure if custom ones will function.

    We find multiple issues with current jinja chat templates (there exists multiple implementations across the ecosystem):

    1. Function and tool calls are rendered with tojson, which is fine it's a dict, but if it's a string, speech marks and other symbols become backslashed.

    2. There are some extra new lines in the jinja template on some boundaries.

    3. Tool calling thoughts from the model should have the analysis tag and not final tag.

    4. Other chat templates seem to not utilize <|channel|>final at all - one should use this for the final assistant message. You should not use this for thinking traces or tool calls.

    Our chat templates for the GGUF, our BnB and BF16 uploads and all versions are fixed! For example when comparing both ours and Harmony's format, we get no different characters:

    🔢 Precision issues

    We found multiple precision issues in Tesla T4 and float16 machines primarily since the model was trained using BF16, and so outliers and overflows existed. MXFP4 is not actually supported on Ampere and older GPUs, so Triton provides tl.dot_scaled for MXFP4 matrix multiplication. It upcasts the matrices to BF16 internaly on the fly.

    We made a MXFP4 inference notebook as well in Tesla T4 Colab!

    Software emulation enables targeting hardware architectures without native microscaling operation support. Right now for such case, microscaled lhs/rhs are upcasted to bf16 element type beforehand for dot computation,

    We found if you use float16 as the mixed precision autocast data-type, you will get infinities after some time. To counteract this, we found doing the MoE in bfloat16, then leaving it in either bfloat16 or float32 precision. If older GPUs don't even have bfloat16 support (like T4), then float32 is used.

    We also change all precisions of operations (like the router) to float32 for float16 machines.

    🖥️ Running gpt-oss

    Below are guides for the 20B and 120B variants of the model.

    Any quant smaller than F16, including 2-bit has minimal accuracy loss, since only some parts (e.g., attention layers) are lower bit while most remain full-precision. That’s why sizes are close to the F16 model; for example, the 2-bit (11.5 GB) version performs nearly the same as the full 16-bit (14 GB) one. Once llama.cpp supports better quantization for these models, we'll upload them ASAP.

    The gpt-oss models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.

    The gpt-oss models offer three distinct levels of reasoning effort you can choose from:

    • Low: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.

    • Medium: A balance between performance and speed.

    • High: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.

    ⚙️ Recommended Settings

    OpenAI recommends these inference settings for both models:

    temperature=1.0, top_p=1.0, top_k=0

    • Temperature of 1.0

    • Top_K = 0 (or experiment with 100 for possible better results)

    • Top_P = 1.0

    • Recommended minimum context: 16,384

    • Maximum context length window: 131,072

    Chat template:

    The end of sentence/generation token: EOS is <|return|>

    Run gpt-oss-20B

    To achieve inference speeds of 6+ tokens per second for our Dynamic 4-bit quant, have at least 14GB of unified memory (combined VRAM and RAM) or 14GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. GGUF Link: unsloth/gpt-oss-20b-GGUF

    NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.

    Follow the best practices above. They're the same as the 120B model.

    You can run the model on Google Colab, Docker, LM Studio or llama.cpp for now. See below:

    You can run gpt-oss-20b for free with our Google Colab notebook

    🐋 Docker: Run gpt-oss-20b Tutorial

    If you already have Docker desktop, all you need to do is run the command below and you're done:

    ✨ Llama.cpp: Run gpt-oss-20b Tutorial

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. You can directly pull from Hugging Face via:

    2. Download the model via (after installing pip install huggingface_hub hf_transfer ).

    Run gpt-oss-120b:

    To achieve inference speeds of 6+ tokens per second for our 1-bit quant, we recommend at least 66GB of unified memory (combined VRAM and RAM) or 66GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. GGUF Link: unsloth/gpt-oss-120b-GGUF

    NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.

    Follow the best practices above. They're the same as the 20B model.

    📖 Llama.cpp: Run gpt-oss-120b Tutorial

    For gpt-oss-120b, we will specifically use Llama.cpp for optimized inference.

    If you want a full precision unquantized version, use our F16 versions!

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2. You can directly use llama.cpp to download the model but I normally suggest using huggingface_hub To use llama.cpp directly, do:

      {% code overflow="wrap" %}

      {% endcode %}

    3. Or, download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

    4. Run the model in conversation mode and try any prompt.

    5. Edit --threads -1 for the number of CPU threads, --ctx-size 262114 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity. More options discussed here.

    🛠️ Improving generation speed

    If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

    Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.

    🦥 Fine-tuning gpt-oss with Unsloth

    Unsloth gpt-oss fine-tuning is 1.5x faster, uses 70% less VRAM, and supports 10x longer context lengths. gpt-oss-20b QLoRA training fits on a 14GB VRAM, and gpt-oss-120b works on 65GB VRAM.

    • QLoRA requirements: gpt-oss-20b = 14GB VRAM • gpt-oss-120b = 65GB VRAM.

    • BF16 LoRA requirements: gpt-oss-20b = 44GB VRAM • gpt-oss-120b = 210GB VRAM.

    Read our step-by-step tutorial for fine-tuning gpt-oss:

    You can now export/save your QLoRA fine-tuned gpt-oss model to llama.cpp, vLLM, HF etc.

    Free Unsloth notebooks to fine-tune gpt-oss:

    • gpt-oss-20b Reasoning + Conversational notebook

    Reinforcement Learning (GRPO)

    Unsloth now supports RL for gpt-oss! We made two notebooks, for more details, read our specific blog for gpt-oss RL: gpt-oss RL

    (Official OpenAI example)

    💾NEW: Saving to GGUF, vLLM after gpt-oss training

    You can now QLoRA fine-tune gpt-oss and directly save, export, or merge the model to llama.cpp, vLLM, or HF - not just Unsloth. We will be releasing a free notebook hopefully soon.

    Previously, any QLoRA fine-tuned gpt-oss model was restricted to running in Unsloth. We’ve removed that limitation by introducing on-demand dequantization of MXFP4 base models (like gpt-oss) during the LoRA merge process. This makes it possible to export your fine-tuned model in bf16 format.

    After fine-tuning your gpt-oss model, you can now merge it into a 16-bit format with a single command:

    If you prefer to merge the model and push to the hugging-face hub directly instead, you could do so using:

    💡Making efficient gpt-oss fine-tuning work

    We found that while MXFP4 is highly efficient, it does not natively support training with gpt-oss. To overcome this limitation, we implemented custom training functions specifically for MXFP4 layers through mimicking it via Bitsandbytes NF4 quantization.

    We utilized OpenAI's Triton Kernels library directly to allow MXFP4 inference. For finetuning / training however, the MXFP4 kernels do not yet support training, since the backwards pass is not yet implemented. We're actively working on implementing it in Triton! There is a flag called W_TRANSPOSE as mentioned here, which should be implemented. The derivative can be calculated by the transpose of the weight matrices, and so we have to implement the transpose operation.

    If you want to train gpt-oss with any library other than Unsloth, you’ll need to upcast the weights to bf16 before training. This approach, however, significantly increases both VRAM usage and training time by as much as 300% more memory usage! ALL other training methods will require a minimum of 65GB VRAM to train the 20b model while Unsloth only requires 14GB VRAM (-80%).

    As both models use MoE architecture, the 20B model selects 4 experts out of 32, while the 120B model selects 4 out of 128 per token. During training and release, weights are stored in MXFP4 format as nn.Parameter objects, not as nn.Linear layers, which complicates quantization, especially since MoE/MLP experts make up about 19B of the 20B parameters.

    To enable BitsandBytes quantization and memory-efficient fine-tuning, we converted these parameters into nn.Linear layers. Although this slightly slows down operations, it allows fine-tuning on GPUs with limited memory, a worthwhile trade-off.

    Datasets fine-tuning guide

    Though gpt-oss supports only reasoning, you can still fine-tune it with a non-reasoning dataset, but this may affect its reasoning ability. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use at least 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.

    Our gpt-oss-20b Conversational notebook uses OpenAI's example which is Hugging Face's Multilingual-Thinking dataset. The purpose of using this dataset is to enable the model to learn and develop reasoning capabilities in these four distinct languages.

    Run gpt-oss-20b
    Run gpt-oss-120b
    Fine-tune gpt-oss
    Aug 28 update
    Unsloth Flex Attention
    Read more here
    ⚡Tutorial: How to Fine-tune gpt-oss

    Datasets Guide

    Learn how to create & prepare a dataset for fine-tuning.

    What is a Dataset?

    For LLMs, datasets are collections of data that can be used to train our models. In order to be useful for training, text data needs to be in a format that can be tokenized. You'll also learn how to use datasets inside of Unsloth.

    One of the key parts of creating a dataset is your chat template and how you are going to design it. Tokenization is also important as it breaks text into tokens, which can be words, sub-words, or characters so LLMs can process it effectively. These tokens are then turned into embeddings and are adjusted to help the model understand the meaning and context.

    Data Format

    To enable the process of tokenization, datasets need to be in a format that can be read by a tokenizer.

    Format
    Description
    Training Type

    It's worth noting that different styles of format exist for each of these types.

    Getting Started

    Before we format our data, we want to identify the following:

    1

    Purpose of dataset

    Knowing the purpose of the dataset will help us determine what data we need and format to use.

    The purpose could be, adapting a model to a new task such as summarization or improving a model's ability to role-play a specific character. For example:

    • Chat-based dialogues (Q&A, learn a new language, customer support, conversations).

    • Structured tasks (, summarization, generation tasks).

    One of the best ways to create a better dataset is by combining it with a more generalized dataset from Hugging Face like ShareGPT to make your model smarter and diverse. You could also add .

    Formatting the Data

    When we have identified the relevant criteria, and collected the necessary data, we can then format our data into a machine readable format that is ready for training.

    Common Data Formats for LLM Training

    For , we use raw text format without specific structure:

    This format preserves natural language flow and allows the model to learn from continuous text.

    If we are adapting a model to a new task, and intend for the model to output text in a single turn based on a specific set of instructions, we can use Instruction format in

    When we want multiple turns of conversation we can use the ShareGPT format:

    The template format uses the "from"/"value" attribute keys and messages alternates between humanand gpt, allowing for natural dialogue flow.

    The other common format is OpenAI's ChatML format and is what Hugging Face defaults to. This is probably the most used format, and alternates between user and assistant

    Applying Chat Templates with Unsloth

    For datasets that usually follow the common chatml format, the process of preparing the dataset for training or finetuning, consists of four simple steps:

    • Check the chat templates that Unsloth currently supports:\

      This will print out the list of templates currently supported by Unsloth. Here is an example output:\

      \

    • Use get_chat_template to apply the right chat template to your tokenizer:\

      \

    • Define your formatting function. Here's an example:\

    Formatting Data Q&A

    Q: How can I use the Alpaca instruct format?

    A: If your dataset is already formatted in the Alpaca format, then follow the formatting steps as shown in the Llama3.1 . If you need to convert your data to the Alpaca format, one approach is to create a Python script to process your raw data. If you're working on a summarization task, you can use a local LLM to generate instructions and outputs for each example.

    Q: Should I always use the standardize_sharegpt method?

    A: Only use the standardize_sharegpt method if your target dataset is formatted in the sharegpt format, but your model expect a ChatML format instead.

    Q: Why not use the apply_chat_template function that comes with the tokenizer.

    A: The chat_template attribute when a model is first uploaded by the original model owners sometimes contains errors and may take time to be updated. In contrast, at Unsloth, we thoroughly check and fix any errors in the chat_template for every model when we upload the quantized versions to our repositories. Additionally, our get_chat_template and apply_chat_template methods offer advanced data manipulation features, which are fully documented on our Chat Templates documentation .

    Q: What if my template is not currently supported by Unsloth?

    A: Submit a feature request on the unsloth github issues . As a temporary workaround, you could also use the tokenizer's own apply_chat_template function until your feature request is approved and merged.

    Synthetic Data Generation

    You can also use any local LLM like Llama 3.3 (70B) or OpenAI's GPT 4.5 to generate synthetic data. Generally, it is better to use a bigger like Llama 3.3 (70B) to ensure the highest quality outputs. You can directly use inference engines like vLLM, Ollama or llama.cpp to generate synthetic data but it will require some manual work to collect it and prompt for more data. There's 3 goals for synthetic data:

    • Produce entirely new data - either from scratch or from your existing dataset

    • Diversify your dataset so your model does not and become too specific

    • Augment existing data e.g. automatically structure your dataset in the correct chosen format

    Synthetic Dataset Notebook

    We collaborated with Meta to launch a free notebook for creating Synthetic Datasets automatically using local models like Llama 3.2.

    What the notebook does:

    • Auto-parses PDFs, websites, YouTube videos and more

    • Uses Meta’s Synthetic Data Kit + Llama 3.2 (3B) to generate QA pairs

    • Cleans and filters the data automatically

    • Fine-tunes the dataset with Unsloth + Llama

    Using a local LLM or ChatGPT for synthetic data

    Your goal is to prompt the model to generate and process QA data that is in your specified format. The model will need to learn the structure that you provided and also the context so ensure you at least have 10 examples of data already. Examples prompts:

    • Prompt for generating more dialogue on an existing dataset:

    • Prompt if you no have dataset:

      {% code overflow="wrap" %}

      {% endcode %}

    • Prompt for a dataset without formatting:

      {% code overflow="wrap" %}

    It is recommended to check the quality of generated data to remove or improve on irrelevant or poor-quality responses. Depending on your dataset it may also have to be balanced in many areas so your model does not overfit. You can then feed this cleaned dataset back into your LLM to regenerate data, now with even more guidance.

    Dataset FAQ + Tips

    How big should my dataset be?

    We generally recommend using a bare minimum of at least 100 rows of data for fine-tuning to achieve reasonable results. For optimal performance, a dataset with over 1,000 rows is preferable, and in this case, more data usually leads to better outcomes. If your dataset is too small you can also add synthetic data or add a dataset from Hugging Face to diversify it. However, the effectiveness of your fine-tuned model depends heavily on the quality of the dataset, so be sure to thoroughly clean and prepare your data.

    How should I structure my dataset if I want to fine-tune a reasoning model?

    If you want to fine-tune a model that already has reasoning capabilities like the distilled versions of DeepSeek-R1 (e.g. DeepSeek-R1-Distill-Llama-8B), you will need to still follow question/task and answer pairs however, for your answer you will need to change the answer so it includes reasoning/chain-of-thought process and the steps it took to derive the answer. For a model that does not have reasoning and you want to train it so that it later encompasses reasoning capabilities, you will need to utilize a standard dataset but this time without reasoning in its answers. This is training process is known as .

    Multiple datasets

    If you have multiple datasets for fine-tuning, you can either:

    • Standardize the format of all datasets, combine them into a single dataset, and fine-tune on this unified dataset.

    • Use the notebook to fine-tune on multiple datasets directly.

    Can I fine-tune the same model multiple times?

    You can fine-tune an already fine-tuned model multiple times, but it's best to combine all the datasets and perform the fine-tuning in a single process instead. Training an already fine-tuned model can potentially alter the quality and knowledge acquired during the previous fine-tuning process.

    Using Datasets in Unsloth

    Alpaca Dataset

    See an example of using the Alpaca dataset inside of Unsloth on Google Colab:

    We will now use the Alpaca Dataset created by calling GPT-4 itself. It is a list of 52,000 instructions and outputs which was very popular when Llama-1 was released, since it made finetuning a base LLM be competitive with ChatGPT itself.

    You can access the GPT4 version of the Alpaca dataset . Below shows some examples of the dataset:

    You can see there are 3 columns in each row - an instruction, and input and an output. We essentially combine each row into 1 large prompt like below. We then use this to finetune the language model, and this made it very similar to ChatGPT. We call this process supervised instruction finetuning.

    Multiple columns for finetuning

    But a big issue is for ChatGPT style assistants, we only allow 1 instruction / 1 prompt, and not multiple columns / inputs. For example in ChatGPT, you can see we must submit 1 prompt, and not multiple prompts.

    This essentially means we have to "merge" multiple columns into 1 large prompt for finetuning to actually function!

    For example the very famous Titanic dataset has many many columns. Your job was to predict whether a passenger has survived or died based on their age, passenger class, fare price etc. We can't simply pass this into ChatGPT, but rather, we have to "merge" this information into 1 large prompt.

    For example, if we ask ChatGPT with our "merged" single prompt which includes all the information for that passenger, we can then ask it to guess or predict whether the passenger has died or survived.

    Other finetuning libraries require you to manually prepare your dataset for finetuning, by merging all your columns into 1 prompt. In Unsloth, we simply provide the function called to_sharegpt which does this in 1 go!

    Now this is a bit more complicated, since we allow a lot of customization, but there are a few points:

    • You must enclose all columns in curly braces {}. These are the column names in the actual CSV / Excel file.

    • Optional text components must be enclosed in [[]]. For example if the column "input" is empty, the merging function will not show the text and skip this. This is useful for datasets with missing values.

    • Select the output or target / prediction column in output_column_name. For the Alpaca dataset, this will be output

    For example in the Titanic dataset, we can create a large merged prompt format like below, where each column / piece of text becomes optional.

    For example, pretend the dataset looks like this with a lot of missing data:

    Embarked
    Age
    Fare

    Then, we do not want the result to be:

    1. The passenger embarked from S. Their age is 23. Their fare is EMPTY.

    2. The passenger embarked from EMPTY. Their age is 18. Their fare is $7.25.

    Instead by optionally enclosing columns using [[]], we can exclude this information entirely.

    1. [[The passenger embarked from S.]] [[Their age is 23.]] [[Their fare is EMPTY.]]

    2. [[The passenger embarked from EMPTY.]] [[Their age is 18.]] [[Their fare is $7.25.]]

    becomes:

    1. The passenger embarked from S. Their age is 23.

    2. Their age is 18. Their fare is $7.25.

    Multi turn conversations

    A bit issue if you didn't notice is the Alpaca dataset is single turn, whilst remember using ChatGPT was interactive and you can talk to it in multiple turns. For example, the left is what we want, but the right which is the Alpaca dataset only provides singular conversations. We want the finetuned language model to somehow learn how to do multi turn conversations just like ChatGPT.

    So we introduced the conversation_extension parameter, which essentially selects some random rows in your single turn dataset, and merges them into 1 conversation! For example, if you set it to 3, we randomly select 3 rows and merge them into 1! Setting them too long can make training slower, but could make your chatbot and final finetune much better!

    Then set output_column_name to the prediction / output column. For the Alpaca dataset dataset, it would be the output column.

    We then use the standardize_sharegpt function to just make the dataset in a correct format for finetuning! Always call this!

    Vision Fine-tuning

    The dataset for fine-tuning a vision or multimodal model also includes image inputs. For example, the uses a radiography case to show how AI can help medical professionals analyze X-rays, CT scans, and ultrasounds more efficiently.

    We'll be using a sampled version of the ROCO radiography dataset. You can access the dataset . The dataset includes X-rays, CT scans and ultrasounds showcasing medical conditions and diseases. Each image has a caption written by experts describing it. The goal is to finetune a VLM to make it a useful analysis tool for medical professionals.

    Let's take a look at the dataset, and check what the 1st example shows:

    Image
    Caption

    To format the dataset, all vision finetuning tasks should be formatted as follows:

    We will craft an custom instruction asking the VLM to be an expert radiographer. Notice also instead of just 1 instruction, you can add multiple turns to make it a dynamic conversation.

    Let's convert the dataset into the "correct" format for finetuning:

    The first example is now structured like below:

    Before we do any finetuning, maybe the vision model already knows how to analyse the images? Let's check if this is the case!

    And the result:

    For more details, view our dataset section in the .

    Tutorial: How to Fine-tune gpt-oss

    Learn step-by-step how to train OpenAI gpt-oss locally with Unsloth.

    In this guide with screenshots, you'll learn to fine-tune your own custom gpt-oss model either locally on your machine or for free using Google Colab. We'll walk you through the entire process, from setup to running and saving your trained model.

    Aug 28 update: You can now export/save your QLoRA fine-tuned gpt-oss model to llama.cpp, vLLM, HF etc.

    We also introduced Unsloth Flex Attention which enables >8× longer context lengths, >50% less VRAM usage and >1.5× faster training vs. all implementations. Read more here

    Quickstart: Fine-tune gpt-oss-20b for free with our:

    Unsloth gpt-oss fine-tuning, when compared to all other FA2 implementations, achieves 1.5× faster training, 70% reduction in VRAM use, and 10x longer context lengths - with no accuracy loss.

    • QLoRA requirements: gpt-oss-20b = 14GB VRAM • gpt-oss-120b = 65GB VRAM.

    • BF16 LoRA requirements: gpt-oss-20b = 44GB VRAM • gpt-oss-120b = 210GB VRAM.

    🌐 Colab gpt-oss Fine-tuning

    This section covers fine-tuning gpt-oss using our Google Colab . You can also save and use the gpt-oss notebook into your favorite code editor and follow our .

    1

    Install Unsloth (in Colab)

    In Colab, run cells from top to bottom. Use Run all for the first pass. The first cell installs Unsloth (and related dependencies) and prints GPU/memory info. If a cell throws an error, simply re-run it.

    2

    🖥️ Local gpt-oss Fine-tuning

    This chapter covers fine-tuning gpt-oss on your local device. While gpt-oss-20b fine-tuning can operate on just 14GB VRAM, we recommend having at least 16GB VRAM available to ensure stable and reliable training runs.

    We recommend downloading or incorporating elements from our Colab into your local setup for easier use.

    1

    Install Unsloth Locally

    Ensure your device is and you can read our detailed .

    Note that pip install unsloth will not work for this setup, as we need to use the latest PyTorch, Triton and related packages. Install Unsloth using this specific command:

    2

    🏁 And that's it!

    You've fine-tuned gpt-oss with Unsloth. We're currently working on RL and GRPO implementations, as well as improved model saving and running, so stay tuned.

    As always, feel free to drop by our or if you need any help.

    ❓FAQ (Frequently Asked Questions)

    1. Can I export my model to use in Hugging Face, llama.cpp GGUF or vLLM later?

    Yes you can now model using Unsloth's new update!

    2. Can I do fp4 or MXFP4 training with gpt-oss?

    No, currently no framework supports fp4 or MXFP4 training. Unsloth however is the only framework to support QLoRA 4-bit fine-tuning for the model, enabling more than 4x less VRAM use.

    3. Can I export my model to MXFP4 format after training?

    No, currently no library or framework supports this.

    4. Can I do Reinforcement Learning (RL) or GRPO with gpt-oss?

    Yes! Unsloth now supports RL for gpt-oss with GRPO/GSPO. We made it work on a free Kaggle notebook and achieved the fastest inference for RL.


    Acknowledgements: A huge thank you to for contributing to this guide!

    unsloth/Llama-3.3-70B-Instruct-FP8-Dynamic · Hugging Facehuggingface
    Logo
    Google Colabcolab.research.google.com
    Google Colabcolab.research.google.com

    Kimi K2 Thinking: Run Locally Guide

    Guide on running Kimi-K2-Thinking and Kimi-K2 on your own local device!

    Kimi-K2-Thinking got released. Read our or access .

    We also collaborated with the Kimi team on for Kimi-K2-Thinking.

    Kimi-K2 and Kimi-K2-Thinking achieve SOTA performance in knowledge, reasoning, coding, and agentic tasks. The full 1T parameter models from Moonshot AI requires 1.09TB of disk space, while the quantized Unsloth Dynamic 1.8-bit version reduces this to just 230GB (-80% size):

    You can also now run our

    ./llama.cpp/llama-cli \
        -hf unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_XL \
        --jinja -ngl 99 --threads -1 --ctx-size 32684 \
        --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --presence-penalty 1.0
    ./llama.cpp/llama-cli \
        -hf unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:Q4_K_XL \
        --jinja -ngl 99 --threads -1 --ctx-size 32684 \
        --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20 --presence-penalty 1.0
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli \
        -hf unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF:Q2_K_XL \
        --threads -1 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --temp 0.6 \
        --min-p 0.0 \
        --top-p 0.95 \
        --top-k 20 \
        --presence-penalty 1.0
    <|im_start|>user
    Hey there!<|im_end|>
    <|im_start|>assistant
    What is 1+1?<|im_end|>
    <|im_start|>user
    2<|im_end|>
    <|im_start|>assistant
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    ollama run hf.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:UD-Q4_K_XL
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF",
        local_dir = "unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*"],
    )
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    ollama run hf.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:UD-Q4_K_XL
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF",
        local_dir = "unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*"],
    )
    ./llama.cpp/llama-cli \
        --model unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF/UD-Q2_K_XL/Qwen3-235B-A22B-Thinking-2507-UD-Q2_K_XL-00001-of-00002.gguf \
        --threads -1 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --seed 3407 \
        --temp 0.6 \
        --min-p 0.0 \
        --top-p 0.95 \
        --top-k 20
        --presence-penalty 1.0
    ```bash
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ```
    ```
    ./llama.cpp/llama-cli \
        -hf unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF:Q2_K_XL \
        --threads -1 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --temp 0.7 \
        --min-p 0.0 \
        --top-p 0.8 \
        --top-k 20 \
        --repeat-penalty 1.0
    ```
    ```python
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF",
        local_dir = "unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF",
        allow_patterns = ["*UD-Q2_K_XL*"],
    )
    ```
    ./llama.cpp/llama-cli \
        --model unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF/UD-Q2_K_XL/Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL-00001-of-00002.gguf \
        --threads -1 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --temp 0.7 \
        --min-p 0.0 \
        --top-p 0.8 \
        --top-k 20
    pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
    from unsloth import FastModel
    import torch
    model, tokenizer = FastModel.from_pretrained(
        model_name = "unsloth/Qwen3-30B-A3B-Instruct-2507",
        max_seq_length = 2048, # Choose any for long context!
        load_in_4bit = True,  # 4 bit quantization to reduce memory
        load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
        full_finetuning = False, # [NEW!] We have full finetuning now!
        # token = "hf_...", # use one if using gated models
    )
    {%- set content = content.split("</think>", 1)[1] -%}
    {%- set splitted = content.split("</think>") -%}
    {%- set content = splitted[1:] | join("</think>") -%}
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
    cp llama.cpp/build/bin/llama-* llama.cpp
    export LLAMA_CACHE="unsloth/DeepSeek-V3.1-GGUF"
    ./llama.cpp/llama-cli \
        -hf unsloth/DeepSeek-V3.1-GGUF:Q2_K_XL \
        --jinja \
        --n-gpu-layers 99 \
        --temp 0.6 \
        --top_p 0.95 \
        --min_p 0.01 \
        --ctx-size 8192 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU"
    ./llama.cpp/llama-cli \
        -hf unsloth/gpt-oss-20b-GGUF:F16 \
        --jinja -ngl 99 --threads -1 --ctx-size 16384 \
        --temp 1.0 --top-p 1.0 --top-k 0
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli \
        -hf unsloth/gpt-oss-120b-GGUF:F16 \
        --threads -1 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --temp 1.0 \
        --min-p 0.0 \
        --top-p 1.0 \
        --top-k 0.0 \
    messages = [
        {"role" : "user", "content" : "What is 1+1?"},
        {"role" : "assistant", "content" : "2"},
        {"role": "user",  "content": "What's the temperature in San Francisco now? How about tomorrow? Today's date is 2024-09-30."},
        {"role": "assistant",  "content": "User asks: 'What is the weather in San Francisco?' We need to use get_current_temperature tool.", "thinking" : ""},
        {"role": "assistant", "content": "", "tool_calls": [{"name": "get_current_temperature", "arguments": '{"location": "San Francisco, California, United States", "unit": "celsius"}'}]},
        {"role": "tool", "name": "get_current_temperature", "content": '{"temperature": 19.9, "location": "San Francisco, California, United States", "unit": "celsius"}'},
    ]
    from unsloth_zoo import encode_conversations_with_harmony
    
    def encode_conversations_with_harmony(
        messages,
        reasoning_effort = "medium",
        add_generation_prompt = True,
        tool_calls = None,
        developer_instructions = None,
        model_identity = "You are ChatGPT, a large language model trained by OpenAI.",
    )
    <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-05\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there!<|end|><|start|>user<|message|>What is 1+1?<|end|><|start|>assistant
    docker model run hf.co/unsloth/gpt-oss-20b-GGUF:F16
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/gpt-oss-20b-GGUF",
        local_dir = "unsloth/gpt-oss-20b-GGUF",
        allow_patterns = ["*F16*"],
    )
    ./llama.cpp/llama-cli \
        --model unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf \
        --threads -1 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --temp 1.0 \
        --min-p 0.0 \
        --top-p 1.0 \
        --top-k 0.0 \
    model.save_pretrained_merged(save_directory, tokenizer)
    model.push_to_hub_merged(repo_name, tokenizer=tokenizer, token=hf_token)
    Logo
    Logo
    Logo

    DeepSeek V3.1

    76.1

    (3 bit) DeepSeek V3.1 Unsloth

    75.6

    Claude-4-Opus (May)

    72

    o4-mini (High)

    72

    DeepSeek R1 0528

    71.4

    (2 bit) DeepSeek V3.1 Unsloth

    66.7

    Claude-3.7-Sonnet (Feb)

    64.9

    (1 bit) DeepSeek V3.1 Unsloth

    57.8

    DeepSeek R1

    56.9

    (4 bit) DeepSeek V3.1 Unsloth

    69.7

    (3 bit) DeepSeek V3.1 Unsloth

    68.4

    (2 bit) DeepSeek V3.1 Unsloth

    65.8

    Qwen3 235B A22B

    59.6

    Kimi K2

    59.1

    (1 bit) DeepSeek V3.1 Unsloth

    55.7

    DeepSeek V3-0324

    55.1

    GPT-4.1 (April, 2025)

    52.4

    ChatGPT 4o (March, 2025)

    45.3

    GPT-4.5

    44.9

    TQ1_0

    170

    50.7

    IQ1_M

    206

    55.7

    IQ2_M

    215

    56.6

    IQ2_XXS

    225

    61.2

    IQ2_M

    235

    64.3

    Q2_K_L

    239

    64.0

    Q2_K_XL

    255

    65.8

    IQ3_XXS

    268

    65.6

    65.6

    IQ3_XXS

    279

    66.8

    Q3_K_S

    293

    65.2

    Q3_K_XL

    300

    68.4

    IQ4_XS

    357

    69.2

    IQ4_XS

    360

    66.3

    Q4_K_XL

    387

    69.7

    Q4_K_M

    405

    69.7

    Q4_K_M

    409

    67.7

    Q5_K_M

    478

    68.9

    Q5_K_XL

    484

    70.7

    Instruct
    Thinking
    Instruct
    Thinking
    Instruct
    Thinking
    2048 notebook
    Kernel generation notebook
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF",
        local_dir = "unsloth/Qwen3-235B-A22B-Thinking-2507-GGUF",
        allow_patterns = ["*UD-Q2_K_XL*"],
    )
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/gpt-oss-120b-GGUF",
        local_dir = "unsloth/gpt-oss-120b-GGUF",
        allow_patterns = ["*F16*"],
    )

    Domain-specific data (medical, finance, technical).

    2

    Style of output

    The style of output will let us know what sources of data we will use to reach our desired output.

    For example, the type of output you want to achieve could be JSON, HTML, text or code. Or perhaps you want it to be Spanish, English or German etc.

    3

    Data source

    When we know the purpose and style of the data we need, we need to analyze the quality and quantity of the data. Hugging Face and Wikipedia are great sources of datasets and Wikipedia is especially useful if you are looking to train a model to learn a language.

    The Source of data can be a CSV file, PDF or even a website. You can also synthetically generate data but extra care is required to make sure each example is high quality and relevant.

    This function loops through your dataset applying the chat template you defined to each sample.\
  • Finally, let's load the dataset and apply the required modifications to our dataset: \

    If your dataset uses the ShareGPT format with "from"/"value" keys instead of the ChatML "role"/"content" format, you can use the standardize_sharegpt function to convert it first. The revised code will now look as follows: \

  • Notebook is fully done locally with no API calling necessary

    {% endcode %}
    .

    Raw Corpus

    Raw text from a source such as a website, book, or article.

    Continued Pretraining (CPT)

    Instruct

    Instructions for the model to follow and an example of the output to aim for.

    Supervised fine-tuning (SFT)

    Conversation

    Multiple-turn conversation between a user and an AI assistant.

    Supervised fine-tuning (SFT)

    RLHF

    Conversation between a user and an AI assistant, with the assistant's responses being ranked by a script, another model or human evaluator.

    S

    23

    18

    7.25

    Panoramic radiography shows an osteolytic lesion in the right posterior maxilla with resorption of the floor of the maxillary sinus (arrows).

    classification
    synthetically generated data
    continued pretraining
    Alpaca style
    notebook
    page
    forum
    overfit
    Access the notebook here.
    Reinforcement Learning and GRPO
    Multiple Datasets
    here
    Llama 3.2 Vision Notebook
    here
    notebook here

    Reinforcement Learning (RL)

    Configuring gpt-oss and Reasoning Effort

    We’ll load gpt-oss-20b using Unsloth's linearized version (as no other version will work).

    Configure the following parameters:

    • max_seq_length = 1024

      • Recommended for quick testing and initial experiments.

    • load_in_4bit = True

      • Use False for LoRA training (note: setting this to False will need at least 43GB VRAM). You MUST also set model_name = "unsloth/gpt-oss-20b-BF16"

    You should see output similar to the example below. Note: We explicitly change the dtype to float32 to ensure correct training behavior.

    3

    Fine-tuning Hyperparameters (LoRA)

    Now it's time to adjust your training hyperparameters. For a deeper dive into how, when, and what to tune, check out our detailed hyperparameters guide.

    To avoid , monitor your training loss and avoid setting these values too high.

    This step adds LoRA adapters for parameter-efficient fine-tuning. Only about 1% of the model’s parameters are trained, which makes the process significantly more efficient.

    4

    Try Inference

    In the notebook, there's a section called "Reasoning Effort" that demonstrates gpt-oss inference running in Colab. You can skip this step, but you'll still need to run the model later once you've finished fine-tuning it.

    5

    Data Preparation

    For this example, we will use the HuggingFaceH4/Multilingual-Thinking. This dataset contains chain-of-thought reasoning examples derived from user questions translated from English into four additional languages.

    This is the same dataset referenced in OpenAI's fine-tuning cookbook.

    The goal of using a multilingual dataset is to help the model learn and generalize reasoning patterns across multiple languages.

    gpt-oss introduces a reasoning effort system that controls how much reasoning the model performs. By default, the reasoning effort is set to low, but you can change it by setting the reasoning_effort parameter to low, medium or high.

    Example:

    To format the dataset, we apply a customized version of the gpt-oss prompt:

    Let's inspect the dataset by printing the first example:

    One unique feature of gpt-oss is its use of the , which supports structured conversations, reasoning output, and tool calling. This format includes tags such as <|start|> , <|message|> , and <|return|> .

    🦥 Unsloth fixes the chat template to ensure it is correct. See this for technical details on our template fix.

    Feel free to adapt the prompt and structure to suit your own dataset or use-case. For more guidance, refer to our .

    6

    Train the model

    We've pre-selected training hyperparameters for optimal results. However, you can modify them based on your specific use case. Refer to our hyperparameters guide.

    In this example, we train for 60 steps to speed up the process. For a full training run, set num_train_epochs=1 and disable the step limiting by setting max_steps=None.

    During training, monitor the loss to ensure that it is decreasing over time. This confirms that the training process is functioning correctly.

    7

    Inference: Run your trained model

    Now it's time to run inference with your fine-tuned model. You can modify the instruction and input, but leave the output blank.

    In this example, we test the model's ability to reason in French by adding a specific instruction to the system prompt, following the same structure used in our dataset.

    This should produce an output similar to:

    8

    Save/export your model

    To save your fine-tuned model, you can export your fine-tuned model both in bf16 format , with our on-demand dequantization of MXFP4 base models using save_method="merged_16bit"or in native MXFP4 Safetensors format using save_method="mxfp4" .

    The MXFP4 native merge format offers significant performance improvements compared to the bf16 format: it uses up to 75% less disk space, reduces VRAM consumption by 50%, accelerates merging by 5-10x, and enables much faster conversion to GGUF format.

    New: Saving or merging QLoRA fine-tuned models to GGUF is now supported for use in other frameworks (e.g. Hugging Face, llama.cpp with GGUF).

    After fine-tuning your gpt-oss model, you can merge it into MXFP4 format with:

    If you prefer to merge the model and push to the hugging-face hub directly:

    ✨ Saving to Llama.cpp

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2. Convert the MXFP4 merged model:

    3. Run inference on the quantized model:

    Configuring gpt-oss and Reasoning Effort

    We’ll load gpt-oss-20b using Unsloth's linearized version (as no other version will work for QLoRA fine-tuning). Configure the following parameters:

    • max_seq_length = 2048

      • Recommended for quick testing and initial experiments.

    • load_in_4bit = True

      • Use False for LoRA training (note: setting this to False will need at least 43GB VRAM). You MUST also set model_name = "unsloth/gpt-oss-20b-BF16"

    You should see output similar to the example below. Note: We explicitly change the dtype to float32 to ensure correct training behavior.

    3

    Fine-tuning Hyperparameters (LoRA)

    Now it's time to adjust your training hyperparameters. For a deeper dive into how, when, and what to tune, check out our detailed hyperparameters guide.

    To avoid , monitor your training loss and avoid setting these values too high.

    This step adds LoRA adapters for parameter-efficient fine-tuning. Only about 1% of the model’s parameters are trained, which makes the process significantly more efficient.

    4

    Data Preparation

    For this example, we will use the HuggingFaceH4/Multilingual-Thinking. This dataset contains chain-of-thought reasoning examples derived from user questions translated from English into four additional languages.

    This is the same dataset referenced in OpenAI's fine-tuning cookbook. The goal of using a multilingual dataset is to help the model learn and generalize reasoning patterns across multiple languages.

    gpt-oss introduces a reasoning effort system that controls how much reasoning the model performs. By default, the reasoning effort is set to low, but you can change it by setting the reasoning_effort parameter to low, medium or high.

    Example:

    To format the dataset, we apply a customized version of the gpt-oss prompt:

    Let's inspect the dataset by printing the first example:

    One unique feature of gpt-oss is its use of the , which supports structured conversations, reasoning output, and tool calling. This format includes tags such as <|start|> , <|message|> , and <|return|> .

    🦥 Unsloth fixes the chat template to ensure it is correct. See this for technical details on our template fix.

    Feel free to adapt the prompt and structure to suit your own dataset or use-case. For more guidance, refer to our .

    5

    Train the model

    We've pre-selected training hyperparameters for optimal results. However, you can modify them based on your specific use case. Refer to our hyperparameters guide.

    In this example, we train for 60 steps to speed up the process. For a full training run, set num_train_epochs=1 and disable the step limiting by setting max_steps=None.

    During training, monitor the loss to ensure that it is decreasing over time. This confirms that the training process is functioning correctly.

    6

    Inference: Run Your Trained Model

    Now it's time to run inference with your fine-tuned model. You can modify the instruction and input, but leave the output blank.

    In this example, we test the model's ability to reason in French by adding a specific instruction to the system prompt, following the same structure used in our dataset.

    This should produce an output similar to:

    7

    Save and Export Your Model

    To save your fine-tuned model, it can be exported in the Safetensors format with our new on-demand dequantization of MXFP4 base models (like gpt-oss) during the LoRA merge process. This makes it possible to export your fine-tuned model in bf16 format.

    New: Saving or merging QLoRA fine-tuned models to GGUF is now supported for use in other frameworks (e.g. Hugging Face, llama.cpp with GGUF).

    After fine-tuning your gpt-oss model, you can merge it into 16-bit format with:

    If you prefer to merge the model and push to the hugging-face hub directly:

    ✨ Saving to Llama.cpp

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2. Convert and quantize the merged model:

    3. Run inference on the quantized model:

    Colab notebook
    Local Guide
    Colab Guide
    notebooks
    local gpt-oss guide
    notebooks
    Unsloth compatible
    installation guide
    Discord
    Reddit
    save/export your gpt-oss fine-tuned
    Read more here
    Eyera
    .

    All uploads use Unsloth Dynamic 2.0 for SOTA Aider Polyglot and 5-shot MMLU performance. See how our Dynamic 1–2 bit GGUFs perform on coding benchmarks here.

    Run ThinkingRun Instruct

    ⚙️ Recommended Requirements

    You need 247GB of disk space to run the 1bit quant!

    The only requirement is disk space + RAM + VRAM ≥ 247GB. That means you do not need to have that much RAM or VRAM (GPU) to run the model, but it will be much slower.

    The 1.8-bit (UD-TQ1_0) quant will fit in a 1x 24GB GPU (with all MoE layers offloaded to system RAM or a fast disk). Expect around ~1-2 tokens/s with this setup if you have bonus 256GB RAM as well. The full Kimi K2 Q8 quant is 1.09TB in size and will need at least 8 x H200 GPUs.

    For optimal performance you will need at least 247GB unified memory or 247GB combined RAM+VRAM for 5+ tokens/s. If you have less than 247GB combined RAM+VRAM, then the speed of the model will definitely take a hit.

    If you do not have 247GB of RAM+VRAM, no worries! llama.cpp inherently has disk offloading, so through mmaping, it'll still work, just be slower - for example before you might get 5 to 10 tokens / second, now it's under 1 token.

    We suggest using our UD-Q2_K_XL (360GB) quant to balance size and accuracy!

    For the best performance, have your VRAM + RAM combined = the size of the quant you're downloading. If not, it'll still work via disk offloading, just it'll be slower!

    💭Kimi-K2-Thinking Guide

    Kimi-K2-Thinking should generally follow the same instructions as the Instruct model, with a few key differences, particularly in areas such as settings and the chat template.

    To run the model in full precision, you only need to use the 4-bit or 5-bit Dynamic GGUFs (e.g. UD_Q4_K_XL) because the model was originally released in INT4 format.

    You can choose a higher-bit quantization just to be safe in case of small quantization differences, but in most cases this is unnecessary.

    🌙 Official Recommended Settings:

    According to Moonshot AI, these are the recommended settings for Kimi-K2-Thinking inference:

    • Set the temperature 1.0 to reduce repetition and incoherence.

    • Suggested context length = 98,304 (up to 256K)

    • Note: Using different tools may require different settings

    We recommend setting min_p to 0.01 to suppress the occurrence of unlikely tokens with low probabilities.

    For example given a user message of "What is 1+1?", we get:

    ✨ Run Kimi K2 Thinking in llama.cpp

    You can now use the latest update of llama.cpp to run the model:

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. If you want to use llama.cpp directly to load models, you can do the below: (:UD-TQ1_0) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location.

    1. The above will use around 8GB of GPU memory. If you have around 360GB of combined GPU memory, remove -ot ".ffn_.*_exps.=CPU" to get maximum speed!

    Please try out -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    1. Download the model via (after installing pip install huggingface_hub hf_transfer ). We recommend using our 2bit dynamic quant UD-Q2_K_XL to balance size and accuracy. All versions at: huggingface.co/unsloth/Kimi-K2-Thinking-GGUF

    If you find that downloads get stuck at 90 to 95% or so, please see https://docs.unsloth.ai/basics/troubleshooting-and-faqs#downloading-gets-stuck-at-90-to-95

    1. Run any prompt.

    2. Edit --threads -1 for the number of CPU threads (be default it's set to the maximum CPU threads), --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Set it to 99 combined with MoE CPU offloading to get the best performance. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    🤔No Thinking Tags?

    You may notice that there are no thinking tags when you run the model. This is normal and intended behavior.

    In your llama.cpp script, make sure to include the --special flag at the very end of your command. Once you do, you’ll see the <think> token appear as expected.

    You might also see every answer end with <|im_end|>. This is normal as <|im_end|> is a special token that appears when printing special tokens. If you’d like to hide it, you can set <|im_end|> as a stop string in your settings.

    ✨ Deploy with llama-server and OpenAI's completion library

    After installing llama.cpp as per ✨ Run Kimi K2 Thinking in llama.cpp, you can use the below to launch an OpenAI compatible server:

    Then use OpenAI's Python library after pip install openai :

    🔍Tokenizer quirks and bug fixes

    7th November 2025: We notified the Kimi team, and fixed the default system prompt of You are Kimi, an AI assistant created by Moonshot AI. not appearing on the first user prompt! See https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/12

    Huge thanks to the Moonshot Kimi team for their extremely fast response time to our queries and fixing the issue ASAP!

    16th July 2025: Kimi K2 updated their tokenizer to enable multiple tool calls as per https://x.com/Kimi_Moonshot/status/1945050874067476962

    18th July 2025: We fixed a system prompt - Kimi tweeted about our fix as well here: https://x.com/Kimi_Moonshot/status/1946130043446690030. The fix was described here as well: https://huggingface.co/moonshotai/Kimi-K2-Instruct/discussions/28

    If you have the old checkpoints downloaded - now worries - simply download the first GGUF split which was changed. OR if you do not want to download any new files do:

    The Kimi K2 tokenizer was interesting to play around with - it's mostly similar in action to GPT-4o's tokenizer! We first see in the tokenization_kimi.py file the following regular expression (regex) that Kimi K2 uses:

    After careful inspection, we find Kimi K2 is nearly identical to GPT-4o's tokenizer regex which can be found in llama.cpp's source code.

    Both tokenize numbers into groups of 1 to 3 numbers (9, 99, 999), and use similar patterns. The only difference looks to be the handling of "Han" or Chinese characters, which Kimi's tokenizer deals with more. The PR by https://github.com/gabriellarson handles these differences well after some discussions here.

    We also find the correct EOS token should not be [EOS], but rather <|im_end|>, which we have also fixed in our model conversions.

    🌝Kimi-K2-Instruct Guide

    Step-by-step guide on running the Instruct Kimi K2 models including Kimi K2 0905 - the September 5 update.

    🌙 Official Recommended Settings:

    According to Moonshot AI, these are the recommended settings for Kimi K2 inference:

    • Set the temperature 0.6 to reduce repetition and incoherence.

    • Original default system prompt is:

    • (Optional) Moonshot also suggests the below for the system prompt:

    We recommend setting min_p to 0.01 to suppress the occurrence of unlikely tokens with low probabilities.

    🔢 Chat template and prompt format

    Kimi Chat does use a BOS (beginning of sentence token). The system, user and assistant roles are all enclosed with <|im_middle|> which is interesting, and each get their own respective token <|im_system|>, <|im_user|>, <|im_assistant|>.

    To separate the conversational boundaries (you must remove each new line), we get:

    💾 Model uploads

    ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and reasoning tasks.

    MoE Bits
    Type + Link
    Disk Size
    Details

    1.66bit

    245GB

    1.92/1.56bit

    1.78bit

    281GB

    2.06/1.56bit

    1.93bit

    304GB

    We've also uploaded versions in BF16 format.

    ✨ Run Instruct in llama.cpp

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. If you want to use llama.cpp directly to load models, you can do the below: (:UD-IQ1_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. To run the new September 2025 update for the model, change the model name from 'Kimi-K2-Instruct' to 'Kimi-K2-Instruct-0905'.

    Please try out -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    And finally offload all layers via -ot ".ffn_.*_exps.=CPU" This uses the least VRAM.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-TQ1_0(dynamic 1.8bit quant) or other quantized versions like Q2_K_XL . We recommend using our 2bit dynamic quant UD-Q2_K_XL to balance size and accuracy. More versions at: huggingface.co/unsloth/Kimi-K2-Instruct-GGUF

    If you find that downloads get stuck at 90 to 95% or so, please see https://docs.unsloth.ai/basics/troubleshooting-and-faqs#downloading-gets-stuck-at-90-to-95

    1. Run any prompt.

    2. Edit --threads -1 for the number of CPU threads (be default it's set to the maximum CPU threads), --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Set it to 99 combined with MoE CPU offloading to get the best performance. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    🐦 Flappy Bird + other tests

    We introduced the Flappy Bird test when our 1.58bit quants for DeepSeek R1 were provided. We found Kimi K2 one of the only models to one-shot all our tasks including this one, Heptagon and others tests even at 2-bit. The goal is to ask the LLM to create a Flappy Bird game but following some specific instructions:

    You can also test the dynamic quants via the Heptagon Test as per r/Localllama which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.

    The goal is to make the heptagon spin, and the balls in the heptagon should move. The prompt is below:

    Thinking guide
    GGUFs here
    system prompt fix
    Kimi-K2-GGUF
    Kimi-K2-Thinking GGUFs

    Reinforcement Learning (RL) Guide

    Learn all about Reinforcement Learning (RL) and how to train your own DeepSeek-R1 reasoning model with Unsloth using GRPO. A complete guide from beginner to advanced.

    Reinforcement Learning is where an "agent" learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties.

    • Action: What the model generates (e.g. a sentence).

    • Reward: A signal indicating how good or bad the model's action was (e.g. did the response follow instructions? was it helpful?).

    • Environment: The scenario or task the model is working on (e.g. answering a user’s question).

    Nov 26 update: We're introducing FP8 precision RL and GRPO in Unsloth!

    🦥What you will learn

    1. What is RL? RLVR? PPO? GRPO? RLHF? RFT? Is "Luck is All You Need?" for RL?

    2. What is an environment? Agent? Action? Reward function? Rewards?

    This article covers everything (from beginner to advanced) you need to know about GRPO, Reinforcement Learning (RL) and reward functions, along with tips, and the basics of using GRPO with . If you're looking for a step-by-step tutorial for using GRPO, see our guide .

    For advanced GRPO documentation on batching, generation and training parameters,

    ❓What is Reinforcement Learning (RL)?

    The goal of RL is to:

    1. Increase the chance of seeing "good" outcomes.

    2. Decrease the chance of seeing "bad" outcomes.

    That's it! There are intricacies on what "good" and "bad" means, or how do we go about "increasing" or "decreasing" it, or what even "outcomes" means.

    For example, in the Pacman game:

    1. The environment is the game world.

    2. The actions you can take are UP, LEFT, RIGHT and DOWN.

    3. The rewards

    Another example is imagine you are given the question: "What is 2 + 2?" (4) An unaligned language model will spit out 3, 4, C, D, -10, literally anything.

    1. Numbers are better than C or D right?

    2. Getting 3 is better than say 8 right?

    🏃From RLHF, PPO to GRPO and RLVR

    OpenAI popularized the concept of (Reinforcement Learning from Human Feedback), where we train an "agent" to produce outputs to a question (the state) that are rated more useful by human beings.

    The thumbs up and down in ChatGPT for example can be used in the RLHF process.

    The clip(..., 1-e, 1+e) term is used to force PPO not to take too large changes. There is also a KL term with beta set to > 0 to force the model not to deviate too much away.

    In order to do RLHF, (Proximal policy optimization) was developed. The agent is the language model in this case. In fact it's composed of 3 systems:

    1. The Generating Policy (current trained model)

    DeepSeek developed (Group Relative Policy Optimization) to train their R1 reasoning models. The key differences to PPO are:

    1. The Value Model is removed, replaced with statistics from calling the reward model multiple times.

    2. The Reward Model is removed and replaced with just custom reward function which RLVR can be used.

    This means GRPO is extremely efficient. Previously PPO needed to train multiple models - now with the reward model and value model removed, we can save memory and speed up everything.

    RLVR (Reinforcement Learning with Verifiable Rewards) allows us to reward the model based on tasks with easy to verify solutions. For example:

    1. Maths equations can be easily verified. Eg 2+2 = 4.

    2. Code output can be verified as having executed correctly or not.

    3. Designing verifiable reward functions can be tough, and so most examples are math or code.

    4. Use-cases for GRPO isn’t just for code or math—its reasoning process can enhance tasks like email automation, database retrieval, law, and medicine, greatly improving accuracy based on your dataset and reward function - the trick is to define a rubric - ie a list of smaller verifiable rewards, and not a final all consuming singular reward.

    Why "Group Relative"?

    GRPO removes the value model entirely, but we still need to estimate the "average reward" given the current state.

    The trick is to sample the LLM! We then calculate the average reward through statistics of the sampling process across multiple different questions.

    For example for "What is 2+2?" we sample 4 times. We might get 4, 3, D, C. We then calculate the reward for each of these answers, then calculate the average reward and standard deviation, then Z-score standardize this!

    This creates the advantages A, which we will use in replacement of the value model. This saves a lot of memory!

    🤞Luck (well Patience) Is All You Need

    The trick of RL is you need 2 things only:

    1. A question or instruction eg "What is 2+2?" "Create a Flappy Bird game in Python"

    2. A reward function and verifier to verify if the output is good or bad.

    With only these 2, we can essentially call a language model an infinite times until we get a good answer. For example for "What is 2+2?", an untrained bad language model will output:

    0, cat, -10, 1928, 3, A, B, 122, 17, 182, 172, A, C, BAHS, %$, #, 9, -192, 12.31**** then suddenly 4.

    The reward signal was 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0**** then suddenly 1.

    So by luck and by chance, RL managed to find the correct answer across multiple rollouts. Our goal is we want to see the good answer 4 more, and the rest (the bad answers) much less.

    So the goal of RL is to be patient - in the limit, if the probability of the correct answer is at least a small number (not zero), it's just a waiting game - you will 100% for sure encounter the correct answer in the limit.

    So I like to call it as "Luck Is All You Need" for RL.

    Well a better phrase is "Patience is All You Need" for RL.

    RL essentially provides us a trick - instead of simply waiting for infinity, we do get "bad signals" ie bad answers, and we can essentially "guide" the model to already try not generating bad solutions. This means although you waited very long for a "good" answer to pop up, the model already has been changed to try its best not to output bad answers.

    In the "What is 2+2?" example - 0, cat, -10, 1928, 3, A, B, 122, 17, 182, 172, A, C, BAHS, %$, #, 9, -192, 12.31**** then suddenly 4.

    Since we got bad answers, RL will influence the model to try NOT to output bad answers. This means over time, we are carefully "pruning" or moving the model's output distribution away from bad answers. This means RL is efficient, since we are NOT just waiting for infinity, but we are actively trying to "push" the model to go as much as possible to the "correct answer space".

    If the probability is always 0, then RL will never work. This is also why people like to do RL from an already instruction finetuned model, which can partially follow instructions reasonably well - this boosts the probability most likely above 0.

    🦥What Unsloth offers for RL

    • With 15GB VRAM, Unsloth allows you to transform any model up to 17B parameters like Llama 3.1 (8B), Phi-4 (14B), Mistral (7B) or Qwen2.5 (7B) into a reasoning model

    • Unsloth now supports models!

    • Minimum requirement: Just  5GB VRAM is enough to train your own reasoning model locally (for any model with 1.5B parameters or less)

    GRPO notebooks:

    NEW! We now support and most other new GRPO techniques. You can play with the following arguments in GRPOConfig to enable:

    • If you're not getting any reasoning, make sure you have enough training steps and ensure your is working. We provide examples for reward functions .

    • Previous demonstrations show that you could achieve your own "aha" moment with Qwen2.5 (3B) - but it required 2xA100 GPUs (160GB VRAM). Now, with Unsloth, you can achieve the same "aha" moment using just a single 5GB VRAM GPU.

    • Previously, GRPO was only supported for full fine-tuning, but we've made it work with QLoRA and LoRA

    • On

    In a test example, even though we only trained Phi-4 with 100 steps using GRPO, the results are already clear. The model without GRPO does not have the thinking token, whilst the one trained with GRPO does and also has the correct answer.

    💻Training with GRPO

    For a tutorial on how to transform any open LLM into a reasoning model using Unsloth & GRPO, .

    For advanced GRPO documentation on batching, generation and training parameters,

    How GRPO Trains a Model

    1. For each question-answer pair, the model generates multiple possible responses (e.g., 8 variations).

    2. Each response is evaluated using reward functions.

    3. Training Steps:

      • If you have 300 rows of data, that's 300 training steps (or 900 steps if trained for 3 epochs).

    If you're having issues with your GRPO model not learning, we'd highly recommend to use our as it has a much better reward function and you should see results much faster and frequently.

    Basics/Tips

    • Wait for at least 300 steps for the reward to actually increase. In order to get decent results, you may need to trade for a minimum of 12 hours (this is how GRPO works), but keep in mind this isn't compulsory as you can stop at anytime.

    • For optimal results have at least 500 rows of data. You can try with even 10 rows of data but it's better to have more.

    • Each training run will always be different depending on your model, data, reward function/verifier etc. so though 300 steps is what we wrote as the minimum, sometimes it might be 1000 steps or more. So, it depends on various factors.

    📋Reward Functions / Verifiers

    In Reinforcement Learning a Reward Function and a Verifier serve distinct roles in evaluating a model’s output. In general, you could interpret them as the same thing however, technically they're not but it does not matter as much as they are usually used in conjunction with each other.

    Verifier:

    • Determines whether the generated response is correct or incorrect.

    • It does not assign a numerical score—it simply verifies correctness.

    • Example: If a model generates "5" for "2+2", the verifier checks and labels it as "wrong" (since the correct answer is 4).

    • Verifiers can also execute code (e.g., in Python) to validate logic, syntax, and correctness without needing manual evaluation.

    Reward Function:

    • Converts verification results (or other criteria) into a numerical score.

    • Example: If an answer is wrong, it might assign a penalty (-1, -2, etc.), while a correct answer could get a positive score (+1, +2).

    • It can also penalize based on criteria beyond correctness, such as excessive length or poor readability.

    Key Differences:

    • A Verifier checks correctness but doesn’t score.

    • A Reward Function assigns a score but doesn’t necessarily verify correctness itself.

    • A Reward Function can use a Verifier, but they are technically not the same.

    Understanding Reward Functions

    GRPO's primary goal is to maximize reward and learn how an answer was derived, rather than simply memorizing and reproducing responses from its training data.

    • With every training step, GRPO adjusts model weights to maximize the reward. This process fine-tunes the model incrementally.

    • Regular fine-tuning (without GRPO) only maximizes next-word prediction probability but does not optimize for a reward. GRPO optimizes for a reward function rather than just predicting the next word.

    • You can reuse data across multiple epochs.

    🪙Reward Function Examples

    You can refer to the examples below. You can input your generations into an LLM like ChatGPT 4o or Llama 3.1 (8B) and design a reward function and verifier to evaluate it. For example, feed your generations into a LLM of your choice and set a rule: "If the answer sounds too robotic, deduct 3 points." This helps refine outputs based on quality criteria

    Example #1: Simple Arithmetic Task

    • Question: "2 + 2"

    • Answer: "4"

    • Reward Function 1:

    Example #2: Email Automation Task

    • Question: Inbound email

    • Answer: Outbound email

    • Reward Functions:

      • If the answer contains a required keyword → +1

    Unsloth Proximity-Based Reward Function

    If you’ve checked out our , you’ll notice we’ve created a custom proximity-based reward function built completely from scratch, which is designed to reward answers that are closer to the correct one. This flexible function can be applied across a wide range of tasks.

    • In our examples, we enable reasoning in Qwen3 (Base) and guide it toward specific tasks

    • Apply Pre-finetuning strategies to avoid GRPO’s default tendency to just learn formatting

    • Boost evaluation accuracy with regex-based matching

    • Create custom GRPO templates beyond generic prompts like think, e.g., <start_working_out></end_working_out>

    GSM8K Reward Functions

    In our other examples, we use existing GSM8K reward functions by which is popular and shown to be quite effective:

    • correctness_reward_func – Rewards exact label matches.

    • int_reward_func – Encourages integer-only answers.

    • soft_format_reward_func – Checks structure but allows minor newline mismatches.

    • strict_format_reward_func – Ensures response structure matches the prompt, including newlines.

    🧮Using vLLM

    You can now use directly in your finetuning stack, which allows for much more throughput and allows you to finetune and do inference on the model at the same time! On 1x A100 40GB, expect 4000 tokens / s or so with Unsloth’s dynamic 4bit quant of Llama 3.2 3B Instruct. On a 16GB Tesla T4 (free Colab GPU), you can get 300 tokens / s. We also magically removed double memory usage when loading vLLM and Unsloth together, allowing for savings of 5GB or so for Llama 3.1 8B and 3GB for Llama 3.2 3B. Unsloth could originally finetune Llama 3.3 70B Instruct in 1x 48GB GPU with Llama 3.3 70B weights taking 40GB of VRAM. If we do not remove double memory usage, then we’ll need >= 80GB of VRAM when loading Unsloth and vLLM together. But with Unsloth, you can still finetune and get the benefits of fast inference in one package in under 48GB of VRAM! To use fast inference, first install vllm, and instantiate Unsloth with fast_inference:

    ✅GRPO Requirement Guidelines

    When you’re using Unsloth to do GRPO, we smartly reduce VRAM usage by over 90% when compared to standard implementations with Flash Attention 2 by using multiple tricks! On 20K context lengths for example with 8 generations per prompt, Unsloth uses only 54.3GB of VRAM for Llama 3.1 8B, whilst standard implementations take 510.8GB (90% less for Unsloth).

    1. For GRPO's GPU VRAM requirements for QLoRA 4-bit, the general rule is the model parameters = the amount of VRAM you will need (you can use less VRAM but this just to be safe). The more context length you set, the more VRAM. LoRA 16-bit will use at minimum 4x more VRAM.

    2. Our new memory efficient linear kernels for GRPO slashes memory usage by 8x or more. This shaves 68.5GB of memory, whilst being actually faster through the help of torch.compile!

    3. We leverage our smart algorithm which we released a while ago. It smartly offloads intermediate activations to system RAM asynchronously whilst being only 1% slower. This shaves 52GB of memory.

    Metrics
    Unsloth
    Standard + FA2

    In typical standard GRPO implementations, you need to create 2 logits of size (8. 20K) to calculate the GRPO loss. This takes 2 * 2 bytes * 8 (num generations) * 20K (context length) * 128256 (vocabulary size) = 78.3GB in VRAM.

    Unsloth shaves 8x memory usage for long context GRPO, so we need only an extra 9.8GB in extra VRAM for 20K context lengths!

    We also need to from the KV Cache in 16bit. Llama 3.1 8B has 32 layers, and both K and V are 1024 in size. So memory usage for 20K context length = 2 * 2 bytes * 32 layers * 20K context length * 1024 = 2.5GB per batch. We would set the batch size for vLLM to 8, but we shall leave it at 1 for our calculations to save VRAM. Otherwise you will need 20GB for the KV cache.

    🎥 Unsloth RL 3 hour Workshop Video

    🎓Further Reading

    1. Nathan Lambert's RLHF Book is a must!

    2. Yannic Kilcher's GRPO Youtube video is also a must!

    3. We did a 3 hour workshop at AI Engineer World's Fair 2025. Slides are other material are at

    4. Advanced GRPO notebook via Unsloth.

    Qwen3-Coder: How to Run Locally

    Run Qwen3-Coder-30B-A3B-Instruct and 480B-A35B locally with Unsloth Dynamic quants.

    Qwen3-Coder is Qwen’s new series of coding agent models, available in 30B (Qwen3-Coder-Flash) and 480B parameters. Qwen3-480B-A35B-Instruct achieves SOTA coding performance rivalling Claude Sonnet-4, GPT-4.1, and Kimi K2, with 61.8% on Aider Polygot and support for 256K (extendable to 1M) token context.

    We also uploaded Qwen3-Coder with native 1M context length extended by YaRN and full-precision 8bit and 16bit versions. Unsloth also now supports fine-tuning and RL of Qwen3-Coder.

    UPDATE: We fixed tool-calling for Qwen3-Coder! You can now use tool-calling seamlessly in llama.cpp, Ollama, LMStudio, Open WebUI, Jan etc. This issue was universal and affected all uploads (not just Unsloth), and we've communicated with the Qwen team about our fixes! Read more

    Does work? Yes, and very well. In third-party testing on the Aider Polyglot benchmark, the UD-Q4_K_XL (276GB) dynamic quant nearly matched the full bf16 (960GB) Qwen3-coder model, scoring 60.9% vs 61.8%.

    Qwen3 Coder - Unsloth Dynamic 2.0 GGUFs:

    Dynamic 2.0 GGUF (to run)
    1M Context Dynamic 2.0 GGUF

    🖥️ Running Qwen3-Coder

    Below are guides for the and variants of the model.

    ⚙️ Recommended Settings

    Qwen recommends these inference settings for both models:

    temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05

    • Temperature of 0.7

    • Top_K of 20

    • Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)

    • Top_P of 0.8

    Chat template/prompt format with newlines un-rendered

    Chat template for tool calling (Getting the current temperature for San Francisco). More details here for how to format tool calls.

    Reminder that this model supports only non-thinking mode and does not generate <think></think> blocks in its output. Meanwhile, specifying enable_thinking=False is no longer required.

    Run Qwen3-Coder-30B-A3B-Instruct:

    To achieve inference speeds of 6+ tokens per second for our Dynamic 4-bit quant, have at least 18GB of unified memory (combined VRAM and RAM) or 18GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. E.g. the UD_Q8_K_XL quant (full precision), which is 32.5GB, will require at least 33GB of unified memory (VRAM + RAM) or 33GB of RAM for optimal performance.

    NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.

    Given that this is a non thinking model, there is no need to set thinking=False and the model does not generate <think> </think> blocks.

    Follow the . They're the same as the 480B model.

    🦙 Ollama: Run Qwen3-Coder-30B-A3B-Instruct Tutorial

    1. Install ollama if you haven't already! You can only run models up to 32B in size.

    1. Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

    ✨ Llama.cpp: Run Qwen3-Coder-30B-A3B-Instruct Tutorial

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. You can directly pull from HuggingFace via:

    2. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

    Run Qwen3-Coder-480B-A35B-Instruct:

    To achieve inference speeds of 6+ tokens per second for our 1-bit quant, we recommend at least 150GB of unified memory (combined VRAM and RAM) or 150GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. E.g. the Q2_K_XL quant, which is 180GB, will require at least 180GB of unified memory (VRAM + RAM) or 180GB of RAM for optimal performance.

    NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.

    Follow the . They're the same as the 30B model.

    📖 Llama.cpp: Run Qwen3-Coder-480B-A35B-Instruct Tutorial

    For Coder-480B-A35B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

    If you want a full precision unquantized version, use our Q8_K_XL, Q8_0 or BF16 versions!

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    2. You can directly use llama.cpp to download the model but I normally suggest using huggingface_hub To use llama.cpp directly, do:

      {% code overflow="wrap" %}

      {% endcode %}

    Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity. More options discussed .

    Also don't forget about the new Qwen3 update. Run locally with llama.cpp.

    🛠️ Improving generation speed

    If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

    Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

    If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

    Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

    You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

    The also introduces high throughput mode. Use llama-parallel. Read more about it . You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.

    📐How to fit long context (256K to 1M)

    To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

    --cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

    You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1

    You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it.

    We also uploaded 1 million context length GGUFs via YaRN scaling .

    🧰 Tool Calling Fixes

    We managed to fix tool calling via llama.cpp --jinja specifically for serving through llama-server! If you’re downloading our 30B-A3B quants, no need to worry as these already include our fixes. For the 480B-A35B model, please:

    1. Download the first file at https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/UD-Q2_K_XL for UD-Q2_K_XL, and replace your current file

    2. Use snapshot_download as usual as in https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locally#llama.cpp-run-qwen3-tutorial which will auto override the old files

    3. Use the new chat template via --chat-template-file. See or

    This should solve issues like: https://github.com/ggml-org/llama.cpp/issues/14915

    Using Tool Calling

    To format the prompts for tool calling, let's showcase it with an example.

    I created a Python function called get_current_temperature which is a function which should get the current temperature for a location. For now we created a placeholder function which will always return 21.6 degrees celsius. You should change this to a true function!!

    Then use the tokenizer to create the entire prompt:

    💡Performance Benchmarks

    These official benchmarks are for the full BF16 checkpoint. To use this, simply use the Q8_K_XL, Q8_0, BF16 checkpoints we uploaded - you can still use the tricks like MoE offloading for these versions as well!

    Here are the benchmarks for the 480B model:

    Agentic Coding

    Benchmark
    Qwen3‑Coder 480B‑A35B‑Instruct
    Kimi‑K2
    DeepSeek‑V3-0324
    Claude 4 Sonnet
    GPT‑4.1

    Agentic Browser Use

    Benchmark
    Qwen3‑Coder 480B‑A35B‑Instruct
    Kimi‑K2
    DeepSeek‑V3 0324
    Claude Sonnet‑4
    GPT‑4.1

    Agentic Tool -Use

    Benchmark
    Qwen3‑Coder 480B‑A35B‑Instruct
    Kimi‑K2
    DeepSeek‑V3 0324
    Claude Sonnet‑4
    GPT‑4.1
    # Import and load dataset
    from datasets import load_dataset
    dataset = load_dataset("repo_name/dataset_name", split = "train")
    
    # Apply the formatting function to your dataset using the map method
    dataset = dataset.map(formatting_prompts_func, batched = True,)
    # Import dataset
    from datasets import load_dataset
    dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
    
    # Convert your dataset to the "role"/"content" format if necessary
    from unsloth.chat_templates import standardize_sharegpt
    dataset = standardize_sharegpt(dataset)
    
    # Apply the formatting function to your dataset using the map method
    dataset = dataset.map(formatting_prompts_func, batched = True,)
      "text": "Pasta carbonara is a traditional Roman pasta dish. The sauce is made by mixing raw eggs with grated Pecorino Romano cheese and black pepper. The hot pasta is then tossed with crispy guanciale (cured pork cheek) and the egg mixture, creating a creamy sauce from the residual heat. Despite popular belief, authentic carbonara never contains cream or garlic. The dish likely originated in Rome in the mid-20th century, though its exact origins are debated..."
    "Instruction": "Task we want the model to perform."
    
    "Input": "Optional, but useful, it will essentially be the user's query."
    
    "Output": "The expected result of the task and the output of the model."
    {
      "conversations": [
        {
          "from": "human",
          "value": "Can you help me make pasta carbonara?"
        },
        {
          "from": "gpt",
          "value": "Would you like the traditional Roman recipe, or a simpler version?"
        },
        {
          "from": "human",
          "value": "The traditional version please"
        },
        {
          "from": "gpt",
          "value": "The authentic Roman carbonara uses just a few ingredients: pasta, guanciale, eggs, Pecorino Romano, and black pepper. Would you like the detailed recipe?"
        }
      ]
    }
    {
      "messages": [
        {
          "role": "user",
          "content": "What is 1+1?"
        },
        {
          "role": "assistant",
          "content": "It's 2!"
        },
      ]
    }
    from unsloth.chat_templates import CHAT_TEMPLATES
    print(list(CHAT_TEMPLATES.keys()))
    ['unsloth', 'zephyr', 'chatml', 'mistral', 'llama', 'vicuna', 'vicuna_old', 'vicuna old', 'alpaca', 'gemma', 'gemma_chatml', 'gemma2', 'gemma2_chatml', 'llama-3', 'llama3', 'phi-3', 'phi-35', 'phi-3.5', 'llama-3.1', 'llama-31', 'llama-3.2', 'llama-3.3', 'llama-32', 'llama-33', 'qwen-2.5', 'qwen-25', 'qwen25', 'qwen2.5', 'phi-4', 'gemma-3', 'gemma3']
    from unsloth.chat_templates import get_chat_template
    
    tokenizer = get_chat_template(
        tokenizer,
        chat_template = "gemma-3", # change this to the right chat_template name
    )
    def formatting_prompts_func(examples):
       convos = examples["conversations"]
       texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
       return { "text" : texts, }
    Using the dataset example I provided, follow the structure and generate conversations based on the examples.
    Create 10 examples of product reviews for Coca-Coca classified as either positive, negative, or neutral.
    Structure my dataset so it is in a QA ChatML format for fine-tuning. Then generate 5 synthetic data examples with the same topic and format.
    Dataset({
        features: ['image', 'image_id', 'caption', 'cui'],
        num_rows: 1978
    })
    [
    { "role": "user",
      "content": [{"type": "text",  "text": instruction}, {"type": "image", "image": image} ]
    },
    { "role": "assistant",
      "content": [{"type": "text",  "text": answer} ]
    },
    ]
    instruction = "You are an expert radiographer. Describe accurately what you see in this image."
    
    def convert_to_conversation(sample):
        conversation = [
            { "role": "user",
              "content" : [
                {"type" : "text",  "text"  : instruction},
                {"type" : "image", "image" : sample["image"]} ]
            },
            { "role" : "assistant",
              "content" : [
                {"type" : "text",  "text"  : sample["caption"]} ]
            },
        ]
        return { "messages" : conversation }
    pass
    converted_dataset = [convert_to_conversation(sample) for sample in dataset]
    converted_dataset[0]
    {'messages': [{'role': 'user',
       'content': [{'type': 'text',
         'text': 'You are an expert radiographer. Describe accurately what you see in this image.'},
        {'type': 'image',
         'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=657x442>}]},
      {'role': 'assistant',
       'content': [{'type': 'text',
         'text': 'Panoramic radiography shows an osteolytic lesion in the right posterior maxilla with resorption of the floor of the maxillary sinus (arrows).'}]}]}
    FastVisionModel.for_inference(model) # Enable for inference!
    
    image = dataset[0]["image"]
    instruction = "You are an expert radiographer. Describe accurately what you see in this image."
    
    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": instruction}
        ]}
    ]
    input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
    inputs = tokenizer(
        image,
        input_text,
        add_special_tokens = False,
        return_tensors = "pt",
    ).to("cuda")
    
    from transformers import TextStreamer
    text_streamer = TextStreamer(tokenizer, skip_prompt = True)
    _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                       use_cache = True, temperature = 1.5, min_p = 0.1)
    This radiograph appears to be a panoramic view of the upper and lower dentition, specifically an Orthopantomogram (OPG).
    
    * The panoramic radiograph demonstrates normal dental structures.
    * There is an abnormal area on the upper right, represented by an area of radiolucent bone, corresponding to the antrum.
    
    **Key Observations**
    
    * The bone between the left upper teeth is relatively radiopaque.
    * There are two large arrows above the image, suggesting the need for a closer examination of this area. One of the arrows is in a left-sided position, and the other is in the right-sided position. However, only
    def formatting_prompts_func(examples):
        convos = examples["messages"]
        texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
        return { "text" : texts, }
    pass
    
    from datasets import load_dataset
    
    dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split="train")
    dataset
    from trl import SFTConfig, SFTTrainer
    trainer = SFTTrainer(
        model = model,
        tokenizer = tokenizer,
        train_dataset = dataset,
        args = SFTConfig(
            per_device_train_batch_size = 1,
            gradient_accumulation_steps = 4,
            warmup_steps = 5,
            # num_train_epochs = 1, # Set this for 1 full training run.
            max_steps = 30,
            learning_rate = 2e-4,
            logging_steps = 1,
            optim = "adamw_8bit",
            weight_decay = 0.01,
            lr_scheduler_type = "linear",
            seed = 3407,
            output_dir = "outputs",
            report_to = "none", # Use this for WandB etc
        ),
    )
    messages = [
        {"role": "system", "content": "reasoning language: French\n\nYou are a helpful assistant that can solve mathematical problems."},
        {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt = True,
        return_tensors = "pt",
        return_dict = True,
        reasoning_effort = "medium",
    ).to(model.device)
    from transformers import TextStreamer
    _ = model.generate(**inputs, max_new_tokens = 2048, streamer = TextStreamer(tokenizer))
    # We're installing the latest Torch, Triton, OpenAI's Triton kernels, Transformers and Unsloth!
    !pip install --upgrade -qqq uv
    try: import numpy; install_numpy = f"numpy=={numpy.__version__}"
    except: install_numpy = "numpy"
    !uv pip install -qqq \
        "torch>=2.8.0" "triton>=3.4.0" {install_numpy} \
        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
        torchvision bitsandbytes \
        git+https://github.com/huggingface/transformers \
        git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels
    You are a helpful assistant
    You are Kimi, an AI assistant created by Moonshot AI.
    <|im_system|>system<|im_middle|>You are Kimi, an AI assistant created by Moonshot AI.<|im_end|><|im_user|>user<|im_middle|>What is 1+1?<|im_end|><|im_assistant|>assistant<|im_middle|>
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
    cp llama.cpp/build/bin/llama-* llama.cpp
    export LLAMA_CACHE="unsloth/Kimi-K2-Thinking-GGUF"
    ./llama.cpp/llama-cli \
        -hf unsloth/Kimi-K2-Thinking-GGUF:UD-TQ1_0 \
        --n-gpu-layers 99 \
        --temp 1.0 \
        --min-p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU"
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Kimi-K2-Thinking-GGUF",
        local_dir = "unsloth/Kimi-K2-Thinking-GGUF",
        allow_patterns = ["*UD-TQ1_0*"], # Use "*UD-Q2_K_XL*" for Dynamic 2bit (381GB)
    )
    ./llama.cpp/llama-cli \
        --model unsloth/Kimi-K2-Thinking-GGUF/UD-TQ1_0/Kimi-K2-Thinking-UD-TQ1_0-00001-of-00006.gguf \
        --n-gpu-layers 99 \
        --temp 1.0 \
        --min_p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU"
    ./llama.cpp/llama-server \
        --model unsloth/Kimi-K2-Thinking-GGUF/UD-TQ1_0/Kimi-K2-Thinking-UD-TQ1_0-00001-of-00006.gguf \
        --alias "unsloth/Kimi-K2-Thinking" \
        --threads -1 \
        -fa on \
        --n-gpu-layers 999 \
        -ot ".ffn_.*_exps.=CPU" \
        --min_p 0.01 \
        --ctx-size 16384 \
        --port 8001 \
        --jinja
    from openai import OpenAI
    openai_client = OpenAI(
        base_url = "http://127.0.0.1:8001/v1",
        api_key = "sk-no-key-required",
    )
    completion = openai_client.chat.completions.create(
        model = "unsloth/Kimi-K2-Thinking",
        messages = [{"role": "user", "content": "What is 2+2?"},],
    )
    print(completion.choices[0].message.content)
    wget https://huggingface.co/unsloth/Kimi-K2-Instruct/raw/main/chat_template.jinja
    ./llama.cpp ... --chat-template-file /dir/to/chat_template.jinja
    pat_str = "|".join(
        [
            r"""[\p{Han}]+""",
            r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
            r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
            r"""\p{N}{1,3}""",
            r""" ?[^\s\p{L}\p{N}]+[\r\n]*""",
            r"""\s*[\r\n]+""",
            r"""\s+(?!\S)""",
            r"""\s+""",
        ]
    )
    [^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n/]*|\s*[\r\n]+|\s+(?!\S)|\s+
    <|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_user|>user<|im_middle|>What is 1+1?<|im_end|><|im_assistant|>assistant<|im_middle|>2<|im_end|>
    <|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|>
    <|im_user|>user<|im_middle|>What is 1+1?<|im_end|>
    <|im_assistant|>assistant<|im_middle|>2<|im_end|>
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
    cp llama.cpp/build/bin/llama-* llama.cpp
    export LLAMA_CACHE="unsloth/Kimi-K2-Instruct-GGUF"
    ./llama.cpp/llama-cli \
        -hf unsloth/Kimi-K2-Instruct-GGUF:TQ1_0 \
        --threads -1 \
        --n-gpu-layers 99 \
        --temp 0.6 \
        --min-p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU"
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Kimi-K2-Instruct-GGUF",
        local_dir = "unsloth/Kimi-K2-Instruct-GGUF",
        allow_patterns = ["*UD-TQ1_0*"], # Dynamic 1bit (281GB) Use "*UD-Q2_K_XL*" for Dynamic 2bit (381GB)
    )
    ./llama.cpp/llama-cli \
        --model unsloth/Kimi-K2-Instruct-GGUF/UD-TQ1_0/Kimi-K2-Instruct-UD-TQ1_0-00001-of-00005.gguf \
        --threads -1 \
        --n-gpu-layers 99 \
        --temp 0.6 \
        --min_p 0.01 \
        --ctx-size 16384 \
        --seed 3407 \
        -ot ".ffn_.*_exps.=CPU"
    Create a Flappy Bird game in Python. You must include these things:
    1. You must use pygame.
    2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
    3. Pressing SPACE multiple times will accelerate the bird.
    4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
    5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
    6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
    7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
    8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
    The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.
    Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.

    2.5/2.06/1.56

    2.42bit

    UD-IQ2_XXS

    343GB

    2.5/2.06bit

    2.71bit

    UD-Q2_K_XL

    381GB

    3.5/2.5bit

    3.12bit

    UD-IQ3_XXS

    417GB

    3.5/2.06bit

    3.5bit

    UD-Q3_K_XL

    452GB

    4.5/3.5bit

    4.5bit

    UD-Q4_K_XL

    588GB

    5.5/4.5bit

    5.5bit

    UD-Q5_K_XL

    732GB

    6.5/5.5bit

    UD-TQ1_0
    UD-IQ1_S
    UD-IQ1_M
    overfitting
    OpenAI Harmony format
    tweet
    dataset guide
    GitHub here
    overfitting
    OpenAI Harmony format
    tweet
    dataset guide
    GitHub here

    Repetition Penalty of 1.05

  • Chat template:

    {% code overflow="wrap" %}

    {% endcode %}

  • Recommended context output: 65,536 tokens (can be increased). Details here.

  • Or, download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

  • Run the model in conversation mode and try any prompt.

  • Edit --threads -1 for the number of CPU threads, --ctx-size 262114 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

  • As an extra, we also made 1 single 150GB UD-IQ1_M file (so Ollama works) at https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/blob/main/Qwen3-Coder-480B-A35B-Instruct-UD-IQ1_M.gguf

    70.4

    –

    SWE‑bench Verified w/ OpenHands (100 turns)

    67.0

    65.4

    38.8

    68.0

    48.6

    SWE‑bench Verified w/ Private Scaffolding

    –

    65.8

    –

    72.7

    63.8

    SWE‑bench Live

    26.3

    22.3

    13.0

    27.7

    –

    SWE‑bench Multilingual

    54.7

    47.3

    13.0

    53.3

    31.5

    Multi‑SWE‑bench mini

    25.8

    19.8

    7.5

    24.8

    –

    Multi‑SWE‑bench flash

    27.0

    20.7

    –

    25.0

    –

    Aider‑Polyglot

    61.8

    60.0

    56.9

    56.4

    52.4

    Spider2

    31.1

    25.2

    12.8

    31.1

    16.5

    47.4

    49.6

    80.5

    –

    TAU‑Bench Airline

    60.0

    53.5

    40.0

    60.0

    –

    • 30B-A3B-Instruct

    • 480B-A35B-Instruct

    • 30B-A3B-Instruct

    • 480B-A35B-Instruct

    Terminal‑Bench

    37.5

    30.0

    2.5

    35.5

    25.3

    SWE‑bench Verified w/ OpenHands (500 turns)

    69.6

    –

    WebArena

    49.9

    47.4

    40.0

    51.1

    44.3

    Mind2Web

    55.8

    42.7

    BFCL‑v3

    68.7

    65.2

    56.9

    73.3

    62.9

    TAU‑Bench Retail

    77.5

    70.7

    Run 30B-A3B
    Run 480B-A35B
    Unsloth Dynamic Quants
    More details here.
    30B-A3B
    480B-A35B
    best practices above
    GitHub here
    best practices above
    GitHub here
    here
    Qwen3-235B-A22B-Instruct-2507
    latest llama.cpp release
    here
    here
    GGUF chat template
    chat_template.jinja

    –

    36.0

    59.1

    tokenizer.apply_chat_template(
        text, 
        tokenize = False, 
        add_generation_prompt = False,
        reasoning_effort = "medium",
    )
    from unsloth.chat_templates import standardize_sharegpt
    dataset = standardize_sharegpt(dataset)
    dataset = dataset.map(formatting_prompts_func, batched = True,)
    print(dataset[0]['text'])
    model.save_pretrained_merged(save_directory, tokenizer, save_method="mxfp4)
    model.push_to_hub_merged(repo_name, tokenizer=tokenizer, token= hf_token, save_method="mxfp4")
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cp
    python3 llama.cpp/convert_hf_to_gguf.py gpt-oss-finetuned-merged/ --outfile gpt-oss-finetuned-mxfp4.gguf
    llama.cpp/llama-cli --model gpt-oss-finetuned-mxfp4.gguf \
        --jinja -ngl 99 --threads -1 --ctx-size 16384 \
        --temp 1.0 --top-p 1.0 --top-k 0 \
         -p "The meaning to life and the universe is"
    from unsloth import FastLanguageModel
    import torch
    max_seq_length = 1024
    dtype = None
    
    # 4bit pre quantized models we support for 4x faster downloading + no OOMs.
    fourbit_models = [
        "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization
        "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
        "unsloth/gpt-oss-20b", # 20B model using MXFP4 format
        "unsloth/gpt-oss-120b",
    ] # More models at https://huggingface.co/unsloth
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/gpt-oss-20b",
        dtype = dtype, # None for auto detection
        max_seq_length = max_seq_length, # Choose any for long context!
        load_in_4bit = True,  # 4 bit quantization to reduce memory
        full_finetuning = False, # [NEW!] We have full finetuning now!
        # token = "hf_...", # use one if using gated models
    )
    model = FastLanguageModel.get_peft_model(
        model,
        r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 16,
        lora_dropout = 0, # Supports any, but = 0 is optimized
        bias = "none",    # Supports any, but = "none" is optimized
        # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
        use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
        random_state = 3407,
        use_rslora = False,  # We support rank stabilized LoRA
        loftq_config = None, # And LoftQ
    )
    tokenizer.apply_chat_template(
        text, 
        tokenize = False, 
        add_generation_prompt = False,
        reasoning_effort = "medium",
    )
    from unsloth.chat_templates import standardize_sharegpt
    dataset = standardize_sharegpt(dataset)
    dataset = dataset.map(formatting_prompts_func, batched = True,)
    print(dataset[0]['text'])
    model.save_pretrained_merged(save_directory, tokenizer)
    model.push_to_hub_merged(repo_name, tokenizer=tokenizer, token= hf_token)
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cp
    python3 llama.cpp/convert_hf_to_gguf.py gpt-oss-finetuned-merged/ --outfile gpt-oss-finetuned.gguf
    llama.cpp/llama-quantize gpt-oss-finetuned.gguf  gpt-oss-finetuned-Q8_0.gguf Q8_0
    llama.cpp/llama-cli --model gpt-oss-finetuned-Q8_0.gguf \
        --jinja -ngl 99 --threads -1 --ctx-size 16384 \
        --temp 1.0 --top-p 1.0 --top-k 0 \
         -p "The meaning to life and the universe is"
    <|im_start|>user
    Hey there!<|im_end|>
    <|im_start|>assistant
    What is 1+1?<|im_end|>
    <|im_start|>user
    2<|im_end|>
    <|im_start|>assistant
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF",
        local_dir = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF",
        allow_patterns = ["*UD-Q2_K_XL*"],
    )
    <|im_start|>user\nHey there!<|im_end|>\n<|im_start|>assistant\nWhat is 1+1?<|im_end|>\n<|im_start|>user\n2<|im_end|>\n<|im_start|>assistant\n
    <|im_start|>user
    What's the temperature in San Francisco now? How about tomorrow?<|im_end|>
    <|im_start|>assistant
    <tool_call>\n<function=get_current_temperature>\n<parameter=location>\nSan Francisco, CA, USA
    </parameter>\n</function>\n</tool_call><|im_end|>
    <|im_start|>user
    <tool_response>
    {"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}
    </tool_response>\n<|im_end|>
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli \
        -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL \
        --jinja -ngl 99 --threads -1 --ctx-size 32684 \
        --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --repeat-penalty 1.05
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF",
        local_dir = "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*"],
    )
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli \
        -hf unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF:Q2_K_XL \
        --threads -1 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --temp 0.7 \
        --min-p 0.0 \
        --top-p 0.8 \
        --top-k 20 \
        --repeat-penalty 1.05
    ./llama.cpp/llama-cli \
        --model unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/UD-Q2_K_XL/Qwen3-Coder-480B-A35B-Instruct-UD-Q2_K_XL-00001-of-00004.gguf \
        --threads -1 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --temp 0.7 \
        --min-p 0.0 \
        --top-p 0.8 \
        --top-k 20 \
        --repeat-penalty 1.05
    def get_current_temperature(location: str, unit: str = "celsius"):
        """Get current temperature at a location.
    
        Args:
            location: The location to get the temperature for, in the format "City, State, Country".
            unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
    
        Returns:
            the temperature, the location, and the unit in a dict
        """
        return {
            "temperature": 26.1, # PRE_CONFIGURED -> you change this!
            "location": location,
            "unit": unit,
        }
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-Coder-480B-A35B-Instruct")
    
    messages = [
        {'role': 'user', 'content': "What's the temperature in San Francisco now? How about tomorrow?"},
        {'content': "", 'role': 'assistant', 'function_call': None, 'tool_calls': [
            {'id': 'ID', 'function': {'arguments': {"location": "San Francisco, CA, USA"}, 'name': 'get_current_temperature'}, 'type': 'function'},
        ]},
        {'role': 'tool', 'content': '{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}', 'tool_call_id': 'ID'},
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize = False)
    are good if you eat a cookie, or bad if you hit one of the squiggly enemies.
  • In RL, you can't know the "best action" you can take, but you can observe intermediate steps, or the final game state (win or lose)

  • Getting 4 is definitely correct.

    We just designed a reward function!

    The Reference Policy (original model)

  • The Value Model (average reward estimator)

  • We use the Reward Model to calculate the reward for the current environment, and our goal is to maximize this!

    The formula for PPO looks quite complicated because it was designed to be stable. Visit our AI Engineer talk we gave in 2025 about RL for more in depth maths derivations about PPO.

    OpenAI popularized this in their
    offering for example.
    for example with 8 generations per prompt, Unsloth uses only 54.3GB of VRAM for Llama 3.1 (8B), whilst standard implementations (+ Flash Attention 2) take
    510.8GB (90% less for Unsloth)
    .
  • Please note, this isn’t fine-tuning DeepSeek’s R1 distilled models or using distilled data from R1 for tuning which Unsloth already supported. This is converting a standard model into a full-fledged reasoning model using GRPO.

  • You can increase the number of generated responses per question (e.g., from 8 to 16).

  • The model learns by updating its weights every step.

  • If you're using GRPO with Unsloth locally, please "pip install diffusers" as well if you get an error. Please also use the latest version of vLLM.
  • It’s advised to apply GRPO to a model at least 1.5B in parameters to correctly generate thinking tokens as smaller models may not.

  • For GRPO's GPU VRAM requirements for QLoRA 4-bit, the general rule is the model parameters = the amount of VRAM you will need (you can use less VRAM but this just to be safe). The more context length you set, the more VRAM. LoRA 16-bit will use at minimum 4x more VRAM.

  • Continuous fine-tuning is possible and you can just leave GRPO running in the background.

  • In the example notebooks, we use the GSM8K dataset, the current most popular choice for R1-style training.

  • If you’re using a base model, ensure you have a chat template.

  • The more you train with GRPO the better. The best part of GRPO is you don't even need that much data. All you need is a great reward function/verifier and the more time spent training, the better your model will get. Expect your reward vs step to increase as time progresses like this:

  • Training loss tracking for GRPO is now built directly into Unsloth, eliminating the need for external tools like wandb etc. It contains full logging details for all reward functions now including the total aggregated reward function itself.

  • Default reward functions can be predefined to be used on a wide array of use cases or you can ask ChatGPT/local model to generate them for you.
  • There’s no single correct way to design reward functions or verifiers - the possibilities are endless. However, they must be well-designed and meaningful, as poorly crafted rewards can unintentionally degrade model performance.

  • If a number is detected → +1
  • If no number is detected → -1

  • Reward Function 2:

    • If the number matches the correct answer → +3

    • If incorrect → -3

  • Total Reward: Sum of all reward functions

  • If the answer exactly matches the ideal response → +1

  • If the response is too long → -1

  • If the recipient's name is included → +1

  • If a signature block (phone, email, address) is present → +1

  • Apply proximity-based scoring — models get more reward for closer answers (e.g., predicting 9 instead of 10 is better than 3) while outliers are penalized

  • xmlcount_reward_func – Ensures exactly one of each XML tag in the response.

    Unsloth also uses the same GPU / CUDA memory space as the underlying inference engine (vLLM), unlike implementations in other packages. This shaves 16GB of memory.

    Total Memory Usage

    54.33GB (90% less)

    510.8GB

    GRPO from a base model notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb

    gpt-oss-20b GSPO - new

    Qwen3-VL-8B - Vision GSPO - new

    Gemma 3 (4B) - Vision GSPO - new

    Qwen3 (4B) - Advanced

    DeepSeek-R1-0528-Qwen3-8B

    Llama 3.2 (3B) - Advanced

    Gemma 3 (1B)

    Phi-4 (14B)

    Qwen2.5 (3B)

    Mistral v0.3 (7B)

    Llama 3.1 (8B)

    Training Memory Cost (GB)

    42GB

    414GB

    GRPO Memory Cost (GB)

    9.8GB

    78.3GB

    Inference Cost (GB)

    0GB

    16GB

    Inference KV Cache for 20K context length (GB)

    2.5GB

    Read blog
    Unsloth
    here
    read our guide!
    RLHF
    PPO
    GRPO
    RL for Vision/multimodal
    ⚡Tutorial: GRPO Training
    GSPO
    reward function/verifier
    here
    20K context lengths
    see here
    read our guide!
    Advanced GRPO notebooks
    Advanced GRPO Colab Notebook
    @willccbb
    vLLM
    Unsloth gradient checkpointing
    https://rlhfbook.com/c/11-policy-gradients.html
    https://www.youtube.com/watch?v=bAWV_yrqx4w
    https://docs.unsloth.ai/ai-engineers-2025
    https://docs.unsloth.ai/basics/reinforcement-learning-guide/tutorial-train-your-own-reasoning-model-with-grpo
    PPO formula
    GRPO advantage calculation

    2.5GB

    reinforcement learning finetuning (RFT)

    Unsloth Dynamic 2.0 GGUFs

    A big new upgrade to our Dynamic Quants!

    We're excited to introduce our Dynamic v2.0 quantization method - a major upgrade to our previous quants. This new method outperforms leading quantization methods and sets new benchmarks for 5-shot MMLU and KL Divergence.

    This means you can now run + fine-tune quantized LLMs while preserving as much accuracy as possible! You can run the 2.0 GGUFs on any inference engine like llama.cpp, Ollama, Open WebUI etc.

    Sept 10, 2025 update: You asked for tougher benchmarks, so we’re showcasing Aider Polyglot results! Our Dynamic 3-bit DeepSeek V3.1 GGUF scores 75.6%, surpassing many full-precision SOTA LLMs. Read more.

    The key advantage of using the Unsloth package and models is our active role in fixing critical bugs in major models. We've collaborated directly with teams behind , , , and , contributing essential fixes that significantly boost accuracy.

    Detailed analysis of our benchmarks and evaluation further below.

    💡 What's New in Dynamic v2.0?

    • Revamped Layer Selection for GGUFs + safetensors: Unsloth Dynamic 2.0 now selectively quantizes layers much more intelligently and extensively. Rather than modifying only select layers, we now dynamically adjust the quantization type of every possible layer, and the combinations will differ for each layer and model.

    • Current selected and all future GGUF uploads will utilize Dynamic 2.0 and our new calibration dataset. The dataset contains more than >1.5M tokens (depending on model) and comprise of high-quality, hand-curated and cleaned data - to greatly enhance conversational chat performance.

    • Previously, our Dynamic quantization (DeepSeek-R1 1.58-bit GGUF) was effective only for MoE architectures. Dynamic 2.0 quantization now works on all models (including MOEs & non-MoEs).

    To ensure accurate benchmarking, we built an internal evaluation framework to match official reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, QAT and standard imatrix GGUF quants.

    All future GGUF uploads will utilize Unsloth Dynamic 2.0, and our Dynamic 4-bit safe tensor quants will also benefit from this in the future.

    📊 Why KL Divergence?

    showcases how pruning layers, even by selecting unnecessary ones still yields vast differences in terms of "flips". A "flip" is defined as answers changing from incorrect to correct or vice versa. The paper shows how MMLU might not decrease as we prune layers or do quantization,but that's because some incorrect answers might have "flipped" to become correct. Our goal is to match the original model, so measuring "flips" is a good metric.

    KL Divergence should be the gold standard for reporting quantization errors as per the research paper "Accuracy is Not All You Need". Using perplexity is incorrect since output token values can cancel out, so we must use KLD!

    The paper also shows that interestingly KL Divergence is highly correlated with flips, and so our goal is to reduce the mean KL Divergence whilst increasing the disk space of the quantization as less as possible.

    ⚖️ Calibration Dataset Overfitting

    Most frameworks report perplexity and KL Divergence using a test set of Wikipedia articles. However, we noticed using the calibration dataset which is also Wikipedia related causes quants to overfit, and attain lower perplexity scores. We utilize and datasets for fair testing which includes some wikitext data amongst other data. Also instruct models have unique chat templates, and using text only calibration datasets is not effective for instruct models (base models yes). In fact most imatrix GGUFs are typically calibrated with these issues. As a result, they naturally perform better on KL Divergence benchmarks that also use Wikipedia data, since the model is essentially optimized for that domain.

    To ensure a fair and controlled evaluation, we do not to use our own calibration dataset (which is optimized for chat performance) when benchmarking KL Divergence. Instead, we conducted tests using the same standard Wikipedia datasets, allowing us to directly compare the performance of our Dynamic 2.0 method against the baseline imatrix approach.

    🔢 MMLU Replication Adventure

    • Replicating MMLU 5 shot was nightmarish. We could not replicate MMLU results for many models including Llama 3.1 (8B) Instruct, Gemma 3 (12B) and others due to subtle implementation issues. Llama 3.1 (8B) for example should be getting ~68.2%, whilst using incorrect implementations can attain 35% accuracy.

    • Llama 3.1 (8B) Instruct has a MMLU 5 shot accuracy of 67.8% using a naive MMLU implementation. We find however Llama tokenizes "A" and "_A" (A with a space in front) as different token ids. If we consider both spaced and non spaced tokens, we get 68.2% (+0.4%)

    • Interestingly Llama 3 as per Eleuther AI's also appends "The best answer is" to the question, following Llama 3's original MMLU benchmarks.

    • There are many other subtle issues, and so to benchmark everything in a controlled environment, we designed our own MMLU implementation from scratch by investigating directly, and verified our results across multiple models and comparing to reported numbers.

    ✨ Gemma 3 QAT Replication, Benchmarks

    The Gemma team released two QAT (quantization aware training) versions of Gemma 3:

    1. Q4_0 GGUF - Quantizes all layers to Q4_0 via the formula w = q * block_scale with each block having 32 weights. See for more details.

    2. int4 version - presumably ?

    We benchmarked all Q4_0 GGUF versions, and did extensive experiments on the 12B model. We see the 12B Q4_0 QAT model gets 67.07% whilst the full bfloat16 12B version gets 67.15% on 5 shot MMLU. That's very impressive! The 27B model is mostly nearly there!

    Metric
    1B
    4B
    12B
    27B

    We designed a new Efficiency metric which calculates the usefulness of the model whilst also taking into account its disk size and MMLU 5 shot score:

    We have to minus 25 since MMLU has 4 multiple choices - A, B, C or D. Assume we make a model that simply randomly chooses answers - it'll get 25% accuracy, and have a disk space of a few bytes. But clearly this is not a useful model.

    On KL Divergence vs the base model, below is a table showcasing the improvements. Reminder the closer the KL Divergence is to 0, the better (ie 0 means identical to the full precision model)

    Quant
    Baseline KLD
    GB
    New KLD
    GB

    If we plot the ratio of the disk space increase and the KL Divergence ratio change, we can see a much clearer benefit! Our dynamic 2bit Q2_K_XL reduces KLD quite a bit (around 7.5%).

    Truncated table of results for MMLU for Gemma 3 (27B). See below.

    1. Our dynamic 4bit version is 2GB smaller whilst having +1% extra accuracy vs the QAT version!

    2. Efficiency wise, 2bit Q2_K_XL and others seem to do very well!

    Quant
    Unsloth
    Unsloth + QAT
    Disk Size
    Efficiency
    Click here for Full Google's Gemma 3 (27B) QAT Benchmarks:
    Model
    Unsloth
    Unsloth + QAT
    Disk Size
    Efficiency

    🦙 Llama 4 Bug Fixes + Run

    We also helped and fixed a few Llama 4 bugs:

    • Llama 4 Scout changed the RoPE Scaling configuration in their official repo. We helped resolve issues in llama.cpp to enable this

    • Llama 4's QK Norm's epsilon for both Scout and Maverick should be from the config file - this means using 1e-05 and not 1e-06. We helped resolve these in and

    • The Llama 4 team and vLLM also independently fixed an issue with QK Norm being shared across all heads (should not be so) . MMLU Pro increased from 68.58% to 71.53% accuracy.

    As shown in our graph, our 4-bit Dynamic QAT quantization deliver better performance on 5-shot MMLU while also being smaller in size.

    Running Llama 4 Scout:

    To run Llama 4 Scout for example, first clone llama.cpp:

    Then download out new dynamic v 2.0 quant for Scout:

    And and let's do inference!

    Read more on running Llama 4 here:

    QwQ-32B: How to Run effectively

    How to run QwQ-32B effectively with our bug fixes and without endless generations + GGUFs.

    Qwen released QwQ-32B - a reasoning model with performance comparable to DeepSeek-R1 on many . However, people have been experiencing infinite generations, many repetitions, <think> token issues and finetuning issues. We hope this guide will help debug and fix most issues!

    Our model uploads with our bug fixes work great for fine-tuning, vLLM and Transformers. If you're using llama.cpp and engines that use llama.cpp as backend, follow our to fix endless generations.

    Unsloth QwQ-32B uploads with our bug fixes:

    epsilon=0.2,
    epsilon_high=0.28, # one sided
    delta=1.5 # two sided
    
    loss_type='gspo',
    # or:
    loss_type='grpo',
    # or:
    loss_type='dr_grpo',
    
    mask_truncated_completions=True,
    pip install unsloth vllm
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/Llama-3.2-3B-Instruct",
        fast_inference = True,
    )
    model.fast_generate(["Hello!"])

    Model-Specific Quants: Each model now uses a custom-tailored quantization scheme. E.g. the layers quantized in Gemma 3 differ significantly from those in Llama 4.

  • To maximize efficiency, especially on Apple Silicon and ARM devices, we now also add Q4_NL, Q5.1, Q5.0, Q4.1, and Q4.0 formats.

  • 1.20

    10.26

    5.59

    2.84

    0.535764

    7.16

    0.521039

    7.31

    IQ2_M

    0.26554

    8.84

    0.258192

    8.96

    Q2_K_XL

    0.229671

    9.78

    0.220937

    9.95

    Q3_K_XL

    0.087845

    12.51

    0.080617

    12.76

    Q4_K_XL

    0.024916

    15.41

    0.023701

    15.64

    66.47

    64.47

    8.96

    4.40

    Q2_K_XL

    68.70

    67.77

    9.95

    4.30

    Q3_K_XL

    70.87

    69.50

    12.76

    3.49

    Q4_K_XL

    71.47

    71.07

    15.64

    2.94

    Google QAT

    70.64

    17.2

    2.65

    3.03

    IQ1_M

    48.10

    47.23

    6.51

    3.42

    IQ2_XXS

    59.20

    56.57

    7.31

    4.32

    IQ2_M

    66.47

    64.47

    8.96

    4.40

    Q2_K

    68.50

    67.60

    9.78

    4.35

    Q2_K_XL

    68.70

    67.77

    9.95

    4.30

    IQ3_XXS

    68.27

    67.07

    10.07

    4.18

    Q3_K_M

    70.70

    69.77

    12.51

    3.58

    Q3_K_XL

    70.87

    69.50

    12.76

    3.49

    Q4_K_M

    71.23

    71.00

    15.41

    2.98

    Q4_K_XL

    71.47

    71.07

    15.64

    2.94

    Q5_K_M

    71.77

    71.23

    17.95

    2.58

    Q6_K

    71.87

    71.60

    20.64

    2.26

    Q8_0

    71.60

    71.53

    26.74

    1.74

    Google QAT

    70.64

    17.2

    2.65

    Wolfram Ravenwolf showcased how our GGUFs via llama.cpp attain much higher accuracy than third party inference providers - this was most likely a combination of the issues explained above, and also probably due to quantization issues.

    MMLU 5 shot

    26.12%

    55.13%

    67.07% (67.15% BF16)

    70.64% (71.5% BF16)

    Disk Space

    0.93GB

    2.94GB

    7.52GB

    16.05GB

    Efficiency=MMLU 5 shot score−25Disk Space GB\text{Efficiency} = \frac{\text{MMLU 5 shot score} - 25}{\text{Disk Space GB}}Efficiency=Disk Space GBMMLU 5 shot score−25​

    IQ1_S

    1.035688

    5.83

    0.972932

    6.06

    IQ1_M

    0.832252

    6.33

    0.800049

    6.51

    IQ1_M

    48.10

    47.23

    6.51

    3.42

    IQ2_XXS

    59.20

    56.57

    7.31

    4.32

    IQ1_S

    41.87

    43.37

    Qwen3
    Meta (Llama 4)
    Mistral (Devstral)
    Google (Gemma 1–3)
    Microsoft (Phi-3/4)
    Accuracy is Not All You Need
    Calibration_v3
    Calibration_v5
    LLM Harness
    github.com/hendrycks/test
    llama.cpp wiki
    TorchAO int4 style
    change here
    llama.cpp
    transformers
    here
    https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-tune-llama-4
    Thinking Aider Benchmarks
    5-shot MMLU Benchmarks
    MMLU implementation issues

    Efficiency*

    IQ2_XXS

    IQ2_M

    6.06

    ⚙️ Official Recommended Settings

    According to Qwen, these are the recommended settings for inference:

    • Temperature of 0.6

    • Top_K of 40 (or 20 to 40)

    • Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)

    • Top_P of 0.95

    • Repetition Penalty of 1.0. (1.0 means disabled in llama.cpp and transformers)

    • Chat template: <|im_start|>user\nCreate a Flappy Bird game in Python.<|im_end|>\n<|im_start|>assistant\n<think>\n

    llama.cpp uses min_p = 0.1by default, which might cause issues. Force it to 0.0.

    👍 Recommended settings for llama.cpp

    We noticed many people use a Repetition Penalty greater than 1.0. For example 1.1 to 1.5. This actually interferes with llama.cpp's sampling mechanisms. The goal of a repetition penalty is to penalize repeated generations, but we found this doesn't work as expected.

    Turning off Repetition Penalty also works (ie setting it to 1.0), but we found using it to be useful to penalize endless generations.

    To use it, we found you must also edit the ordering of samplers in llama.cpp to before applying Repetition Penalty, otherwise there will be endless generations. So add this:

    By default, llama.cpp uses this ordering:

    We reorder essentially temperature and dry, and move min_p forward. This means we apply samplers in this order:

    If you still encounter issues, you can increase the--repeat-penalty 1.0 to 1.2 or 1.3.

    Courtesy to @krist486 for bringing llama.cpp sampling directions to my attention.

    ☀️ Dry Repetition Penalty

    We investigated usage of dry penalty as suggested in https://github.com/ggml-org/llama.cpp/blob/master/examples/main/README.md using a value of 0.8, but we actually found this to rather cause syntax issues especially for coding. If you still encounter issues, you can increase thedry penalty to 0.8.

    Utilizing our swapped sampling ordering can also help if you decide to use dry penalty.

    🦙 Tutorial: How to Run QwQ-32B in Ollama

    1. Install ollama if you haven't already!

    1. Run run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature, min_p etc) in param in our Hugging Face upload!

    📖 Tutorial: How to Run QwQ-32B in llama.cpp

    1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision). More versions at: https://huggingface.co/unsloth/QwQ-32B-GGUF

    1. Run Unsloth's Flappy Bird test, which will save the output to Q4_K_M_yes_samplers.txt

    2. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    3. We use --repeat-penalty 1.1 and --dry-multiplier 0.5 which you can adjust.

    The full input from our https://unsloth.ai/blog/deepseekr1-dynamic 1.58bit blog is:

    The beginning and the end of the final Python output after removing the thinking parts:

    Full final Python output (removed thinking parts):
    1. When running it, we get a runnable game!

    1. Now try the same without our fixes! So remove --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc" This will save the output to Q4_K_M_no_samplers.txt

    You will get some looping, but problematically incorrect Python syntax and many other issues. For example the below looks correct, but is wrong! Ie line 39 pipes.clear() ### <<< NameError: name 'pipes' is not defined. Did you forget to import 'pipes'?

    1. If you use --repeat-penalty 1.5, it gets even worse and more obvious, with actually totally incorrect syntax.

    1. You might be wondering maybe it's Q4_K_M? B16 ie full precision should work fine right? Incorrect - the outputs again fail if we do not use our fix of --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc" when using a Repetition Penalty.

    🌄 Still doesn't work? Try Min_p = 0.1, Temperature = 1.5

    According to the Min_p paper https://arxiv.org/pdf/2407.01082, for more creative and diverse outputs, and if you still see repetitions, try disabling top_p and top_k!

    Another approach is to disable min_p directly, since llama.cpp by default uses min_p = 0.1!

    🤔 <think> token not shown?

    Some people are reporting that because <think> is default added in the chat template, some systems are not outputting the thinking traces correctly. You will have to manually edit the Jinja template from:

    to another by removing the <think>\n at the end. The model will now have to manually add <think>\n during inference, which might not always succeed. DeepSeek also edited all models to default add a <think> token to force the model to go into reasoning model.

    So change {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n<think>\n' }} {%- endif %} to {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %}

    ie remove <think>\n

    Full jinja template with removed <think>\n part

    Extra Notes

    We first thought maybe:

    1. QwQ's context length was not natively 128K, but rather 32K with YaRN extension. For example in the readme file for https://huggingface.co/Qwen/QwQ-32B, we see:

    We tried overriding llama.cpp's YaRN handling, but nothing changed.

    1. We also thought maybe the RMS Layernorm epsilon was wrong - not 1e-5 but maybe 1e-6. For example this has rms_norm_eps=1e-06, whilst this has rms_norm_eps=1e-05 . We also overrided it, but it did not work:

    1. We also tested if tokenizer IDs matched between llama.cpp and normal Transformers courtesy of @kalomaze. They matched, so this was not the culprit.

    We provide our experimental results below:

    ✏️ Tokenizer Bug Fixes

    • We found a few issues as well specifically impacting finetuning! The EOS token is correct, but the PAD token should probably rather be "<|vision_pad|>" We updated it in: https://huggingface.co/unsloth/QwQ-32B/blob/main/tokenizer_config.json

    🛠️ Dynamic 4-bit Quants

    We also uploaded dynamic 4bit quants which increase accuracy vs naive 4bit quantizations! We attach the QwQ quantization error plot analysis for both activation and weight quantization errors:

    We uploaded dynamic 4-bit quants to: https://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit

    Since vLLM 0.7.3 (2025 February 20th) https://github.com/vllm-project/vllm/releases/tag/v0.7.3, vLLM now supports loading Unsloth dynamic 4bit quants!

    All our GGUFs are at https://huggingface.co/unsloth/QwQ-32B-GGUF!

    GGUF

    benchmarks
    instructions here

    61KB
    file_BF16_no_samplers.txt
    Open
    BF16 full precision with no sampling fix
    55KB
    file_BF16_yes_samplers.txt
    Open
    BF16 full precision with sampling fix
    71KB
    final_Q4_K_M_no_samplers.txt
    Open
    Q4_K_M precision with no sampling fix
    65KB
    final_Q4_K_M_yes_samplers.txt
    Open
    Q4_K_M precision with sampling fix

    DeepSeek-R1 Dynamic 1.58-bit

    See performance comparison tables for Unsloth's Dynamic GGUF Quants vs Standard IMatrix Quants.

    Read our full DeepSeek-R1 blogpost here: unsloth.ai/blog/deepseekr1-dynamic

    1-bit (Small) - Dynamic vs. Basic

    GGUF Type
    Quant
    Size (GB)
    Seed
    Pygame
    Background
    Accelerate SPACE
    Bird shape
    Land
    Top right score
    Pipes
    Best Score
    Quit
    Runnable
    Score
    Avg Score
    Errors
    Notes

    1-bit (Medium) - Dynamic vs. Basic

    GGUF Type
    Quant
    Size (GB)
    Seed
    Pygame
    Background
    Accelerate SPACE
    Bird shape
    Land
    Top right score
    Pipes
    Best Score
    Quit
    Runnable
    Score
    Avg Score

    2-bit (Extra extra Small) - Dynamic vs. Basic

    GGUF Type
    Quant
    Size (GB)
    Seed
    Pygame
    Background
    Accelerate SPACE
    Bird shape
    Land
    Top right score
    Pipes
    Best Score
    Quit
    Runnable
    Score
    Avg Score

    Dynamic Quantization trial output

    Non Dynamic Quantization trial output

    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF",
        local_dir = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF",
        allow_patterns = ["*IQ2_XXS*"],
    )
    ./llama.cpp/llama-cli \
        --model unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/Llama-4-Scout-17B-16E-Instruct-UD-IQ2_XXS.gguf \
        --threads 32 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        -ot ".ffn_.*_exps.=CPU" \
        --seed 3407 \
        --prio 3 \
        --temp 0.6 \
        --min-p 0.01 \
        --top-p 0.9 \
        -no-cnv \
        --prompt "<|header_start|>user<|header_end|>\n\nCreate a Flappy Bird game.<|eot|><|header_start|>assistant<|header_end|>\n\n"
    import pygame
    import random
    import sys
    
    pygame.init()
    WIDTH, HEIGHT = 800, 600
    screen = pygame.display.set_mode((WIDTH, HEIGHT))
    pygame.display.set_caption("Flappy Bird Clone")
    clock = pygame.time.Clock()
    
    # Colors
    pipe_colors = [(0, 100, 0), (210, 180, 140), (50, 50, 50)]
    land_colors = [(139, 69, 19), (255, 255, 0)]
    
    # Game constants
    GRAVITY = 0.5
    PIPE_SPEED = 5
    BIRD_SIZE = 30
    LAND_HEIGHT = 50
    PIPE_WIDTH = 50
    PIPE_GAP = 150
    
    class Bird:
        def __init__(self):
            self.x = WIDTH // 2
            self.y = HEIGHT // 2
            self.velocity = 0
            self.shape = random.choice(['square', 'circle', 'triangle'])
            self.color = (random.randint(0, 100), random.randint(0, 100), random.randint(0, 100))
            self.rect = pygame.Rect(self.x - BIRD_SIZE//2, self.y - BIRD_SIZE//2, BIRD_SIZE, BIRD_SIZE)
        
        def update(self):
            self.velocity += GRAVITY
            self.y += self.velocity
            self.rect.y = self.y - BIRD_SIZE//2
            self.rect.x = self.x - BIRD_SIZE//2  # Keep x centered
        
        def draw(self):
            if self.shape == 'square':
                pygame.draw.rect(screen, self.color, self.rect)
            elif self.shape == 'circle':
                pygame.draw.circle(screen, self.color, (self.rect.centerx, self.rect.centery), BIRD_SIZE//2)
            elif self.shape == 'triangle':
                points = [
                    (self.rect.centerx, self.rect.top),
                    (self.rect.left, self.rect.bottom),
                    (self.rect.right, self.rect.bottom)
                ]
                pygame.draw.polygon(screen, self.color, points)
    
    def spawn_pipe():
        pipe_x = WIDTH
        top_height = random.randint(50, HEIGHT - PIPE_GAP - LAND_HEIGHT)
        rect_top = pygame.Rect(pipe_x, 0, PIPE_WIDTH, top_height)
        bottom_y = top_height + PIPE_GAP
        bottom_height = (HEIGHT - LAND_HEIGHT) - bottom_y
        rect_bottom = pygame.Rect(pipe_x, bottom_y, PIPE_WIDTH, bottom_height)
        color = random.choice(pipe_colors)
        return {
            'rect_top': rect_top,
            'rect_bottom': rect_bottom,
            'color': color,
            'scored': False
        }
    
    def main():
        best_score = 0
        current_score = 0
        game_over = False
        pipes = []
        first_time = True  # Track first game play
    
        # Initial setup
        background_color = (173, 216, 230)  # Light blue initially
        land_color = random.choice(land_colors)
        bird = Bird()
    
        while True:
            for event in pygame.event.get():
                if event.type == pygame.QUIT:
                    pygame.quit()
                    sys.exit()
                if event.type == pygame.KEYDOWN:
                    if event.key == pygame.K_ESCAPE or event.key == pygame.K_q:
                        pygame.quit()
                        sys.exit()
                    if event.key == pygame.K_SPACE:
                        if game_over:
                            # Reset the game
                            bird = Bird()
                            pipes.clear()
                            current_score = 0
                            if first_time:
                                # First restart after initial game over
                                background_color = (random.randint(200, 255), random.randint(200, 255), random.randint(200, 255))
                                first_time = False
                            else:
                                background_color = (random.randint(200, 255), random.randint(200, 255), random.randint(200, 255))
                            land_color = random.choice(land_colors)
                            game_over = False
                        else:
                            # Jump the bird
                            bird.velocity = -15  # Initial upward velocity
    
            if not game_over:
                # Update bird and pipes
                bird.update()
    
                # Move pipes left
                remove_pipes = []
                for pipe in pipes:
                    pipe['rect_top'].x -= PIPE_SPEED
                    pipe['rect_bottom'].x -= PIPE_SPEED
                    # Check if bird passed the pipe
                    if not pipe['scored'] and bird.rect.x > pipe['rect_top'].right:
                        current_score += 1
                        pipe['scored'] = True
                    # Check if pipe is offscreen
                    if pipe['rect_top'].right < 0:
                        remove_pipes.append(pipe)
                # Remove offscreen pipes
                for p in remove_pipes:
                    pipes.remove(p)
    
                # Spawn new pipe if needed
                if not pipes or pipes[-1]['rect_top'].x < WIDTH - 200:
                    pipes.append(spawn_pipe())
    
                # Check collisions
                land_rect = pygame.Rect(0, HEIGHT - LAND_HEIGHT, WIDTH, LAND_HEIGHT)
                bird_rect = bird.rect
                # Check pipes
                for pipe in pipes:
                    if bird_rect.colliderect(pipe['rect_top']) or bird_rect.colliderect(pipe['rect_bottom']):
                        game_over = True
                        break
                # Check land and top
                if bird_rect.bottom >= land_rect.top or bird_rect.top <= 0:
                    game_over = True
    
                if game_over:
                    if current_score > best_score:
                        best_score = current_score
    
            # Drawing
            screen.fill(background_color)
            # Draw pipes
            for pipe in pipes:
                pygame.draw.rect(screen, pipe['color'], pipe['rect_top'])
                pygame.draw.rect(screen, pipe['color'], pipe['rect_bottom'])
            # Draw land
            pygame.draw.rect(screen, land_color, (0, HEIGHT - LAND_HEIGHT, WIDTH, LAND_HEIGHT))
            # Draw bird
            bird.draw()
            # Draw score
            font = pygame.font.SysFont(None, 36)
            score_text = font.render(f'Score: {current_score}', True, (0, 0, 0))
            screen.blit(score_text, (WIDTH - 150, 10))
            # Game over screen
            if game_over:
                over_text = font.render('Game Over!', True, (255, 0, 0))
                best_text = font.render(f'Best: {best_score}', True, (255, 0, 0))
                restart_text = font.render('Press SPACE to restart', True, (255, 0, 0))
                screen.blit(over_text, (WIDTH//2 - 70, HEIGHT//2 - 30))
                screen.blit(best_text, (WIDTH//2 - 50, HEIGHT//2 + 10))
                screen.blit(restart_text, (WIDTH//2 - 100, HEIGHT//2 + 50))
            
            pygame.display.flip()
            clock.tick(60)
    
    if __name__ == "__main__":
        main()
    --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"
    --samplers "dry;top_k;typ_p;top_p;min_p;xtc;temperature"
    top_k=40
    top_p=0.95
    min_p=0.0
    temperature=0.6
    dry
    typ_p
    xtc
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    ollama run hf.co/unsloth/QwQ-32B-GGUF:Q4_K_M
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggerganov/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/QwQ-32B-GGUF",
        local_dir = "unsloth-QwQ-32B-GGUF",
        allow_patterns = ["*Q4_K_M*"], # For Q4_K_M
    )
    ./llama.cpp/llama-cli \
        --model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \
        --threads 32 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 0.6 \
        --repeat-penalty 1.1 \
        --dry-multiplier 0.5 \
        --min-p 0.01 \
        --top-k 40 \
        --top-p 0.95 \
        -no-cnv \
        --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc" \
        --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n<think>\n"  \
            2>&1 | tee Q4_K_M_yes_samplers.txt
    <|im_start|>user
    Create a Flappy Bird game in Python. You must include these things:
    1. You must use pygame.
    2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
    3. Pressing SPACE multiple times will accelerate the bird.
    4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
    5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
    6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
    7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
    8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
    The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>
    <|im_start|>assistant
    <think>
    import pygame
    import random
    import sys
    
    pygame.init()
    ### Continues
    
    class Bird:
        def __init__(self):
            ### Continues
    
    def main():
        best_score = 0
        current_score = 0
        game_over = False
        pipes = []
        first_time = True  # Track first game play
    
        # Initial setup
        background_color = (173, 216, 230)  # Light blue initially
        land_color = random.choice(land_colors)
        bird = Bird()
    
        while True:
            for event in pygame.event.get():
                ### Continues
    
            if not game_over:
                # Update bird and pipes
                bird.update()
                ### Continues
    
            # Drawing
            ### Continues
            pygame.display.flip()
            clock.tick(60)
    
    if __name__ == "__main__":
        main()
    ./llama.cpp/llama-cli \
        --model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \
        --threads 32 \
        --ctx-size 16384 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 0.6 \
        --repeat-penalty 1.1 \
        --dry-multiplier 0.5 \
        --min-p 0.01 \
        --top-k 40 \
        --top-p 0.95 \
        -no-cnv \
        --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n<think>\n"  \
            2>&1 | tee Q4_K_M_no_samplers.txt
    import pygame
    import random
    
    pygame.init()
    
    # Constants
    WIDTH, HEIGHT = 800, 600
    GROUND_HEIGHT = 20
    GRAVITY = 0.7
    PIPE_SPEED = -3
    BIRD_SIZE = 45
    MIN_GAP = 130
    MAX_GAP = 200
    PIPE_COLORS = [(0, 96, 0), (205, 133, 63), (89, 97, 107)]
    DARK_BROWN = (94, 72, 4)
    YELLOW = (252, 228, 6)
    
    screen = pygame.display.set_mode((WIDTH, HEIGHT))
    clock = pygame.time.Clock()
    
    def random_light_color():
        return (
            random.randint(180, 230),
            random.randint(190, 300),
            random.randint(250, 255)
        )
    
    def reset_game():
        global bird_x, bird_y
        global pipes, score
        global background_color, land_color
        global bird_shape, bird_color
    
        # Bird properties
        bird_x = WIDTH * 0.3
        bird_y = HEIGHT // 2
        bird_vel = -5  # Initial upward thrust
    
        pipes.clear() ### <<< NameError: name 'pipes' is not defined. Did you forget to import 'pipes'?
    import pygame
    from random import randint  # For generating colors/shapes/positions randomly 
    pygame.init()
    
    # Constants:
    WIDTH, HEIGHT =456 ,702   #
    BACKGROUND_COLOR_LIGHTS=['lightskyblue']
    GAP_SIZE=189           #
    
    BIRD_RADIUS=3.  
    PIPE_SPEED=- ( )    ? 
    class Game():
    def __init__(self):
            self.screen_size=( )
    
    def reset_game_vars():
        global current_scor e
       # set to zero and other initial states.
    
    # Main game loop:
    while running :
         for event in pygame.event.get() : 
            if quit ... etc
    
    pygame.quit()
    print("Code is simplified. Due time constraints, full working version requires further implementation.")
    ./llama.cpp/llama-cli --model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \
        --threads 32 --n-gpu-layers 99 \
        --ctx-size 16384 \
        --temp 1.5 \
        --min-p 0.1 \
        --top-k 0 \
        --top-p 1.0 \
        -no-cnv \
        --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n<think>\n"
    ./llama.cpp/llama-cli --model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \
        --threads 32 --n-gpu-layers 99 \
        --ctx-size 16384 \
        --temp 0.6 \
        --min-p 0.0 \
        --top-k 40 \
        --top-p 0.95 \
        -no-cnv \
        --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n<think>\n"
    {
      ...,
      "rope_scaling": {
        "factor": 4.0,
        "original_max_position_embeddings": 32768,
        "type": "yarn"
      }
    }
    --override-kv qwen2.context_length=int:131072 \
    --override-kv qwen2.rope.scaling.type=str:yarn \
    --override-kv qwen2.rope.scaling.factor=float:4 \
    --override-kv qwen2.rope.scaling.original_context_length=int:32768 \
    --override-kv qwen2.rope.scaling.attn_factor=float:1.13862943649292 \
    --override-kv qwen2.attention.layer_norm_rms_epsilon=float:0.000001 \
    "eos_token": "<|im_end|>",
    "pad_token": "<|endoftext|>",
    Dynamic 4-bit
    BnB 4-bit
    16-bit

    IQ1_S

    131

    3408

    1

    1

    0.25

    1

    0.5

    1

    0.5

    1

    1

    0

    7.25

    score =B4 NameError: name 'B4' is not defined

    Better - selects pipe colors randomnly, but all are just 1 color - should be different. Dropping to ground fails to reset acceleration.

    Dynamic

    IQ1_S

    131

    3409

    1

    0.5

    0.5

    0.5

    0

    1

    1

    1

    1

    0

    6.5

    6.92

    score =3D 0 SyntaxError: invalid decimal literal

    Too hard to play - acceleration too fast. Pipe colors now are random, but bird shape not changing. Land collison fails.

    Basic

    IQ1_S

    133

    3407

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    No code

    Fully failed. Repeats "with Dark Colurs" forever

    Basic

    IQ1_S

    133

    3408

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    No code

    Fully failed. Repeats "Pygame's" forever

    Basic

    IQ1_S

    133

    3409

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    No code

    Fully failed. Repeats "pipe_x = screen_height pipe_x = screen_height pipe_height = screen_height - Pipe_height" forever.

    Errors
    Notes

    Dynamic

    IQ1_M

    158

    3407

    1

    1

    0.75

    1

    1

    1

    1

    1

    1

    1

    9.75

    None

    A bit fast and hard to play.

    Errors
    Notes

    Dynamic

    IQ2_XXS

    183

    3407

    1

    1

    0.5

    1

    1

    1

    1

    1

    1

    1

    9.5

    None

    Too hard to play - acceleration too slow. Lags

    Dynamic

    IQ1_S

    131

    3407

    1

    0.5

    1

    0.5

    0.5

    1

    0.5

    1

    1

    0

    7

    score =!inc SyntaxError: invalid syntax

    Selects random shapes and colors at the start, but doesn't rotate across trials

    12KB
    inference_UD-IQ1_S_3407.txt
    Open
    11KB
    inference_UD-IQ1_S_3408.txt
    Open
    10KB
    inference_UD-IQ1_S_3409.txt
    Open
    10KB
    inference_UD-IQ1_M_3407.txt
    Open
    30KB
    inference_UD-IQ1_M_3408.txt
    Open
    9KB
    inference_UD-IQ1_M_3409.txt
    Open
    29KB
    inference_UD-IQ2_XXS_3407.txt
    Open
    34KB
    inference_UD-IQ2_XXS_3408.txt
    Open
    42KB
    inference_UD-IQ2_XXS_3409.txt
    Open
    25KB
    inference_basic-IQ1_S_3407.txt
    Open
    15KB
    inference_basic-IQ1_S_3408.txt
    Open
    14KB
    inference_basic-IQ1_S_3409.txt
    Open
    7KB
    inference_basic-IQ1_M_3407.txt
    Open
    7KB
    inference_basic-IQ1_M_3408.txt
    Open
    12KB
    inference_basic-IQ1_M_3409.txt
    Open
    25KB
    inference_basic-IQ2_XXS_3407.txt
    Open
    34KB
    inference_basic-IQ2_XXS_3408.txt
    Open
    34KB
    inference_basic-IQ2_XXS_3409.txt
    Open

    Dynamic

    Dynamic

    IQ1_M

    158

    3408

    1

    1

    0.5

    1

    1

    1

    1

    1

    1

    1

    9.5

    None

    Very good - land should be clearer. Acceleration should be slower.

    Dynamic

    IQ1_M

    158

    3409

    1

    0.5

    1

    0.5

    0.5

    1

    0.5

    1

    1

    1

    8

    9.08

    None

    Background color does not change across trials.Pipes do not touch the top. No land is seen.

    Basic

    IQ1_M

    149

    3407

    1

    0

    0

    0

    0

    0

    0

    0

    1

    0

    2

    if game_over: NameError: name 'game_over' is not defined

    Fully failed. Black screen only

    Basic

    IQ1_M

    149

    3408

    1

    0

    0

    0

    0

    0

    0

    0

    1

    0

    2

    No code

    Fully failed. Black screen then closes.

    Basic

    IQ1_M

    149

    3409

    1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    1

    1.67

    window.fill((100, 100, 255)) Light Blue SyntaxError: invalid syntax && main() NameError: name 'main' is not defined.

    Fully failed.

    Dynamic

    IQ2_XXS

    183

    3408

    1

    1

    1

    1

    1

    1

    0.5

    0.5

    1

    0

    8

    global best_score SyntaxError: name 'best_score' is assigned to before global declaration

    Had to edit 2 lines - remove global best_score, and set pipe_list = []

    Dynamic

    IQ2_XXS

    183

    3409

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    10

    9.17

    None

    Extremely good. Even makes pipes have random distances between them.

    Basic

    IQ2_XXS

    175

    3407

    1

    0.5

    0.5

    0.5

    1

    0

    0.5

    1

    0

    0

    5

    pipe_color = random.choice([(34, 139, 34), (139, 69, 19), (47, 47, 47)) SyntaxError: closing parenthesis ')' does not match opening parenthesis '[' && pygame.draw.polygon(screen, bird_color, points) ValueError: points argument must contain more than 2 points

    Fails quiting. Same color. Collison detection a bit off. No score

    Basic

    IQ2_XXS

    175

    3408

    1

    0.5

    0.5

    0.5

    1

    1

    0.5

    1

    0

    0

    6

    pipes.append({'x': SCREEN_WIDTH, 'gap_y': random.randint(50, SCREEN_HEIGHT - 150)) SyntaxError: closing parenthesis ')' does not match opening parenthesis '{'

    Acceleration weird. Chooses 1 color per round. Cannot quit.

    Basic

    IQ2_XXS

    175

    3409

    1

    1

    1

    1

    1

    1

    1

    0

    0.5

    0

    7.5

    6.17

    screen = pygame.display.set_mode((SCREEN_WIDTH, SCREENHEIGHT)) NameError: name 'SCREENHEIGHT' is not defined. Did you mean: 'SCREEN_HEIGHT'?

    OK. Colors change. Best score does not update. Quit only ESC not Q.

    DeepSeek-V3-0324: How to Run Locally

    How to run DeepSeek-V3-0324 locally using our dynamic quants which recovers accuracy

    Please see https://docs.unsloth.ai/basics/deepseek-r1-0528-how-to-run-locally (May 28th 2025 update) to learn on how to run DeepSeek faster and more efficiently!

    DeepSeek is at it again! After releasing V3, R1 Zero and R1 back in December 2024 and January 2025, DeepSeek updated their checkpoints / models for V3, and released a March update!

    According to DeepSeek, MMLU-Pro jumped +5.3% to 81.2%. GPQA +9.3% points. AIME + 19.8% and LiveCodeBench + 10.0%! They provided a plot showing how they compared to the previous V3 checkpoint and other models like GPT 4.5 and Claude Sonnet 3.7. But how do we run a 671 billion parameter model locally?

    MoE Bits
    Type
    Disk Size
    Accuracy
    Link
    Details

    DeepSeek V3's original upload is in float8, which takes 715GB. Using Q4_K_M halves the file size to 404GB or so, and our dynamic 1.78bit quant fits in around 151GB. We suggest using our 2.7bit quant to balance size and accuracy! The 2.4bit one also works well!

    ⚙️ Official Recommended Settings

    According to , these are the recommended settings for inference:

    • Temperature of 0.3 (Maybe 0.0 for coding as )

    • Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)

    • Chat template: <|User|>Create a simple playable Flappy Bird Game in Python. Place the final game inside of a markdown section.<|Assistant|>

    📖 Tutorial: How to Run DeepSeek-V3 in llama.cpp

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    NOTE using -DGGML_CUDA=ON for GPUs might take 5 minutes to compile. CPU only takes 1 minute to compile. You might be interested in llama.cpp's precompiled binaries.

    1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-IQ1_S(dynamic 1.78bit quant) or other quantized versions like Q4_K_M . I recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy. More versions at:

    1. Run Unsloth's Flappy Bird test as described in our 1.58bit Dynamic Quant for DeepSeek R1.

    2. Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

    If we run the above, we get 2 very different results. Standard 2-bit version: Click to view result (seizure warning!) Dynamic 2-bit version: See the result below:

    Standard 2-bit. Fails with background, fails with collision

    1. Like DeepSeek-R1, V3 has 61 layers. For example with a 24GB GPU or 80GB GPU, you can expect to offload after rounding down (reduce by 1 if it goes out of memory):

    Quant
    File Size
    24GB GPU
    80GB GPU
    2x80GB GPU

    Running on Mac / Apple devices

    For Apple Metal devices, be careful of --n-gpu-layers. If you find the machine going out of memory, reduce it. For a 128GB unified memory machine, you should be able to offload 59 layers or so.

    🎱 Heptagon Test

    We also test our dynamic quants via which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.

    The dynamic 2.7 bit quant which is only 230GB in size actually manages to solve the heptagon puzzle! The full output for all 3 versions (including full fp8) is below:

    Dynamic 2bit Heptagon code
    Non Dynamic 2bit Heptagon code
    Float8 Heptagon code

    🕵️ Extra Findings & Tips

    1. We find using lower KV cache quantization (4bit) seems to degrade generation quality via empirical tests - more tests need to be done, but we suggest using q8_0 cache quantization. The goal of quantization is to support longer context lengths since the KV cache uses quite a bit of memory.

    2. We found the down_proj in this model to be extremely sensitive to quantitation. We had to redo some of our dyanmic quants which used 2bits for down_proj and now we use 3bits as the minimum for all these matrices.

    3. Using llama.cpp

    2.71bit

    Q2_K_XL

    231GB

    Suggested

    3.5/2.5bit

    3.5bit

    Q3_K_XL

    320GB

    Great

    4.5/3.5bit

    4.5bit

    Q4_K_XL

    406GB

    Best

    5.5/4.5bit

    A BOS token of
    <|begin▁of▁sentence|>
    is auto added during tokenization (do NOT add it manually!)
  • DeepSeek mentioned using a system prompt as well (optional) - it's in Chinese: 该助手为DeepSeek Chat,由深度求索公司创造。\n今天是3月24日,星期一。 which translates to: The assistant is DeepSeek Chat, created by DeepSeek.\nToday is Monday, March 24th.

  • For KV cache quantization, use 8bit, NOT 4bit - we found it to do noticeably worse.

  • 212GB

    2

    19

    32

    's Flash Attention backend does result in somewhat faster decoding speeds. Use
    -DGGML_CUDA_FA_ALL_QUANTS=ON
    when compiling. Note it's also best to set your CUDA architecture as found in
    to reduce compilation times, then set it via -DCMAKE_CUDA_ARCHITECTURES="80"
  • Using a min_p=0.01is probably enough. llama.cppdefaults to 0.1, which is probably not necessary. Since a temperature of 0.3 is used anyways, we most likely will very unlikely sample low probability tokens, so removing very unlikely tokens is a good idea. DeepSeek recommends 0.0 temperature for coding tasks.

  • 1.78bit

    IQ1_S

    173GB

    Ok

    Link

    2.06/1.56bit

    1.93bit

    IQ1_M

    183GB

    Fair

    Link

    2.5/2.06/1.56

    2.42bit

    IQ2_XXS

    203GB

    Suggested

    Link

    1.73bit

    173GB

    5

    25

    56

    2.22bit

    183GB

    4

    22

    49

    DeepSeek
    seen here
    GitHub here
    https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF
    r/Localllama
    Dynamic 2-bit. Succeeds in creating a playable game.
    The goal is to make the heptagon spin, and the balls in the heptagon should move.
    Cover

    Non Dynamic 2bit. Fails - SEIZURE WARNING again!

    unsloth-q2_k_rotate.txt
    Cover

    Dynamic 2bit. Actually solves the heptagon puzzle correctly!!

    unsloth-q2_k_xl_rotate.txt
    Cover

    Original float8

    fp8-heptagon.txt

    2.5/2.06bit

    2.51bit

    https://developer.nvidia.com/cuda-gpus
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
    cp llama.cpp/build/bin/llama-* llama.cpp
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/DeepSeek-V3-0324-GGUF-UD",
        local_dir = "unsloth/DeepSeek-V3-0324-GGUF-UD",
        allow_patterns = ["*UD-Q2_K_XL*"], # Dynamic 2.7bit (230GB) Use "*UD-IQ_S*" for Dynamic 1.78bit (151GB)
    )
    ./llama.cpp/llama-cli \
        --model unsloth/DeepSeek-V3-0324-GGUF-UD/blob/main/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf \
        --cache-type-k q8_0 \
        --threads 20 \
        --n-gpu-layers 2 \
        -no-cnv \
        --prio 3 \
        --temp 0.3 \
        --min-p 0.01 \
        --ctx-size 4096 \
        --seed 3407 \
        --prompt "<|User|>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|Assistant|>"
    ./llama.cpp/llama-cli \
        --model DeepSeek-R1-GGUF/DeepSeek-V3-0324-UD-IQ1_S/DeepSeek-V3-0324-UD-IQ1_S-00001-of-00003.gguf \
        --cache-type-k q4_0 \
        --threads 16 \
        --prio 2 \
        --temp 0.6 \
        --ctx-size 8192 \
        --seed 3407 \
        --n-gpu-layers 59 \
        -no-cnv \
        --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
    ./llama.cpp/llama-cli \
        --model unsloth/DeepSeek-V3-0324-GGUF-UD/blob/main/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf \
        --cache-type-k q8_0 \
        --threads 20 \
        --n-gpu-layers 2 \
        -no-cnv \
        --prio 3 \
        --temp 0.3 \
        --min_p 0.01 \
        --ctx-size 4096 \
        --seed 3407 \
        --prompt "<|User|>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<|Assistant|>"
    import tkinter as tk
    import math
    import numpy as np
    from dataclasses import dataclass
    from typing import List, Tuple, Optional
    
    # Constants
    HEPTAGON_RADIUS = 300
    BALL_RADIUS = 20
    GRAVITY = 0.2
    FRICTION = 0.99
    BOUNCE_FACTOR = 0.8
    ROTATION_SPEED = 360 / 5  # degrees per second
    SPIN_FRICTION = 0.98
    BALL_COLORS = [
        "#f8b862", "#f6ad49", "#f39800", "#f08300", "#ec6d51",
        "#ee7948", "#ed6d3d", "#ec6800", "#ec6800", "#ee7800",
        "#eb6238", "#ea5506", "#ea5506", "#eb6101", "#e49e61",
        "#e45e32", "#e17b34", "#dd7a56", "#db8449", "#d66a35"
    ]
    
    @dataclass
    class Ball:
        x: float
        y: float
        vx: float
        vy: float
        number: int
        spin: float = 0.0
        color: str = "#000000"
    
    @dataclass
    class Wall:
        x1: float
        y1: float
        x2: float
        y2: float
    
    class BouncingBalls:
        def __init__(self, root):
            self.root = root
            self.canvas = tk.Canvas(root, width=800, height=800, bg="white")
            self.canvas.pack()
            self.balls: List[Ball] = []
            self.walls: List[Wall] = []
            self.heptagon_angle = 0
            self.last_time = 0
            
            self.setup_balls()
            self.setup_heptagon()
            
            self.root.after(16, self.update)
            self.root.bind("<space>", self.reset_balls)
        
        def setup_balls(self):
            for i in range(20):
                ball = Ball(
                    x=400,
                    y=400,
                    vx=np.random.uniform(-5, 5),
                    vy=np.random.uniform(-5, 5),
                    number=i+1,
                    color=BALL_COLORS[i]
                )
                self.balls.append(ball)
        
        def setup_heptagon(self):
            # Create initial heptagon walls
            self.update_heptagon_walls(0)
        
        def update_heptagon_walls(self, angle):
            self.walls = []
            center_x, center_y = 400, 400
            angle_rad = math.radians(angle)
            
            for i in range(7):
                angle1 = angle_rad + 2 * math.pi * i / 7
                angle2 = angle_rad + 2 * math.pi * (i + 1) / 7
                
                x1 = center_x + HEPTAGON_RADIUS * math.cos(angle1)
                y1 = center_y + HEPTAGON_RADIUS * math.sin(angle1)
                x2 = center_x + HEPTAGON_RADIUS * math.cos(angle2)
                y2 = center_y + HEPTAGON_RADIUS * math.sin(angle2)
                
                self.walls.append(Wall(x1, y1, x2, y2))
        
        def reset_balls(self, event=None):
            for ball in self.balls:
                ball.x = 400
                ball.y = 400
                ball.vx = np.random.uniform(-5, 5)
                ball.vy = np.random.uniform(-5, 5)
                ball.spin = np.random.uniform(-5, 5)
        
        def update(self):
            current_time = self.root.after_idle(self.root.after, 16, self.update)
            if self.last_time == 0:
                self.last_time = current_time
                return
            
            # Calculate delta time (approximate)
            dt = 0.016  # Assuming ~60 FPS
            
            # Update heptagon rotation
            self.heptagon_angle += ROTATION_SPEED * dt
            self.update_heptagon_walls(self.heptagon_angle)
            
            # Update balls
            for ball in self.balls:
                # Apply gravity
                ball.vy += GRAVITY
                
                # Apply friction
                ball.vx *= FRICTION
                ball.vy *= FRICTION
                ball.spin *= SPIN_FRICTION
                
                # Move ball
                ball.x += ball.vx
                ball.y += ball.vy
                
                # Check collisions with walls
                self.check_wall_collisions(ball)
                
                # Check collisions with other balls
                for other in self.balls:
                    if other.number != ball.number:
                        self.check_ball_collision(ball, other)
            
            # Draw everything
            self.draw()
        
        def check_wall_collisions(self, ball):
            for wall in self.walls:
                # Find closest point on wall segment to ball
                closest = self.closest_point_on_segment(
                    wall.x1, wall.y1, wall.x2, wall.y2, ball.x, ball.y
                )
                
                # Calculate distance to wall
                dx = ball.x - closest[0]
                dy = ball.y - closest[1]
                distance = math.sqrt(dx*dx + dy*dy)
                
                if distance < BALL_RADIUS:
                    # Collision detected
                    # Calculate normal vector
                    nx = dx / distance
                    ny = dy / distance
                    
                    # Calculate relative velocity along normal
                    v_rel = ball.vx * nx + ball.vy * ny
                    
                    if v_rel < 0:  # Moving toward the wall
                        # Calculate impulse
                        j = -(1 + BOUNCE_FACTOR) * v_rel
                        
                        # Apply impulse
                        ball.vx += j * nx
                        ball.vy += j * ny
                        
                        # Add some spin based on collision
                        ball.spin += (ball.vx * ny - ball.vy * nx) * 0.1
                        
                        # Move ball out of collision
                        penetration = BALL_RADIUS - distance
                        ball.x += penetration * nx
                        ball.y += penetration * ny
        
        def check_ball_collision(self, ball1, ball2):
            dx = ball2.x - ball1.x
            dy = ball2.y - ball1.y
            distance = math.sqrt(dx*dx + dy*dy)
            
            if distance < 2 * BALL_RADIUS:
                # Collision detected
                nx = dx / distance
                ny = dy / distance
                
                # Calculate relative velocity
                v_rel_x = ball2.vx - ball1.vx
                v_rel_y = ball2.vy - ball1.vy
                v_rel = v_rel_x * nx + v_rel_y * ny
                
                if v_rel < 0:  # Moving toward each other
                    # Calculate impulse
                    j = -(1 + BOUNCE_FACTOR) * v_rel / 2
                    
                    # Apply impulses
                    ball1.vx -= j * nx
                    ball1.vy -= j * ny
                    ball2.vx += j * nx
                    ball2.vy += j * ny
                    
                    # Add spin based on collision
                    ball1.spin += (ball1.vx * ny - ball1.vy * nx) * 0.05
                    ball2.spin += (ball2.vx * ny - ball2.vy * nx) * 0.05
                    
                    # Move balls apart
                    penetration = 2 * BALL_RADIUS - distance
                    ball1.x -= penetration * nx * 0.5
                    ball1.y -= penetration * ny * 0.5
                    ball2.x += penetration * nx * 0.5
                    ball2.y += penetration * ny * 0.5
        
        @staticmethod
        def closest_point_on_segment(x1, y1, x2, y2, x, y):
            # Vector from point to segment start
            dx = x - x1
            dy = y - y1
            
            # Segment vector
            sx = x2 - x1
            sy = y2 - y1
            
            # Projection of point onto segment
            dot = dx * sx + dy * sy
            len_sq = sx * sx + sy * sy
            param = dot / len_sq if len_sq != 0 else -1
            
            if param < 0:
                return x1, y1
            elif param > 1:
                return x2, y2
            else:
                return x1 + param * sx, y1 + param * sy
        
        def draw(self):
            self.canvas.delete("all")
            
            # Draw heptagon
            points = []
            for wall in self.walls:
                points.extend([wall.x1, wall.y1])
            self.canvas.create_polygon(points, fill="", outline="black", width=2)
            
            # Draw balls
            for ball in self.balls:
                # Draw ball
                self.canvas.create_oval(
                    ball.x - BALL_RADIUS, ball.y - BALL_RADIUS,
                    ball.x + BALL_RADIUS, ball.y + BALL_RADIUS,
                    fill=ball.color, outline="black"
                )
                
                # Draw number with rotation based on spin
                angle = ball.spin * 10  # Scale spin for visual effect
                self.canvas.create_text(
                    ball.x, ball.y,
                    text=str(ball.number),
                    font=("Arial", 12, "bold"),
                    angle=angle
                )
    
    if __name__ == "__main__":
        root = tk.Tk()
        root.title("Bouncing Balls in Spinning Heptagon")
        app = BouncingBalls(root)
        root.mainloop()
    import tkinter as tk
    import math
    import random
    from dataclasses import dataclass
    from typing import List, Tuple, Optional
    import sys
    
    # Constants
    WIDTH, HEIGHT = 800, 800
    HEPTAGON_RADIUS = 300
    BALL_RADIUS = 15
    GRAVITY = 0.5
    FRICTION = 0.999
    ELASTICITY = 0.8
    ROTATION_SPEED = 2 * math.pi / 5  # 360 degrees per 5 seconds
    SPIN_DECAY = 0.99
    
    # Colors for the balls
    BALL_COLORS = [
        "#f8b862", "#f6ad49", "#f39800", "#f08300", "#ec6d51",
        "#ee7948", "#ed6d3d", "#ec6800", "#ec6800", "#ee7800",
        "#eb6238", "#ea5506", "#ea5506", "#eb6101", "#e49e61",
        "#e45e32", "#e17b34", "#dd7a56", "#db8449", "#d66a35"
    ]
    
    @dataclass
    class Ball:
        x: float
        y: float
        vx: float
        vy: float
        radius: float
        color: str
        number: int
        spin: float = 0.0
    
    @dataclass
    class Heptagon:
        center_x: float
        center_y: float
        radius: float
        angle: float = 0.0
    
    class BouncingBalls:
        def __init__(self, root):
            self.root = root
            self.canvas = tk.Canvas(root, width=WIDTH, height=HEIGHT, bg="white")
            self.canvas.pack()
            
            self.heptagon = Heptagon(WIDTH//2, HEIGHT//2, HEPTAGON_RADIUS)
            self.balls = []
            self.setup_balls()
            
            self.root.after(0, self.update)
            self.root.mainloop()
        
        def setup_balls(self):
            center_x, center_y = WIDTH//2, HEIGHT//2
            for i in range(20):
                self.balls.append(Ball(
                    x=center_x,
                    y=center_y,
                    vx=0,
                    vy=0,
                    radius=BALL_RADIUS,
                    color=BALL_COLORS[i],
                    number=i+1,
                    spin=0
                ))
        
        def update(self):
            self.canvas.delete("all")
            
            # Update heptagon angle
            self.heptagon.angle += ROTATION_SPEED / 60  # Assuming 60 FPS
            
            # Draw heptagon
            self.draw_heptagon()
            
            # Update and draw balls
            for ball in self.balls:
                # Apply gravity
                ball.vy += GRAVITY
                
                # Update position
                ball.x += ball.vx
                ball.y += ball.vy
                
                # Apply friction
                ball.vx *= FRICTION
                ball.vy *= FRICTION
                
                # Apply spin decay
                ball.spin *= SPIN_DECAY
                
                # Check collision with heptagon walls
                self.check_heptagon_collision(ball)
                
                # Check collision with other balls
                for other in self.balls:
                    if other != ball:
                        if self.check_ball_collision(ball, other):
                            self.resolve_ball_collision(ball, other)
                
                # Draw the ball
                self.draw_ball(ball)
            
            self.root.after(16, self.update)  # ~60 FPS
        
        def draw_heptagon(self):
            center_x, center_y = self.heptagon.center_x, self.heptagon.center_y
            points = []
            for i in range(7):
                angle = self.heptagon.angle + i * 2 * math.pi / 7
                x = center_x + self.heptagon.radius * math.cos(angle)
                y = center_y + self.heptagon.radius * math.sin(angle)
                points.append((x, y))
            
            # Draw heptagon
            self.canvas.create_polygon(
                [points[0], points[1], points[2], points[3], 
                 points[4], points[5], points[6]],
                outline="black", fill="", width=2
            )
        
        def draw_ball(self, ball):
            self.canvas.create_oval(
                ball.x - ball.radius,
                ball.y - ball.radius,
                ball.x + ball.radius,
                ball.y + ball.radius,
                fill=ball.color,
                outline="black"
            )
            
            # Draw the number
            self.canvas.create_text(
                ball.x, ball.y,
                text=str(ball.number),
                fill="black"
            )
        
        def check_heptagon_collision(self, ball):
            center_x, center_y = WIDTH//2, HEIGHT//2
            
            # Check distance from center
            dx = ball.x - center_x
            dy = ball.y - center_y
            dist = math.sqrt(dx**2 + dy**2)
            
            if dist + ball.radius > self.heptagon.radius:
                # Find the normal vector from center to ball
                angle = math.atan2(dy, dx)
                normal_x = math.cos(angle)
                normal_y = math.sin(angle)
                
                # Move ball back inside heptagon
                overlap = (dist + ball.radius) - self.heptagon.radius
                ball.x -= overlap * normal_x
                ball.y -= overlap * normal_y
                
                # Reflect velocity
                dot_product = ball.vx * normal_x + ball.vy * normal_y
                ball.vx -= 2 * dot_product * normal_x * ELASTICITY
                ball.vy -= 2 * dot_product * normal_y * ELASTICITY
        
        def check_ball_collision(self, ball1, ball2):
            dx = ball2.x - ball1.x
            dy = ball2.y - ball1.y
            distance = math.sqrt(dx**2 + dy**2)
            return distance < (ball1.radius + ball2.radius)
        
        def resolve_ball_collision(self, ball1, ball2):
            dx = ball2.x - ball1.x
            dy = ball2.y - ball1.y
            distance = math.sqrt(dx**2 + dy**2)
            
            # Normal vector
            nx = dx / distance
            ny = dy / distance
            
            # Relative velocity
            dvx = ball2.vx - ball1.vx
            dvy = ball2.vy - ball1.vy
            
            # Calculate impulse
            impulse = 2 * (dvx * nx + dvy * ny) / 2
            impulse *= ELASTICITY
            
            # Apply impulse
            ball1.vx -= impulse * nx
            ball1.vy -= impulse * ny
            ball2.vx += impulse * nx
            ball2.vy += impulse * ny
            
            # Separate the balls to prevent sticking
            overlap = (ball1.radius + ball2.radius) - distance
            ball1.x -= overlap * nx / 2
            ball1.y -= overlap * ny / 2
            ball2.x += overlap * nx / 2
            ball2.y += overlap * ny / 2
        
        def run(self):
            self.root.mainloop()
    
    if __name__ == "__main__":
        root = tk.Tk()
        root.title("Bouncing Balls in a Spinning Heptagon")
        app = BouncingBalls(root)
        app.run()
    import tkinter as tk
    import math
    import numpy as np
    from dataclasses import dataclass
    from typing import List, Tuple, Optional
    
    # Constants
    WIDTH, HEIGHT = 800, 800
    CENTER_X, CENTER_Y = WIDTH // 2, HEIGHT // 2
    GRAVITY = 0.2
    FRICTION = 0.99
    BALL_RADIUS = 15
    HEPTAGON_RADIUS = 300
    ROTATION_SPEED = 360 / 5  # degrees per second
    ELASTICITY = 0.8
    SPIN_FRICTION = 0.98
    
    # Colors for the balls
    BALL_COLORS = [
        '#f8b862', '#f6ad49', '#f39800', '#f08300', '#ec6d51',
        '#ee7948', '#ed6d3d', '#ec6800', '#ec6800', '#ee7800',
        '#eb6238', '#ea5506', '#ea5506', '#eb6101', '#e49e61',
        '#e45e32', '#e17b34', '#dd7a56', '#db8449', '#d66a35'
    ]
    
    @dataclass
    class Ball:
        x: float
        y: float
        vx: float
        vy: float
        radius: float
        color: str
        number: int
        spin: float = 0.0
    
        def move(self):
            self.x += self.vx
            self.y += self.vy
            self.vy += GRAVITY
            self.vx *= FRICTION
            self.vy *= FRICTION
            self.spin *= SPIN_FRICTION
    
        def collide_with_ball(self, other: 'Ball'):
            dx = other.x - self.x
            dy = other.y - self.y
            distance = math.hypot(dx, dy)
            
            if distance < self.radius + other.radius:
                # Calculate collision normal
                nx = dx / distance
                ny = dy / distance
                
                # Calculate relative velocity
                dvx = other.vx - self.vx
                dvy = other.vy - self.vy
                
                # Calculate impulse
                impulse = 2 * (dvx * nx + dvy * ny) / (1/self.radius + 1/other.radius)
                
                # Apply impulse
                self.vx += impulse * nx / self.radius
                self.vy += impulse * ny / self.radius
                other.vx -= impulse * nx / other.radius
                other.vy -= impulse * ny / other.radius
                
                # Separate balls to prevent sticking
                overlap = (self.radius + other.radius - distance) / 2
                self.x -= overlap * nx
                self.y -= overlap * ny
                other.x += overlap * nx
                other.y += overlap * ny
                
                # Transfer some spin
                transfer = impulse * 0.01
                self.spin -= transfer
                other.spin += transfer
    
    class HeptagonBounceSimulator:
        def __init__(self, root):
            self.root = root
            self.canvas = tk.Canvas(root, width=WIDTH, height=HEIGHT, bg='white')
            self.canvas.pack()
            
            self.balls = self.create_balls()
            self.heptagon_angle = 0
            self.last_time = 0
            self.running = True
            
            self.root.bind('<space>', self.toggle_pause)
            self.root.bind('<Escape>', lambda e: root.destroy())
            
            self.last_time = self.root.after(0, self.update)
        
        def create_balls(self) -> List[Ball]:
            balls = []
            for i in range(20):
                # Start all balls at center with small random velocity
                angle = np.random.uniform(0, 2 * math.pi)
                speed = np.random.uniform(0.5, 2)
                vx = math.cos(angle) * speed
                vy = math.sin(angle) * speed
                
                balls.append(Ball(
                    x=CENTER_X,
                    y=CENTER_Y,
                    vx=vx,
                    vy=vy,
                    radius=BALL_RADIUS,
                    color=BALL_COLORS[i],
                    number=i+1,
                    spin=np.random.uniform(-2, 2)
                ))
            return balls
        
        def toggle_pause(self, event):
            self.running = not self.running
            if self.running:
                self.last_time = self.root.after(0, self.update)
        
        def get_heptagon_vertices(self) -> List[Tuple[float, float]]:
            vertices = []
            for i in range(7):
                angle = math.radians(self.heptagon_angle + i * 360 / 7)
                x = CENTER_X + HEPTAGON_RADIUS * math.cos(angle)
                y = CENTER_Y + HEPTAGON_RADIUS * math.sin(angle)
                vertices.append((x, y))
            return vertices
        
        def check_ball_heptagon_collision(self, ball: Ball):
            vertices = self.get_heptagon_vertices()
            closest_dist = float('inf')
            closest_normal = (0, 0)
            closest_edge = None
            
            # Check collision with each edge of the heptagon
            for i in range(len(vertices)):
                p1 = vertices[i]
                p2 = vertices[(i + 1) % len(vertices)]
                
                # Vector from p1 to p2
                edge_x = p2[0] - p1[0]
                edge_y = p2[1] - p1[1]
                edge_length = math.hypot(edge_x, edge_y)
                
                # Normalize edge vector
                edge_x /= edge_length
                edge_y /= edge_length
                
                # Normal vector (perpendicular to edge, pointing inward)
                nx = -edge_y
                ny = edge_x
                
                # Vector from p1 to ball
                ball_to_p1_x = ball.x - p1[0]
                ball_to_p1_y = ball.y - p1[1]
                
                # Project ball onto edge normal
                projection = ball_to_p1_x * nx + ball_to_p1_y * ny
                
                # If projection is negative, ball is outside the heptagon
                if projection < ball.radius:
                    # Find closest point on edge to ball
                    edge_proj = ball_to_p1_x * edge_x + ball_to_p1_y * edge_y
                    edge_proj = max(0, min(edge_length, edge_proj))
                    closest_x = p1[0] + edge_proj * edge_x
                    closest_y = p1[1] + edge_proj * edge_y
                    
                    # Distance from ball to closest point on edge
                    dist = math.hypot(ball.x - closest_x, ball.y - closest_y)
                    
                    if dist < closest_dist:
                        closest_dist = dist
                        closest_normal = (nx, ny)
                        closest_edge = (p1, p2)
            
            if closest_dist < ball.radius:
                # Calculate bounce response
                dot_product = ball.vx * closest_normal[0] + ball.vy * closest_normal[1]
                
                # Apply bounce with elasticity
                ball.vx -= (1 + ELASTICITY) * dot_product * closest_normal[0]
                ball.vy -= (1 + ELASTICITY) * dot_product * closest_normal[1]
                
                # Add some spin based on impact
                edge_vec = (closest_edge[1][0] - closest_edge[0][0], 
                            closest_edge[1][1] - closest_edge[0][1])
                edge_length = math.hypot(edge_vec[0], edge_vec[1])
                if edge_length > 0:
                    edge_vec = (edge_vec[0]/edge_length, edge_vec[1]/edge_length)
                    # Cross product of velocity and edge direction
                    spin_effect = (ball.vx * edge_vec[1] - ball.vy * edge_vec[0]) * 0.1
                    ball.spin += spin_effect
                
                # Move ball outside the heptagon to prevent sticking
                penetration = ball.radius - closest_dist
                ball.x += penetration * closest_normal[0]
                ball.y += penetration * closest_normal[1]
        
        def update(self):
            if not self.running:
                return
            
            # Clear canvas
            self.canvas.delete('all')
            
            # Update heptagon rotation
            self.heptagon_angle += ROTATION_SPEED / 60  # Assuming ~60 FPS
            
            # Draw heptagon
            vertices = self.get_heptagon_vertices()
            self.canvas.create_polygon(vertices, outline='black', fill='', width=2)
            
            # Update and draw balls
            for i, ball in enumerate(self.balls):
                # Move ball
                ball.move()
                
                # Check collisions with heptagon
                self.check_ball_heptagon_collision(ball)
                
                # Draw ball
                self.canvas.create_oval(
                    ball.x - ball.radius, ball.y - ball.radius,
                    ball.x + ball.radius, ball.y + ball.radius,
                    fill=ball.color, outline='black'
                )
                
                # Draw number with rotation based on spin
                angle = ball.spin * 10  # Scale spin for visible rotation
                self.canvas.create_text(
                    ball.x, ball.y,
                    text=str(ball.number),
                    font=('Arial', 10, 'bold'),
                    angle=angle
                )
            
            # Check ball-ball collisions
            for i in range(len(self.balls)):
                for j in range(i + 1, len(self.balls)):
                    self.balls[i].collide_with_ball(self.balls[j])
            
            # Schedule next update
            self.last_time = self.root.after(16, self.update)  # ~60 FPS
    
    if __name__ == '__main__':
        root = tk.Tk()
        root.title('Bouncing Balls in a Spinning Heptagon')
        simulator = HeptagonBounceSimulator(root)
        root.mainloop()
    Link
    Link
    Link

    Magistral: How to Run & Fine-tune

    Meet Magistral - Mistral's new reasoning models.

    Magistral-Small-2509 is a reasoning LLM developed by Mistral AI. It excels at coding and mathematics and supports multiple languages. Magistral supports a 128k token context window and was finetuned from Mistral-Small-3.2. Magistral runs perfectly well locally on a single RTX 4090 or a Mac with 16 to 24GB RAM.

    Running Magistral Tutorial Fine-tuning Magistral

    Update: Magistral-2509 new update is out as of September, 2025! Now with Vision support! We worked with Mistral again with the release of Magistral. Make sure to download Mistral's official uploads or Unsloth's uploads to get the correct implementation (ie correct system prompt, correct chat template etc.)

    If you're using llama.cpp, please use --jinja to enable the system prompt!

    All uploads use Unsloth for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized Mistral LLMs with minimal accuracy loss.

    Magistral-Small - Unsloth Dynamic uploads:

    Dynamic 2.0 GGUF (to run)
    Dynamic 4-bit (to finetune/deploy)
    Dynamic Float8

    🖥️ Running Magistral

    ⚙️ Official Recommended Settings

    According to Mistral AI, these are the recommended settings for inference:

    • Temperature of: 0.7

    • Min_P of: 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)

    • Set top_p to: 0.95

    • A 128k context window is supported, but performance might degrade past 40k. So we recommend setting the maximum length to 40k if you see bad performance.

    This is the recommended system prompt for Magistral 2509, 2507:

    This is the recommended system prompt for Magistral 2506:

    Our dynamic uploads have the 'UD' prefix in them. Those without are not dynamic however still utilize our calibration dataset.

    • Multilingual: Magistral supports many languages including: English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, and Farsi.

    ❓Testing the model

    Mistral has their own vibe checking prompts which can be used to evaluate Magistral. Keep in mind these tests are based on running the full unquantized version of the model, however you could also test them on quantized versions:

    Easy - Make sure they always work

    Medium - Should most of the time be correct

    Hard - Should sometimes get them right

    We provide some at the end of the blog.

    🦙 Tutorial: How to Run Magistral in Ollama

    1. Install ollama if you haven't already!

    1. Run the model with our dynamic quant. We did not set the context length automatically, so it will just use Ollama's default set context length. Note you can call ollama serve &in another terminal if it fails! We include all suggested parameters (temperature etc) in params in our Hugging Face upload!

    2. Also Magistral supports 40K context lengths, so best to enable . We use 8bit quantization which saves 50% memory usage. You can also try "q4_0" or "q8_0"

    📖 Tutorial: How to Run Magistral in llama.cpp

    1. Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

    1. If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run

    In llama.cpp, please use --jinja to enable the system prompt!

    1. OR download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL, (Unsloth Dynamic), Q4_K_M, or other quantized versions (like BF16 full precision).

    1. Run the model.

    2. Edit --threads -1 for the maximum CPU threads, --ctx-size 40960 for context length (Magistral supports 40K context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference. We also use 8bit quantization for the K cache to reduce memory usage.

    3. For conversation mode:

    Remember to remove <bos> since Magistral auto adds a <bos>

    Sample outputs

    How many "r" are in strawberry? [Correct answer = 3]
    Exactly how many days ago did the French Revolution start? Today is June 4th, 2025. [Correct answer = 86,157 days]

    👁Vision Support

    Magistral 2509's September 2025 update now includes Vision support by default!

    For Magistral versions before September 2025, from HuggingFace showed in their how it is actually possible to "graft" the vision encoder from Mistral 3.1 Instruct onto Devstral meaning you could do the same for Magistral! According to our tests and many users, it works quite well! We also uploaded our mmproj files which allows you to use the following:

    🦥 Fine-tuning Magistral with Unsloth

    Just like standard Mistral models including Mistral Small 3.1, Unsloth supports Magistral fine-tuning. Training is 2x faster, use 70% less VRAM and supports 8x longer context lengths. Magistral fits comfortably in a 24GB VRAM L4 GPU.

    • Magistral 2509 Kaggle (2x Tesla T4s) free

    • Magistral 2509 Colab L4 (24GB)

    Magistral slightly exceeds the memory limits of a 16GB VRAM, so fine-tuning it for free on Google Colab isn't possible for now. However, you can fine-tune the model for free using , which offers access to dual GPUs.

    To finetune on new reasoning traces, you can use our free

    If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

    💠Dynamic Float8 Checkpoints

    We also provide 2 popular formats for float8 checkpoints, which also utilizes some of our dynamic methodology to retain maximum accuracy:

    Both are fantastic to deploy via vLLM. Read up on using TorchAO based FP8 quants in vLLM .

    Ollama also sets the default context length to 4096, as mentioned here. Use OLLAMA_CONTEXT_LENGTH=8192 to change it to 8192. Magistral supports up to 128K, but 40K (40960) is tested most.
    • Magistral-Small-2509-GGUF - new

    • Magistral-Small-2507-GGUF

    • Magistral-Small-2506-GGUF

    • Magistral-Small-2509-unsloth-bnb-4bit - new

    • Magistral-Small-2507-unsloth-bnb-4bit

    • Magistral-Small-2506-unsloth-bnb-4bit

    • Magistral-Small-2509-FP8-Dynamic

    • Magistral-Small-2509-FP8-torchao

    Dynamic 2.0
    example outputs
    KV cache quantization
    GitHub here
    Xuan-Son
    GGUF repo
    finetuning notebook
    finetuning notebook
    Kaggle
    Kaggle notebook for Magistral
    vLLM's Float8 format
    TorchAO's Float8 format
    here
    First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.
    
    Your thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response.
    A user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts (i.e. your summary should be succinct but contain all the critical steps you needed to reach the conclusion). You should use Markdown to format your response. Write both your thoughts and summary in the same language as the task posed by the user. NEVER use \boxed{} in your response.
    
    Your thinking process must follow the template below:
    <think>
    Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate a correct answer.
    </think>
    
    Here, provide a concise summary that reflects your reasoning and presents a clear final answer to the user. Don't mention that this is a summary.
    
    Problem:
    prompt_1 = 'How many "r" are in strawberry?'
    
    prompt_2 = 'John is one of 4 children. The first sister is 4 years old. Next year, the second sister will be twice as old as the first sister. The third sister is two years older than the second sister. The third sister is half the ago of her older brother. How old is John?'
    
    prompt_3 = '9.11 and 9.8, which is greater?'
    prompt_4 = "Think about 5 random numbers. Verify if you can combine them with addition, multiplication, subtraction or division to 133"
    
    prompt_5 = "Write 4 sentences, each with at least 8 words. Now make absolutely sure that every sentence has exactly one word less than the previous sentence."
    
    prompt_6 = "If it takes 30 minutes to dry 12 T-shirts in the sun, how long does it take to dry 33 T-shirts?"
    prompt_7 = "Pick 5 random words each with at least 10 letters. Print them out. Reverse each word and print it out. Then extract letters that are alphabetically sorted smaller than "g" and print them. Do not use code."
    
    prompt_8 = "Exactly how many days ago did the French Revolution start? Today is June 4th, 2025."
    apt-get update
    apt-get install pciutils -y
    curl -fsSL https://ollama.com/install.sh | sh
    export OLLAMA_KV_CACHE_TYPE="f16"
    OLLAMA_CONTEXT_LENGTH=8192 ollama serve &
    ollama run hf.co/unsloth/Magistral-Small-2509-GGUF:UD-Q4_K_XL
    apt-get update
    apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
    git clone https://github.com/ggml-org/llama.cpp
    cmake llama.cpp -B llama.cpp/build \
        -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
    cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
    cp llama.cpp/build/bin/llama-* llama.cpp
    ./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2509-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99
    # !pip install huggingface_hub hf_transfer
    import os
    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id = "unsloth/Magistral-Small-2509-GGUF",
        local_dir = "unsloth/Magistral-Small-2509-GGUF",
        allow_patterns = ["*UD-Q4_K_XL*"], # For UD-Q4_K_XL
    )
    ./llama.cpp/llama-cli \
        --model unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-UD-Q4_K_XL.gguf \
        --threads -1 \
        --ctx-size 40960 \
        --cache-type-k f16 \
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 0.7 \
        --repeat-penalty 1.0 \
        --min-p 0.01 \
        --top-k -1 \
        --top-p 0.95 \
        --jinja
    <think>
    Alright, I have a problem to solve: "How many 'r's are in strawberry?"
    
    First, I need to understand the problem. It's asking for the count of the letter 'r' in the word "strawberry".
    
    Step 1: Write down the word and look at each letter.
    The word is: s t r a w b e r r y
    
    Step 2: Identify the 'r's in the word.
    Let's go through each letter one by one:
    1. s - not an r
    2. t - not an r
    3. r - that's one r
    4. a - not an r
    5. w - not an r
    6. b - not an r
    7. e - not an r
    8. r - that's another r
    9. r - and another one
    10. y - not an r
    
    So, I see the letter 'r' appears at positions 3, 8, and 9.
    
    Step 3: Count the number of 'r's.
    From the above, there are three 'r's in "strawberry".
    
    But wait, I should double-check to make sure I didn't miss anything or count incorrectly.
    
    Let me write the word again: strawberry.
    
    Looking at it:
    - The third letter is 'r' (s-t-r)
    - Then later, we have 'e-r-r' which is two more 'r's.
    
    Yes, that's three 'r's in total.
    
    But just to be thorough, sometimes letters can be hidden or the word might have silent letters, but in this case, all letters are pronounced in "strawberry", and 'r' is clearly pronounced in the third, eighth, and ninth positions.
    
    So, the count is indeed three.
    
    But wait, is it possible that the 'r's are not all pronounced? For example, in some accents or dialects, letters might be silent or pronounced differently. However, in standard pronunciation, all these 'r's are pronounced.
    
    But the question is about the number of 'r's in the word, not about pronunciation. It's about the written form. So, regardless of pronunciation, we're counting the letters in the word "strawberry".
    
    Let me write it out again and count:
    
    s (1)
    t (2)
    r (3) - first r
    a (4)
    w (5)
    b (6)
    e (7)
    r (8) - second r
    r (9) - third r
    y (10)
    
    Yes, three 'r's.
    
    But just to be sure, maybe I should think about the spelling. Sometimes letters can be part of digraphs or other combinations, but in "strawberry", the 'r's are all standalone letters.
    
    Let me think about the spelling breakdown:
    - "straw" has one 'r' (the third letter)
    - "berry" has two 'r's (the second and third letters in "berry", but in the full word, it's the eighth and ninth letters)
    
    So, "straw" has 1 'r', "berry" has 2 'r's, totaling 3 'r's in "strawberry".
    
    I think I'm confident now. The word "strawberry" has three 'r's.
    
    But before finalizing, I recall that sometimes letters can be repeated or there might be a trick. Let me see:
    
    The word is: s-t-r-a-w-b-e-r-r-y.
    
    Counting the 'r's:
    1. The 'r' in the third position.
    2. The 'r' in the eighth position.
    3. The 'r' in the ninth position.
    
    No other 'r's are present.
    
    Therefore, the total number of 'r's in "strawberry" is three.
    
    But just to cross-verify, perhaps I can write the word and highlight the 'r's:
    
    s t r a w b e r r y
       r     r r
    
    Yes, three 'r's.
    
    Is there any chance I'm missing something? Maybe considering uppercase and lowercase? But the problem is case-insensitive since it's about the letter 'r', not 'R'. The word is given in lowercase, but even if it were uppercase, 'R' is the same letter as 'r' in terms of counting.
    
    So, final count is three.
    
    But wait, is "strawberry" spelled correctly? Let me check the spelling.
    
    I think it's s-t-r-a-w-b-e-r-r-y. Yes, that seems correct.
    
    Therefore, the number of 'r's is indeed three.
    
    But just to be extra sure, maybe I can think of the word in a different way. For example, breaking it down:
    
    "straw" + "berry" = "strawberry"
    
    "straw" has one 'r' (the third letter: s-t-r)
    "berry" has two 'r's (the second and third letters: b-e-r-r-y, but actually, 'berry' is b-e-r-r-y, so 'r' is the third and fourth letters? Wait, let's see:
    
    'berry' is b-e-r-r-y. Wait, no:
    
    b (1)
    e (2)
    r (3)
    r (4)
    y (5)
    
    So, 'berry' has 'r's at positions 3 and 4, which are two 'r's.
    
    'straw' is s-t-r-a-w, so 'r' is at position 3, one 'r'.
    
    Combined, 'strawberry' is 'straw' + 'berry' = s-t-r-a-w-b-e-r-r-y, which has 'r's at positions 3, 3+5+1=9? Wait, no, let's index:
    
    'straw' is positions 1-5 (s-t-r-a-w)
    'berry' is positions 6-10 (b-e-r-r-y)
    
    So 'r's in 'straw' are at position 3.
    'r's in 'berry' are at positions 8 and 9 (since 'berry' starts at position 6: b(6), e(7), r(8), r(9), y(10)).
    
    Thus, total 'r's are at positions 3, 8, and 9: three 'r's.
    
    Yes, consistent again.
    
    Is there any other way to approach this? Maybe counting the total letters and seeing how many are 'r's.
    
    "strawberry" has 10 letters: s, t, r, a, w, b, e, r, r, y.
    
    Counting 'r's: positions 3, 8, 9 → 3 'r's.
    
    Alternatively, I could write out the word and replace non-'r's with another character to see how many 'r's are left.
    
    Original: s t r a w b e r r y
    Replace non-r: x x r x x x x r r x
    Now count 'r's: positions 2 (but original was 3rd letter), wait no, in the replaced string, 'r's are at positions 3 (original position 3), and then after replacing others, the next 'r's are at positions... Wait, let's see:
    
    Original positions:
    1: s → x
    2: t → x
    3: r → r
    4: a → x
    5: w → x
    6: b → x
    7: e → x
    8: r → r
    9: r → r
    10: y → x
    
    So replaced string: x x r x x x x r r x
    Now count 'r's: positions 3, 8, 9 in original correspond to positions 3, 8, 9 in replaced (but replaced has x's in other positions, so 'r's are still at 3,8,9). Wait, in replaced string, 'r's are at positions 3,8,9, but the length is still 10, so positions are same as original. So count is same: 3 'r's.
    
    Another method: write the word and cross out non-'r's.
    
    Original: s t r a w b e r r y
    Cross out non-r: x x r x x x x r r x
    Now count 'r's: there are three 'r's remaining.
    
    All methods confirm that there are three 'r's in "strawberry".
    
    But just to think outside the box, is there any ambiguity in the question? For example, is it asking for uppercase 'R' or lowercase 'r'? The problem says "r", lowercase, so we're counting lowercase 'r's. In the word "strawberry", all letters are lowercase (assuming the input is in lowercase; if it were mixed case, we'd have to consider, but since it's given as "strawberry", which is all lowercase, we're good).
    
    Also, is it possible that some 'r's are part of a digraph or trigraph? For example, in some languages, 'rr' might be considered a single sound or letter, but in English orthography, 'rr' is two separate 'r's. So in "strawberry", the 'rr' is two 'r's.
    
    Therefore, the count remains three.
    
    Is there any other interpretation? Maybe considering that sometimes letters can represent different sounds or that 'r' might be silent, but as mentioned earlier, the question is about the written letter 'r', not pronunciation.
    
    Final confirmation: let's write the word and count 'r's aloud:
    
    s (no), t (no), r (1), a (no), w (no), b (no), e (no), r (2), r (3), y (no).
    
    Yes, three 'r's.
    
    Therefore, the number of 'r's in "strawberry" is three.
    
    But before finalizing, let me think if there's any trick or if I'm missing something obvious. For example, sometimes questions like this have hidden complexities, like counting 'r's in the word's spelling or pronunciation, or considering that 'r' might appear in other forms, like in 'rr' which is two 'r's together, but we've already accounted for that.
    
    Alternatively, is the question about the number of times the letter 'r' appears in the spelling of "strawberry"? Yes, that's how I interpreted it, and my count is three.
    
    Is there any chance the word is spelled differently? For example, in some countries or dialects, is "strawberry" spelled differently? In American English and British English, it's spelled the same way: s-t-r-a-w-b-e-r-r-y.
    
    Let me check the spelling again: s, t, r, a, w, b, e, r, r, y. Yes, that's correct.
    
    Therefore, the number of 'r's is indeed three.
    
    But just to be absolutely sure, perhaps I can think of the word in a sentence or see it written out:
    
    Example sentence: "I ate a strawberry." The word is "strawberry", and in it, the 'r's are in the middle and at the end.
    
    Looking at the spelling: s-t-r-a-w-b-e-r-r-y. Counting:
    
    1. s
    2. t
    3. r (1)
    4. a
    5. w
    6. b
    7. e
    8. r (2)
    9. r (3)
    10. y
    
    Yes, three 'r's.
    
    I think I've exhausted all possible ways to verify this count. Therefore, the number of 'r's in "strawberry" is three.
    
    But wait, the user might have intended the question differently. Could it be about the number of times the sound 'r' appears in pronunciation? For example, sometimes letters are silent or combined in pronunciation, but the question asks for the count of the letter 'r', not the sound.
    
    In pronunciation, "strawberry" is /ˈstroʊbɛri/ or similar, where the 'r's are pronounced as in the spelling. But since the question is about the letter count, not phonemes, we stick with the written form.
    
    Therefore, the count remains three.
    
    Is there any other interpretation? For example, counting uppercase 'R's if the word were capitalized? But the word is given in lowercase, and the question asks for 'r', lowercase, so it's consistent.
    
    Final answer: there are three 'r's in "strawberry".
    
    But just to double-check, perhaps I can list the letters with their positions:
    
    Position: 1 2 3 4 5 6 7 8 9 10
    Letter:   s t r a w b e r r y
    
    Now, pick out the 'r's: positions 3, 8, 9. That's three 'r's.
    
    Alternatively, write the word and circle the 'r's:
    
    Strawberry: s t r a w b e r r y
    The 'r's are in positions 3, 8, and 9. So, three 'r's.
    
    No other interpretations seem plausible. Therefore, the answer is three.
    
    But before finalizing, let me think if there's any ambiguity in the word itself. Could "strawberry" be misspelled or is there another spelling variant?
    
    In English, "strawberry" is consistently spelled with two 'r's at the end: "berry" is spelled with two 'r's (b-e-r-r-y), and "straw" has one 'r' (s-t-r-a-w). So total is three 'r's.
    
    Is there any other word that might be confused with "strawberry"? For example, "strawberi" or similar, but no, "strawberry" is the correct spelling with three 'r's.
    
    Therefore, the final count is three.
    
    But just to be thorough, perhaps I can think of the word in another language. For example, in Spanish, "strawberry" is "fresas" or "fresa", which don't have 'r's, but the question is about the English word "strawberry".
    
    In English, as confirmed, there are three 'r's in "strawberry".
    
    Is there any other angle? For example, if the word were in uppercase, like "STRAWBERRY", the count would still be the same since 'R' is the uppercase counterpart of 'r'. But the problem specifies 'r', lowercase, and the word is given in lowercase, so it's consistent.
    
    Therefore, the number of 'r's in "strawberry" is three.
    
    But just to think differently, perhaps the word is being considered in a different context, like in a code or cipher where letters might represent something else. But the problem is straightforward: count the number of times the letter 'r' appears in the word "strawberry".
    
    So, after all this careful consideration, the answer is clearly three.
    
    But before finalizing, let me think if I've made any mistake in counting. Let me recount:
    
    Word: s t r a w b e r r y
    
    Positions:
    1: s
    2: t
    3: r (1)
    4: a
    5: w
    6: b
    7: e
    8: r (2)
    9: r (3)
    10: y
    
    Yes, three 'r's at positions 3, 8, and 9.
    
    No mistake in counting.
    
    Therefore, the final answer is that there are three 'r's in the word "strawberry".
    
    But wait, the user might expect a different answer if they're considering something else, but based on the straightforward interpretation, it's three.
    
    Perhaps the user expects us to count only the distinct 'r's or something, but all 'r's are identical letters in this context.
    
    Alternatively, maybe the question is a trick question implying that 'r's are pronounced differently or not at all, but again, the question is about the count of the letter 'r' in the spelling, not pronunciation.
    
    Thus, the answer is three.
    
    But just to be sure, let me think of the word "strawberry" and say it aloud: "strawberry". The 'r's are clearly pronounced as in the spelling.
    
    No hidden complexities here. The count is three.
    
    Final verification: write the word and count 'r's:
    
    s t r a w b e r r y
    
    'r' appears in the 3rd, 8th, and 9th positions. That's three times.
    
    Therefore, the number of 'r's in "strawberry" is three.
    
    But before concluding, let me think if there's any other interpretation where the answer could be different. For example, if the word were written in a different font or if there were any hidden letters, but in standard writing, it's clear.
    
    Alternatively, if the word were in a different language with different spelling, but "strawberry" is the English word, and in English, it's spelled with three 'r's.
    
    Another thought: is it possible that the 'w' or another letter is sometimes considered an 'r'? No, in English alphabet, 'w' is a separate letter from 'r'.
    
    No other interpretations seem valid. The answer is three.
    
    But just to be extra cautious, perhaps the word is misspelled in the question, but assuming it's correctly spelled as "strawberry", the count is three.
    
    Therefore, after thorough consideration, the number of 'r's in "strawberry" is three.
    
    But wait, looking back at the word: strawberry. Let me write it and count:
    
    s, t, r, a, w, b, e, r, r, y.
    
    Let me list them:
    
    1. s
    2. t
    3. r
    4. a
    5. w
    6. b
    7. e
    8. r
    9. r
    10. y
    
    So, 'r's are at positions 3, 8, and 9. That's three 'r's.
    
    Is there any chance I'm miscounting positions? Let me recount:
    
    1: s
    2: t
    3: r (1st r)
    4: a
    5: w
    6: b
    7: e
    8: r (2nd r)
    9: r (3rd r)
    10: y
    
    Yes, three 'r's.
    
    Therefore, the final answer is that there are three 'r's in the word "strawberry".
    
    But just to ensure completeness, perhaps the word is being considered in uppercase or another case, but since the question is about 'r' (lowercase) and the word is given in lowercase, it's consistent.
    
    Alternatively, if the word were "Strawberry" (with a capital S), does that affect the count? The question asks for 'r' (lowercase), and in "Strawberry", the 'S' is uppercase, but the 'r's are lowercase (assuming standard title case where only the first letter is capitalized). But since the question specifies 'r' (lowercase), and the 'r's in "Strawberry" are lowercase (unless it's all caps or something), but the given word is "strawberry" all lowercase, so it's three 'r's.
    
    But even if it were "Strawberry", the 'r's are still lowercase (assuming standard capitalization), so count remains the same.
    
    But since the given word is "strawberry" in all lowercase, the count is three.
    
    Therefore, the number of 'r's in "strawberry" is three.
    
    But just to think of edge cases, what if the word had a silent 'r'? But as discussed earlier, the question is about the written letter, not pronunciation.
    
    Alternatively, if the word were in a different font where 'r' looks like something else, but in standard fonts, it's clear.
    
    No other possibilities seem to alter the count.
    
    Thus, the final answer is three.
    
    But before finalizing, let me think if the word could be written differently, like with a different spelling variant. For example, in some dialects or historical contexts, words might be spelled differently, but modern English spells "strawberry" with three 'r's (one in 'straw' and two in 'berry').
    
    Therefore, the count is three.
    
    After all this careful consideration and multiple verification methods, I am confident that the number of 'r's in "strawberry" is three.
    
    But to ensure that I haven't missed anything, perhaps I can think of the word's origin or related words. For example, "straw" has one 'r', "berry" has two 'r's, combining gives three 'r's in "strawberry". This also confirms the count.
    
    Therefore, the answer is three.
    
    However, sometimes such questions are trick questions expecting a different answer. For example, if the question was about the number of times the letter 'r' appears in the word "strawberry" when spelled out, but that seems convoluted and unlikely.
    
    Alternatively, counting the number of closed areas in the letter 'r' when written in a certain font, but that's not relevant here.
    
    Another possibility is that the question expects us to count the number of times the sound 'r' appears in pronunciation, but again, the question asks for the count of the letter 'r', not sounds.
    
    In pronunciation, "strawberry" has the 'r' sound in "straw" and in "berry", but since 'berry' has two 'r's in spelling, both are pronounced in standard English, so the count would still relate to the spelling, which is three 'r's.
    
    But since the question is about the letter count in writing, not pronunciation, the answer remains three.
    
    Therefore, after all this thorough checking, the number of 'r's in "strawberry" is three.
    
    But just to be sure, perhaps I can ask myself: if I were to type the word "strawberry" and count how many times I press the 'r' key, it would be three times (assuming I don't make any typos while typing).
    
    Thus, confirming again: three 'r's.
    
    Therefore, the final answer is:
    
    There are 3 'r's in the word "strawberry".
    
    But to present it neatly, perhaps as a count:
    
    The word "strawberry" spelled out: s-t-r-a-w-b-e-r-r-y.
    
    Counting 'r's:
    1. The third letter: r
    2. The eighth letter: r
    3. The ninth letter: r
    
    Total: 3 'r's.
    
    Therefore, the number of 'r's in "strawberry" is 3.
    
    But to ensure completeness, let me think of the word's length: "strawberry" has 10 letters. The positions are:
    
    1: s
    2: t
    3: r (1st r)
    4: a
    5: w
    6: b
    7: e
    8: r (2nd r)
    9: r (3rd r)
    10: y
    
    Thus, three 'r's at positions 3, 8, and 9.
    
    No other 'r's present.
    
    Therefore, the final count is three.
    
    But to present it in a structured way:
    
    Word: s t r a w b e r r y
    Positions of 'r': 3, 8, 9
    Number of 'r's: 3
    
    Thus, the answer is 3.
    
    But just to think if there's any other interpretation where the answer could be different, perhaps considering that in some fonts, 'r' might look like another letter, but in standard fonts, it's distinct.
    
    Alternatively, if the word were in a different language where 'r' has a different representation, but since it's given in English, we proceed with English alphabet rules.
    
    Therefore, the count is three.
    
    After all this exhaustive checking, I conclude that the number of 'r's in "strawberry" is three.
    
    But to present it neatly in the summary, here's the concise version:
    
    The word "strawberry" has the following letters: s, t, r, a, w, b, e, r, r, y. Counting the 'r's, we find them at the 3rd, 8th, and 9th positions, totaling three 'r's.
    
    Therefore, there are **3 'r's** in the word "strawberry".
    
    But wait, looking back at the word: strawberry. Let me write it and count:
    
    s (1), t (2), r (3), a (4), w (5), b (6), e (7), r (8), r (9), y (10).
    
    So 'r's at positions 3, 8, 9: that's three 'r's.
    
    Is there any possibility that one of these 'r's is not counted? For example, if one of them were part of a digraph or silent, but in standard English spelling and pronunciation, all these 'r's are present and pronounced.
    
    Therefore, the count is indeed three.
    
    But just to think differently, perhaps the question expects us to count only the distinct positions where 'r' appears, but since all 'r's are the same letter, the count is still three.
    
    Alternatively, if the question was about the number of times the letter 'r' is written when writing the word "strawberry", that would also be three times.
    
    Thus, the answer is three.
    
    But to present it clearly:
    
    We analyze the word "strawberry" by listing its letters and identifying the positions of 'r':
    
    Letters: s, t, r, a, w, b, e, r, r, y
    Positions of 'r': 3, 8, 9
    Count of 'r's: 3
    
    Therefore, there are three 'r's in "strawberry".
    
    But to ensure that this is the expected answer, perhaps the user expects us to consider that 'rr' is a double 'r', but even then, it's two separate 'r's next to each other, totaling three 'r's in the word (one from 'straw' and two from 'berry').
    
    Thus, no alternative interpretation changes the count.
    
    Therefore, the final answer is that there are three 'r's in "strawberry".
    
    But to summarize neatly, here's the concise version:
    
    The word "strawberry" is spelled with the following letters in order: s, t, r, a, w, b, e, r, r, y. The letter 'r' appears at the 3rd, 8th, and 9th positions, giving a total count of three 'r's.
    
    Thus, the number of 'r's in "strawberry" is:
    
    \boxed{3}
    </think>
    
    **Summary:**
    The word "strawberry" is analyzed by listing its letters and identifying the positions of the letter 'r'. The 'r's are found at the 3rd, 8th, and 9th positions, resulting in a total count of three 'r's.
    
    Therefore, there are **3 'r's** in the word "strawberry".
    
    \[
    \boxed{3}
    \]
    <think>
    Alright, let's tackle this problem step by step. First, I need to know the exact start date of the French Revolution.
    
    From my knowledge, the French Revolution is generally considered to have started on July 14, 1789, which is the date of the Storming of the Bastille. However, some historians argue that the revolution began earlier with events like the Estates-General meeting on May 5, 1789, or even earlier with financial and political crises. But for simplicity, I'll use July 14, 1789, as the start date since it's widely recognized as the symbolic beginning of the revolution.
    
    Now, the problem gives today's date as June 4, 2025. I need to calculate the number of days between July 14, 1789, and June 4, 2025.
    
    ### Step 1: Calculate the total number of years between the two dates.
    2025 - 1789 = 236 years.
    
    But we need to be careful here because not all years have the same number of days (leap years have an extra day), and we're not counting full years from July 14 to July 14.
    
    ### Step 2: Break down the calculation into years, months, and days.
    First, let's calculate the number of full years between July 14, 1789, and July 14, 2025. That's 2025 - 1789 = 236 years.
    
    But since we're starting from July 14, 1789, and ending on June 4, 2025, which is before July 14, 2025, we have 235 full years (from July 14, 1789, to July 14, 2024) plus the days from July 14, 2024, to June 4, 2025. Wait, that doesn't seem right.
    
    Actually, it's better to think of it as:
    From July 14, 1789, to July 14, 2024, is 235 years (because 2024 - 1789 = 235).
    Then from July 14, 2024, to June 4, 2025, is the remaining time.
    
    But let's double-check:
    From July 14, 1789, to July 14, 2024, is 235 years.
    From July 14, 2024, to June 4, 2025, is less than a year, so we can calculate the days in this period separately.
    
    But actually, to make it easier, we can calculate the total days from July 14, 1789, to June 4, 2025, by calculating the days from July 14, 1789, to July 14, 2024 (235 years) plus the days from July 14, 2024, to June 4, 2025.
    
    But wait, July 14, 2024, to June 4, 2025, is not correct because June 4, 2025, is before July 14, 2025. So the correct breakdown is:
    From July 14, 1789, to July 14, 2024: 235 years.
    From July 14, 2024, to June 4, 2025: but this is negative because June 4 is before July 14. So it's better to think of it as:
    From July 14, 1789, to June 4, 2024: 234 full years plus the days from July 14 to June 4 in the next year (but this seems confusing).
    
    Perhaps a better approach is to calculate the number of days from July 14, 1789, to June 4, 2025.
    
    First, let's calculate the number of days from July 14, 1789, to July 14, 2024. That's 235 years.
    
    Now, how many days are there in 235 years? We need to account for leap years.
    
    A year has 365 days, and a leap year has 366 days. Leap years occur every 4 years, but century years are not leap years unless they're divisible by 400.
    
    First, calculate the number of leap years between 1789 and 2024 (not including 2024 if it's a leap year, but since we're going up to July 14, 2024, which is before February 29, the leap day in 2024 wouldn't affect the count if we're counting full years, but actually, since we're counting from July 14, 1789, to July 14, 2024, and 2024 is a leap year, but since we're not crossing into the next year's February, the leap day in 2024 doesn't affect this period because the extra day is in February, and we're starting after February (July is after February, so the leap day is already accounted for in the year's total days).
    
    Wait, no. The leap day is February 29. If we're counting from July 14, 1789, to July 14, 2024, then the leap day in each leap year between 1789 and 2024 would have been included in the previous year's count, but since we're starting after February, the leap day in the starting year (1789) doesn't affect us (since we're starting after February 29). Similarly, for the ending year (2024), since we're ending on July 14, which is after February 29, the leap day in 2024 is included in the count for that year (since we're including the entire year up to July 14, but actually, we're counting full years from July 14 to July 14, so each full year contributes either 365 or 366 days depending on whether it's a leap year.
    
    But actually, since we're counting the number of days between two dates, and both dates are after February in their respective years, the leap day affects the count only if the period includes February 29. But since we're counting full years from July 14 to July 14, and July is after February, the leap day is included in the full year count.
    
    Wait, no. If we're counting from July 14, 1789, to July 14, 2024, that's 235 years. Each year in this period has 365 days, except for leap years which have 366 days.
    
    Number of leap years between 1789 and 2023 (since 2024's leap day is after July 14, but we're counting full years up to July 14, 2024, which is before February 29, 2025... wait no, 2024's leap day is February 29, 2024, and we're up to July 14, 2024, which is after February 29, so the leap day in 2024 is included in our count if we're counting full years up to July 14, 2024.
    
    Wait, no, the full years are from July 14, 1789, to July 14, 2024. That's 2024 - 1789 = 235 years, but since we're starting at July 14, 1789, and ending at July 14, 2024, that's exactly 235 years.
    
    Now, how many leap years are there in this period? A year is a leap year if:
    1. It's divisible by 4,
    2. It's not divisible by 100, unless it's also divisible by 400.
    
    But since we're counting from July 14, the leap day (February 29) is included in the year's count if the year is a leap year, because we're including the entire year from July 14 to July 14 (which is equivalent to counting from July 14 to July 14 of the next year, but actually, no, from July 14, 1789, to July 14, 1790, is one year, which may include February 29 if 1790 is a leap year... wait, no, 1790 is not divisible by 4, so it's not a leap year.
    
    Wait, the period from July 14, 1789, to July 14, 1790, is one year, and it includes February 29, 1790? No, 1790 is not a leap year (1790 is not divisible by 4). The leap day is February 29 in a leap year, but since our period starts after February in 1789, and ends before February in 1790... wait no, our period is from July 14, 1789, to July 14, 1790, which includes February 29, 1790? Wait, no, 1789 to 1790 is not a leap year, because 1789 to 1790 is one year, and the leap day would be in February 1790 if 1790 were a leap year, but it's not (1790 is not divisible by 4).
    
    Wait, perhaps it's easier to think that for each full year from July 14 to July 14, the number of days is 365, plus 1 if the year is a leap year and the period includes February 29. But since our period starts after February in the starting year and ends after February in the ending year, the leap day is included in the count for leap years.
    
    So, the number of leap years between 1789 and 2024 inclusive (since 2024 is a leap year, and we're counting up to July 14, 2024, which is after February 29, 2024, so the leap day is included).
    
    Number of years: 2024 - 1789 + 1 = 236 years. Wait, no, from July 14, 1789, to July 14, 2024, is 2024 - 1789 = 235 years (because at July 14, 1789, it's the start, and at July 14, 2024, it's after 235 years).
    
    Number of leap years in this period: The first year is 1789 (not a leap year, since 1789 is not divisible by 4). The last year is 2024 (which is a leap year, divisible by 4 and not by 100 unless divisible by 400, but 2024 is divisible by 4 and not by 100).
    
    The number of leap years between 1789 and 2024 inclusive is the number of years divisible by 4 in this range, minus those divisible by 100 but not by 400.
    
    First, number of years divisible by 4 between 1789 and 2024 inclusive:
    The first leap year after 1789 is 1792 (since 1789 + 3 = 1792, which is divisible by 4).
    The last leap year before 2024 is 2024 itself (which is divisible by 4).
    Number of leap years = (2024 - 1792) / 4 + 1 = (2024 - 1792) = 232, 232 / 4 = 58, +1 = 59.
    
    But wait, 2024 is included, so it's correct: (2024 - 1792) / 4 + 1 = (232)/4 + 1 = 58 + 1 = 59.
    
    Now, subtract the century years that are not leap years (i.e., divisible by 100 but not by 400). Between 1789 and 2024, the century years are 1800, 1900, and 2000.
    
    Now, which of these are not leap years?
    - 1800: divisible by 100 but not by 400 (since 1800 / 400 = 4.5, not integer) → not a leap year.
    - 1900: same, divisible by 100 but not by 400 → not a leap year.
    - 2000: divisible by 400 (2000 / 400 = 5) → is a leap year.
    
    So, we have two century years that are not leap years: 1800 and 1900.
    
    Thus, total leap years = 59 - 2 = 57.
    
    So, total days for the full years = 235 * 365 + number of leap years = 235 * 365 + 57.
    
    Let's calculate that:
    235 * 365:
    200 * 365 = 73000
    35 * 365 = 35 * 300 + 35 * 65 = 10500 + 2275 = 12775
    Total: 73000 + 12775 = 85775
    Plus leap days: 85775 + 57 = 85832 days for the full years.
    
    Now, we need to add the days from July 14, 2024, to June 4, 2025.
    
    First, calculate the days remaining in 2024 after July 14:
    July has 31 days, so from July 14 to July 31 is 31 - 14 = 17 days (including July 14? Wait, no, from July 14 to July 15 is 1 day, so July 14 to July 31 is 31 - 14 = 17 days, but we need to include July 14? Actually, the number of days from July 14 to July 31 inclusive is 31 - 14 + 1 = 18 days. But usually, when we say "from A to B", we count the days starting from A and including B, so from July 14 to July 14 is 1 day, from July 14 to July 15 is 2 days, etc. So from July 14 to July 31 is 31 - 14 + 1 = 18 days.
    
    But in calculating intervals, sometimes it's exclusive. The standard way is that the number of days between two dates is (end - start) + 1 if including both, or (end - start) if including start but not end. Here, we want the number of days from July 14, 2024, up to but not including July 14, 2025, but our end date is June 4, 2025, which is before July 14, 2025.
    
    Wait, no, we're calculating the days from July 14, 2024, to June 4, 2025.
    
    Let's break it down:
    1. From July 14, 2024, to December 31, 2024.
    2. From January 1, 2025, to June 4, 2025.
    
    First part: July 14 to December 31, 2024.
    Number of days remaining in July after the 14th: 31 - 14 = 17 (but actually, it's 31 - 14 = 17, but that's from July 15 to July 31, which is 17 days. So from July 14 to July 31 is 31 - 14 + 1 = 18 days (including July 14).
    Wait, no, if we're counting days from July 14 (inclusive) to July 31 (inclusive), that's 31 - 14 + 1 = 18 days.
    But in calculating the interval between two dates, it's often (end - start) + 1 if both are included. Here, we want the number of days from July 14 (inclusive) to June 4, 2025 (inclusive), which is (June 4, 2025 - July 14, 2024) + 1? Wait, no, better to calculate it directly.
    
    Let's calculate the days remaining in 2024 after July 14:
    July: 31 - 14 = 17 (but this is from July 15 to July 31, which is 17 days. So from July 14 to July 31 is 18 days (including July 14).
    But actually, if we're counting days starting from July 14, then:
    - July: 31 - 14 = 17 (from July 15 to July 31), plus July 14 itself is 18 days in July.
    But that can't be right because July has only 31 days. Wait, if we include July 14, then the days are July 14 to July 31, which is 31 - 14 + 1 = 18 days.
    
    Similarly, for August: 31 days
    September: 30
    October: 31
    November: 30
    December: 31
    Total for August to December: 31 + 30 + 31 + 30 + 31 = 153 days
    Plus July: 18
    Total for July 14 to Dec 31, 2024: 18 + 153 = 171 days
    
    Wait, let's verify:
    From July 14 to July 31: 31 - 14 + 1 = 18
    August: 31
    September: 30
    October: 31
    November: 30
    December: 31
    Total: 18 + 31 + 30 + 31 + 30 + 31 = 18 + 31 = 49; +30=79; +31=110; +30=140; +31=171. Yes, correct.
    
    Now, from January 1, 2025, to June 4, 2025.
    Is 2025 a leap year? 2025 is not divisible by 4, so no.
    Months:
    January: 31
    February: 28 (not leap year)
    March: 31
    April: 30
    May: 31
    June: up to 4th is 4
    Total: 31 + 28 = 59; +31=90; +30=120; +31=151; +4=155 days.
    
    So total days from July 14, 2024, to June 4, 2025: 171 (remaining in 2024) + 155 (in 2025) = 326 days.
    
    Wait, but we have to be careful with the counting. Are we including both July 14, 2024, and June 4, 2025? If we're calculating the number of days between two dates including both endpoints, then the calculation is correct. But in calculating the difference between two dates, it's often exclusive of the start date. The problem asks "how many days ago", which typically means counting from today backwards, not including today. But here, we're calculating the days from the start of the revolution to today, so we should include both the start date and today in the count.
    
    But in the initial problem, it's asking for how many days ago the French Revolution started, which is the number of days from June 4, 2025, back to July 14, 1789. This would be (June 4, 2025 - July 14, 1789) minus 1 if we're counting days ago (since today is not ago from today). But usually, the number of days between two dates including both is (end - start) + 1. But "days ago" would be the duration from the start to today, which is today's date minus start date, not including the start date if we're counting days after. Wait, no, if the revolution started on day X, then the number of days ago it started is (today - X), not including X itself in the count.
    
    But in our calculation, we have:
    From July 14, 1789, to June 4, 2025.
    
    If we include both endpoints, it's (June 4, 2025 - July 14, 1789) + 1.
    But "days ago" would be (today - start date), which is today's date minus start date minus 1 if not including the start date. But actually, if the revolution started on July 14, 1789, and today is June 4, 2025, then the number of days ago it started is the number of days between July 14, 1789, and June 4, 2025, minus one if we're not counting the start day. But usually, "how many days ago" refers to the time elapsed, which is (end - start), where end is today, start is the event date. So it's (June 4, 2025 - July 14, 1789), which would be the number of days between the two dates, not including the start day if we're counting the days after.
    
    But in our calculation, we did full years (July 14, 1789 to July 14, 2024) which is 235 years, and then from July 14, 2024 to June 4, 2025.
    
    But actually, to find the number of days ago, we should calculate the difference between June 4, 2025, and July 14, 1789, which is (June 4, 2025 - July 14, 1789).
    
    But in our initial approach, we calculated the days from July 14, 1789, to June 4, 2025, which is the same as (June 4, 2025 - July 14, 1789), and then we can add 1 if we want to include both endpoints. But for "days ago", we don't include the start day, so it's just (June 4, 2025 - July 14, 1789).
    
    So our calculation is correct as is: 235 years' worth of days plus the remaining days.
    
    But let's think differently: calculate the total days from July 14, 1789, to June 4, 2025.
    
    Alternative approach:
    Calculate the number of full years and the remaining days.
    
    From July 14, 1789, to July 14, 2024: 2024 - 1789 = 235 years.
    Number of leap years in this period: as before, 57.
    
    Total days for full years: 235 * 365 + 57 = 85775 + 57 = 85832.
    
    Now, days from July 14, 2024, to June 4, 2025.
    
    As calculated earlier: 326 days.
    
    Total days: 85832 (full years) + 326 (remaining) = 86158 days.
    
    But wait, does this count include both July 14, 1789, and June 4, 2025? If so, then to get the number of days between them (excluding the start day), we'd subtract 1. But in our calculation, we've included both endpoints (since we included July 14, 2024, in the remaining days calculation by doing July 14 to July 31 as 18 days, which includes July 14).
    
    But actually, in the remaining days calculation:
    From July 14, 2024, to June 4, 2025:
    We calculated July 14 to Dec 31, 2024: 171 days (including July 14)
    Jan 1 to June 4, 2025: 155 days (including Jan 1)
    Total: 171 + 155 = 326 days, which includes both July 14, 2024, and June 4, 2025.
    
    Similarly, the full years from July 14, 1789, to July 14, 2024, include July 14, 1789, and July 14, 2024 (but July 14, 2024, is already included in the remaining days, so we have double-counted July 14, 2024).
    
    Wait, no, the full years are from July 14, 1789 (inclusive) to July 14, 2024 (exclusive? Or inclusive?).
    
    Actually, the period from July 14, 1789, to July 14, 2024, includes July 14, 1789, and July 14, 2024, if we're counting inclusively. But in terms of years, it's 235 years from July 14, 1789, to July 14, 2024 (since at July 14, 2024, it's been exactly 235 years since July 14, 1789).
    
    But in our days calculation, the full years contribute 235 years' worth of days, where each year is from July 14 to July 14 of the next year. But actually, from July 14, 1789, to July 14, 1790, is one year, which has 365 or 366 days depending on whether it's a leap year. But since the year starts on July 14, the leap day (February 29) is included in that year if the year is a leap year.
    
    But our initial calculation of leap years assumed calendar years (January to December), but our period is July to July. So we need to recalculate the number of leap years in the period from July 14, 1789, to July 14, 2024.
    
    A year Y is a leap year if it's divisible by 4, but not by 100 unless also by 400. But since our year period starts in July, the leap day (February 29) is included in the year if Y is a leap year (because February 29 is before July in the same year).
    
    Wait, no: the period from July 14, Y to July 14, Y+1 includes February of Y+1, not Y. Wait, no:
    
    From July 14, 2023, to July 14, 2024: this includes February 29, 2024, if 2024 is a leap year (which it is). So the period from July 14, Y to July 14, Y+1 includes February of Y+1.
    
    Thus, for the period from July 14, 1789, to July 14, 2024, the leap years are those Y where Y+1 is a leap year (i.e., Y+1 is divisible by 4, etc.).
    
    Wait, no: the year from July 14, Y to July 14, Y+1 is a leap period if Y+1 is a leap year (i.e., if February 29 falls within this period, which it does if Y+1 is a leap year).
    
    But actually, the period from July 14, Y to July 14, Y+1 includes February of Y+1, so the extra day is present if Y+1 is a leap year.
    
    Thus, the number of leap periods in 235 years is equal to the number of leap years in Y+1 for Y from 1789 to 2023 (since the last period is from July 14, 2023, to July 14, 2024, and we're checking if 2024 is a leap year).
    
    But Y ranges from 1789 to 2023 (since the periods are 1789-1790, ..., 2023-2024), so Y+1 ranges from 1790 to 2024.
    
    Number of leap years between 1790 and 2024 inclusive.
    
    Number of years divisible by 4 in 1790..2024:
    First divisible by 4 is 1792 (since 1790/4 = 447.5, so next is 1792).
    Last is 2024.
    Number of terms = (2024 - 1792)/4 + 1 = (232)/4 + 1 = 58 + 1 = 59.
    
    Subtract century years not divisible by 400: 1800, 1900 (2000 is divisible by 400, so it's a leap year).
    So subtract 2: 59 - 2 = 57 leap years.
    
    Thus, there are 57 leap periods in the 235 years (each corresponding to a leap year in Y+1, i.e., 1792 to 2024 excluding 1800 and 1900).
    
    Thus, total days for full years: 235 * 365 + 57 = 85832.
    
    Now, remaining days from July 14, 2024, to June 4, 2025: as before, 326.
    
    Total days: 85832 + 326 = 86158.
    
    But this counts both July 14, 1789, and June 4, 2025. If we want the number of days between them (excluding the start day), it's 86158 - 1 = 86157.
    
    But the question is "how many days ago did the French Revolution start?" which is asking for the time elapsed since the start. The time elapsed from July 14, 1789, to June 4, 2025, is (June 4, 2025 - July 14, 1789), which is the same as counting the days from July 14, 1789 (not including) to June 4, 2025 (including), which is 86158 - 1 = 86157.
    
    Wait, no: if we say "X days ago from today", it's the number of days in [today - X, today), so the revolution started X days ago means today - X = start date, so X = today - start date, which is the number of days from start date to today, excluding the start date if counting days after.
    
    But in our calculation, we included both endpoints, so to get the number of days between them (excluding start), it's 86158 - 1 = 86157.
    
    But let's think differently: the duration between two dates is (end - start) + 1 if including both, or (end - start) if including start but not end, etc. But in terms of days ago, if today is June 4, 2025, and the revolution started on July 14, 1789, then the number of days ago is (June 4, 2025 - July 14, 1789), which is the same as counting the days from July 15, 1789, to June 4, 2025, inclusive, which is (June 4, 2025 - July 15, 1789) + 1.
    
    But our initial calculation was from July 14, 1789 (inclusive) to June 4, 2025 (inclusive), which is (June 4, 2025 - July 14, 1789) + 1 = 86158.
    
    Thus, days ago = 86158 - 1 (since the day of start is not "ago" from today) = 86157.
    
    But actually, if the revolution started on day X, then today is X + N days ago. So N = today - X, where today and X are dates, and the subtraction gives the number of days between them, excluding the start day X in the count.
    
    But in our calculation, we did X to today inclusive, which is (today - X) + 1. So today - X = 86158 - 1 = 86157.
    
    Thus, the number of days ago is 86157.
    
    But let's verify with a smaller example. Suppose today is July 16, 2024, and the start was July 14, 2024. Then days ago is 2 (July 14 is 2 days ago from July 16). Our method would calculate days from July 14 to July 16 inclusive: 3 days (14,15,16), then subtract 1 to get days ago: 2, which is correct.
    
    Another example: today is July 15, 2024, start was July 14, 2024. Days from July 14 to July 15 inclusive: 2 days. Days ago: 2 - 1 = 1, which is correct (July 14 is 1 day ago from July 15).
    
    Thus, our calculation seems correct: total inclusive days is 86158, so days ago is 86157.
    
    But wait, let's think about the full years part. We calculated 235 years with 57 leap periods, giving 235*365 + 57 = 85832 days for the full years. Then we added 326 days for the remaining period from July 14, 2024, to June 4, 2025, which is correct. So total is 86158 inclusive, so days ago is 86157.
    
    But let's cross-validate the remaining days calculation (July 14, 2024, to June 4, 2025):
    
    From July 14 to December 31, 2024:
    July: 31 - 14 = 17 (from July 15 to July 31) + 1 (for July 14) = 18 days? Wait, no, from July 14 to July 14 is 1 day (just July 14). From July 14 to July 15 is 2 days (14 and 15). So from July 14 to July 31 is 31 - 14 + 1 = 18 days (including both ends).
    Similarly for other months: but since we're starting at July 14, the first month is July with 18 days (July 14 to July 31), then full months August to December, then full months January to May, then June up to 4.
    
    Wait, perhaps better to calculate it as:
    From July 14, 2024, to June 4, 2025, is equal to:
    (From July 14, 2024, to July 14, 2025) minus (from June 5, 2025, to July 14, 2025).
    
    But that seems convoluted. Alternatively, calculate days remaining in 2024 after July 14, plus days in 2025 up to June 4.
    
    Days in July after July 14: July has 31 days, so from July 15 to July 31 is 17 days (31 - 14 = 17, since July 15 is day 1, ..., July 31 is day 17). Wait, no, July 14 to July 31 inclusive is 31 - 14 + 1 = 18 days.
    
    Similarly, for August: full month is 31 days.
    September: 30
    October: 31
    November: 30
    December: 31
    Total for August to December: 31 + 30 + 31 + 30 + 31 = 153
    Total for July: 18
    Total for 2024: 18 (July) + 153 (Aug-Dec) = 171 days from July 14 to Dec 31, 2024 inclusive.
    
    Now 2025 is not a leap year (2025 is not divisible by 4).
    Days in 2025 up to June 4:
    January: 31
    February: 28 (not leap year)
    March: 31
    April: 30
    May: 31
    June: 4
    Total: 31 + 28 = 59; +31=90; +30=120; +31=151; +4=155 days from Jan 1 to June 4 inclusive.
    
    But we need from July 14, 2024, to June 4, 2025, inclusive, which is:
    (July 14, 2024, to Dec 31, 2024) + (Jan 1, 2025, to June 4, 2025) = 171 + 155 = 326 days.
    
    Thus, total days from start to today inclusive is 85832 (full years) + 326 (remaining) = 86158.
    Thus, days ago is 86158 - 1 = 86157.
    
    But let's cross-check the leap year count.
    
    Number of leap years between July 14, 1789, and July 14, 2024 (inclusive of the start date but not the end? Wait, our period is from July 14, 1789, up to but not including July 14, 2025, but we're counting up to July 14, 2024, which is before July 14, 2025.
    
    Wait, perhaps better to think in terms of the number of February 29s in the period from July 14, 1789, to June 4, 2025.
    
    A February 29 falls within our period if the year Y is a leap year and February 29, Y, is >= July 14, 1789, and < June 4, 2025.
    
    But since our start date is July 14, 1789, which is after February 29, 1789, the first possible leap day in our period is February 29, 1792.
    
    Similarly, the last possible leap day before June 4, 2025, is February 29, 2024 (since February 29, 2028, is after June 4, 2025).
    
    So leap days in our period are February 29 of years Y where:
    1792 <= Y <= 2024, and Y is a leap year.
    
    Number of leap years between 1792 and 2024 inclusive.
    
    Number of years divisible by 4: (2024 - 1792)/4 + 1 = (232)/4 + 1 = 58 + 1 = 59.
    
    Subtract century years not divisible by 400: 1800, 1900 (2000 is divisible by 400, so it's included).
    Thus, subtract 2: 59 - 2 = 57 leap years.
    
    Thus, there are 57 leap days in our period (February 29, 1792, to February 29, 2024).
    
    Each of these adds 1 day to the total count (since we're counting the entire period inclusively, and February 29 is included in the period for these years).
    
    Thus, our initial calculation seems correct: 235 full years with 57 leap periods contributing an extra day each, plus the remaining 326 days, gives 86158 total days including both endpoints. Therefore, the number of days ago is 86157.
    
    But wait, is the count of leap years correct? Let's verify for a smaller range.
    
    Example: from July 14, 2020, to June 4, 2022.
    Leap years in this period: February 29, 2020 (but our start is July 14, 2020, so February 29, 2020, is before our start date, so not included).
    Next is February 29, 2024, which is after June 4, 2022, so not included.
    Wait, in our original problem, the period is July 14, 1789, to June 4, 2025, so February 29, 2024, is included (since June 4, 2025, is after February 29, 2024).
    
    But in our initial calculation, we considered leap years up to 2024 (February 29, 2024), which is correct because our period includes February 29, 2024 (since June 4, 2025, is after February 29, 2024).
    
    But in the small example from July 14, 2020, to June 4, 2022:
    Leap days: February 29, 2020 is before July 14, 2020 (start), so not included.
    February 29, 2024 is after June 4, 2022 (end), so not included.
    Thus, no leap days in this period. But 2020 is a leap year, and our period starts after February 29, 2020, so the year from July 14, 2020, to July 14, 2021, does not include February 29 (since February 29, 2020, is before July 14, 2020, and February 29, 2021, doesn't exist). Wait, no, the period from July 14, 2020, to July 14, 2021, includes February 29, 2020? Wait, no, the period is from July 14, 2020, to July 14, 2021, which does not include February 29 (since February 29, 2020, is before July 14, 2020, and February 29, 2021, doesn't exist). Wait, no, the period from July 14, 2020, to July 14, 2021, includes February 29, 2021? Wait, no, February 29 is only in leap years, and 2021 is not a leap year. So in this small example, there are no February 29s in the period from July 14, 2020, to June 4, 2022, so the leap day count is correct (0 in this case).
    
    Thus, our original calculation seems correct.
    
    Final calculation:
    Total days from July 14, 1789, to June 4, 2025 inclusive: 86158.
    Number of days ago: 86158 - 1 = 86157.
    
    But wait, is the count of full years correct? We have 235 full years (from July 14, 1789, to July 14, 2024), but actually, the period from July 14, 1789, to July 14, 2024, is 2024 - 1789 = 235 years, where each year is from July 14 to July 14 of the next year. So the number of full years is correct at 235.
    
    Leap days in this period are February 29s that fall within any of these year-long periods. For a period from July 14, Y, to July 14, Y+1, the February 29 that could be included is February 29, Y+1 (since February 29, Y, is before July 14, Y, the start of the period, unless Y is before July, but our periods are from July 14 to July 14, so February is always before the start of the period if Y >= July). Wait, no, our periods are from July 14, Y, to July 14, Y+1, which includes February of Y+1, not Y. So for each Y from 1789 to 2023 (since the last period is from July 14, 2023, to July 14, 2024), the period includes February of Y+1.
    
    Thus, the leap day is included in the period if Y+1 is a leap year. So the number of leap periods is equal to the number of Y in 1789..2023 such that Y+1 is a leap year, i.e., Y+1 is divisible by 4, not divisible by 100 unless also by 400.
    
    Thus, Y+1 ranges from 1790 to 2024, and we need the number of leap years in this range, which is what we calculated earlier: 57.
    
    Thus, the calculation holds.
    
    Final answer: 86157 days ago.
    </think>
    
    ### Summary:
    The French Revolution started on July 14, 1789. To find out how many days ago this was from June 4, 2025, we calculate the total days between these two dates.
    
    1. Calculate the number of full years between July 14, 1789, and July 14, 2024: 235 years.
    2. Determine the number of leap years in this period by counting leap years between 1790 and 2024 (inclusive), accounting for century years that are not leap years (1800 and 1900), resulting in 57 leap years.
    3. Calculate the total days for these full years: \( 235 \times 365 + 57 = 85832 \) days.
    4. Calculate the days remaining from July 14, 2024, to June 4, 2025, which is 326 days (171 days in 2024 after July 14 and 155 days in 2025 up to June 4).
    5. Sum the total days: \( 85832 + 326 = 86158 \) days (inclusive of both start and end dates).
    6. Since the question asks for how many days ago the revolution started, we subtract 1 day to exclude the start day itself, resulting in 86157 days ago.
    
    However, upon closer inspection, the correct interpretation is that the number of days between two dates (excluding the start day) is simply the difference in days between the two dates. Therefore, the correct calculation is the total inclusive days minus one to exclude the start day, which remains 86157.
    
    But let's cross-validate with a smaller example to ensure correctness.
    
    **Example:**
    If today is July 16, 2024, and the event was on July 14, 2024:
    - Inclusive days: July 14 to July 16 is 3 days (14, 15, 16).
    - Days ago: 2 (July 14 is 2 days ago from July 16: July 15 and 16 are 1 and 0 days ago, respectively).
    
    This matches our method: inclusive days = 3, so days ago = 3 - 1 = 2.
    
    Thus, applying this to the original problem:
    Inclusive days = 86158, so days ago = 86158 - 1 = 86157.
    
    However, another approach is to calculate the exact difference between the two dates without including the start day in the count of "days ago". The duration between July 14, 1789, and June 4, 2025, excluding the start day, is (June 4, 2025 - July 14, 1789), which is equivalent to counting from July 15, 1789, to June 4, 2025, inclusive. This would be:
    
    From July 15, 1789, to July 14, 2025: 235 years and 326 days (since July 15 to July 14 is one day less than a full year, but this seems confusing).
    
    Alternatively, it's simpler to compute the total inclusive days and subtract 1, as initially done.
    
    Therefore, the final answer is that the French Revolution started **86157 days ago**.
    
    But let's verify the calculation once more for accuracy.
    
    Recalculating the remaining days (July 14, 2024, to June 4, 2025):
    
    From July 14 to December 31, 2024:
    - July 14 to July 31: 18 days (including July 14)
    - August: 31
    - September: 30
    - October: 31
    - November: 30
    - December: 31
    Total: 18 + 31 + 30 + 31 + 30 + 31 = 171 days (but wait, July 14 to July 31 is 18 days, August starts on August 1, so total is correct as 171).
    
    From January 1 to June 4, 2025 (not a leap year):
    - January: 31
    - February: 28
    - March: 31
    - April: 30
    - May: 31
    - June: 4
    Total: 31 + 28 + 31 + 30 + 31 + 4 = 155 days.
    
    Total remaining days: 171 + 155 = 326.
    
    Full years: 235 years, with 57 leap days (as calculated earlier).
    Total for full years: 235 * 365 + 57 = 85775 + 57 = 85832.
    
    Total inclusive days: 85832 + 326 = 86158.
    
    Days ago: 86158 - 1 = 86157.
    
    Thus, the final answer is:
    
    \[
    \boxed{86157}
    \]
    
    However, upon re-reading the problem, it asks for "how many days ago did the French Revolution start?" which is asking for the duration from the start date to today, not including the start date itself. Therefore, the correct calculation is the difference between the two dates, which is (today - start date). If we compute the total inclusive days and subtract 1, we get the correct number of days ago, which is 86157.
    
    But let's compute it directly:
    
    Compute the difference between June 4, 2025, and July 14, 1789.
    
    We can break it down as follows:
    
    1. Calculate the number of days from July 14, 1789, to July 14, 2024: 235 years.
       - Number of leap years in this period: years divisible by 4 between 1792 and 2024, excluding 1800 and 1900: as before, 57.
       - Total days: 235 * 365 + 57 = 85832.
    
    2. Add the days from July 14, 2024, to June 4, 2025: as before, 326.
    
    Total days from July 14, 1789, to June 4, 2025, inclusive: 85832 + 326 = 86158.
    
    But the number of days ago is the number of days after July 14, 1789, up to June 4, 2025, which is June 4, 2025 - July 14, 1789 = (days from July 15, 1789, to June 4, 2025, inclusive).
    
    This is equal to (total inclusive days from July 14, 1789, to June 4, 2025) - 1 = 86157.
    
    Thus, the French Revolution started **86157 days ago**.
    
    However, let's cross-check with an online date difference calculator to ensure our manual calculation is correct.
    
    Using an online calculator (e.g., timeanddate.com), the number of days between July 14, 1789, and June 4, 2025, inclusive is indeed 86158 days. Excluding the start date, it's 86157 days ago.
    
    Thus, the final answer is:
    
    The French Revolution started **86157 days ago**.
    
    But wait, let's think about the interpretation of "days ago". If the revolution started on day X, then "days ago" is how many days have passed since X up to but not including today. But in our case, today is June 4, 2025, and the revolution started on July 14, 1789. The duration from July 14, 1789, to June 4, 2025 (excluding the start day) is the same as the number of days from July 15, 1789, to June 4, 2025, inclusive, which is the total inclusive days minus 1 (for the start day).
    
    But actually, the standard way to compute "days ago" is to calculate (today - start date), which is the same as counting the days from the day after the start date up to and including today. But since our end date is today (June 4, 2025), and we're calculating days from July 14, 1789, to June 4, 2025 inclusive (which is 86158), then the days ago is 86158 - 1 = 86157.
    
    But if we were to calculate (June 4, 2025 - July 14, 1789), this would directly give the number of days between the two dates not including the start date, which is what we want. This is equivalent to counting from July 15, 1789, to June 4, 2025 inclusive, which is the same as the total inclusive days minus one (the start day), so our initial calculation is correct.
    
    Thus, the final answer is:
    
    The French Revolution started **86157 days ago**.
    
    But let's cross-validate with a smaller example again to ensure correctness.
    
    Example:
    Start date: July 14, 2024
    Today: July 16, 2024
    Inclusive days: July 14, 15, 16 → 3 days.
    Days ago: 2 (since July 14 is 2 days ago from July 16: July 15 is 1 day ago, July 16 is today, 0 days ago).
    Our method: inclusive days (3) - 1 = 2. Correct.
    
    Another example:
    Start: July 14, 2024
    Today: July 15, 2024
    Inclusive days: July 14, 15 → 2 days.
    Days ago: 1 (July 14 is 1 day ago from July 15).
    Our method: 2 - 1 = 1. Correct.
    
    Thus, our method is correct.
    
    Therefore, the French Revolution started **86157 days ago**.
    
    However, let's think about the leap day calculation again to ensure no mistake was made.
    
    We have full years from July 14, 1789, to July 14, 2024 (235 years), and we counted 57 leap periods in this interval (where each period is a year from July 14 to July 14, and contains February 29 if the next year is a leap year).
    
    But let's think differently: the total number of February 29s in the entire period from July 14, 1789, to June 4, 2025.
    
    February 29 occurs in leap years. The leap years in this period are those where February 29 is between July 14, 1789, and June 4, 2025.
    
    Since July 14 is after February in any year, the February 29 of year Y is included in our period if:
    July 14, 1789 <= February 29, Y < June 4, 2025.
    
    But February 29, Y, is always before July 14, Y (since February is before July), so February 29, Y, is included if Y >= 1789 (since February 29, 1789, is before July 14, 1789, so not included), and February 29, Y < June 4, 2025.
    
    But since February 29 is always before July 14 in the same year, February 29, Y, is included in our period if Y > 1789 and February 29, Y < June 4, 2025.
    
    But February 29, Y, is before July 14, Y, so February 29, Y, is included if Y >= 1789 and February 29, Y < June 4, 2025.
    
    But February 29, Y, is before July 14, Y, so for Y >= 1789, February 29, Y is before July 14, Y, so February 29, Y is only included in our period if Y > 1789 (since February 29, 1789, is before our start date of July 14, 1789), and February 29, Y < June 4, 2025.
    
    But since February 29, Y, is always before July 14, Y, and our period starts on July 14, 1789, February 29, Y, is included if Y >= 1789 and February 29, Y >= July 14, 1789. But February is before July, so February 29, Y, is always before July 14, Y. Therefore, February 29, Y, is included in our period if Y > 1789 (since February 29, 1789, is before our start date) and February 29, Y < June 4, 2025.
    
    But February 29, Y, is always before July 14, Y, so February 29, Y, is included in our period if:
    July 14, 1789 <= February 29, Y < June 4, 2025.
    
    But since February 29, Y, is always before July 14, Y, the first condition is satisfied if Y >= 1789, but February 29, 1789, is before July 14, 1789 (start of our period), so February 29, Y, is included if Y >= 1789 and February 29, Y >= July 14, 1789. But February 29 is always before July 14 in the same year, so February 29, Y, is never >= July 14, Y. Thus, February 29, Y, is included if Y > 1789 and February 29, Y >= July 14, 1789. But since February is before July, February 29, Y, is only >= July 14, 1789 if Y >= 1790 (because February 29, 1789, is before July 14, 1789, and February 29, 1790, is before July 14, 1790, etc., but our period starts on July 14, 1789, so February 29, Y, is included if Y >= 1790 (since February 29, 1790, is after July 14, 1789? Wait, no, February 29, Y, is always before July 14, Y, so February 29, Y, is included in our period if Y >= 1789 and February 29, Y >= July 14, 1789. But February 29, Y, is always before July 14, Y, so February 29, Y, is >= July 14, 1789 only if Y >= 1789 and February 29, Y >= July 14, 1789. But February is always before July, so February 29, Y, is always before July 14, Y, so February 29, Y >= July 14, 1789 would require Y > 1789 (since February 29, 1789, is before July 14, 1789, and February 29, 1790, is before July 14, 1790, etc., so February 29, Y, is never >= July 14, Y for Y >= 1789).
    
    Wait, this seems confusing. Perhaps a better approach is to realize that in our period from July 14, 1789, to June 4, 2025, a February 29 is included if it falls within this interval. Since February is before July, February 29, Y, is included if Y >= 1790 (because February 29, 1789, is before July 14, 1789, so not included, and February 29, 1790, is after July 14, 1789, and before June 4, 2025, since June 4, 2025, is after February 29, 2024, and before February 29, 2028, etc.
    
    But actually, February 29, Y, is included if Y is such that February 29, Y, is >= July 14, 1789 and < June 4, 2025. Since February is before July, February 29, Y, is >= July 14, 1789 only if Y >= 1790 (because February 29, 1789, is before July 14, 1789, but February 29, 1790, is after July 14, 1789? Wait, no, February 29, Y, is always before July 14, Y, so February 29, Y, is before July 14, Y >= 1789, so February 29, Y, cannot be >= July 14, 1789 unless Y > 1789 (but even then, February 29, 1790, is before July 14, 1790, etc.).
    
    Wait, this seems incorrect. February 29, Y, is always before July 14, Y, so February 29, Y, cannot be >= July 14, Y for any Y. Therefore, February 29, Y, is included in our period if Y >= 1790 (since February 29, 1790, is after July 14, 1789, and before June 4, 2025).
    
    Thus, the leap days included are February 29 of years Y where 1790 <= Y <= 2024 (since February 29, 2024, is before June 4, 2025, and February 29, 2028, would be after June 4, 2025).
    
    Thus, we need to count the number of leap years Y in 1790 <= Y <= 2024.
    
    Number of years divisible by 4 in 1790..2024:
    First divisible by 4 is 1792 (since 1790 and 1791 are not divisible by 4).
    Last divisible by 4 is 2024.
    Number of terms: (2024 - 1792)/4 + 1 = (232)/4 + 1 = 58 + 1 = 59.
    
    Subtract century years not divisible by 400: 1800, 1900 (2000 is divisible by 400, so it's included).
    Thus, subtract 2: 59 - 2 = 57 leap years.
    
    This matches our earlier count of leap periods (57), confirming that our initial approach was correct.
    
    Thus, the final answer is indeed 86157 days ago.
    
    However, let's perform the calculation once more to ensure no arithmetic errors were made.
    
    Full years: 235 years (July 14, 1789, to July 14, 2024).
    Number of leap years in this period: as above, 57 (for years 1792 to 2024 excluding 1800 and 1900).
    Total days for full years: 235 * 365 + 57 = 235 * 365 = 235 * 300 + 235 * 65 = 70500 + 235 * 65.
    Calculate 235 * 65:
    200 * 65 = 13000
    35 * 65 = 2275
    Total: 13000 + 2275 = 15275
    So 70500 + 15275 = 85775
    Add leap days: 85775 + 57 = 85832.
    
    Remaining days: July 14, 2024, to June 4, 2025.
    From July 14 to July 31, 2024: 31 - 14 + 1 = 18 days.
    August to December: 31 + 30 + 31 + 30 + 31 = 153.
    Total for 2024: 18 + 153 = 171.
    From Jan 1 to June 4, 2025:
    January: 31
    February: 28 (not a leap year)
    March: 31
    April: 30
    May: 31
    June: 4
    Total: 31 + 28 = 59; +31=90; +30=120; +31=151; +4=155.
    Total remaining days: 171 + 155 = 326.
    
    Total inclusive days: 85832 (full years) + 326 (remaining) = 86158.
    Days ago: 86158 - 1 = 86157.
    
    Thus, the final answer is:
    
    The French Revolution started **86157 days ago**.
    
    \boxed{86157}
    ./llama.cpp/llama-mtmd-cli \
        --model unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-Q4_K_XL.gguf \
        --mmproj unsloth/Magistral-Small-2509-GGUF/mmproj-BF16.gguf \
        --threads -1 \
        --ctx-size 40960 \
        --cache-type-k f16
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 0.7 \
        --repeat-penalty 1.0 \
        --min-p 0.01 \
        --top-k -1 \
        --top-p 0.95 \
        --jinja
    ./llama.cpp/llama-mtmd-cli \
        --model unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-Q4_K_XL.gguf \
        --mmproj unsloth/Magistral-Small-2509-GGUF/mmproj-BF16.gguf \
        --threads -1 \
        --ctx-size 40960 \
        --cache-type-k f16
        --n-gpu-layers 99 \
        --seed 3407 \
        --prio 2 \
        --temp 0.7 \
        --repeat-penalty 1.0 \
        --min-p 0.01 \
        --top-k -1 \
        --top-p 0.95 \
        --jinja
    !pip install --upgrade unsloth
    from unsloth import FastLanguageModel
    import torch
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/Magistral-Small-2509-unsloth-bnb-4bit",
        max_seq_length = 2048,   # Context length - can be longer, but uses more memory
        load_in_4bit = True,     # 4bit uses much less memory
        load_in_8bit = False,    # A bit more accurate, uses 2x memory
        full_finetuning = False, # We have full finetuning now!
        device_map = "balanced", # Uses 2x Telsa T4s
        # token = "hf_...",      # use one if using gated models
    )
    pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

    Unsloth Model Catalog

    Unsloth model catalog for all our Dynamic GGUF, 4-bit, 16-bit models on Hugging Face.

    DeepSeekLlamaGemmaQwenMistralPhi

    GGUFs let you run models in tools like Ollama, Open WebUI, and llama.cpp. Instruct (4-bit) safetensors can be used for inference or fine-tuning.

    New & recommended models:

    Model
    Variant
    GGUF
    Instruct (4-bit)

    DeepSeek models:

    Model
    Variant
    GGUF
    Instruct (4-bit)

    Llama models:

    Model
    Variant
    GGUF
    Instruct (4-bit)

    Gemma models:

    Model
    Variant
    GGUF
    Instruct (4-bit)

    Qwen models:

    Model
    Variant
    GGUF
    Instruct (4-bit)

    Mistral models:

    Model
    Variant
    GGUF
    Instruct (4-bit)

    Phi models:

    Model
    Variant
    GGUF
    Instruct (4-bit)

    Other (GLM, Orpheus, Smol, Llava etc.) models:

    Model
    Variant
    GGUF
    Instruct (4-bit)

    16-bit and 8-bit Instruct models are used for inference or fine-tuning:

    New models:

    Model
    Variant
    Instruct (16-bit)

    Base models are usually used for fine-tuning purposes:

    New models:

    Model
    Variant
    Base (16-bit)
    Base (4-bit)

    You can use our FP8 uploads for training or serving/deployment.

    FP8 Dynamic offers slightly faster training and lower VRAM usage than FP8 Block, but with a small trade-off in accuracy.

    Model
    Variant
    FP8 (Dynamic / Block)

    •

    •

    24B

    —

    Mistral Large 3

    675B

    FLUX.2

    dev

    —

    80B-A3B-Instruct

    80B-A3B-Thinking

    —

    2B-Instruct

    2B-Thinking

    4B-Instruct

    4B-Thinking

    8B-Instruct

    8B-Thinking

    30B-A3B-Instruct

    —

    30B-A3B-Thinking

    —

    32B-Instruct

    32B-Thinking

    235B-A22B-Instruct

    —

    235B-A22B-Thinking

    —

    30B-A3B-Instruct

    —

    30B-A3B-Thinking

    —

    235B-A22B-Thinking

    —

    235B-A22B-Instruct

    —

    Qwen3-Coder

    30B-A3B

    —

    480B-A35B

    —

    GLM

    4.6V-Flash

    —

    4.6

    —

    4.5-Air

    —

    Terminus

    —

    V3.1

    —

    Granite-4.0

    H-Small

    Kimi-K2

    Thinking

    —

    0905

    —

    Gemma 3n

    E2B

    E4B

    DeepSeek-R1-0528

    R1-0528-Qwen3-8B

    R1-0528

    —

    V3

    —

    DeepSeek-R1

    R1-0528

    —

    R1-0528-Qwen3-8B

    R1

    —

    R1 Zero

    —

    Distill Llama 3 8 B

    Distill Llama 3.3 70 B

    Distill Qwen 2.5 1.5 B

    Distill Qwen 2.5 7 B

    Distill Qwen 2.5 14 B

    Distill Qwen 2.5 32 B

    Llama 3.2

    1 B

    3 B

    11 B Vision

    —

    90 B Vision

    —

    Llama 3.1

    8 B

    70 B

    —

    405 B

    —

    Llama 3

    8 B

    —

    70 B

    —

    Llama 2

    7 B

    —

    13 B

    —

    CodeLlama

    7 B

    —

    13 B

    —

    34 B

    —

    1 B

    4 B

    12 B

    27 B

    MedGemma

    4 B (vision)

    27 B (vision)

    Gemma 2

    2 B

    9 B

    —

    27 B

    —

    4B-Thinking

    8B-Instruct

    8B-Thinking

    Qwen3-Coder

    30B-A3B

    —

    480B-A35B

    —

    30B-A3B-Instruct

    —

    30B-A3B-Thinking

    —

    235B-A22B-Thinking

    —

    235B-A22B-Instruct

    —

    Qwen 3

    0.6 B

    1.7 B

    4 B

    8 B

    14 B

    30 B-A3B

    32 B

    235 B-A22B

    —

    Qwen 2.5 Omni

    3 B

    —

    7 B

    —

    Qwen 2.5 VL

    3 B

    7 B

    32 B

    72 B

    Qwen 2.5

    0.5 B

    —

    1.5 B

    —

    3 B

    —

    7 B

    —

    14 B

    —

    32 B

    —

    72 B

    —

    Qwen 2.5 Coder (128 K)

    0.5 B

    1.5 B

    3 B

    7 B

    14 B

    32 B

    QwQ

    32 B

    QVQ (preview)

    72 B

    —

    Qwen 2 (chat)

    1.5 B

    —

    7 B

    —

    72 B

    —

    Qwen 2 VL

    2 B

    —

    7 B

    —

    72 B

    —

    Mistral Small

    3.2-24 B (2506)

    3.1-24 B (2503)

    3-24 B (2501)

    2409-22 B

    —

    Devstral

    Small-24 B (2507)

    Small-24 B (2505)

    Pixtral

    12 B (2409)

    —

    Mistral NeMo

    12 B (2407)

    Mistral Large

    2407

    —

    Mistral 7 B

    v0.3

    —

    v0.2

    —

    Mixtral

    8 × 7 B

    —

    Phi-4 (instruct)

    mini (instruct)

    Phi-3.5

    mini

    —

    Phi-3

    mini

    —

    medium

    —

    Grok 2

    270B

    —

    Baidu-ERNIE

    4.5-21B-A3B-Thinking

    —

    Hunyuan

    A13B

    —

    Orpheus

    0.1-ft (3B)

    LLava

    1.5 (7 B)

    —

    1.6 Mistral (7 B)

    —

    TinyLlama

    Chat

    —

    SmolLM 2

    135 M

    360 M

    1.7 B

    Zephyr-SFT

    7 B

    —

    Yi

    6 B (v1.5)

    —

    6 B (v1.0)

    —

    34 B (chat)

    —

    34 B (base)

    —

    120b

    Gemma 3n

    E2B

    E4B

    DeepSeek-R1-0528

    R1-0528-Qwen3-8B

    R1-0528

    Mistral

    Small 3.2 24B (2506)

    Small 3.1 24B (2503)

    Small 3.0 24B (2501)

    Magistral Small (2506)

    Qwen 3

    0.6 B

    1.7 B

    4 B

    8 B

    14 B

    30B-A3B

    32 B

    235B-A22B

    Llama 4

    Scout 17B-16E

    Maverick 17B-128E

    Qwen 2.5 Omni

    3 B

    7 B

    Phi-4

    Reasoning-plus

    Reasoning

    DeepSeek models

    Model
    Variant
    Instruct (16-bit)

    DeepSeek-V3

    V3-0324

    V3

    DeepSeek-R1

    R1-0528

    R1-0528-Qwen3-8B

    Llama models

    Family
    Variant
    Instruct (16-bit)

    Llama 4

    Scout 17B-16E

    Maverick 17B-128E

    Llama 3.3

    70 B

    Llama 3.2

    1 B

    Gemma models:

    Model
    Variant
    Instruct (16-bit)

    Gemma 3n

    E2B

    E4B

    Gemma 3

    1 B

    4 B

    Qwen models:

    Family
    Variant
    Instruct (16-bit)

    Qwen 3

    0.6 B

    1.7 B

    4 B

    8 B

    Mistral models:

    Model
    Variant
    Instruct (16-bit)

    Mistral

    Small 2409-22B

    Mistral

    Large 2407

    Mistral

    7B v0.3

    Mistral

    7B v0.2

    Phi models:

    Model
    Variant
    Instruct (16-bit)

    Phi-4

    Reasoning-plus

    Reasoning

    Phi-4 (core)

    Mini-Reasoning

    Text-to-Speech (TTS) models:

    Model
    Instruct (16-bit)

    Orpheus-3B (v0.1 ft)

    Orpheus-3B (v0.1 pt)

    Sesame-CSM 1B

    Whisper Large V3 (STT)

    Llasa-TTS 1B

    Spark-TTS 0.5B

    E4B

    Qwen 3

    0.6 B

    1.7 B

    4 B

    8 B

    14 B

    30B-A3B

    Llama 4

    Scout 17B 16E

    Maverick 17B 128E

    —

    Llama models:

    Model
    Variant
    Base (16-bit)
    Base (4-bit)

    Llama 4

    Scout 17B 16E

    —

    Maverick 17B 128E

    —

    Llama 3.3

    70 B

    Qwen models:

    Model
    Variant
    Base (16-bit)
    Base (4-bit)

    Qwen 3

    0.6 B

    1.7 B

    4 B

    Llama models:

    Model
    Variant
    Base (16-bit)
    Base (4-bit)

    Llama 4

    Scout 17B 16E

    —

    Maverick 17B 128E

    —

    Llama 3.3

    70 B

    Gemma models

    Model
    Variant
    Base (16-bit)
    Base (4-bit)

    Gemma 3

    1 B

    4 B

    12 B

    Mistral models:

    Model
    Variant
    Base (16-bit)
    Base (4-bit)

    Mistral

    Small 24B 2501

    —

    NeMo 12B 2407

    —

    7B v0.3

    Other (TTS, TinyLlama) models:

    Model
    Variant
    Base (16-bit)
    Base (4-bit)

    TinyLlama

    1.1 B (Base)

    Orpheus-3b

    0.1-pretrained

    ·

    3B Base

    ·

    3B Instruct

    ·

    Llama 3.1

    8B Base

    ·

    8B Instruct

    ·

    70B Base

    ·

    Qwen3

    0.6B

    1.7B

    4B

    8B

    14B

    32B

    235B-A22B

    Qwen3 (2507)

    4B Instruct

    4B Thinking

    30B-A3B Instruct

    30B-A3B Thinking

    235B-A22B Instruct

    235B-A22B Thinking

    Qwen3-VL

    4B Instruct

    4B Thinking

    8B Instruct

    8B Thinking

    Qwen3-Coder

    480B-A35B Instruct

    Granite 4.0

    h-tiny

    h-small

    Magistral Small

    2509

    ·

    Mistral Small 3.2

    24B Instruct-2506

    Gemma 3

    270M-it torchao 270m — 1B — 4B — 12B — 27B —

    gpt-oss

    120b

    link

    link

    20b

    link

    link

    Ministral 3

    3B

    Instruct • Reasoning

    Instruct • Reasoning

    8B

    Instruct • Reasoning

    Instruct • Reasoning

    DeepSeek-V3.1

    Terminus

    link

    V3.1

    link

    DeepSeek-V3

    V3-0324

    link

    Llama 4

    Scout 17 B-16 E

    link

    link

    Maverick 17 B-128 E

    link

    —

    Llama 3.3

    70 B

    link

    Gemma 3n

    E2B

    ​link

    link

    E4B

    link

    link

    Gemma 3

    270M

    link

    Qwen3-VL

    2B-Instruct

    link

    link

    2B-Thinking

    link

    link

    4B-Instruct

    link

    Magistral

    Small (2506)

    link

    link

    Small (2509)

    link

    link

    Small (2507)

    link

    Phi-4

    Reasoning-plus

    link

    link

    Reasoning

    link

    link

    Mini-Reasoning

    link

    GLM

    4.5-Air

    link

    —

    4.5

    4.5

    —

    4-32B-0414

    4-32B-0414

    gpt-oss (new)

    20b

    link

    Gemma 3n

    E2B

    Llama 3.3

    70B Instruct

    Dynamic · Block

    Llama 3.2

    1B Base

    Dynamic · Block

    14B

    —

    —

    1B Instruct

    link

    R1

    link

    R1 Zero

    link

    Distill Llama 3 8B

    link

    Distill Llama 3.3 70B

    link

    Distill Qwen 2.5 1.5B

    link

    Distill Qwen 2.5 7B

    link

    Distill Qwen 2.5 14B

    link

    Distill Qwen 2.5 32B

    link

    link

    3 B

    link

    11 B Vision

    link

    90 B Vision

    link

    Llama 3.1

    8 B

    link

    70 B

    link

    405 B

    link

    Llama 3

    8 B

    link

    70 B

    link

    Llama 2

    7 B

    link

    link

    12 B

    link

    27 B

    link

    Gemma 2

    2 B

    link

    9 B

    link

    27 B

    link

    link

    14 B

    link

    30B-A3B

    link

    32 B

    link

    235B-A22B

    link

    Qwen 2.5 Omni

    3 B

    link

    7 B

    link

    Qwen 2.5 VL

    3 B

    link

    7 B

    link

    32 B

    link

    72 B

    link

    Qwen 2.5

    0.5 B

    link

    1.5 B

    link

    3 B

    link

    7 B

    link

    14 B

    link

    32 B

    link

    72 B

    link

    Qwen 2.5 Coder 128 K

    0.5 B

    link

    1.5 B

    link

    3 B

    link

    7 B

    link

    14 B

    link

    32 B

    link

    QwQ

    32 B

    link

    QVQ (preview)

    72 B

    —

    Qwen 2 (Chat)

    1.5 B

    link

    7 B

    link

    72 B

    link

    Qwen 2 VL

    2 B

    link

    7 B

    link

    72 B

    link

    link

    Pixtral

    12B 2409

    link

    Mixtral

    8×7B

    link

    Mistral NeMo

    12B 2407

    link

    Devstral

    Small 2505

    link

    link

    Mini

    link

    Phi-3.5

    Mini

    link

    Phi-3

    Mini

    link

    Medium

    link

    Oute-TTS 1B

    link

    —

    Llama 3.2

    1 B

    link

    —

    3 B

    link

    —

    11 B Vision

    link

    —

    90 B Vision

    link

    —

    Llama 3.1

    8 B

    link

    —

    70 B

    link

    —

    Llama 3

    8 B

    link

    link

    Llama 2

    7 B

    link

    link

    13 B

    link

    link

    link

    8 B

    link

    link

    14 B

    link

    link

    30B-A3B

    link

    link

    Qwen 2.5

    0.5 B

    link

    link

    1.5 B

    link

    link

    3 B

    link

    link

    7 B

    link

    link

    14 B

    link

    link

    32 B

    link

    link

    72 B

    link

    link

    Qwen 2

    1.5 B

    link

    link

    7 B

    link

    link

    —

    Llama 3.2

    1 B

    link

    —

    3 B

    link

    —

    11 B Vision

    link

    —

    90 B Vision

    link

    —

    Llama 3.1

    8 B

    link

    —

    70 B

    link

    —

    Llama 3

    8 B

    link

    link

    Llama 2

    7 B

    link

    link

    13 B

    link

    link

    link

    27 B

    link

    link

    Gemma 2

    2 B

    link

    —

    9 B

    link

    —

    27 B

    link

    —

    link

    7B v0.2

    link

    link

    Pixtral 12B 2409

    link

    —

    Instruct
    Reasoning
    Instruct
    Reasoning
    Devstral 2
    link
    link
    link
    link
    Qwen3-Next
    link
    link
    link
    Qwen3-VL
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    Qwen3-2507
    link
    link
    link
    link
    link
    link
    link
    link
    link
    DeepSeek-V3.1
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    Qwen3-2507
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    link
    Dynamic
    Block
    Dynamic
    Block
    Dynamic
    Block
    Dynamic
    Block
    Dynamic
    Block
    Dynamic
    Block
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8 Dynamic
    FP8 Dynamic
    FP8 Dynamic
    FP8 torchao
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8
    FP8 torchao