1 of 8

Reinforcement Learning (RL) Guide

Learn all about Reinforcement Learning (RL) and how to train your own DeepSeek-R1 reasoning model with Unsloth using GRPO. A complete guide from beginner to advanced.

Reinforcement Learning is where an "agent" learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties.

Action: What the model generates (e.g. a sentence).
Reward: A signal indicating how good or bad the model's action was (e.g. did the response follow instructions? was it helpful?).
Environment: The scenario or task the model is working on (e.g. answering a user’s question).

For advanced GRPO documentation on batching, generation and training parameters, read our guide!

🦥What you will learn

What is RL? RLVR? PPO? GRPO? RLHF? RFT? Is "Luck is All You Need?" for RL?
What is an environment? Agent? Action? Reward function? Rewards?

This article covers everything (from beginner to advanced) you need to know about GRPO, Reinforcement Learning (RL) and reward functions, along with tips, and the basics of using GRPO with Unsloth. If you're looking for a step-by-step tutorial for using GRPO, see our guide here.

❓What is Reinforcement Learning (RL)?

The goal of RL is to:

Increase the chance of seeing "good" outcomes.
Decrease the chance of seeing "bad" outcomes.

That's it! There are intricacies on what "good" and "bad" means, or how do we go about "increasing" or "decreasing" it, or what even "outcomes" means.

For example, in the Pacman game:

The environment is the game world.
The actions you can take are UP, LEFT, RIGHT and DOWN.
The rewards are good if you eat a cookie, or bad if you hit one of the squiggly enemies.
In RL, you can't know the "best action" you can take, but you can observe intermediate steps, or the final game state (win or lose)

Another example is imagine you are given the question: "What is 2 + 2?" (4) An unaligned language model will spit out 3, 4, C, D, -10, literally anything.

Numbers are better than C or D right?
Getting 3 is better than say 8 right?
Getting 4 is definitely correct.

We just designed a reward function!

🏃From RLHF, PPO to GRPO and RLVR

OpenAI popularized the concept of RLHF (Reinforcement Learning from Human Feedback), where we train an "agent" to produce outputs to a question (the state) that are rated more useful by human beings.

The thumbs up and down in ChatGPT for example can be used in the RLHF process.

The clip(..., 1-e, 1+e) term is used to force PPO not to take too large changes. There is also a KL term with beta set to > 0 to force the model not to deviate too much away.

In order to do RLHF, PPO (Proximal policy optimization) was developed. The agent is the language model in this case. In fact it's composed of 3 systems:

The Generating Policy (current trained model)
The Reference Policy (original model)
The Value Model (average reward estimator)

We use the Reward Model to calculate the reward for the current environment, and our goal is to maximize this!

The formula for PPO looks quite complicated because it was designed to be stable. Visit our AI Engineer talk we gave in 2025 about RL for more in depth maths derivations about PPO.

DeepSeek developed GRPO (Group Relative Policy Optimization) to train their R1 reasoning models. The key differences to PPO are:

The Value Model is removed, replaced with statistics from calling the reward model multiple times.
The Reward Model is removed and replaced with just custom reward function which RLVR can be used.

This means GRPO is extremely efficient. Previously PPO needed to train multiple models - now with the reward model and value model removed, we can save memory and speed up everything.

RLVR (Reinforcement Learning with Verifiable Rewards) allows us to reward the model based on tasks with easy to verify solutions. For example:

Maths equations can be easily verified. Eg 2+2 = 4.
Code output can be verified as having executed correctly or not.
Designing verifiable reward functions can be tough, and so most examples are math or code.
Use-cases for GRPO isn’t just for code or math—its reasoning process can enhance tasks like email automation, database retrieval, law, and medicine, greatly improving accuracy based on your dataset and reward function - the trick is to define a rubric - ie a list of smaller verifiable rewards, and not a final all consuming singular reward. OpenAI popularized this in their reinforcement learning finetuning (RFT) offering for example.

Why "Group Relative"?

GRPO removes the value model entirely, but we still need to estimate the "average reward" given the current state.

The trick is to sample the LLM! We then calculate the average reward through statistics of the sampling process across multiple different questions.

For example for "What is 2+2?" we sample 4 times. We might get 4, 3, D, C. We then calculate the reward for each of these answers, then calculate the average reward and standard deviation, then Z-score standardize this!

This creates the advantages A, which we will use in replacement of the value model. This saves a lot of memory!

🤞Luck (well Patience) Is All You Need

The trick of RL is you need 2 things only:

A question or instruction eg "What is 2+2?" "Create a Flappy Bird game in Python"
A reward function and verifier to verify if the output is good or bad.

With only these 2, we can essentially call a language model an infinite times until we get a good answer. For example for "What is 2+2?", an untrained bad language model will output:

0, cat, -10, 1928, 3, A, B, 122, 17, 182, 172, A, C, BAHS, %$, #, 9, -192, 12.31**** then suddenly 4.

The reward signal was 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0**** then suddenly 1.

So by luck and by chance, RL managed to find the correct answer across multiple rollouts. Our goal is we want to see the good answer 4 more, and the rest (the bad answers) much less.

So the goal of RL is to be patient - in the limit, if the probability of the correct answer is at least a small number (not zero), it's just a waiting game - you will 100% for sure encounter the correct answer in the limit.

So I like to call it as "Luck Is All You Need" for RL.

Well a better phrase is "Patience is All You Need" for RL.

RL essentially provides us a trick - instead of simply waiting for infinity, we do get "bad signals" ie bad answers, and we can essentially "guide" the model to already try not generating bad solutions. This means although you waited very long for a "good" answer to pop up, the model already has been changed to try its best not to output bad answers.

In the "What is 2+2?" example - 0, cat, -10, 1928, 3, A, B, 122, 17, 182, 172, A, C, BAHS, %$, #, 9, -192, 12.31**** then suddenly 4.

Since we got bad answers, RL will influence the model to try NOT to output bad answers. This means over time, we are carefully "pruning" or moving the model's output distribution away from bad answers. This means RL is efficient, since we are NOT just waiting for infinity, but we are actively trying to "push" the model to go as much as possible to the "correct answer space".

If the probability is always 0, then RL will never work. This is also why people like to do RL from an already instruction finetuned model, which can partially follow instructions reasonably well - this boosts the probability most likely above 0.

🦥What Unsloth offers for RL

With 15GB VRAM, Unsloth allows you to transform any model up to 17B parameters like Llama 3.1 (8B), Phi-4 (14B), Mistral (7B) or Qwen2.5 (7B) into a reasoning model
Unsloth now supports RL for Vision/multimodal models!
Minimum requirement: Just  5GB VRAM is enough to train your own reasoning model locally (for any model with 1.5B parameters or less)

⚡Tutorial: Train your own Reasoning model with GRPO

GRPO notebooks:

GSPO - new

- Vision GSPO - new

- Vision GSPO - new

- Advanced

NEW! We now support GSPO and most other new GRPO techniques. You can play with the following arguments in GRPOConfig to enable:

epsilon=0.2,
epsilon_high=0.28, # one sided
delta=1.5 # two sided

loss_type='gspo',
# or:
loss_type='grpo',
# or:
loss_type='dr_grpo',

mask_truncated_completions=True,

If you're not getting any reasoning, make sure you have enough training steps and ensure your reward function/verifier is working. We provide examples for reward functions here.
Previous demonstrations show that you could achieve your own "aha" moment with Qwen2.5 (3B) - but it required 2xA100 GPUs (160GB VRAM). Now, with Unsloth, you can achieve the same "aha" moment using just a single 5GB VRAM GPU.
Previously, GRPO was only supported for full fine-tuning, but we've made it work with QLoRA and LoRA
On 20K context lengths for example with 8 generations per prompt, Unsloth uses only 54.3GB of VRAM for Llama 3.1 (8B), whilst standard implementations (+ Flash Attention 2) take 510.8GB (90% less for Unsloth).
Please note, this isn’t fine-tuning DeepSeek’s R1 distilled models or using distilled data from R1 for tuning which Unsloth already supported. This is converting a standard model into a full-fledged reasoning model using GRPO.

In a test example, even though we only trained Phi-4 with 100 steps using GRPO, the results are already clear. The model without GRPO does not have the thinking token, whilst the one trained with GRPO does and also has the correct answer.

💻Training with GRPO

For a tutorial on how to transform any open LLM into a reasoning model using Unsloth & GRPO, see here.

For advanced GRPO documentation on batching, generation and training parameters, read our guide!

How GRPO Trains a Model

For each question-answer pair, the model generates multiple possible responses (e.g., 8 variations).
Each response is evaluated using reward functions.
Training Steps:
- If you have 300 rows of data, that's 300 training steps (or 900 steps if trained for 3 epochs).
- You can increase the number of generated responses per question (e.g., from 8 to 16).
The model learns by updating its weights every step.

If you're having issues with your GRPO model not learning, we'd highly recommend to use our Advanced GRPO notebooks as it has a much better reward function and you should see results much faster and frequently.

Basics/Tips

Wait for at least 300 steps for the reward to actually increase. In order to get decent results, you may need to trade for a minimum of 12 hours (this is how GRPO works), but keep in mind this isn't compulsory as you can stop at anytime.
For optimal results have at least 500 rows of data. You can try with even 10 rows of data but it's better to have more.
Each training run will always be different depending on your model, data, reward function/verifier etc. so though 300 steps is what we wrote as the minimum, sometimes it might be 1000 steps or more. So, it depends on various factors.
If you're using GRPO with Unsloth locally, please "pip install diffusers" as well if you get an error. Please also use the latest version of vLLM.
It’s advised to apply GRPO to a model at least 1.5B in parameters to correctly generate thinking tokens as smaller models may not.
For GRPO's GPU VRAM requirements for QLoRA 4-bit, the general rule is the model parameters = the amount of VRAM you will need (you can use less VRAM but this just to be safe). The more context length you set, the more VRAM. LoRA 16-bit will use at minimum 4x more VRAM.
Continuous fine-tuning is possible and you can just leave GRPO running in the background.
In the example notebooks, we use the GSM8K dataset, the current most popular choice for R1-style training.
If you’re using a base model, ensure you have a chat template.
The more you train with GRPO the better. The best part of GRPO is you don't even need that much data. All you need is a great reward function/verifier and the more time spent training, the better your model will get. Expect your reward vs step to increase as time progresses like this:
Training loss tracking for GRPO is now built directly into Unsloth, eliminating the need for external tools like wandb etc. It contains full logging details for all reward functions now including the total aggregated reward function itself.

📋Reward Functions / Verifiers

In Reinforcement Learning a Reward Function and a Verifier serve distinct roles in evaluating a model’s output. In general, you could interpret them as the same thing however, technically they're not but it does not matter as much as they are usually used in conjunction with each other.

Verifier:

Determines whether the generated response is correct or incorrect.
It does not assign a numerical score—it simply verifies correctness.
Example: If a model generates "5" for "2+2", the verifier checks and labels it as "wrong" (since the correct answer is 4).
Verifiers can also execute code (e.g., in Python) to validate logic, syntax, and correctness without needing manual evaluation.

Reward Function:

Converts verification results (or other criteria) into a numerical score.
Example: If an answer is wrong, it might assign a penalty (-1, -2, etc.), while a correct answer could get a positive score (+1, +2).
It can also penalize based on criteria beyond correctness, such as excessive length or poor readability.

Key Differences:

A Verifier checks correctness but doesn’t score.
A Reward Function assigns a score but doesn’t necessarily verify correctness itself.
A Reward Function can use a Verifier, but they are technically not the same.

Understanding Reward Functions

GRPO's primary goal is to maximize reward and learn how an answer was derived, rather than simply memorizing and reproducing responses from its training data.

With every training step, GRPO adjusts model weights to maximize the reward. This process fine-tunes the model incrementally.
Regular fine-tuning (without GRPO) only maximizes next-word prediction probability but does not optimize for a reward. GRPO optimizes for a reward function rather than just predicting the next word.
You can reuse data across multiple epochs.
Default reward functions can be predefined to be used on a wide array of use cases or you can ask ChatGPT/local model to generate them for you.
There’s no single correct way to design reward functions or verifiers - the possibilities are endless. However, they must be well-designed and meaningful, as poorly crafted rewards can unintentionally degrade model performance.

🪙Reward Function Examples

You can refer to the examples below. You can input your generations into an LLM like ChatGPT 4o or Llama 3.1 (8B) and design a reward function and verifier to evaluate it. For example, feed your generations into a LLM of your choice and set a rule: "If the answer sounds too robotic, deduct 3 points." This helps refine outputs based on quality criteria

Example #1: Simple Arithmetic Task

Question: "2 + 2"
Answer: "4"
Reward Function 1:
- If a number is detected → +1
- If no number is detected → -1
Reward Function 2:
- If the number matches the correct answer → +3
- If incorrect → -3
Total Reward: Sum of all reward functions

Example #2: Email Automation Task

Question: Inbound email
Answer: Outbound email
Reward Functions:
- If the answer contains a required keyword → +1
- If the answer exactly matches the ideal response → +1
- If the response is too long → -1
- If the recipient's name is included → +1
- If a signature block (phone, email, address) is present → +1

Unsloth Proximity-Based Reward Function

If you’ve checked out our Advanced GRPO Colab Notebook, you’ll notice we’ve created a custom proximity-based reward function built completely from scratch, which is designed to reward answers that are closer to the correct one. This flexible function can be applied across a wide range of tasks.

In our examples, we enable reasoning in Qwen3 (Base) and guide it toward specific tasks
Apply Pre-finetuning strategies to avoid GRPO’s default tendency to just learn formatting
Boost evaluation accuracy with regex-based matching
Create custom GRPO templates beyond generic prompts like think, e.g., <start_working_out></end_working_out>
Apply proximity-based scoring — models get more reward for closer answers (e.g., predicting 9 instead of 10 is better than 3) while outliers are penalized

GSM8K Reward Functions

In our other examples, we use existing GSM8K reward functions by @willccbb which is popular and shown to be quite effective:

correctness_reward_func – Rewards exact label matches.
int_reward_func – Encourages integer-only answers.
soft_format_reward_func – Checks structure but allows minor newline mismatches.
strict_format_reward_func – Ensures response structure matches the prompt, including newlines.
xmlcount_reward_func – Ensures exactly one of each XML tag in the response.

🧮Using vLLM

You can now use vLLM directly in your finetuning stack, which allows for much more throughput and allows you to finetune and do inference on the model at the same time! On 1x A100 40GB, expect 4000 tokens / s or so with Unsloth’s dynamic 4bit quant of Llama 3.2 3B Instruct. On a 16GB Tesla T4 (free Colab GPU), you can get 300 tokens / s. We also magically removed double memory usage when loading vLLM and Unsloth together, allowing for savings of 5GB or so for Llama 3.1 8B and 3GB for Llama 3.2 3B. Unsloth could originally finetune Llama 3.3 70B Instruct in 1x 48GB GPU with Llama 3.3 70B weights taking 40GB of VRAM. If we do not remove double memory usage, then we’ll need >= 80GB of VRAM when loading Unsloth and vLLM together. But with Unsloth, you can still finetune and get the benefits of fast inference in one package in under 48GB of VRAM! To use fast inference, first install vllm, and instantiate Unsloth with fast_inference:

pip install unsloth vllm
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    fast_inference = True,
)
model.fast_generate(["Hello!"])

✅GRPO Requirement Guidelines

When you’re using Unsloth to do GRPO, we smartly reduce VRAM usage by over 90% when compared to standard implementations with Flash Attention 2 by using multiple tricks! On 20K context lengths for example with 8 generations per prompt, Unsloth uses only 54.3GB of VRAM for Llama 3.1 8B, whilst standard implementations take 510.8GB (90% less for Unsloth).

For GRPO's GPU VRAM requirements for QLoRA 4-bit, the general rule is the model parameters = the amount of VRAM you will need (you can use less VRAM but this just to be safe). The more context length you set, the more VRAM. LoRA 16-bit will use at minimum 4x more VRAM.
Our new memory efficient linear kernels for GRPO slashes memory usage by 8x or more. This shaves 68.5GB of memory, whilst being actually faster through the help of torch.compile!
We leverage our smart Unsloth gradient checkpointing algorithm which we released a while ago. It smartly offloads intermediate activations to system RAM asynchronously whilst being only 1% slower. This shaves 52GB of memory.
Unsloth also uses the same GPU / CUDA memory space as the underlying inference engine (vLLM), unlike implementations in other packages. This shaves 16GB of memory.

Metrics

Unsloth

Standard + FA2

Training Memory Cost (GB)

42GB

414GB

GRPO Memory Cost (GB)

9.8GB

78.3GB

Inference Cost (GB)

0GB

16GB

Inference KV Cache for 20K context length (GB)

2.5GB

Total Memory Usage

54.33GB (90% less)

510.8GB

In typical standard GRPO implementations, you need to create 2 logits of size (8. 20K) to calculate the GRPO loss. This takes 2 * 2 bytes * 8 (num generations) * 20K (context length) * 128256 (vocabulary size) = 78.3GB in VRAM.

Unsloth shaves 8x memory usage for long context GRPO, so we need only an extra 9.8GB in extra VRAM for 20K context lengths!

We also need to from the KV Cache in 16bit. Llama 3.1 8B has 32 layers, and both K and V are 1024 in size. So memory usage for 20K context length = 2 * 2 bytes * 32 layers * 20K context length * 1024 = 2.5GB per batch. We would set the batch size for vLLM to 8, but we shall leave it at 1 for our calculations to save VRAM. Otherwise you will need 20GB for the KV cache.

🎥 Unsloth RL 3 hour Workshop Video

🎓Further Reading

Nathan Lambert's RLHF Book is a must! https://rlhfbook.com/c/11-policy-gradients.html
Yannic Kilcher's GRPO Youtube video is also a must! https://www.youtube.com/watch?v=bAWV_yrqx4w
We did a 3 hour workshop at AI Engineer World's Fair 2025. Slides are other material are at https://docs.unsloth.ai/ai-engineers-2025
Advanced GRPO notebook via Unsloth. https://docs.unsloth.ai/basics/reinforcement-learning-guide/tutorial-train-your-own-reasoning-model-with-grpo
GRPO from a base model notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb

Tutorial: Train your own Reasoning model with GRPO

Beginner's Guide to transforming a model like Llama 3.1 (8B) into a reasoning model by using Unsloth and GRPO.

DeepSeek developed GRPO (Group Relative Policy Optimization) to train their R1 reasoning models.

Quickstart

These instructions are for our pre-made Google Colab notebooks. If you are installing Unsloth locally, you can also copy our notebooks inside your favorite code editor. We'll be using any of these notebooks:

- GSPO

- Vision GSPO

- Advanced

Install Unsloth

If you're using our Colab notebook, click Runtime > Run all. We'd highly recommend you checking out our Fine-tuning Guide before getting started.

If installing locally, ensure you have the correct requirements and use pip install unsloth on Linux or follow our Windows install instructions.

Learn about GRPO & Reward Functions

Before we get started, it is recommended to learn more about GRPO, reward functions and how they work. Read more about them including tips & tricks here.

You will also need enough VRAM. In general, model parameters = amount of VRAM you will need. In Colab, we are using their free 16GB VRAM GPUs which can train any model up to 16B in parameters.

Configure desired settings

We have pre-selected optimal settings for the best results for you already and you can change the model to whichever you want listed in our supported models. Would not recommend changing other settings if you're a beginner.

For advanced GRPO documentation on batching, generation and training parameters, read our guide!

Data preparation

We have pre-selected OpenAI's GSM8K dataset which contains grade school math problems but you could change it to your own or any public one on Hugging Face. You can read more about datasets here.

Your dataset should still have at least 2 columns for question and answer pairs. However the answer must not reveal the reasoning behind how it derived the answer from the question. See below for an example:

We'll structure the data to prompt the model to articulate its reasoning before delivering an answer. To start, we'll establish a clear format for both prompts and responses.

# Define the system prompt that instructs the model to use a specific format
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

Now, to prepare the dataset:

import re
from datasets import load_dataset, Dataset


# Helper functions to extract answers from different formats
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()


def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()


# Function to prepare the GSM8K dataset
def get_gsm8k_questions(split="train") -> Dataset:
    data = load_dataset("openai/gsm8k", "main")[split]
    data = data.map(
        lambda x: {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": x["question"]},
            ],
            "answer": extract_hash_answer(x["answer"]),
        }
    )
    return data


dataset = get_gsm8k_questions()

The dataset is prepared by extracting the answers and formatting them as structured strings.

Reward Functions/Verifier

Reward Functions/Verifiers lets us know if the model is doing well or not according to the dataset you have provided. Each generation run will be assessed on how it performs to the score of the average of the rest of generations. You can create your own reward functions however we have already pre-selected them for you with Will's GSM8K reward functions. With this, we have 5 different ways which we can reward each generation.

You can input your generations into an LLM like ChatGPT 4o or Llama 3.1 (8B) and design a reward function and verifier to evaluate it. For example, feed your generations into a LLM of your choice and set a rule: "If the answer sounds too robotic, deduct 3 points." This helps refine outputs based on quality criteria. See examples of what they can look like here.

Example Reward Function for an Email Automation Task:

Question: Inbound email
Answer: Outbound email
Reward Functions:
- If the answer contains a required keyword → +1
- If the answer exactly matches the ideal response → +1
- If the response is too long → -1
- If the recipient's name is included → +1
- If a signature block (phone, email, address) is present → +1

Train your model

We have pre-selected hyperparameters for the most optimal results however you could change them. Read all about parameters here. For advanced GRPO documentation on batching, generation and training parameters, read our guide!

The GRPOConfig defines key hyperparameters for training:

use_vllm: Activates fast inference using vLLM.
learning_rate: Determines the model's learning speed.
num_generations: Specifies the number of completions generated per prompt.
max_steps: Sets the total number of training steps.

NEW! We now support DAPO, Dr. GRPO and most other new GRPO techniques. You can play with the following arguments in GRPOConfig to enable:

epsilon=0.2,
epsilon_high=0.28, # one sided
delta=1.5 # two sided

loss_type='bnpo',
# or:
loss_type='grpo',
# or:
loss_type='dr_grpo',
# or:
loss_type='dapo',

mask_truncated_completions=True,

You should see the reward increase overtime. We would recommend you train for at least 300 steps which may take 30 mins however, for optimal results, you should train for longer.

You will also see sample answers which allows you to see how the model is learning. Some may have steps, XML tags, attempts etc. and the idea is as trains it's going to get better and better because it's going to get scored higher and higher until we get the outputs we desire with long reasoning chains of answers.

Run & Evaluate your model

Run your model by clicking the play button. In the first example, there is usually no reasoning in the answer and in order to see the reasoning, we need to first save the LoRA weights we just trained with GRPO first using:

model.save_lora("grpo_saved_lora")

Then we load the LoRA and test it. Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

You can then save your model to GGUF, Ollama etc. by following our guide here.

If you are still not getting any reasoning, you may have either trained for too less steps or your reward function/verifier was not optimal.

Save your model

We have multiple options for saving your fine-tuned model, but we’ll focus on the easiest and most popular approaches which you can read more about here

Saving in 16-bit Precision

You can save the model with 16-bit precision using the following command:

# Save to 16-bit precision
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")

Pushing to Hugging Face Hub

To share your model, we’ll push it to the Hugging Face Hub using the push_to_hub_merged method. This allows saving the model in multiple quantization formats.

# Push to Hugging Face Hub (requires a token)
model.push_to_hub_merged(
    "your-username/model-name", tokenizer, save_method="merged_16bit", token="your-token"
)

Saving in GGUF Format for llama.cpp

Unsloth also supports saving in GGUF format, making it compatible with llama.cpp and Ollama.

model.push_to_hub_gguf(
    "your-username/model-name",
    tokenizer,
    quantization_method=["q4_k_m", "q8_0", "q5_k_m"],
    token="your-token",
)

Once saved in GGUF format, the model can be easily deployed in lightweight environments using llama.cpp or used in other inference engines.

Video Tutorials

Here are some video tutorials created by amazing YouTubers who we think are fantastic!

Advanced RL Documentation

Advanced documentation settings when using Unsloth with GRPO.

Detailed guides on doing GRPO with Unsloth for Batching, Generation & Training Parameters:

Training Parameters

beta (float, default 0.0): KL coefficient.
- 0.0 ⇒ no reference model loaded (lower memory, faster).
- Higher beta constrains the policy to stay closer to the ref policy.
num_iterations (int, default 1): PPO epochs per batch (μ in the algorithm). Replays data within each gradient accumulation step; e.g., 2 = two forward passes per accumulation step.
epsilon (float, default 0.2): Clipping value for token-level log-prob ratios (typical ratio range ≈ [-1.2, 1.2] with default ε).
delta (float, optional): Enables upper clipping bound for two-sided GRPO when set. If None, standard GRPO clipping is used. Recommended > 1 + ε when enabled (per INTELLECT-2 report).
epsilon_high (float, optional): Upper-bound epsilon; defaults to epsilon if unset. DAPO recommends 0.28.
importance_sampling_level (“token” | “sequence”, default "token"):
- "token": raw per-token ratios (one weight per token).
- "sequence": average per-token ratios to a single sequence-level ratio. GSPO shows sequence-level sampling often gives more stable training for sequence-level rewards.
reward_weights (list[float], optional): One weight per reward. If None, all weights = 1.0.
scale_rewards (str|bool, default "group"):
- True or "group": scale by std within each group (unit variance in group).
- "batch": scale by std across the entire batch (per PPO-Lite).
- False or "none": no scaling. Dr. GRPO recommends not scaling to avoid difficulty bias from std scaling.
loss_type (str, default "dapo"):
- "grpo": normalizes over sequence length (length bias; not recommended).
- "dr_grpo": normalizes by a global constant (introduced in Dr. GRPO; removes length bias). Constant ≈ max_completion_length.
- "dapo" (default): normalizes by active tokens in the global accumulated batch (introduced in DAPO; removes length bias).
- "bnpo": normalizes by active tokens in the local batch only (results can vary with local batch size; equals GRPO when per_device_train_batch_size == 1).
mask_truncated_completions (bool, default False): When True, truncated completions are excluded from loss (recommended by DAPO for stability). Note: There are some KL issues with this flag, so we recommend to disable it.
```
# If mask_truncated_completions is enabled, zero out truncated completions in completion_mask
if self.mask_truncated_completions:
    truncated_completions = ~is_eos.any(dim=1)
    completion_mask = completion_mask * (~truncated_completions).unsqueeze(1).int()
```
This can zero out all completion_mask entries when many completions are truncated, making n_mask_per_reward = 0 and causing KL to become NaN. See
vllm_importance_sampling_correction (bool, default True): Applies Truncated Importance Sampling (TIS) to correct off-policy effects when generation (e.g., vLLM / fast_inference) differs from training backend. In Unsloth, this is auto-set to True if you’re using vLLM/fast_inference; otherwise False.
vllm_importance_sampling_cap (float, default 2.0): Truncation parameter C for TIS; sets an upper bound on the importance sampling ratio to improve stability.
dtype when choosing float16 or bfloat16, see FP16 vs BF16 for RL

Generation Parameters

temperature (float, defaults to 1.0): Temperature for sampling. The higher the temperature, the more random the completions. Make sure you use a relatively high (1.0) temperature to have diversity in generations which helps learning.
top_p (float, optional, defaults to 1.0): Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1.0 to consider all tokens.
top_k (int, optional): Number of highest probability vocabulary tokens to keep for top-k-filtering. If None, top-k-filtering is disabled and all tokens are considered.
min_p (float, optional): Minimum token probability, which will be scaled by the probability of the most likely token. It must be a value between 0.0 and 1.0. Typical values are in the 0.01-0.2 range.
repetition_penalty (float, optional, defaults to 1.0): Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1.0 encourage the model to use new tokens, while values < 1.0 encourage the model to repeat tokens.
steps_per_generation: (int, optional): Number of steps per generation. If None, it defaults to gradient_accumulation_steps. Mutually exclusive with generation_batch_size.

It is a bit confusing to mess with this parameter, it is recommended to edit per_device_train_batch_size and gradient accumulation for the batch sizes

Batch & Throughput Parameters

Parameters that control batches

train_batch_size: Number of samples per process per step. If this integer is less than num_generations, it will default to num_generations.
steps_per_generation: Number of microbatches that contribute to one generation’s loss calculation (forward passes only). A new batch of data is generated every steps_per_generation steps; backpropagation timing depends on gradient_accumulation_steps.
num_processes: Number of distributed training processes (e.g., GPUs / workers).
gradient_accumulation_steps (aka gradient_accumulation): Number of microbatches to accumulate before applying backpropagation and optimizer update.
Effective batch size:
```
effective_batch_size = steps_per_generation * num_processes * train_batch_size
```
Total samples contributing to gradients before an update (across all processes and steps).

Optimizer steps per generation:

optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps

Example: 4 / 2 = 2.

num_generations: Number of generations produced per prompt (applied after computing effective_batch_size). The number of unique prompts in a generation cycle is:
```
unique_prompts = effective_batch_size / num_generations
```
Must be > 2 for GRPO to work.

GRPO Batch Examples

The tables below illustrate how batches flow through steps, when optimizer updates occur, and how new batches are generated.

Example 1

num_gpus = 1
per_device_train_batch_size = 3
gradient_accumulation_steps = 2
steps_per_generation = 4

effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3

Generation cycle A

Step

Batch

Notes

[0,0,0]

[1,1,1]

→ optimizer update (accum = 2 reached)

[2,2,2]

[3,3,3]

optimizer update

Generation cycle B

Step

Batch

Notes

[4,4,4]

[5,5,5]

→ optimizer update (accum = 2 reached)

[6,6,6]

[7,7,7]

optimizer update

Example 2

num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4

effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3

Generation cycle A

Step

Batch

Notes

[0,0,0]

[1,1,1]

[2,2,2]

[3,3,3]

optimizer update (accum = 4 reached)

Generation cycle B

Step

Batch

Notes

[4,4,4]

[5,5,5]

[6,6,6]

[7,7,7]

optimizer update (accum = 4 reached)

Example 3

num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4

effective_batch_size = 4 * 3 * 1 = 12
num_generations = 4
unique_prompts = effective_batch_size / num_generations = 3

Generation cycle A

Step

Batch

Notes

[0,0,0]

[0,1,1]

[1,1,3]

[3,3,3]

optimizer update (accum = 4 reached)

Generation cycle B

Step

Batch

Notes

[4,4,4]

[4,5,5]

[5,5,6]

[6,6,6]

optimizer update (accum = 4 reached)

Example 4

num_gpus = 1
per_device_train_batch_size = 6
steps_per_generation = gradient_accumulation_steps = 2

effective_batch_size = 2 * 6 * 1 = 12
num_generations = 3
unique_prompts = 4

Generation cycle A

Step

Batch

Notes

[0,0,0, 1,1,1]

[2,2,2, 3,3,3]

optimizer update (accum = 2 reached)

Generation cycle B

Step

Batch

Notes

[4,4,4, 5,5,5]

[6,6,6, 7,7,7]

optimizer update (accum = 2 reached)

Quick Formula Reference

effective_batch_size = steps_per_generation * num_processes * train_batch_size
optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps
unique_prompts = effective_batch_size / num_generations   # must be > 2

FP16 vs BF16 for RL

Defeating the Training-Inference Mismatch via FP16 https://arxiv.org/pdf/2510.26788 shows how using float16 is better than bfloat16

Float16 vs Bfloat16

There was a paper titled "Defeating the Training-Inference Mismatch via FP16" showing how using float16 precision can dramatically be better than using bfloat16 when doing reinforcement learning.

In fact the longer the generation, the worse it gets when using bfloat16:

We did an investigation, and DO find float16 to be more stable than bfloat16 with much smaller gradient norms see and

🤯A100 Cascade Attention Bug

As per and , older vLLM versions (before 0.11.0) had broken attention mechanisms for A100 and similar GPUs. Please update vLLM! We also by default disable cascade attention in vLLM during Unsloth reinforcement learning if we detect an older vLLM version.

Different hardware also changes results, where newer and more expensive GPUs have less KL difference between the inference and training sides:

🔥Using float16 in Unsloth RL

To use float16 precision in Unsloth GRPO and RL, you just need to set dtype = torch.float16 and we'll take care of the rest!

Memory Efficient RL

We're excited to introduce more efficient reinforcement learning (RL) in Unsloth with multiple algorithmic advancements:

1.2 to 1.7x increased context lengths with no slowdown and no extra memory usage!
10% faster RL training runs with revamped kernels and async data movements
2x faster torch.compile times during model loading

Unsloth already increases RL training speed, context window and reduces VRAM usage by 50–90% vs. all other setups with FA2, but now Unsloth's Standby improves this even further. Our Standby feature uniquely limits speed degradation compared to other implementations and sometimes makes training even faster!

Now, Qwen3-32B LoRA 16-bit can attain 6,144 context lengths vs 3,600 (1.7x longer) before on 1xH100 80GB GPU. Llama-3.1-8B QLoRA 4bit can attain 47,500 lengths vs 42,000 before (1.13x longer).

We made RL runs 10% faster through various kernel optimizations, and removed the LoRA communication channel between the CPU and GPU when switching from training to inference mode. Finally, we used custom torch.compile flags to make vLLM's rollout faster by 10%, and reduced compilation time by 2x.

✨How to enable optimizations

To enable Unsloth's Standby feature, set the environment variable UNSLOTH_VLLM_STANDBY before any Unsloth import. Then set gpu_memory_utilization = 0.95 and that's it!

import os
os.environ["UNSLOTH_VLLM_STANDBY"] = "1"

from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-8B-Base",
    max_seq_length = 2048, # Can increase for longer reasoning traces
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True,
    max_lora_rank = 32, # Larger rank = smarter, but slower
    gpu_memory_utilization = 0.95,
)

🎓No more `gpu_memory_utilization`!

With Unsloth's new RL improvements, you NEVER have to worry about tuning or setting gpu_memory_utilization ever again - simply set it to 90% or 95% of GPU utilization - 100% sadly won't work since some space is needed for small tensors. Previously one had to tune it from 30% to 95% - no more now! Set it to the maximum and Unsloth will handle the rest!

⁉️Why does RL use so much memory?

GRPO (and many RL variants) rely heavily on generation which is primarily powered by vLLM. But this comes comes with a steep cost since it requires constant GPU memory for weights, activations, and the KV Cache.

Inference takes a lot of VRAM

Whilst Training also uses VRAM!

This means RL needs to keep 2 sets of VRAM / memory on the GPU at the same time:

Inference engine (has model weights, KV cache)
Training engine (has model weights, activations, gradients, optimizer states)

Current RL frameworks have to split 50/50 for a 80GB GPU with 50% for inference and 50% for training. And moving weights from training mode to inference mode can take quite some time.

80GB GPU

Inference Engine (50%)

Training Engine (50%)

Model Weights

16GB

KV Cache

24GB

Activations, Gradients, Optimizer States

24GB

Previous Unsloth versions already smartly optimizes the above, as we share vLLM's weight space directly which removes the double memory usage of the model weights. This frees up 16GB of space for example which can be used to increase context length or the speed of generation. Also, we don't need to do memory movements, which makes training faster.

80GB GPU

Inference Engine (50%)

Training Engine (50%)

Model Weights

16GB SHARED

<<< SHARED

KV Cache

24GB + 8GB= 32GB

Activations, Gradients, Optimizer States

24GB + 8GB=32GB

🦥Unsloth Standby

But we can go further - we first note RL does inference then training then inference then training etc.

This means the memory space for inference and training can in theory be re-used, since inference and training are separate modes - this is where vLLM's sleep mode feature comes in, which has 2 options:

level = 1 copies weights to the CPU and deletes KV cache
level = 2 deletes weights and deletes KV cache

But reminder in Unsloth we share vLLM's memory space for the weights - this means we need a new way to delete the KV cache, and ignore deletion of the weights, and we call this Unsloth Standby.

80GB GPU

Inference Engine

Training Engine

Model Weights

16GB SHARED

<<< SHARED

Multi-purpose

64GB space

KV Cache

Activations, Gradients, Optimizer States

To enable this, simply add the below to all RL / GRPO training runs before any Unsloth import:

import os
os.environ["UNSLOTH_VLLM_STANDBY"] = "1"

🧪Performance Experiments

Here you will find out how we benchmarked memory usage and context length for GRPO. Note that we do 2 generations per prompt because for GRPO to work, we need at least 2 generations for which to calculate the sample mean and variance. Without 2 generations, the standard deviation of one sample is 0. This causes the advantages which uses this: (reward - mean)/std to be undefined.

Z=\frac{r_i - \mu}{\sqrt{\frac{1}{n}\sum(r_i-\mu)^2}} \\ Z_{n=1}=\frac{r_1 - \mu}{\sqrt{\frac{1}{1}\sum(r_1-\mu)^2}}=\frac{0}{0}=\text{undefined}

This means for GRPO specifically, a maximum context length of 6,144 for Qwen-3 32B is actually 6,144 multiplied by 2 generations ie 12,288 in length.

We provide experiments for Llama-3.1 8B on both LoRA (16bit) and QLoRA (4bit) below:

If you notice any training time differences, it isn’t much. In our apples to apples comparison we noticed <1% training time slowdowns or even speedups which can be attributed to margin of error.

We also theorize speedups are possible due to reduced memory pressure, so there might be less memory cleanup on the CUDA memory allocator side.

In the above image, you see the difference between baseline and standby mode on a single T4 GPU for Qwen 3 4B. We can stretch the vllm's gpu_memory_utilisation to as high as 0.95 without worrying that it'd affect training. This means you can fit higher context length sequences and more sequences can be processed. In the first case, for example, we have enough memory to fit and process 32K length sequences provided training allows where as previously, any inputs longer than 2K would potentially not fit in and end up causing OOMs (out of memory).

Experiments

Config

Status

GPU Memory usage

Comments

standby True

vllm_gpu_util 0.95

num_gen 2

grad_acc_steps 2

Runs for 40 steps/ 40 minutes

14.5 GiB (set by vllm_gpu_util)

Enough to fit in 32K KVCache with chunk of 2-4K or say 16K KVCache + 16K chunks

standby True

vllm_gpu_util 0.9

num_gen 2

grad_acc_steps 2

Runs 32 steps in 40 m

13.8 GiB (set by…)

Approx enough to fit in ~28K KVCache with chunk of 2-4K or say 15K KVCache + 15K chunks

standby False

vllm_gpu_util 0.9

num_gen 2

grad_acc_steps 2

model loads but can’t train because even batch size of 1 doesn’t fit

OOM

standby False

vllm_gpu_util 0.8

num_gen 2

grad_acc_steps 2

model loads but can’t train because even batch size of 1 doesn’t fit

OOM

standby False

vllm_gpu_util 0.7

num_gen 2

grad_acc_steps 2

Trains fine

28 steps take 39min

~15.1GiB

any input slightly longer will result in OOM on colab

standby True

vllm_gpu_util 0.7

num_gen 2

grad_acc_steps 2

Trains fine

29 steps take 40min

13GiB but most of the time around 10-11GB

At the same config, we save 2GiB aka 15% memory here. Can be higher for longer sequences

H100 Experiments

Model

GPU

Seq Len

Num Generations

Grad Acc Steps

Qwen2.5-14B-Instruct

NVIDIA H100 80GB PCIe

32,768

In our collapsible results below, you can see there is a 9GiB difference in the peak memory used (note that 90% of the time, the GPU memory usage is equal to the peak memory in our case). To put things into perspective, using TRL and LoRA we were able to only fine-tune an 8B parameter model with a context length of 1024 at max (32x less). Anything with higher sequence length (with similar configuration) results in the process failing with OOM.

Click for Unsloth Standby Mode vs. no Standby Benchmarks

Standy mode enabled:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  32249 MiB |  43042 MiB | 128336 GiB | 128305 GiB |
|       from large pool |  31415 MiB |  42165 MiB | 127204 GiB | 127173 GiB |
|       from small pool |    834 MiB |   1184 MiB |   1132 GiB |   1131 GiB |
|---------------------------------------------------------------------------|
| Active memory         |  32249 MiB |  43042 MiB | 128336 GiB | 128305 GiB |
|       from large pool |  31415 MiB |  42165 MiB | 127204 GiB | 127173 GiB |
|       from small pool |    834 MiB |   1184 MiB |   1132 GiB |   1131 GiB |
|---------------------------------------------------------------------------|
| Requested memory      |  32199 MiB |  42987 MiB | 128176 GiB | 128145 GiB |
|       from large pool |  31364 MiB |  42110 MiB | 127047 GiB | 127016 GiB |
|       from small pool |    834 MiB |   1184 MiB |   1129 GiB |   1128 GiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  37644 MiB |  47504 MiB | 705806 MiB | 668162 MiB |
|       from large pool |  36376 MiB |  46588 MiB | 682818 MiB | 646442 MiB |
|       from small pool |   1268 MiB |   1284 MiB |  22988 MiB |  21720 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 713142 KiB |   4633 MiB | 103206 GiB | 103205 GiB |
|       from large pool | 525312 KiB |   4594 MiB | 101923 GiB | 101922 GiB |
|       from small pool | 187830 KiB |    250 MiB |   1283 GiB |   1283 GiB |
|---------------------------------------------------------------------------|
| Allocations           |    3460    |    4809    |   15606 K  |   15603 K  |
|       from large pool |     395    |     563    |    2812 K  |    2811 K  |
|       from small pool |    3065    |    4270    |   12794 K  |   12791 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    3460    |    4809    |   15606 K  |   15603 K  |
|       from large pool |     395    |     563    |    2812 K  |    2811 K  |
|       from small pool |    3065    |    4270    |   12794 K  |   12791 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |     913    |     920    |   13260    |   12347    |
|       from large pool |     279    |     305    |    1766    |    1487    |
|       from small pool |     634    |     642    |   11494    |   10860    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |     422    |     628    |    4766 K  |    4765 K  |
|       from large pool |      66    |      92    |    1290 K  |    1289 K  |
|       from small pool |     356    |     555    |    3476 K  |    3475 K  |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|


Without Standby:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  32711 MiB |  52084 MiB | 142756 GiB | 142724 GiB |
|       from large pool |  31877 MiB |  51207 MiB | 141499 GiB | 141467 GiB |
|       from small pool |    834 MiB |   1184 MiB |   1257 GiB |   1256 GiB |
|---------------------------------------------------------------------------|
| Active memory         |  32711 MiB |  52084 MiB | 142756 GiB | 142724 GiB |
|       from large pool |  31877 MiB |  51207 MiB | 141499 GiB | 141467 GiB |
|       from small pool |    834 MiB |   1184 MiB |   1257 GiB |   1256 GiB |
|---------------------------------------------------------------------------|
| Requested memory      |  32572 MiB |  51658 MiB | 141898 GiB | 141866 GiB |
|       from large pool |  31738 MiB |  50780 MiB | 140644 GiB | 140613 GiB |
|       from small pool |    833 MiB |   1184 MiB |   1253 GiB |   1252 GiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  49552 MiB |  52188 MiB |  86354 MiB |  36802 MiB |
|       from large pool |  48320 MiB |  51300 MiB |  84740 MiB |  36420 MiB |
|       from small pool |   1232 MiB |   1232 MiB |   1614 MiB |    382 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Allocations           |    3460    |    4809    |   17440 K  |   17437 K  |
|       from large pool |     395    |     564    |    2742 K  |    2741 K  |
|       from small pool |    3065    |    4270    |   14698 K  |   14695 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    3460    |    4809    |   17440 K  |   17437 K  |
|       from large pool |     395    |     564    |    2742 K  |    2741 K  |
|       from small pool |    3065    |    4270    |   14698 K  |   14695 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

The image below shows how standby compares against non standby training with Unsloth. It is averaged over 3 runs to make sure the metrics aren’t noisy. In fact, if you zoom in close enough, you’d see that enabling standby makes it faster as well, probably due to less memory pressure as discussed before.

Previous A100 40GB experiments

In our previous experiments on A100 40GB GPU with Qwen-2.5-3b-instruct and 8 generations per sample, we observed that without standby, the GRPO training (model loaded in 16bit, LoRA, only weights trainable), we could only fit 6K sequence lengths. With our standby feature, we were able to fit 10K and beyond! For comparison TRL can only give you context lengths of up to 1K while holding the same batch size.

🎉Other optimizations

We now select better compilation flags and reduce compile times by 50% or more. We also managed to dynamically patch any vLLM version to handle gc.collect better for backwards compatibility reasons, as inspired from this vLLM pull request. This reduces compilation times from 2 minutes to under 40 seconds.

We also optimized torch.compile flags and tried turning on some flags - unfortunately combo_kernels and multi_kernel could not function correctly on vLLM 0.10 and Torch 2.8/2.9 nightly and coordinate_descent_tuning made autotuning all kernels dramatically slower. It used to compile in under a minute, but enabling it took over 13 minutes and more, with minimal performance gains.

📚GRPO Notebooks

All our GRPO notebooks have Unsloth Standby on by default and all optimizations! See https://docs.unsloth.ai/get-started/unsloth-notebooks for all our GRPO notebooks, or try the below:

Qwen3 (4B) - Advanced GRPO LoRA
DeepSeek-R1-0528-Qwen3 (8B) (for multilingual usecases)
Gemma 3 (1B)
Llama 3.2 (3B) - Advanced GRPO LoRA
Llama 3.1 (8B)
Phi-4 (14B)
Mistral v0.3 (7B)
Qwen2.5 (3B)

RL Reward Hacking

Learn what is Reward Hacking in Reinforcement Learning and how to counter it.

The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric). But RL can cheat. When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking".

It's the reason models learn to modify unit tests to pass coding challenges, and these are critical blockers for real world deployment. Some other good examples are from Wikipedia.

Can you counter reward hacking? Yes! In our free gpt-oss RL notebook we explore how to counter reward hacking in a code generation setting and showcase tangible solutions to common error modes. We saw the model edit the timing function, outsource to other libraries, cache the results, and outright cheat. After countering, the result is our model generates genuinely optimized matrix multiplication kernels, not clever cheats.

🏆 Reward Hacking Overview

Some common examples of reward hacking during RL include:

Laziness

RL learns to use Numpy, Torch, other libraries, which calls optimized CUDA kernels. We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries.

Caching & Cheating

RL learns to cache the result of the output and RL learns to find the actual output by inspecting Python global variables.

We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.

Cheating

RL learns to edit the timing function to make it output 0 time as passed. We can stop the RL algorithm from using global or cached variables by restricting it's locals and globals. We are also going to use exec to create the function, so we have to save the output to an empty dict. We also disallow global variable access via types.FunctionType(f.__code__, {})\

GSPO Reinforcement Learning

Train with GSPO (Group Sequence Policy Optimization) RL in Unsloth.

We're introducing GSPO which is a variant of made by the Qwen team at Alibaba. They noticed the observation that when GRPO takes importance weights for each token, even though inherently advantages do not scale or change with each token. This lead to the creation of GSPO, which now assigns the importance on the sequence likelihood rather than the individual token likelihoods of the tokens.

Use our free GSPO notebooks for: and

Enable GSPO in Unsloth by setting importance_sampling_level = "sequence" in the GRPO config. The difference between these two algorithms can be seen below, both from the GSPO paper from Qwen and Alibaba:

In Equation 1, it can be seen that the advantages scale each of the rows into the token logprobs before that tensor is sumed. Essentially, each token is given the same scaling even though that scaling was given to the entire sequence rather than each individual token. A simple diagram of this can be seen below:

Equation 2 shows that the logprob ratios for each sequence is summed and exponentiated after the Logprob ratios are computed, and only the resulting now sequence ratios get row wise multiplied by the advantages.

Enabling GSPO is simple, all you need to do is set the importance_sampling_level = "sequence" flag in the GRPO config.

Reinforcement Learning - DPO, ORPO & KTO

To use the reward modelling functions for DPO, GRPO, ORPO or KTO with Unsloth, follow the steps below:

DPO (Direct Preference Optimization), ORPO (Odds Ratio Preference Optimization), PPO, KTO Reward Modelling all work with Unsloth.

We have Google Colab notebooks for reproducing GRPO, ORPO, DPO Zephyr, KTO and SimPO:

We're also in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs.

DPO Code

python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Optional set GPU device ID

from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
)

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = YOUR_DATASET_HERE,
    # eval_dataset = YOUR_DATASET_HERE,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)
dpo_trainer.train()

Reinforcement Learning (RL) Guide

Learn all about Reinforcement Learning (RL) and how to train your own DeepSeek-R1 reasoning model with Unsloth using GRPO. A complete guide from beginner to advanced.

Reinforcement Learning is where an "agent" learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties.

Action: What the model generates (e.g. a sentence).
Reward: A signal indicating how good or bad the model's action was (e.g. did the response follow instructions? was it helpful?).
Environment: The scenario or task the model is working on (e.g. answering a user’s question).

For advanced GRPO documentation on batching, generation and training parameters, read our guide!

🦥What you will learn

What is RL? RLVR? PPO? GRPO? RLHF? RFT? Is "Luck is All You Need?" for RL?
What is an environment? Agent? Action? Reward function? Rewards?

❓What is Reinforcement Learning (RL)?

The goal of RL is to:

Increase the chance of seeing "good" outcomes.
Decrease the chance of seeing "bad" outcomes.

That's it! There are intricacies on what "good" and "bad" means, or how do we go about "increasing" or "decreasing" it, or what even "outcomes" means.

For example, in the Pacman game:

The environment is the game world.
The actions you can take are UP, LEFT, RIGHT and DOWN.
The rewards are good if you eat a cookie, or bad if you hit one of the squiggly enemies.
In RL, you can't know the "best action" you can take, but you can observe intermediate steps, or the final game state (win or lose)

Another example is imagine you are given the question: "What is 2 + 2?" (4) An unaligned language model will spit out 3, 4, C, D, -10, literally anything.

Numbers are better than C or D right?
Getting 3 is better than say 8 right?
Getting 4 is definitely correct.

We just designed a reward function!

🏃From RLHF, PPO to GRPO and RLVR

The thumbs up and down in ChatGPT for example can be used in the RLHF process.

The clip(..., 1-e, 1+e) term is used to force PPO not to take too large changes. There is also a KL term with beta set to > 0 to force the model not to deviate too much away.

In order to do RLHF, PPO (Proximal policy optimization) was developed. The agent is the language model in this case. In fact it's composed of 3 systems:

The Generating Policy (current trained model)
The Reference Policy (original model)
The Value Model (average reward estimator)

We use the Reward Model to calculate the reward for the current environment, and our goal is to maximize this!

The formula for PPO looks quite complicated because it was designed to be stable. Visit our AI Engineer talk we gave in 2025 about RL for more in depth maths derivations about PPO.

DeepSeek developed GRPO (Group Relative Policy Optimization) to train their R1 reasoning models. The key differences to PPO are:

The Value Model is removed, replaced with statistics from calling the reward model multiple times.
The Reward Model is removed and replaced with just custom reward function which RLVR can be used.

This means GRPO is extremely efficient. Previously PPO needed to train multiple models - now with the reward model and value model removed, we can save memory and speed up everything.

RLVR (Reinforcement Learning with Verifiable Rewards) allows us to reward the model based on tasks with easy to verify solutions. For example:

Maths equations can be easily verified. Eg 2+2 = 4.
Code output can be verified as having executed correctly or not.
Designing verifiable reward functions can be tough, and so most examples are math or code.
Use-cases for GRPO isn’t just for code or math—its reasoning process can enhance tasks like email automation, database retrieval, law, and medicine, greatly improving accuracy based on your dataset and reward function - the trick is to define a rubric - ie a list of smaller verifiable rewards, and not a final all consuming singular reward. OpenAI popularized this in their reinforcement learning finetuning (RFT) offering for example.

Why "Group Relative"?

GRPO removes the value model entirely, but we still need to estimate the "average reward" given the current state.

The trick is to sample the LLM! We then calculate the average reward through statistics of the sampling process across multiple different questions.

This creates the advantages A, which we will use in replacement of the value model. This saves a lot of memory!

🤞Luck (well Patience) Is All You Need

The trick of RL is you need 2 things only:

A question or instruction eg "What is 2+2?" "Create a Flappy Bird game in Python"
A reward function and verifier to verify if the output is good or bad.

With only these 2, we can essentially call a language model an infinite times until we get a good answer. For example for "What is 2+2?", an untrained bad language model will output:

0, cat, -10, 1928, 3, A, B, 122, 17, 182, 172, A, C, BAHS, %$, #, 9, -192, 12.31**** then suddenly 4.

The reward signal was 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0**** then suddenly 1.

So by luck and by chance, RL managed to find the correct answer across multiple rollouts. Our goal is we want to see the good answer 4 more, and the rest (the bad answers) much less.

So I like to call it as "Luck Is All You Need" for RL.

Well a better phrase is "Patience is All You Need" for RL.

In the "What is 2+2?" example - 0, cat, -10, 1928, 3, A, B, 122, 17, 182, 172, A, C, BAHS, %$, #, 9, -192, 12.31**** then suddenly 4.

🦥What Unsloth offers for RL

With 15GB VRAM, Unsloth allows you to transform any model up to 17B parameters like Llama 3.1 (8B), Phi-4 (14B), Mistral (7B) or Qwen2.5 (7B) into a reasoning model
Unsloth now supports RL for Vision/multimodal models!
Minimum requirement: Just  5GB VRAM is enough to train your own reasoning model locally (for any model with 1.5B parameters or less)

⚡Tutorial: Train your own Reasoning model with GRPO

GRPO notebooks:

GSPO - new

- Vision GSPO - new

- Vision GSPO - new

- Advanced

NEW! We now support GSPO and most other new GRPO techniques. You can play with the following arguments in GRPOConfig to enable:

epsilon=0.2,
epsilon_high=0.28, # one sided
delta=1.5 # two sided

loss_type='gspo',
# or:
loss_type='grpo',
# or:
loss_type='dr_grpo',

mask_truncated_completions=True,

If you're not getting any reasoning, make sure you have enough training steps and ensure your reward function/verifier is working. We provide examples for reward functions here.
Previous demonstrations show that you could achieve your own "aha" moment with Qwen2.5 (3B) - but it required 2xA100 GPUs (160GB VRAM). Now, with Unsloth, you can achieve the same "aha" moment using just a single 5GB VRAM GPU.
Previously, GRPO was only supported for full fine-tuning, but we've made it work with QLoRA and LoRA
On 20K context lengths for example with 8 generations per prompt, Unsloth uses only 54.3GB of VRAM for Llama 3.1 (8B), whilst standard implementations (+ Flash Attention 2) take 510.8GB (90% less for Unsloth).
Please note, this isn’t fine-tuning DeepSeek’s R1 distilled models or using distilled data from R1 for tuning which Unsloth already supported. This is converting a standard model into a full-fledged reasoning model using GRPO.

💻Training with GRPO

For a tutorial on how to transform any open LLM into a reasoning model using Unsloth & GRPO, see here.

For advanced GRPO documentation on batching, generation and training parameters, read our guide!

How GRPO Trains a Model

For each question-answer pair, the model generates multiple possible responses (e.g., 8 variations).
Each response is evaluated using reward functions.
Training Steps:
- If you have 300 rows of data, that's 300 training steps (or 900 steps if trained for 3 epochs).
- You can increase the number of generated responses per question (e.g., from 8 to 16).
The model learns by updating its weights every step.

Basics/Tips

Wait for at least 300 steps for the reward to actually increase. In order to get decent results, you may need to trade for a minimum of 12 hours (this is how GRPO works), but keep in mind this isn't compulsory as you can stop at anytime.
For optimal results have at least 500 rows of data. You can try with even 10 rows of data but it's better to have more.
Each training run will always be different depending on your model, data, reward function/verifier etc. so though 300 steps is what we wrote as the minimum, sometimes it might be 1000 steps or more. So, it depends on various factors.
If you're using GRPO with Unsloth locally, please "pip install diffusers" as well if you get an error. Please also use the latest version of vLLM.
It’s advised to apply GRPO to a model at least 1.5B in parameters to correctly generate thinking tokens as smaller models may not.
For GRPO's GPU VRAM requirements for QLoRA 4-bit, the general rule is the model parameters = the amount of VRAM you will need (you can use less VRAM but this just to be safe). The more context length you set, the more VRAM. LoRA 16-bit will use at minimum 4x more VRAM.
Continuous fine-tuning is possible and you can just leave GRPO running in the background.
In the example notebooks, we use the GSM8K dataset, the current most popular choice for R1-style training.
If you’re using a base model, ensure you have a chat template.
The more you train with GRPO the better. The best part of GRPO is you don't even need that much data. All you need is a great reward function/verifier and the more time spent training, the better your model will get. Expect your reward vs step to increase as time progresses like this:
Training loss tracking for GRPO is now built directly into Unsloth, eliminating the need for external tools like wandb etc. It contains full logging details for all reward functions now including the total aggregated reward function itself.

📋Reward Functions / Verifiers

Verifier:

Determines whether the generated response is correct or incorrect.
It does not assign a numerical score—it simply verifies correctness.
Example: If a model generates "5" for "2+2", the verifier checks and labels it as "wrong" (since the correct answer is 4).
Verifiers can also execute code (e.g., in Python) to validate logic, syntax, and correctness without needing manual evaluation.

Reward Function:

Converts verification results (or other criteria) into a numerical score.
Example: If an answer is wrong, it might assign a penalty (-1, -2, etc.), while a correct answer could get a positive score (+1, +2).
It can also penalize based on criteria beyond correctness, such as excessive length or poor readability.

Key Differences:

A Verifier checks correctness but doesn’t score.
A Reward Function assigns a score but doesn’t necessarily verify correctness itself.
A Reward Function can use a Verifier, but they are technically not the same.

Understanding Reward Functions

GRPO's primary goal is to maximize reward and learn how an answer was derived, rather than simply memorizing and reproducing responses from its training data.

With every training step, GRPO adjusts model weights to maximize the reward. This process fine-tunes the model incrementally.
Regular fine-tuning (without GRPO) only maximizes next-word prediction probability but does not optimize for a reward. GRPO optimizes for a reward function rather than just predicting the next word.
You can reuse data across multiple epochs.
Default reward functions can be predefined to be used on a wide array of use cases or you can ask ChatGPT/local model to generate them for you.
There’s no single correct way to design reward functions or verifiers - the possibilities are endless. However, they must be well-designed and meaningful, as poorly crafted rewards can unintentionally degrade model performance.

🪙Reward Function Examples

Example #1: Simple Arithmetic Task

Question: "2 + 2"
Answer: "4"
Reward Function 1:
- If a number is detected → +1
- If no number is detected → -1
Reward Function 2:
- If the number matches the correct answer → +3
- If incorrect → -3
Total Reward: Sum of all reward functions

Example #2: Email Automation Task

Question: Inbound email
Answer: Outbound email
Reward Functions:
- If the answer contains a required keyword → +1
- If the answer exactly matches the ideal response → +1
- If the response is too long → -1
- If the recipient's name is included → +1
- If a signature block (phone, email, address) is present → +1

Unsloth Proximity-Based Reward Function

In our examples, we enable reasoning in Qwen3 (Base) and guide it toward specific tasks
Apply Pre-finetuning strategies to avoid GRPO’s default tendency to just learn formatting
Boost evaluation accuracy with regex-based matching
Create custom GRPO templates beyond generic prompts like think, e.g., <start_working_out></end_working_out>
Apply proximity-based scoring — models get more reward for closer answers (e.g., predicting 9 instead of 10 is better than 3) while outliers are penalized

GSM8K Reward Functions

In our other examples, we use existing GSM8K reward functions by @willccbb which is popular and shown to be quite effective:

correctness_reward_func – Rewards exact label matches.
int_reward_func – Encourages integer-only answers.
soft_format_reward_func – Checks structure but allows minor newline mismatches.
strict_format_reward_func – Ensures response structure matches the prompt, including newlines.
xmlcount_reward_func – Ensures exactly one of each XML tag in the response.

🧮Using vLLM

pip install unsloth vllm
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    fast_inference = True,
)
model.fast_generate(["Hello!"])

✅GRPO Requirement Guidelines

For GRPO's GPU VRAM requirements for QLoRA 4-bit, the general rule is the model parameters = the amount of VRAM you will need (you can use less VRAM but this just to be safe). The more context length you set, the more VRAM. LoRA 16-bit will use at minimum 4x more VRAM.
Our new memory efficient linear kernels for GRPO slashes memory usage by 8x or more. This shaves 68.5GB of memory, whilst being actually faster through the help of torch.compile!
We leverage our smart Unsloth gradient checkpointing algorithm which we released a while ago. It smartly offloads intermediate activations to system RAM asynchronously whilst being only 1% slower. This shaves 52GB of memory.
Unsloth also uses the same GPU / CUDA memory space as the underlying inference engine (vLLM), unlike implementations in other packages. This shaves 16GB of memory.

Metrics

Unsloth

Standard + FA2

Training Memory Cost (GB)

42GB

414GB

GRPO Memory Cost (GB)

9.8GB

78.3GB

Inference Cost (GB)

0GB

16GB

Inference KV Cache for 20K context length (GB)

2.5GB

Total Memory Usage

54.33GB (90% less)

510.8GB

Unsloth shaves 8x memory usage for long context GRPO, so we need only an extra 9.8GB in extra VRAM for 20K context lengths!

🎥 Unsloth RL 3 hour Workshop Video

🎓Further Reading

Nathan Lambert's RLHF Book is a must! https://rlhfbook.com/c/11-policy-gradients.html
Yannic Kilcher's GRPO Youtube video is also a must! https://www.youtube.com/watch?v=bAWV_yrqx4w
We did a 3 hour workshop at AI Engineer World's Fair 2025. Slides are other material are at https://docs.unsloth.ai/ai-engineers-2025
Advanced GRPO notebook via Unsloth. https://docs.unsloth.ai/basics/reinforcement-learning-guide/tutorial-train-your-own-reasoning-model-with-grpo
GRPO from a base model notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb

Advanced RL Documentation

Advanced documentation settings when using Unsloth with GRPO.

Detailed guides on doing GRPO with Unsloth for Batching, Generation & Training Parameters:

Training Parameters

beta (float, default 0.0): KL coefficient.
- 0.0 ⇒ no reference model loaded (lower memory, faster).
- Higher beta constrains the policy to stay closer to the ref policy.
num_iterations (int, default 1): PPO epochs per batch (μ in the algorithm). Replays data within each gradient accumulation step; e.g., 2 = two forward passes per accumulation step.
epsilon (float, default 0.2): Clipping value for token-level log-prob ratios (typical ratio range ≈ [-1.2, 1.2] with default ε).
delta (float, optional): Enables upper clipping bound for two-sided GRPO when set. If None, standard GRPO clipping is used. Recommended > 1 + ε when enabled (per INTELLECT-2 report).
epsilon_high (float, optional): Upper-bound epsilon; defaults to epsilon if unset. DAPO recommends 0.28.
importance_sampling_level (“token” | “sequence”, default "token"):
- "token": raw per-token ratios (one weight per token).
- "sequence": average per-token ratios to a single sequence-level ratio. GSPO shows sequence-level sampling often gives more stable training for sequence-level rewards.
reward_weights (list[float], optional): One weight per reward. If None, all weights = 1.0.
scale_rewards (str|bool, default "group"):
- True or "group": scale by std within each group (unit variance in group).
- "batch": scale by std across the entire batch (per PPO-Lite).
- False or "none": no scaling. Dr. GRPO recommends not scaling to avoid difficulty bias from std scaling.
loss_type (str, default "dapo"):
- "grpo": normalizes over sequence length (length bias; not recommended).
- "dr_grpo": normalizes by a global constant (introduced in Dr. GRPO; removes length bias). Constant ≈ max_completion_length.
- "dapo" (default): normalizes by active tokens in the global accumulated batch (introduced in DAPO; removes length bias).
- "bnpo": normalizes by active tokens in the local batch only (results can vary with local batch size; equals GRPO when per_device_train_batch_size == 1).
mask_truncated_completions (bool, default False): When True, truncated completions are excluded from loss (recommended by DAPO for stability). Note: There are some KL issues with this flag, so we recommend to disable it.
```
# If mask_truncated_completions is enabled, zero out truncated completions in completion_mask
if self.mask_truncated_completions:
    truncated_completions = ~is_eos.any(dim=1)
    completion_mask = completion_mask * (~truncated_completions).unsqueeze(1).int()
```
This can zero out all completion_mask entries when many completions are truncated, making n_mask_per_reward = 0 and causing KL to become NaN. See
vllm_importance_sampling_correction (bool, default True): Applies Truncated Importance Sampling (TIS) to correct off-policy effects when generation (e.g., vLLM / fast_inference) differs from training backend. In Unsloth, this is auto-set to True if you’re using vLLM/fast_inference; otherwise False.
vllm_importance_sampling_cap (float, default 2.0): Truncation parameter C for TIS; sets an upper bound on the importance sampling ratio to improve stability.
dtype when choosing float16 or bfloat16, see FP16 vs BF16 for RL

Generation Parameters

temperature (float, defaults to 1.0): Temperature for sampling. The higher the temperature, the more random the completions. Make sure you use a relatively high (1.0) temperature to have diversity in generations which helps learning.
top_p (float, optional, defaults to 1.0): Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1.0 to consider all tokens.
top_k (int, optional): Number of highest probability vocabulary tokens to keep for top-k-filtering. If None, top-k-filtering is disabled and all tokens are considered.
min_p (float, optional): Minimum token probability, which will be scaled by the probability of the most likely token. It must be a value between 0.0 and 1.0. Typical values are in the 0.01-0.2 range.
repetition_penalty (float, optional, defaults to 1.0): Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1.0 encourage the model to use new tokens, while values < 1.0 encourage the model to repeat tokens.
steps_per_generation: (int, optional): Number of steps per generation. If None, it defaults to gradient_accumulation_steps. Mutually exclusive with generation_batch_size.

It is a bit confusing to mess with this parameter, it is recommended to edit per_device_train_batch_size and gradient accumulation for the batch sizes

Batch & Throughput Parameters

Parameters that control batches

train_batch_size: Number of samples per process per step. If this integer is less than num_generations, it will default to num_generations.
steps_per_generation: Number of microbatches that contribute to one generation’s loss calculation (forward passes only). A new batch of data is generated every steps_per_generation steps; backpropagation timing depends on gradient_accumulation_steps.
num_processes: Number of distributed training processes (e.g., GPUs / workers).
gradient_accumulation_steps (aka gradient_accumulation): Number of microbatches to accumulate before applying backpropagation and optimizer update.
Effective batch size:
```
effective_batch_size = steps_per_generation * num_processes * train_batch_size
```
Total samples contributing to gradients before an update (across all processes and steps).

Optimizer steps per generation:

optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps

Example: 4 / 2 = 2.

num_generations: Number of generations produced per prompt (applied after computing effective_batch_size). The number of unique prompts in a generation cycle is:
```
unique_prompts = effective_batch_size / num_generations
```
Must be > 2 for GRPO to work.

GRPO Batch Examples

The tables below illustrate how batches flow through steps, when optimizer updates occur, and how new batches are generated.

Example 1

num_gpus = 1
per_device_train_batch_size = 3
gradient_accumulation_steps = 2
steps_per_generation = 4

effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3

Generation cycle A

Step

Batch

Notes

[0,0,0]

[1,1,1]

→ optimizer update (accum = 2 reached)

[2,2,2]

[3,3,3]

optimizer update

Generation cycle B

Step

Batch

Notes

[4,4,4]

[5,5,5]

→ optimizer update (accum = 2 reached)

[6,6,6]

[7,7,7]

optimizer update

Example 2

num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4

effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3

Generation cycle A

Step

Batch

Notes

[0,0,0]

[1,1,1]

[2,2,2]

[3,3,3]

optimizer update (accum = 4 reached)

Generation cycle B

Step

Batch

Notes

[4,4,4]

[5,5,5]

[6,6,6]

[7,7,7]

optimizer update (accum = 4 reached)

Example 3

num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4

effective_batch_size = 4 * 3 * 1 = 12
num_generations = 4
unique_prompts = effective_batch_size / num_generations = 3

Generation cycle A

Step

Batch

Notes

[0,0,0]

[0,1,1]

[1,1,3]

[3,3,3]

optimizer update (accum = 4 reached)

Generation cycle B

Step

Batch

Notes

[4,4,4]

[4,5,5]

[5,5,6]

[6,6,6]

optimizer update (accum = 4 reached)

Example 4

num_gpus = 1
per_device_train_batch_size = 6
steps_per_generation = gradient_accumulation_steps = 2

effective_batch_size = 2 * 6 * 1 = 12
num_generations = 3
unique_prompts = 4

Generation cycle A

Step

Batch

Notes

[0,0,0, 1,1,1]

[2,2,2, 3,3,3]

optimizer update (accum = 2 reached)

Generation cycle B

Step

Batch

Notes

[4,4,4, 5,5,5]

[6,6,6, 7,7,7]

optimizer update (accum = 2 reached)

Quick Formula Reference

effective_batch_size = steps_per_generation * num_processes * train_batch_size
optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps
unique_prompts = effective_batch_size / num_generations   # must be > 2

Memory Efficient RL

We're excited to introduce more efficient reinforcement learning (RL) in Unsloth with multiple algorithmic advancements:

1.2 to 1.7x increased context lengths with no slowdown and no extra memory usage!
10% faster RL training runs with revamped kernels and async data movements
2x faster torch.compile times during model loading

Now, Qwen3-32B LoRA 16-bit can attain 6,144 context lengths vs 3,600 (1.7x longer) before on 1xH100 80GB GPU. Llama-3.1-8B QLoRA 4bit can attain 47,500 lengths vs 42,000 before (1.13x longer).

✨How to enable optimizations

To enable Unsloth's Standby feature, set the environment variable UNSLOTH_VLLM_STANDBY before any Unsloth import. Then set gpu_memory_utilization = 0.95 and that's it!

import os
os.environ["UNSLOTH_VLLM_STANDBY"] = "1"

from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-8B-Base",
    max_seq_length = 2048, # Can increase for longer reasoning traces
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True,
    max_lora_rank = 32, # Larger rank = smarter, but slower
    gpu_memory_utilization = 0.95,
)

🎓No more `gpu_memory_utilization`!

⁉️Why does RL use so much memory?

Inference takes a lot of VRAM

Whilst Training also uses VRAM!

This means RL needs to keep 2 sets of VRAM / memory on the GPU at the same time:

Inference engine (has model weights, KV cache)
Training engine (has model weights, activations, gradients, optimizer states)

Current RL frameworks have to split 50/50 for a 80GB GPU with 50% for inference and 50% for training. And moving weights from training mode to inference mode can take quite some time.

80GB GPU

Inference Engine (50%)

Training Engine (50%)

Model Weights

16GB

KV Cache

24GB

Activations, Gradients, Optimizer States

24GB

80GB GPU

Inference Engine (50%)

Training Engine (50%)

Model Weights

16GB SHARED

<<< SHARED

KV Cache

24GB + 8GB= 32GB

Activations, Gradients, Optimizer States

24GB + 8GB=32GB

🦥Unsloth Standby

But we can go further - we first note RL does inference then training then inference then training etc.

level = 1 copies weights to the CPU and deletes KV cache
level = 2 deletes weights and deletes KV cache

But reminder in Unsloth we share vLLM's memory space for the weights - this means we need a new way to delete the KV cache, and ignore deletion of the weights, and we call this Unsloth Standby.

80GB GPU

Inference Engine

Training Engine

Model Weights

16GB SHARED

<<< SHARED

Multi-purpose

64GB space

KV Cache

Activations, Gradients, Optimizer States

To enable this, simply add the below to all RL / GRPO training runs before any Unsloth import:

import os
os.environ["UNSLOTH_VLLM_STANDBY"] = "1"

🧪Performance Experiments

Z=\frac{r_i - \mu}{\sqrt{\frac{1}{n}\sum(r_i-\mu)^2}} \\ Z_{n=1}=\frac{r_1 - \mu}{\sqrt{\frac{1}{1}\sum(r_1-\mu)^2}}=\frac{0}{0}=\text{undefined}

This means for GRPO specifically, a maximum context length of 6,144 for Qwen-3 32B is actually 6,144 multiplied by 2 generations ie 12,288 in length.

We provide experiments for Llama-3.1 8B on both LoRA (16bit) and QLoRA (4bit) below:

We also theorize speedups are possible due to reduced memory pressure, so there might be less memory cleanup on the CUDA memory allocator side.

Experiments

Config

Status

GPU Memory usage

Comments

standby True

vllm_gpu_util 0.95

num_gen 2

grad_acc_steps 2

Runs for 40 steps/ 40 minutes

14.5 GiB (set by vllm_gpu_util)

Enough to fit in 32K KVCache with chunk of 2-4K or say 16K KVCache + 16K chunks

standby True

vllm_gpu_util 0.9

num_gen 2

grad_acc_steps 2

Runs 32 steps in 40 m

13.8 GiB (set by…)

Approx enough to fit in ~28K KVCache with chunk of 2-4K or say 15K KVCache + 15K chunks

standby False

vllm_gpu_util 0.9

num_gen 2

grad_acc_steps 2

model loads but can’t train because even batch size of 1 doesn’t fit

OOM

standby False

vllm_gpu_util 0.8

num_gen 2

grad_acc_steps 2

model loads but can’t train because even batch size of 1 doesn’t fit

OOM

standby False

vllm_gpu_util 0.7

num_gen 2

grad_acc_steps 2

Trains fine

28 steps take 39min

~15.1GiB

any input slightly longer will result in OOM on colab

standby True

vllm_gpu_util 0.7

num_gen 2

grad_acc_steps 2

Trains fine

29 steps take 40min

13GiB but most of the time around 10-11GB

At the same config, we save 2GiB aka 15% memory here. Can be higher for longer sequences

H100 Experiments

Model

GPU

Seq Len

Num Generations

Grad Acc Steps

Qwen2.5-14B-Instruct

NVIDIA H100 80GB PCIe

32,768

Click for Unsloth Standby Mode vs. no Standby Benchmarks

Standy mode enabled:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  32249 MiB |  43042 MiB | 128336 GiB | 128305 GiB |
|       from large pool |  31415 MiB |  42165 MiB | 127204 GiB | 127173 GiB |
|       from small pool |    834 MiB |   1184 MiB |   1132 GiB |   1131 GiB |
|---------------------------------------------------------------------------|
| Active memory         |  32249 MiB |  43042 MiB | 128336 GiB | 128305 GiB |
|       from large pool |  31415 MiB |  42165 MiB | 127204 GiB | 127173 GiB |
|       from small pool |    834 MiB |   1184 MiB |   1132 GiB |   1131 GiB |
|---------------------------------------------------------------------------|
| Requested memory      |  32199 MiB |  42987 MiB | 128176 GiB | 128145 GiB |
|       from large pool |  31364 MiB |  42110 MiB | 127047 GiB | 127016 GiB |
|       from small pool |    834 MiB |   1184 MiB |   1129 GiB |   1128 GiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  37644 MiB |  47504 MiB | 705806 MiB | 668162 MiB |
|       from large pool |  36376 MiB |  46588 MiB | 682818 MiB | 646442 MiB |
|       from small pool |   1268 MiB |   1284 MiB |  22988 MiB |  21720 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 713142 KiB |   4633 MiB | 103206 GiB | 103205 GiB |
|       from large pool | 525312 KiB |   4594 MiB | 101923 GiB | 101922 GiB |
|       from small pool | 187830 KiB |    250 MiB |   1283 GiB |   1283 GiB |
|---------------------------------------------------------------------------|
| Allocations           |    3460    |    4809    |   15606 K  |   15603 K  |
|       from large pool |     395    |     563    |    2812 K  |    2811 K  |
|       from small pool |    3065    |    4270    |   12794 K  |   12791 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    3460    |    4809    |   15606 K  |   15603 K  |
|       from large pool |     395    |     563    |    2812 K  |    2811 K  |
|       from small pool |    3065    |    4270    |   12794 K  |   12791 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |     913    |     920    |   13260    |   12347    |
|       from large pool |     279    |     305    |    1766    |    1487    |
|       from small pool |     634    |     642    |   11494    |   10860    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |     422    |     628    |    4766 K  |    4765 K  |
|       from large pool |      66    |      92    |    1290 K  |    1289 K  |
|       from small pool |     356    |     555    |    3476 K  |    3475 K  |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|


Without Standby:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  32711 MiB |  52084 MiB | 142756 GiB | 142724 GiB |
|       from large pool |  31877 MiB |  51207 MiB | 141499 GiB | 141467 GiB |
|       from small pool |    834 MiB |   1184 MiB |   1257 GiB |   1256 GiB |
|---------------------------------------------------------------------------|
| Active memory         |  32711 MiB |  52084 MiB | 142756 GiB | 142724 GiB |
|       from large pool |  31877 MiB |  51207 MiB | 141499 GiB | 141467 GiB |
|       from small pool |    834 MiB |   1184 MiB |   1257 GiB |   1256 GiB |
|---------------------------------------------------------------------------|
| Requested memory      |  32572 MiB |  51658 MiB | 141898 GiB | 141866 GiB |
|       from large pool |  31738 MiB |  50780 MiB | 140644 GiB | 140613 GiB |
|       from small pool |    833 MiB |   1184 MiB |   1253 GiB |   1252 GiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  49552 MiB |  52188 MiB |  86354 MiB |  36802 MiB |
|       from large pool |  48320 MiB |  51300 MiB |  84740 MiB |  36420 MiB |
|       from small pool |   1232 MiB |   1232 MiB |   1614 MiB |    382 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Allocations           |    3460    |    4809    |   17440 K  |   17437 K  |
|       from large pool |     395    |     564    |    2742 K  |    2741 K  |
|       from small pool |    3065    |    4270    |   14698 K  |   14695 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    3460    |    4809    |   17440 K  |   17437 K  |
|       from large pool |     395    |     564    |    2742 K  |    2741 K  |
|       from small pool |    3065    |    4270    |   14698 K  |   14695 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

Previous A100 40GB experiments

🎉Other optimizations

📚GRPO Notebooks

All our GRPO notebooks have Unsloth Standby on by default and all optimizations! See https://docs.unsloth.ai/get-started/unsloth-notebooks for all our GRPO notebooks, or try the below:

Qwen3 (4B) - Advanced GRPO LoRA
DeepSeek-R1-0528-Qwen3 (8B) (for multilingual usecases)
Gemma 3 (1B)
Llama 3.2 (3B) - Advanced GRPO LoRA
Llama 3.1 (8B)
Phi-4 (14B)
Mistral v0.3 (7B)
Qwen2.5 (3B)

Reinforcement Learning (RL) Guide

🦥What you will learn

❓What is Reinforcement Learning (RL)?

🏃From RLHF, PPO to GRPO and RLVR

🤞Luck (well Patience) Is All You Need

🦥What Unsloth offers for RL

GRPO notebooks:

💻Training with GRPO

How GRPO Trains a Model

Basics/Tips

📋Reward Functions / Verifiers

Understanding Reward Functions

🪙Reward Function Examples

Example #1: Simple Arithmetic Task

Example #2: Email Automation Task

Unsloth Proximity-Based Reward Function

GSM8K Reward Functions

🧮Using vLLM

✅GRPO Requirement Guidelines

🎥 Unsloth RL 3 hour Workshop Video

🎓Further Reading

Tutorial: Train your own Reasoning model with GRPO

Quickstart

Install Unsloth

Learn about GRPO & Reward Functions

Configure desired settings

Data preparation

Reward Functions/Verifier

Train your model

Run & Evaluate your model

Save your model

Video Tutorials

Advanced RL Documentation

Training Parameters

Generation Parameters

Batch & Throughput Parameters

Parameters that control batches

GRPO Batch Examples

Example 1

Example 2

Example 3

Example 4

Quick Formula Reference

FP16 vs BF16 for RL

Float16 vs Bfloat16

🤯A100 Cascade Attention Bug

🔥Using float16 in Unsloth RL

Memory Efficient RL

✨How to enable optimizations

🎓No more gpu_memory_utilization!

⁉️Why does RL use so much memory?

🦥Unsloth Standby

🧪Performance Experiments

H100 Experiments

Previous A100 40GB experiments

🎉Other optimizations

📚GRPO Notebooks

RL Reward Hacking

🏆 Reward Hacking Overview

Laziness

Caching & Cheating

Cheating

GSPO Reinforcement Learning

Reinforcement Learning - DPO, ORPO & KTO

DPO Code

Tutorial: Train your own Reasoning model with GRPO

Quickstart

Install Unsloth

Learn about GRPO & Reward Functions

Configure desired settings

Data preparation

Reward Functions/Verifier

Train your model

Run & Evaluate your model

Save your model

Video Tutorials

RL Reward Hacking

🏆 Reward Hacking Overview

Laziness

Caching & Cheating

🎓No more `gpu_memory_utilization`!

🎓No more `gpu_memory_utilization`!