Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Beginner's Guide to transforming a model like Llama 3.1 (8B) into a reasoning model by using Unsloth and GRPO.
DeepSeek developed GRPO (Group Relative Policy Optimization) to train their R1 reasoning models.
These instructions are for our pre-made Google Colab notebooks. If you are installing Unsloth locally, you can also copy our notebooks inside your favorite code editor. We'll be using any of these notebooks:
- GSPO
- Vision GSPO
- Vision GSPO
- Advanced
- Advanced
If you're using our Colab notebook, click Runtime > Run all. We'd highly recommend you checking out our Fine-tuning Guide before getting started.
If installing locally, ensure you have the correct requirements and use pip install unsloth on Linux or follow our Windows install instructions.
Before we get started, it is recommended to learn more about GRPO, reward functions and how they work. Read more about them including tips & tricks here.
You will also need enough VRAM. In general, model parameters = amount of VRAM you will need. In Colab, we are using their free 16GB VRAM GPUs which can train any model up to 16B in parameters.
We have pre-selected optimal settings for the best results for you already and you can change the model to whichever you want listed in our supported models. Would not recommend changing other settings if you're a beginner.
For advanced GRPO documentation on batching, generation and training parameters, read our guide!
We have pre-selected OpenAI's GSM8K dataset which contains grade school math problems but you could change it to your own or any public one on Hugging Face. You can read more about datasets here.
Your dataset should still have at least 2 columns for question and answer pairs. However the answer must not reveal the reasoning behind how it derived the answer from the question. See below for an example:
We'll structure the data to prompt the model to articulate its reasoning before delivering an answer. To start, we'll establish a clear format for both prompts and responses.
# Define the system prompt that instructs the model to use a specific format
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""Now, to prepare the dataset:
import re
from datasets import load_dataset, Dataset
# Helper functions to extract answers from different formats
def extract_xml_answer(text: str) -> str:
answer = text.split("<answer>")[-1]
answer = answer.split("</answer>")[0]
return answer.strip()
def extract_hash_answer(text: str) -> str | None:
if "####" not in text:
return None
return text.split("####")[1].strip()
# Function to prepare the GSM8K dataset
def get_gsm8k_questions(split="train") -> Dataset:
data = load_dataset("openai/gsm8k", "main")[split]
data = data.map(
lambda x: {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": x["question"]},
],
"answer": extract_hash_answer(x["answer"]),
}
)
return data
dataset = get_gsm8k_questions()The dataset is prepared by extracting the answers and formatting them as structured strings.
Reward Functions/Verifiers lets us know if the model is doing well or not according to the dataset you have provided. Each generation run will be assessed on how it performs to the score of the average of the rest of generations. You can create your own reward functions however we have already pre-selected them for you with Will's GSM8K reward functions. With this, we have 5 different ways which we can reward each generation.
You can input your generations into an LLM like ChatGPT 4o or Llama 3.1 (8B) and design a reward function and verifier to evaluate it. For example, feed your generations into a LLM of your choice and set a rule: "If the answer sounds too robotic, deduct 3 points." This helps refine outputs based on quality criteria. See examples of what they can look like here.
Example Reward Function for an Email Automation Task:
Question: Inbound email
Answer: Outbound email
Reward Functions:
If the answer contains a required keyword → +1
If the answer exactly matches the ideal response → +1
If the response is too long → -1
If the recipient's name is included → +1
If a signature block (phone, email, address) is present → +1
We have pre-selected hyperparameters for the most optimal results however you could change them. Read all about parameters here. For advanced GRPO documentation on batching, generation and training parameters, read our guide!
The GRPOConfig defines key hyperparameters for training:
use_vllm: Activates fast inference using vLLM.
learning_rate: Determines the model's learning speed.
num_generations: Specifies the number of completions generated per prompt.
max_steps: Sets the total number of training steps.
NEW! We now support DAPO, Dr. GRPO and most other new GRPO techniques. You can play with the following arguments in GRPOConfig to enable:
epsilon=0.2,
epsilon_high=0.28, # one sided
delta=1.5 # two sided
loss_type='bnpo',
# or:
loss_type='grpo',
# or:
loss_type='dr_grpo',
# or:
loss_type='dapo',
mask_truncated_completions=True,You should see the reward increase overtime. We would recommend you train for at least 300 steps which may take 30 mins however, for optimal results, you should train for longer.
If you're having issues with your GRPO model not learning, we'd highly recommend to use our Advanced GRPO notebooks as it has a much better reward function and you should see results much faster and frequently.
You will also see sample answers which allows you to see how the model is learning. Some may have steps, XML tags, attempts etc. and the idea is as trains it's going to get better and better because it's going to get scored higher and higher until we get the outputs we desire with long reasoning chains of answers.
Run your model by clicking the play button. In the first example, there is usually no reasoning in the answer and in order to see the reasoning, we need to first save the LoRA weights we just trained with GRPO first using:
model.save_lora("grpo_saved_lora")Then we load the LoRA and test it. Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!
You can then save your model to GGUF, Ollama etc. by following our guide here.
If you are still not getting any reasoning, you may have either trained for too less steps or your reward function/verifier was not optimal.
We have multiple options for saving your fine-tuned model, but we’ll focus on the easiest and most popular approaches which you can read more about here
Saving in 16-bit Precision
You can save the model with 16-bit precision using the following command:
# Save to 16-bit precision
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")Pushing to Hugging Face Hub
To share your model, we’ll push it to the Hugging Face Hub using the push_to_hub_merged method. This allows saving the model in multiple quantization formats.
# Push to Hugging Face Hub (requires a token)
model.push_to_hub_merged(
"your-username/model-name", tokenizer, save_method="merged_16bit", token="your-token"
)Saving in GGUF Format for llama.cpp
Unsloth also supports saving in GGUF format, making it compatible with llama.cpp and Ollama.
model.push_to_hub_gguf(
"your-username/model-name",
tokenizer,
quantization_method=["q4_k_m", "q8_0", "q5_k_m"],
token="your-token",
)Once saved in GGUF format, the model can be easily deployed in lightweight environments using llama.cpp or used in other inference engines.
Here are some video tutorials created by amazing YouTubers who we think are fantastic!








Learn what is Reward Hacking in Reinforcement Learning and how to counter it.
The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric). But RL can cheat. When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking".
It's the reason models learn to modify unit tests to pass coding challenges, and these are critical blockers for real world deployment. Some other good examples are from Wikipedia.
Can you counter reward hacking? Yes! In our free gpt-oss RL notebook we explore how to counter reward hacking in a code generation setting and showcase tangible solutions to common error modes. We saw the model edit the timing function, outsource to other libraries, cache the results, and outright cheat. After countering, the result is our model generates genuinely optimized matrix multiplication kernels, not clever cheats.
Some common examples of reward hacking during RL include:
RL learns to use Numpy, Torch, other libraries, which calls optimized CUDA kernels. We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries.
RL learns to cache the result of the output and RL learns to find the actual output by inspecting Python global variables.
We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.
RL learns to edit the timing function to make it output 0 time as passed. We can stop the RL algorithm from using global or cached variables by restricting it's locals and globals. We are also going to use exec to create the function, so we have to save the output to an empty dict. We also disallow global variable access via types.FunctionType(f.__code__, {})\
Train with GSPO (Group Sequence Policy Optimization) RL in Unsloth.
We're introducing GSPO which is a variant of made by the Qwen team at Alibaba. They noticed the observation that when GRPO takes importance weights for each token, even though inherently advantages do not scale or change with each token. This lead to the creation of GSPO, which now assigns the importance on the sequence likelihood rather than the individual token likelihoods of the tokens.
Use our free GSPO notebooks for: and
Enable GSPO in Unsloth by setting importance_sampling_level = "sequence" in the GRPO config. The difference between these two algorithms can be seen below, both from the GSPO paper from Qwen and Alibaba:
In Equation 1, it can be seen that the advantages scale each of the rows into the token logprobs before that tensor is sumed. Essentially, each token is given the same scaling even though that scaling was given to the entire sequence rather than each individual token. A simple diagram of this can be seen below:
Equation 2 shows that the logprob ratios for each sequence is summed and exponentiated after the Logprob ratios are computed, and only the resulting now sequence ratios get row wise multiplied by the advantages.
Enabling GSPO is simple, all you need to do is set the importance_sampling_level = "sequence" flag in the GRPO config.
Defeating the Training-Inference Mismatch via FP16 https://arxiv.org/pdf/2510.26788 shows how using float16 is better than bfloat16
There was a paper titled "Defeating the Training-Inference Mismatch via FP16" showing how using float16 precision can dramatically be better than using bfloat16 when doing reinforcement learning.
In fact the longer the generation, the worse it gets when using bfloat16:
We did an investigation, and DO find float16 to be more stable than bfloat16 with much smaller gradient norms see and
As per and , older vLLM versions (before 0.11.0) had broken attention mechanisms for A100 and similar GPUs. Please update vLLM! We also by default disable cascade attention in vLLM during Unsloth reinforcement learning if we detect an older vLLM version.
Different hardware also changes results, where newer and more expensive GPUs have less KL difference between the inference and training sides:
To use float16 precision in Unsloth GRPO and RL, you just need to set dtype = torch.float16 and we'll take care of the rest!
training_args = GRPOConfig(
output_dir = "vlm-grpo-unsloth",
per_device_train_batch_size = 8,
gradient_accumulation_steps = 4,
learning_rate = 5e-6,
adam_beta1 = 0.9,
adam_beta2 = 0.99,
weight_decay = 0.1,
warmup_ratio = 0.1,
lr_scheduler_type = "cosine",
optim = "adamw_8bit",
# beta = 0.00,
epsilon = 3e-4,
epsilon_high = 4e-4,
num_generations = 8,
max_prompt_length = 1024,
max_completion_length = 1024,
log_completions = False,
max_grad_norm = 0.1,
temperature = 0.9,
# report_to = "none", # Set to "wandb" if you want to log to Weights & Biases
num_train_epochs = 2, # For a quick test run, increase for full training
report_to = "none"
# GSPO is below:
importance_sampling_level = "sequence",
# Dr GRPO / GAPO etc
loss_type = "dr_grpo",
)



from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-4B-Base",
max_seq_length = max_seq_length,
load_in_4bit = False, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.9, # Reduce if out of memory
dtype = torch.float16, # Use torch.float16, torch.bfloat16
)






To use the reward modelling functions for DPO, GRPO, ORPO or KTO with Unsloth, follow the steps below:
DPO (Direct Preference Optimization), ORPO (Odds Ratio Preference Optimization), PPO, KTO Reward Modelling all work with Unsloth.
We have Google Colab notebooks for reproducing GRPO, ORPO, DPO Zephyr, KTO and SimPO:
We're also in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs.
python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Optional set GPU device ID
from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/zephyr-sft-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 64,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 64,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
max_seq_length = max_seq_length,
)
dpo_trainer = DPOTrainer(
model = model,
ref_model = None,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 8,
warmup_ratio = 0.1,
num_train_epochs = 3,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
seed = 42,
output_dir = "outputs",
),
beta = 0.1,
train_dataset = YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,
tokenizer = tokenizer,
max_length = 1024,
max_prompt_length = 512,
)
dpo_trainer.train()Learn all about Reinforcement Learning (RL) and how to train your own DeepSeek-R1 reasoning model with Unsloth using GRPO. A complete guide from beginner to advanced.
Reinforcement Learning is where an "agent" learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties.
Action: What the model generates (e.g. a sentence).
Reward: A signal indicating how good or bad the model's action was (e.g. did the response follow instructions? was it helpful?).
Environment: The scenario or task the model is working on (e.g. answering a user’s question).
For advanced GRPO documentation on batching, generation and training parameters, read our guide!
What is RL? RLVR? PPO? GRPO? RLHF? RFT? Is "Luck is All You Need?" for RL?
What is an environment? Agent? Action? Reward function? Rewards?
This article covers everything (from beginner to advanced) you need to know about GRPO, Reinforcement Learning (RL) and reward functions, along with tips, and the basics of using GRPO with Unsloth. If you're looking for a step-by-step tutorial for using GRPO, see our guide here.
The goal of RL is to:
Increase the chance of seeing "good" outcomes.
Decrease the chance of seeing "bad" outcomes.
That's it! There are intricacies on what "good" and "bad" means, or how do we go about "increasing" or "decreasing" it, or what even "outcomes" means.
For example, in the Pacman game:
The environment is the game world.
The actions you can take are UP, LEFT, RIGHT and DOWN.
The rewards are good if you eat a cookie, or bad if you hit one of the squiggly enemies.
In RL, you can't know the "best action" you can take, but you can observe intermediate steps, or the final game state (win or lose)
Another example is imagine you are given the question: "What is 2 + 2?" (4) An unaligned language model will spit out 3, 4, C, D, -10, literally anything.
Numbers are better than C or D right?
Getting 3 is better than say 8 right?
Getting 4 is definitely correct.
We just designed a reward function!
OpenAI popularized the concept of RLHF (Reinforcement Learning from Human Feedback), where we train an "agent" to produce outputs to a question (the state) that are rated more useful by human beings.
The thumbs up and down in ChatGPT for example can be used in the RLHF process.
The clip(..., 1-e, 1+e) term is used to force PPO not to take too large changes. There is also a KL term with beta set to > 0 to force the model not to deviate too much away.
In order to do RLHF, PPO (Proximal policy optimization) was developed. The agent is the language model in this case. In fact it's composed of 3 systems:
The Generating Policy (current trained model)
The Reference Policy (original model)
The Value Model (average reward estimator)
We use the Reward Model to calculate the reward for the current environment, and our goal is to maximize this!
The formula for PPO looks quite complicated because it was designed to be stable. Visit our AI Engineer talk we gave in 2025 about RL for more in depth maths derivations about PPO.
DeepSeek developed GRPO (Group Relative Policy Optimization) to train their R1 reasoning models. The key differences to PPO are:
The Value Model is removed, replaced with statistics from calling the reward model multiple times.
The Reward Model is removed and replaced with just custom reward function which RLVR can be used.
This means GRPO is extremely efficient. Previously PPO needed to train multiple models - now with the reward model and value model removed, we can save memory and speed up everything.
RLVR (Reinforcement Learning with Verifiable Rewards) allows us to reward the model based on tasks with easy to verify solutions. For example:
Maths equations can be easily verified. Eg 2+2 = 4.
Code output can be verified as having executed correctly or not.
Designing verifiable reward functions can be tough, and so most examples are math or code.
Use-cases for GRPO isn’t just for code or math—its reasoning process can enhance tasks like email automation, database retrieval, law, and medicine, greatly improving accuracy based on your dataset and reward function - the trick is to define a rubric - ie a list of smaller verifiable rewards, and not a final all consuming singular reward. OpenAI popularized this in their reinforcement learning finetuning (RFT) offering for example.
Why "Group Relative"?
GRPO removes the value model entirely, but we still need to estimate the "average reward" given the current state.
The trick is to sample the LLM! We then calculate the average reward through statistics of the sampling process across multiple different questions.
For example for "What is 2+2?" we sample 4 times. We might get 4, 3, D, C. We then calculate the reward for each of these answers, then calculate the average reward and standard deviation, then Z-score standardize this!
This creates the advantages A, which we will use in replacement of the value model. This saves a lot of memory!
The trick of RL is you need 2 things only:
A question or instruction eg "What is 2+2?" "Create a Flappy Bird game in Python"
A reward function and verifier to verify if the output is good or bad.
With only these 2, we can essentially call a language model an infinite times until we get a good answer. For example for "What is 2+2?", an untrained bad language model will output:
0, cat, -10, 1928, 3, A, B, 122, 17, 182, 172, A, C, BAHS, %$, #, 9, -192, 12.31**** then suddenly 4.
The reward signal was 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0**** then suddenly 1.
So by luck and by chance, RL managed to find the correct answer across multiple rollouts. Our goal is we want to see the good answer 4 more, and the rest (the bad answers) much less.
So the goal of RL is to be patient - in the limit, if the probability of the correct answer is at least a small number (not zero), it's just a waiting game - you will 100% for sure encounter the correct answer in the limit.
So I like to call it as "Luck Is All You Need" for RL.
Well a better phrase is "Patience is All You Need" for RL.
RL essentially provides us a trick - instead of simply waiting for infinity, we do get "bad signals" ie bad answers, and we can essentially "guide" the model to already try not generating bad solutions. This means although you waited very long for a "good" answer to pop up, the model already has been changed to try its best not to output bad answers.
In the "What is 2+2?" example - 0, cat, -10, 1928, 3, A, B, 122, 17, 182, 172, A, C, BAHS, %$, #, 9, -192, 12.31**** then suddenly 4.
Since we got bad answers, RL will influence the model to try NOT to output bad answers. This means over time, we are carefully "pruning" or moving the model's output distribution away from bad answers. This means RL is efficient, since we are NOT just waiting for infinity, but we are actively trying to "push" the model to go as much as possible to the "correct answer space".
If the probability is always 0, then RL will never work. This is also why people like to do RL from an already instruction finetuned model, which can partially follow instructions reasonably well - this boosts the probability most likely above 0.
With 15GB VRAM, Unsloth allows you to transform any model up to 17B parameters like Llama 3.1 (8B), Phi-4 (14B), Mistral (7B) or Qwen2.5 (7B) into a reasoning model
Unsloth now supports RL for Vision/multimodal models!
Minimum requirement: Just 5GB VRAM is enough to train your own reasoning model locally (for any model with 1.5B parameters or less)
GSPO - new
- Vision GSPO - new
- Vision GSPO - new
- Advanced
- Advanced
NEW! We now support GSPO and most other new GRPO techniques. You can play with the following arguments in GRPOConfig to enable:
epsilon=0.2,
epsilon_high=0.28, # one sided
delta=1.5 # two sided
loss_type='gspo',
# or:
loss_type='grpo',
# or:
loss_type='dr_grpo',
mask_truncated_completions=True,If you're not getting any reasoning, make sure you have enough training steps and ensure your reward function/verifier is working. We provide examples for reward functions here.
Previous demonstrations show that you could achieve your own "aha" moment with Qwen2.5 (3B) - but it required 2xA100 GPUs (160GB VRAM). Now, with Unsloth, you can achieve the same "aha" moment using just a single 5GB VRAM GPU.
Previously, GRPO was only supported for full fine-tuning, but we've made it work with QLoRA and LoRA
On 20K context lengths for example with 8 generations per prompt, Unsloth uses only 54.3GB of VRAM for Llama 3.1 (8B), whilst standard implementations (+ Flash Attention 2) take 510.8GB (90% less for Unsloth).
Please note, this isn’t fine-tuning DeepSeek’s R1 distilled models or using distilled data from R1 for tuning which Unsloth already supported. This is converting a standard model into a full-fledged reasoning model using GRPO.
In a test example, even though we only trained Phi-4 with 100 steps using GRPO, the results are already clear. The model without GRPO does not have the thinking token, whilst the one trained with GRPO does and also has the correct answer.
For a tutorial on how to transform any open LLM into a reasoning model using Unsloth & GRPO, see here.
For advanced GRPO documentation on batching, generation and training parameters, read our guide!
For each question-answer pair, the model generates multiple possible responses (e.g., 8 variations).
Each response is evaluated using reward functions.
Training Steps:
If you have 300 rows of data, that's 300 training steps (or 900 steps if trained for 3 epochs).
You can increase the number of generated responses per question (e.g., from 8 to 16).
The model learns by updating its weights every step.
If you're having issues with your GRPO model not learning, we'd highly recommend to use our Advanced GRPO notebooks as it has a much better reward function and you should see results much faster and frequently.
Wait for at least 300 steps for the reward to actually increase. In order to get decent results, you may need to trade for a minimum of 12 hours (this is how GRPO works), but keep in mind this isn't compulsory as you can stop at anytime.
For optimal results have at least 500 rows of data. You can try with even 10 rows of data but it's better to have more.
Each training run will always be different depending on your model, data, reward function/verifier etc. so though 300 steps is what we wrote as the minimum, sometimes it might be 1000 steps or more. So, it depends on various factors.
If you're using GRPO with Unsloth locally, please "pip install diffusers" as well if you get an error. Please also use the latest version of vLLM.
It’s advised to apply GRPO to a model at least 1.5B in parameters to correctly generate thinking tokens as smaller models may not.
For GRPO's GPU VRAM requirements for QLoRA 4-bit, the general rule is the model parameters = the amount of VRAM you will need (you can use less VRAM but this just to be safe). The more context length you set, the more VRAM. LoRA 16-bit will use at minimum 4x more VRAM.
Continuous fine-tuning is possible and you can just leave GRPO running in the background.
In the example notebooks, we use the GSM8K dataset, the current most popular choice for R1-style training.
If you’re using a base model, ensure you have a chat template.
The more you train with GRPO the better. The best part of GRPO is you don't even need that much data. All you need is a great reward function/verifier and the more time spent training, the better your model will get. Expect your reward vs step to increase as time progresses like this:
Training loss tracking for GRPO is now built directly into Unsloth, eliminating the need for external tools like wandb etc. It contains full logging details for all reward functions now including the total aggregated reward function itself.
In Reinforcement Learning a Reward Function and a Verifier serve distinct roles in evaluating a model’s output. In general, you could interpret them as the same thing however, technically they're not but it does not matter as much as they are usually used in conjunction with each other.
Verifier:
Determines whether the generated response is correct or incorrect.
It does not assign a numerical score—it simply verifies correctness.
Example: If a model generates "5" for "2+2", the verifier checks and labels it as "wrong" (since the correct answer is 4).
Verifiers can also execute code (e.g., in Python) to validate logic, syntax, and correctness without needing manual evaluation.
Reward Function:
Converts verification results (or other criteria) into a numerical score.
Example: If an answer is wrong, it might assign a penalty (-1, -2, etc.), while a correct answer could get a positive score (+1, +2).
It can also penalize based on criteria beyond correctness, such as excessive length or poor readability.
Key Differences:
A Verifier checks correctness but doesn’t score.
A Reward Function assigns a score but doesn’t necessarily verify correctness itself.
A Reward Function can use a Verifier, but they are technically not the same.
GRPO's primary goal is to maximize reward and learn how an answer was derived, rather than simply memorizing and reproducing responses from its training data.
With every training step, GRPO adjusts model weights to maximize the reward. This process fine-tunes the model incrementally.
Regular fine-tuning (without GRPO) only maximizes next-word prediction probability but does not optimize for a reward. GRPO optimizes for a reward function rather than just predicting the next word.
You can reuse data across multiple epochs.
Default reward functions can be predefined to be used on a wide array of use cases or you can ask ChatGPT/local model to generate them for you.
There’s no single correct way to design reward functions or verifiers - the possibilities are endless. However, they must be well-designed and meaningful, as poorly crafted rewards can unintentionally degrade model performance.
You can refer to the examples below. You can input your generations into an LLM like ChatGPT 4o or Llama 3.1 (8B) and design a reward function and verifier to evaluate it. For example, feed your generations into a LLM of your choice and set a rule: "If the answer sounds too robotic, deduct 3 points." This helps refine outputs based on quality criteria
Question: "2 + 2"
Answer: "4"
Reward Function 1:
If a number is detected → +1
If no number is detected → -1
Reward Function 2:
If the number matches the correct answer → +3
If incorrect → -3
Total Reward: Sum of all reward functions
Question: Inbound email
Answer: Outbound email
Reward Functions:
If the answer contains a required keyword → +1
If the answer exactly matches the ideal response → +1
If the response is too long → -1
If the recipient's name is included → +1
If a signature block (phone, email, address) is present → +1
If you’ve checked out our Advanced GRPO Colab Notebook, you’ll notice we’ve created a custom proximity-based reward function built completely from scratch, which is designed to reward answers that are closer to the correct one. This flexible function can be applied across a wide range of tasks.
In our examples, we enable reasoning in Qwen3 (Base) and guide it toward specific tasks
Apply Pre-finetuning strategies to avoid GRPO’s default tendency to just learn formatting
Boost evaluation accuracy with regex-based matching
Create custom GRPO templates beyond generic prompts like think, e.g., <start_working_out></end_working_out>
Apply proximity-based scoring — models get more reward for closer answers (e.g., predicting 9 instead of 10 is better than 3) while outliers are penalized
In our other examples, we use existing GSM8K reward functions by @willccbb which is popular and shown to be quite effective:
correctness_reward_func – Rewards exact label matches.
int_reward_func – Encourages integer-only answers.
soft_format_reward_func – Checks structure but allows minor newline mismatches.
strict_format_reward_func – Ensures response structure matches the prompt, including newlines.
xmlcount_reward_func – Ensures exactly one of each XML tag in the response.
You can now use vLLM directly in your finetuning stack, which allows for much more throughput and allows you to finetune and do inference on the model at the same time! On 1x A100 40GB, expect 4000 tokens / s or so with Unsloth’s dynamic 4bit quant of Llama 3.2 3B Instruct. On a 16GB Tesla T4 (free Colab GPU), you can get 300 tokens / s. We also magically removed double memory usage when loading vLLM and Unsloth together, allowing for savings of 5GB or so for Llama 3.1 8B and 3GB for Llama 3.2 3B. Unsloth could originally finetune Llama 3.3 70B Instruct in 1x 48GB GPU with Llama 3.3 70B weights taking 40GB of VRAM. If we do not remove double memory usage, then we’ll need >= 80GB of VRAM when loading Unsloth and vLLM together. But with Unsloth, you can still finetune and get the benefits of fast inference in one package in under 48GB of VRAM! To use fast inference, first install vllm, and instantiate Unsloth with fast_inference:
pip install unsloth vllm
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2-3B-Instruct",
fast_inference = True,
)
model.fast_generate(["Hello!"])When you’re using Unsloth to do GRPO, we smartly reduce VRAM usage by over 90% when compared to standard implementations with Flash Attention 2 by using multiple tricks! On 20K context lengths for example with 8 generations per prompt, Unsloth uses only 54.3GB of VRAM for Llama 3.1 8B, whilst standard implementations take 510.8GB (90% less for Unsloth).
For GRPO's GPU VRAM requirements for QLoRA 4-bit, the general rule is the model parameters = the amount of VRAM you will need (you can use less VRAM but this just to be safe). The more context length you set, the more VRAM. LoRA 16-bit will use at minimum 4x more VRAM.
Our new memory efficient linear kernels for GRPO slashes memory usage by 8x or more. This shaves 68.5GB of memory, whilst being actually faster through the help of torch.compile!
We leverage our smart Unsloth gradient checkpointing algorithm which we released a while ago. It smartly offloads intermediate activations to system RAM asynchronously whilst being only 1% slower. This shaves 52GB of memory.
Unsloth also uses the same GPU / CUDA memory space as the underlying inference engine (vLLM), unlike implementations in other packages. This shaves 16GB of memory.
Training Memory Cost (GB)
42GB
414GB
GRPO Memory Cost (GB)
9.8GB
78.3GB
Inference Cost (GB)
0GB
16GB
Inference KV Cache for 20K context length (GB)
2.5GB
2.5GB
Total Memory Usage
54.33GB (90% less)
510.8GB
In typical standard GRPO implementations, you need to create 2 logits of size (8. 20K) to calculate the GRPO loss. This takes 2 * 2 bytes * 8 (num generations) * 20K (context length) * 128256 (vocabulary size) = 78.3GB in VRAM.
Unsloth shaves 8x memory usage for long context GRPO, so we need only an extra 9.8GB in extra VRAM for 20K context lengths!
We also need to from the KV Cache in 16bit. Llama 3.1 8B has 32 layers, and both K and V are 1024 in size. So memory usage for 20K context length = 2 * 2 bytes * 32 layers * 20K context length * 1024 = 2.5GB per batch. We would set the batch size for vLLM to 8, but we shall leave it at 1 for our calculations to save VRAM. Otherwise you will need 20GB for the KV cache.
Nathan Lambert's RLHF Book is a must! https://rlhfbook.com/c/11-policy-gradients.html
Yannic Kilcher's GRPO Youtube video is also a must! https://www.youtube.com/watch?v=bAWV_yrqx4w
We did a 3 hour workshop at AI Engineer World's Fair 2025. Slides are other material are at https://docs.unsloth.ai/ai-engineers-2025
Advanced GRPO notebook via Unsloth. https://docs.unsloth.ai/basics/reinforcement-learning-guide/tutorial-train-your-own-reasoning-model-with-grpo
GRPO from a base model notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb













Advanced documentation settings when using Unsloth with GRPO.
Detailed guides on doing GRPO with Unsloth for Batching, Generation & Training Parameters:
beta (float, default 0.0): KL coefficient.
0.0 ⇒ no reference model loaded (lower memory, faster).
Higher beta constrains the policy to stay closer to the ref policy.
num_iterations (int, default 1): PPO epochs per batch (μ in the algorithm).
Replays data within each gradient accumulation step; e.g., 2 = two forward passes per accumulation step.
epsilon (float, default 0.2): Clipping value for token-level log-prob ratios (typical ratio range ≈ [-1.2, 1.2] with default ε).
delta (float, optional): Enables upper clipping bound for two-sided GRPO when set. If None, standard GRPO clipping is used. Recommended > 1 + ε when enabled (per INTELLECT-2 report).
epsilon_high (float, optional): Upper-bound epsilon; defaults to epsilon if unset. DAPO recommends 0.28.
importance_sampling_level (“token” | “sequence”, default "token"):
"token": raw per-token ratios (one weight per token).
"sequence": average per-token ratios to a single sequence-level ratio.
GSPO shows sequence-level sampling often gives more stable training for sequence-level rewards.
reward_weights (list[float], optional): One weight per reward. If None, all weights = 1.0.
scale_rewards (str|bool, default "group"):
True or "group": scale by std within each group (unit variance in group).
"batch": scale by std across the entire batch (per PPO-Lite).
False or "none": no scaling. Dr. GRPO recommends not scaling to avoid difficulty bias from std scaling.
loss_type (str, default "dapo"):
"grpo": normalizes over sequence length (length bias; not recommended).
"dr_grpo": normalizes by a global constant (introduced in Dr. GRPO; removes length bias). Constant ≈ max_completion_length.
"dapo" (default): normalizes by active tokens in the global accumulated batch (introduced in DAPO; removes length bias).
"bnpo": normalizes by active tokens in the local batch only (results can vary with local batch size; equals GRPO when per_device_train_batch_size == 1).
mask_truncated_completions (bool, default False):
When True, truncated completions are excluded from loss (recommended by DAPO for stability).
Note: There are some KL issues with this flag, so we recommend to disable it.
# If mask_truncated_completions is enabled, zero out truncated completions in completion_mask
if self.mask_truncated_completions:
truncated_completions = ~is_eos.any(dim=1)
completion_mask = completion_mask * (~truncated_completions).unsqueeze(1).int()This can zero out all completion_mask entries when many completions are truncated, making n_mask_per_reward = 0 and causing KL to become NaN. See
vllm_importance_sampling_correction (bool, default True):
Applies Truncated Importance Sampling (TIS) to correct off-policy effects when generation (e.g., vLLM / fast_inference) differs from training backend.
In Unsloth, this is auto-set to True if you’re using vLLM/fast_inference; otherwise False.
vllm_importance_sampling_cap (float, default 2.0):
Truncation parameter C for TIS; sets an upper bound on the importance sampling ratio to improve stability.
dtype when choosing float16 or bfloat16, see FP16 vs BF16 for RL
temperature (float, defaults to 1.0):
Temperature for sampling. The higher the temperature, the more random the completions. Make sure you use a relatively high (1.0) temperature to have diversity in generations which helps learning.
top_p (float, optional, defaults to 1.0):
Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1.0 to consider all tokens.
top_k (int, optional):
Number of highest probability vocabulary tokens to keep for top-k-filtering. If None, top-k-filtering is disabled and all tokens are considered.
min_p (float, optional):
Minimum token probability, which will be scaled by the probability of the most likely token. It must be a value between 0.0 and 1.0. Typical values are in the 0.01-0.2 range.
repetition_penalty (float, optional, defaults to 1.0):
Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1.0 encourage the model to use new tokens, while values < 1.0 encourage the model to repeat tokens.
steps_per_generation: (int, optional):
Number of steps per generation. If None, it defaults to gradient_accumulation_steps. Mutually exclusive with generation_batch_size.
train_batch_size: Number of samples per process per step.
If this integer is less than num_generations, it will default to num_generations.
steps_per_generation: Number of microbatches that contribute to one generation’s loss calculation (forward passes only).
A new batch of data is generated every steps_per_generation steps; backpropagation timing depends on gradient_accumulation_steps.
num_processes: Number of distributed training processes (e.g., GPUs / workers).
gradient_accumulation_steps (aka gradient_accumulation): Number of microbatches to accumulate before applying backpropagation and optimizer update.
Effective batch size:
effective_batch_size = steps_per_generation * num_processes * train_batch_sizeTotal samples contributing to gradients before an update (across all processes and steps).
Optimizer steps per generation:
optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_stepsExample: 4 / 2 = 2.
num_generations: Number of generations produced per prompt (applied after computing effective_batch_size).
The number of unique prompts in a generation cycle is:
unique_prompts = effective_batch_size / num_generationsMust be > 2 for GRPO to work.
The tables below illustrate how batches flow through steps, when optimizer updates occur, and how new batches are generated.
num_gpus = 1
per_device_train_batch_size = 3
gradient_accumulation_steps = 2
steps_per_generation = 4
effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3Generation cycle A
0
[0,0,0]
1
[1,1,1]
→ optimizer update (accum = 2 reached)
2
[2,2,2]
3
[3,3,3]
optimizer update
Generation cycle B
0
[4,4,4]
1
[5,5,5]
→ optimizer update (accum = 2 reached)
2
[6,6,6]
3
[7,7,7]
optimizer update
num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4
effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3Generation cycle A
0
[0,0,0]
1
[1,1,1]
2
[2,2,2]
3
[3,3,3]
optimizer update (accum = 4 reached)
Generation cycle B
0
[4,4,4]
1
[5,5,5]
2
[6,6,6]
3
[7,7,7]
optimizer update (accum = 4 reached)
num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4
effective_batch_size = 4 * 3 * 1 = 12
num_generations = 4
unique_prompts = effective_batch_size / num_generations = 3Generation cycle A
0
[0,0,0]
1
[0,1,1]
2
[1,1,3]
3
[3,3,3]
optimizer update (accum = 4 reached)
Generation cycle B
0
[4,4,4]
1
[4,5,5]
2
[5,5,6]
3
[6,6,6]
optimizer update (accum = 4 reached)
num_gpus = 1
per_device_train_batch_size = 6
steps_per_generation = gradient_accumulation_steps = 2
effective_batch_size = 2 * 6 * 1 = 12
num_generations = 3
unique_prompts = 4Generation cycle A
0
[0,0,0, 1,1,1]
1
[2,2,2, 3,3,3]
optimizer update (accum = 2 reached)
Generation cycle B
0
[4,4,4, 5,5,5]
1
[6,6,6, 7,7,7]
optimizer update (accum = 2 reached)
effective_batch_size = steps_per_generation * num_processes * train_batch_size
optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps
unique_prompts = effective_batch_size / num_generations # must be > 2We're excited to introduce more efficient reinforcement learning (RL) in Unsloth with multiple algorithmic advancements:
1.2 to 1.7x increased context lengths with no slowdown and no extra memory usage!
10% faster RL training runs with revamped kernels and async data movements
2x faster torch.compile times during model loading
Unsloth already increases RL training speed, context window and reduces VRAM usage by 50–90% vs. all other setups with FA2, but now Unsloth's Standby improves this even further. Our Standby feature uniquely limits speed degradation compared to other implementations and sometimes makes training even faster!
Now, Qwen3-32B LoRA 16-bit can attain 6,144 context lengths vs 3,600 (1.7x longer) before on 1xH100 80GB GPU. Llama-3.1-8B QLoRA 4bit can attain 47,500 lengths vs 42,000 before (1.13x longer).
We made RL runs 10% faster through various kernel optimizations, and removed the LoRA communication channel between the CPU and GPU when switching from training to inference mode. Finally, we used custom torch.compile flags to make vLLM's rollout faster by 10%, and reduced compilation time by 2x.
To enable Unsloth's Standby feature, set the environment variable UNSLOTH_VLLM_STANDBY before any Unsloth import. Then set gpu_memory_utilization = 0.95 and that's it!
import os
os.environ["UNSLOTH_VLLM_STANDBY"] = "1"
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-8B-Base",
max_seq_length = 2048, # Can increase for longer reasoning traces
load_in_4bit = False, # False for LoRA 16bit
fast_inference = True,
max_lora_rank = 32, # Larger rank = smarter, but slower
gpu_memory_utilization = 0.95,
)gpu_memory_utilization!With Unsloth's new RL improvements, you NEVER have to worry about tuning or setting gpu_memory_utilization ever again - simply set it to 90% or 95% of GPU utilization - 100% sadly won't work since some space is needed for small tensors. Previously one had to tune it from 30% to 95% - no more now! Set it to the maximum and Unsloth will handle the rest!
GRPO (and many RL variants) rely heavily on generation which is primarily powered by vLLM. But this comes comes with a steep cost since it requires constant GPU memory for weights, activations, and the KV Cache.
Inference takes a lot of VRAM
Whilst Training also uses VRAM!
This means RL needs to keep 2 sets of VRAM / memory on the GPU at the same time:
Inference engine (has model weights, KV cache)
Training engine (has model weights, activations, gradients, optimizer states)
Current RL frameworks have to split 50/50 for a 80GB GPU with 50% for inference and 50% for training. And moving weights from training mode to inference mode can take quite some time.
Model Weights
16GB
16GB
KV Cache
24GB
Activations, Gradients, Optimizer States
24GB
Previous Unsloth versions already smartly optimizes the above, as we share vLLM's weight space directly which removes the double memory usage of the model weights. This frees up 16GB of space for example which can be used to increase context length or the speed of generation. Also, we don't need to do memory movements, which makes training faster.
Model Weights
16GB SHARED
<<< SHARED
KV Cache
24GB + 8GB= 32GB
Activations, Gradients, Optimizer States
24GB + 8GB=32GB
But we can go further - we first note RL does inference then training then inference then training etc.
This means the memory space for inference and training can in theory be re-used, since inference and training are separate modes - this is where vLLM's sleep mode feature comes in, which has 2 options:
level = 1 copies weights to the CPU and deletes KV cache
level = 2 deletes weights and deletes KV cache
But reminder in Unsloth we share vLLM's memory space for the weights - this means we need a new way to delete the KV cache, and ignore deletion of the weights, and we call this Unsloth Standby.
Model Weights
16GB SHARED
<<< SHARED
Multi-purpose
64GB space
KV Cache
Activations, Gradients, Optimizer States
To enable this, simply add the below to all RL / GRPO training runs before any Unsloth import:
import os
os.environ["UNSLOTH_VLLM_STANDBY"] = "1"Here you will find out how we benchmarked memory usage and context length for GRPO. Note that we do 2 generations per prompt because for GRPO to work, we need at least 2 generations for which to calculate the sample mean and variance. Without 2 generations, the standard deviation of one sample is 0. This causes the advantages which uses this: (reward - mean)/std to be undefined.
This means for GRPO specifically, a maximum context length of 6,144 for Qwen-3 32B is actually 6,144 multiplied by 2 generations ie 12,288 in length.
We provide experiments for Llama-3.1 8B on both LoRA (16bit) and QLoRA (4bit) below:
If you notice any training time differences, it isn’t much. In our apples to apples comparison we noticed <1% training time slowdowns or even speedups which can be attributed to margin of error.
We also theorize speedups are possible due to reduced memory pressure, so there might be less memory cleanup on the CUDA memory allocator side.
In the above image, you see the difference between baseline and standby mode on a single T4 GPU for Qwen 3 4B. We can stretch the vllm's gpu_memory_utilisation to as high as 0.95 without worrying that it'd affect training. This means you can fit higher context length sequences and more sequences can be processed. In the first case, for example, we have enough memory to fit and process 32K length sequences provided training allows where as previously, any inputs longer than 2K would potentially not fit in and end up causing OOMs (out of memory).
standby True
vllm_gpu_util 0.95
num_gen 2
grad_acc_steps 2
Runs for 40 steps/ 40 minutes
14.5 GiB (set by vllm_gpu_util)
Enough to fit in 32K KVCache with chunk of 2-4K or say 16K KVCache + 16K chunks
standby True
vllm_gpu_util 0.9
num_gen 2
grad_acc_steps 2
Runs 32 steps in 40 m
13.8 GiB (set by…)
Approx enough to fit in ~28K KVCache with chunk of 2-4K or say 15K KVCache + 15K chunks
standby False
vllm_gpu_util 0.9
num_gen 2
grad_acc_steps 2
model loads but can’t train because even batch size of 1 doesn’t fit
OOM
standby False
vllm_gpu_util 0.8
num_gen 2
grad_acc_steps 2
model loads but can’t train because even batch size of 1 doesn’t fit
OOM
standby False
vllm_gpu_util 0.7
num_gen 2
grad_acc_steps 2
Trains fine
28 steps take 39min
~15.1GiB
any input slightly longer will result in OOM on colab
standby True
vllm_gpu_util 0.7
num_gen 2
grad_acc_steps 2
Trains fine
29 steps take 40min
13GiB but most of the time around 10-11GB
At the same config, we save 2GiB aka 15% memory here. Can be higher for longer sequences
Qwen2.5-14B-Instruct
NVIDIA H100 80GB PCIe
32,768
8
4
In our collapsible results below, you can see there is a 9GiB difference in the peak memory used (note that 90% of the time, the GPU memory usage is equal to the peak memory in our case). To put things into perspective, using TRL and LoRA we were able to only fine-tune an 8B parameter model with a context length of 1024 at max (32x less). Anything with higher sequence length (with similar configuration) results in the process failing with OOM.
The image below shows how standby compares against non standby training with Unsloth. It is averaged over 3 runs to make sure the metrics aren’t noisy. In fact, if you zoom in close enough, you’d see that enabling standby makes it faster as well, probably due to less memory pressure as discussed before.
In our previous experiments on A100 40GB GPU with Qwen-2.5-3b-instruct and 8 generations per sample, we observed that without standby, the GRPO training (model loaded in 16bit, LoRA, only weights trainable), we could only fit 6K sequence lengths. With our standby feature, we were able to fit 10K and beyond! For comparison TRL can only give you context lengths of up to 1K while holding the same batch size.
We now select better compilation flags and reduce compile times by 50% or more. We also managed to dynamically patch any vLLM version to handle gc.collect better for backwards compatibility reasons, as inspired from this vLLM pull request. This reduces compilation times from 2 minutes to under 40 seconds.
We also optimized torch.compile flags and tried turning on some flags - unfortunately combo_kernels and multi_kernel could not function correctly on vLLM 0.10 and Torch 2.8/2.9 nightly and coordinate_descent_tuning made autotuning all kernels dramatically slower. It used to compile in under a minute, but enabling it took over 13 minutes and more, with minimal performance gains.
All our GRPO notebooks have Unsloth Standby on by default and all optimizations! See https://docs.unsloth.ai/get-started/unsloth-notebooks for all our GRPO notebooks, or try the below:
Qwen3 (4B) - Advanced GRPO LoRA
DeepSeek-R1-0528-Qwen3 (8B) (for multilingual usecases)
Llama 3.2 (3B) - Advanced GRPO LoRA






