⚡Tutorial: Train your own Reasoning model with GRPO

Beginner's Guide to transforming a model like Llama 3.1 (8B) into a reasoning model by using Unsloth and GRPO.

DeepSeek developed GRPO (Group Relative Policy Optimization) to train their R1 reasoning models.

Quickstart

These instructions are for our pre-made Google Colab notebooks. If you are installing Unsloth locally, you can also copy our notebooks inside your favorite code editor. We'll be using any of these notebooks:

gpt-oss-20b - GSPO

Qwen2.5-VL - Vision GSPO

Gemma 3 (4B) - Vision GSPO

Qwen3 (4B) - Advanced

DeepSeek-R1-0528-Qwen3-8B

Llama 3.2 (3B) - Advanced

Install Unsloth

If you're using our Colab notebook, click Runtime > Run all. We'd highly recommend you checking out our Fine-tuning Guide before getting started.

If installing locally, ensure you have the correct requirements and use pip install unsloth on Linux or follow our Windows install instructions.

Learn about GRPO & Reward Functions

Before we get started, it is recommended to learn more about GRPO, reward functions and how they work. Read more about them including tips & tricks here.

You will also need enough VRAM. In general, model parameters = amount of VRAM you will need. In Colab, we are using their free 16GB VRAM GPUs which can train any model up to 16B in parameters.

Configure desired settings

We have pre-selected optimal settings for the best results for you already and you can change the model to whichever you want listed in our supported models. Would not recommend changing other settings if you're a beginner.

For advanced GRPO documentation on batching, generation and training parameters, read our guide!

Data preparation

We have pre-selected OpenAI's GSM8K dataset which contains grade school math problems but you could change it to your own or any public one on Hugging Face. You can read more about datasets here.

Your dataset should still have at least 2 columns for question and answer pairs. However the answer must not reveal the reasoning behind how it derived the answer from the question. See below for an example:

We'll structure the data to prompt the model to articulate its reasoning before delivering an answer. To start, we'll establish a clear format for both prompts and responses.

# Define the system prompt that instructs the model to use a specific format
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

Now, to prepare the dataset:

import re
from datasets import load_dataset, Dataset


# Helper functions to extract answers from different formats
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()


def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()


# Function to prepare the GSM8K dataset
def get_gsm8k_questions(split="train") -> Dataset:
    data = load_dataset("openai/gsm8k", "main")[split]
    data = data.map(
        lambda x: {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": x["question"]},
            ],
            "answer": extract_hash_answer(x["answer"]),
        }
    )
    return data


dataset = get_gsm8k_questions()

The dataset is prepared by extracting the answers and formatting them as structured strings.

Reward Functions/Verifier

Reward Functions/Verifiers lets us know if the model is doing well or not according to the dataset you have provided. Each generation run will be assessed on how it performs to the score of the average of the rest of generations. You can create your own reward functions however we have already pre-selected them for you with Will's GSM8K reward functions. With this, we have 5 different ways which we can reward each generation.

You can input your generations into an LLM like ChatGPT 4o or Llama 3.1 (8B) and design a reward function and verifier to evaluate it. For example, feed your generations into a LLM of your choice and set a rule: "If the answer sounds too robotic, deduct 3 points." This helps refine outputs based on quality criteria. See examples of what they can look like here.

Example Reward Function for an Email Automation Task:

Question: Inbound email
Answer: Outbound email
Reward Functions:
- If the answer contains a required keyword → +1
- If the answer exactly matches the ideal response → +1
- If the response is too long → -1
- If the recipient's name is included → +1
- If a signature block (phone, email, address) is present → +1

Train your model

We have pre-selected hyperparameters for the most optimal results however you could change them. Read all about parameters here. For advanced GRPO documentation on batching, generation and training parameters, read our guide!

The GRPOConfig defines key hyperparameters for training:

use_vllm: Activates fast inference using vLLM.
learning_rate: Determines the model's learning speed.
num_generations: Specifies the number of completions generated per prompt.
max_steps: Sets the total number of training steps.

NEW! We now support DAPO, Dr. GRPO and most other new GRPO techniques. You can play with the following arguments in GRPOConfig to enable:

epsilon=0.2,
epsilon_high=0.28, # one sided
delta=1.5 # two sided

loss_type='bnpo',
# or:
loss_type='grpo',
# or:
loss_type='dr_grpo',
# or:
loss_type='dapo',

mask_truncated_completions=True,

You should see the reward increase overtime. We would recommend you train for at least 300 steps which may take 30 mins however, for optimal results, you should train for longer.

If you're having issues with your GRPO model not learning, we'd highly recommend to use our Advanced GRPO notebooks as it has a much better reward function and you should see results much faster and frequently.

You will also see sample answers which allows you to see how the model is learning. Some may have steps, XML tags, attempts etc. and the idea is as trains it's going to get better and better because it's going to get scored higher and higher until we get the outputs we desire with long reasoning chains of answers.

Run & Evaluate your model

Run your model by clicking the play button. In the first example, there is usually no reasoning in the answer and in order to see the reasoning, we need to first save the LoRA weights we just trained with GRPO first using:

model.save_lora("grpo_saved_lora")

Then we load the LoRA and test it. Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

You can then save your model to GGUF, Ollama etc. by following our guide here.

If you are still not getting any reasoning, you may have either trained for too less steps or your reward function/verifier was not optimal.

Save your model

We have multiple options for saving your fine-tuned model, but we’ll focus on the easiest and most popular approaches which you can read more about here

Saving in 16-bit Precision

You can save the model with 16-bit precision using the following command:

# Save to 16-bit precision
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")

Pushing to Hugging Face Hub

To share your model, we’ll push it to the Hugging Face Hub using the push_to_hub_merged method. This allows saving the model in multiple quantization formats.

# Push to Hugging Face Hub (requires a token)
model.push_to_hub_merged(
    "your-username/model-name", tokenizer, save_method="merged_16bit", token="your-token"
)

Saving in GGUF Format for llama.cpp

Unsloth also supports saving in GGUF format, making it compatible with llama.cpp and Ollama.

model.push_to_hub_gguf(
    "your-username/model-name",
    tokenizer,
    quantization_method=["q4_k_m", "q8_0", "q5_k_m"],
    token="your-token",
)

Once saved in GGUF format, the model can be easily deployed in lightweight environments using llama.cpp or used in other inference engines.

Video Tutorials

Here are some video tutorials created by amazing YouTubers who we think are fantastic!

PreviousReinforcement Learning (RL) Guide NextAdvanced RL Documentation

Last updated 6 days ago

Was this helpful?