⚡Tutorial: Train your own Reasoning model with GRPO
Beginner's Guide to transforming a model like Llama 3.1 (8B) into a reasoning model by using Unsloth and GRPO.
Last updated
Was this helpful?
Beginner's Guide to transforming a model like Llama 3.1 (8B) into a reasoning model by using Unsloth and GRPO.
Last updated
Was this helpful?
DeepSeek developed GRPO (Group Relative Policy Optimization) to train their R1 reasoning models.
These instructions are for our pre-made Google Colab notebooks. If you are installing Unsloth locally, you can also copy our notebooks inside your favorite code editor.
If you're using our Colab notebook, click Runtime > Run all. We'd highly recommend you checking out our Fine-tuning Guide before getting started.
If installing locally, ensure you have the correct requirements and use pip install unsloth
Before we get started, it is recommended to learn more about GRPO, reward functions and how they work. Read more about them including tips & tricks here.
You will also need enough VRAM. In general, model parameters = amount of VRAM you will need. In Colab, we are using their free 16GB VRAM GPUs which can train any model up to 16B in parameters.
We have pre-selected optimal settings for the best results for you already and you can change the model to whichever you want listed in our supported models. Would not recommend changing other settings if you're a beginner.
We have pre-selected OpenAI's GSM8K dataset already but you could change it to your own or any public one on Hugging Face. You can read more about datasets here.
Your dataset should still have at least 2 columns for question and answer pairs. However the answer must not reveal the reasoning behind how it derived the answer from the question. See below for an example:
Reward Functions/Verifiers lets us know if the model is doing well or not according to the dataset you have provided. Each generation run will be assessed on how it performs to the score of the average of the rest of generations. You can create your own reward functions however we have already pre-selected them for you with Will's GSM8K reward functions. With this, we have 5 different ways which we can reward each generation.
You can input your answer into an LLM like ChatGPT 4o or Llama 3.1 (8B) and design a reward function and verifier to evaluate it. For example, set a rule: "If the answer sounds too robotic, deduct 3 points." This helps refine outputs based on quality criteria. See examples of what they can look like here.
Example Reward Function for an Email Automation Task:
Question: Inbound email
Answer: Outbound email
Reward Functions:
If the answer contains a required keyword → +1
If the answer exactly matches the ideal response → +1
If the response is too long → -1
If the recipient's name is included → +1
If a signature block (phone, email, address) is present → +1
We have pre-selected hyperparameters for the most optimal results however you could change them. Read all about parameters here.
You should see the reward increase overtime. We would recommend you train for at least 300 steps which may take 30 mins however, for optimal results, you should train for longer.
You will also see sample answers which allows you to see how the model is learning. Some may have steps, XML tags, attempts etc. and the idea is as trains it's going to get better and better because it's going to get scored higher and higher until we get the outputs we desire with long reasoning chains of answers.
Run your model by clicking the play button. In the first example, there is usually no reasoning in the answer and in order to see the reasoning, we need to first save the LoRA we just trained with GRPO first.
Then we load the LoRA and test it. Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!
You can then save your model to GGUF, Ollama etc. by following our guide here.
If you are still not getting any reasoning, you may have either trained for too less steps or your reward function/verifier was not optimal.
Here are some video tutorials created by amazing YouTubers who we think are fantastic!