๐กReasoning - GRPO
Train your own DeepSeek-R1 reasoning model with Unsloth using GRPO which is a part of Reinforced Learning fine-tuning.
Last updated
Was this helpful?
Train your own DeepSeek-R1 reasoning model with Unsloth using GRPO which is a part of Reinforced Learning fine-tuning.
Last updated
Was this helpful?
Read our blog post: unsloth.ai/blog/r1-reasoning
DeepSeekโs learning algorithm, GRPO (Group Relative Policy Optimization), is a reinforcement learning technique that optimizes responses efficiently without requiring a value function model. This reduces memory and computational costs compared to methods like PPO (Proximal Policy Optimization).
With 15GB VRAM, Unsloth allows you to transform any model up to 15B parameters like Llama 3.1 (8B), Phi-4 (14B), Mistral (7B) or Qwen2.5 (7B) into a reasoning model
Minimum requirement: Just โฏ7GB VRAM is enough to train your own reasoning model locally.
Previous demonstrations show that you could achieve your own "aha" moment with Qwen2.5 (3B) - but it required 2xA100 GPUs (160GB VRAM). Now, with Unsloth, you can achieve the same "aha" moment using just a single 7GB VRAM GPU
Previously, GRPO was only supported for full fine-tuning, but we've made it work with QLoRA and LoRA
Please note, this isnโt fine-tuning DeepSeekโs R1 distilled models or using distilled data from R1 for tuning which Unsloth already supported. This is converting a standard model into a full-fledged reasoning model using GRPO.
In a test example, even though we only trained Phi-4 with 100 steps using GRPO, the results are already clear. The model without GRPO does not have the thinking token, whilst the one trained with GRPO does and also has the correct answer.
Wait for at least 300 steps for the reward to actually increase. In order to get decent results, you may need to trade for a minimum of 12 hours (this is how GRPO works), but keep in mind this isn't compulsory as you can stop at anytime.
If you're using GRPO with Unsloth locally, please "pip install diffusers" as well if you get an error. Please also use the latest version of vLLM.
Itโs advised to apply GRPO to a model at least 1.5B in parameters to correctly generate thinking tokens as smaller models may not..
Training loss tracking for GRPO is now built directly into Unsloth, eliminating the need for external tools like wandb etc.
If youโre using a base model, ensure you have a chat template.
DeepSeekโs researchers observed an "aha moment" when training R1-Zero with pure reinforcement learning (RL). The model learned to extend its thinking time by reevaluating its initial approach, without any human guidance or predefined instructions.
The model generates groups of responses.
Each response is scored based on correctness or another metric created by some set reward function rather than an LLM reward model.
The average score of the group is computed.
Each response's score is compared to the group average.
The model is reinforced to favor higher-scoring responses.
As an example, assume we want a model to solve:
What is 1+1? >> Chain of thought/working out >> The answer is 2.
What is 2+2? >> Chain of thought/working out >> The answer is 4.
Originally, one had to collect large swathes of data to fill the working out / chain of thought process. But GRPO (the algorithm DeepSeek uses) or other RL algorithms can steer the model to automatically exhibit reasoning capabilities and create the reasoning trace. Instead, we need to create good reward functions or verifiers. For example, if it gets the correct answer, give it a score of 1. If some words are mis-spelt, minus 0.1. And so on! We can provide many many functions to reward the process.