GSPO Reinforcement Learning
Train with GSPO (Group Sequence Policy Optimization) RL in Unsloth.
We're introducing GSPO which is a variant of GRPO made by the Qwen team at Alibaba. They noticed the observation that when GRPO takes importance weights for each token, even though inherently advantages do not scale or change with each token. This lead to the creation of GSPO, which now assigns the importance on the sequence likelihood rather than the individual token likelihoods of the tokens.
Enable GSPO in Unsloth by setting importance_sampling_level = "sequence"
in the GRPO config.
The difference between these two algorithms can be seen below, both from the GSPO paper from Qwen and Alibaba:

Equation 1: GRPO Algorithm, Source: Qwen team, Alibaba https://arxiv.org/abs/2507.18071

Equation 2: GSPO algorithm, Source: Qwen team, Alibaba https://arxiv.org/abs/2507.18071
In Equation 1, it can be seen that the advantages scale each of the rows into the token logprobs before that tensor is sumed. Essentially, each token is given the same scaling even though that scaling was given to the entire sequence rather than each individual token. A simple diagram of this can be seen in Figure 1 below.

Figure 3: GRPO Logprob Ratio row wise scaled with advantages
Equation 2 shows that the logprob ratios for each sequence is summed and exponentiated after the Logprob ratios are computed, and only the resulting now sequence ratios get row wise mutlipleid by the advantages.

Fgirue 4: GSPO Sequence Ratio row wise scaled with advantages
Enable GSPO in Unsloth
Enabling GSPO is simple, all you need to do is set the importance_sampling_level = "sequence"
flag in the GRPO config.
training_args = GRPOConfig(
output_dir="vlm-grpo-unsloth",
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
learning_rate = 5e-6,
adam_beta1 = 0.9,
adam_beta2 = 0.99,
weight_decay = 0.1,
warmup_ratio = 0.1,
lr_scheduler_type = "cosine",
optim = "adamw_8bit",
importance_sampling_level = "sequence",
loss_type="dr_grpo",
#beta=0.00,
epsilon=3e-4,
epsilon_high=4e-4,
num_generations=8,
max_prompt_length=1024,
max_completion_length=1024,
log_completions=True,
max_grad_norm = 0.1,
temperature = 0.9,
#report_to="none", # Set to "wandb" if you want to log to Weights & Biases
num_train_epochs=2, # For a quick test run, increase for full training
report_to = "wandb"
)
Last updated
Was this helpful?