OpenAI gpt-oss & all model types now supported!

GSPO Reinforcement Learning

Train with GSPO (Group Sequence Policy Optimization) RL in Unsloth.

We're introducing GSPO which is a variant of GRPO made by the Qwen team at Alibaba. They noticed the observation that when GRPO takes importance weights for each token, even though inherently advantages do not scale or change with each token. This lead to the creation of GSPO, which now assigns the importance on the sequence likelihood rather than the individual token likelihoods of the tokens.

Enable GSPO in Unsloth by setting importance_sampling_level = "sequence" in the GRPO config.

The difference between these two algorithms can be seen below, both from the GSPO paper from Qwen and Alibaba:

Equation 1: GRPO Algorithm, Source: Qwen team, Alibaba https://arxiv.org/abs/2507.18071

Equation 2: GSPO algorithm, Source: Qwen team, Alibaba https://arxiv.org/abs/2507.18071

In Equation 1, it can be seen that the advantages scale each of the rows into the token logprobs before that tensor is sumed. Essentially, each token is given the same scaling even though that scaling was given to the entire sequence rather than each individual token. A simple diagram of this can be seen in Figure 1 below.

Figure 3: GRPO Logprob Ratio row wise scaled with advantages

Equation 2 shows that the logprob ratios for each sequence is summed and exponentiated after the Logprob ratios are computed, and only the resulting now sequence ratios get row wise mutlipleid by the advantages.

Fgirue 4: GSPO Sequence Ratio row wise scaled with advantages

Enable GSPO in Unsloth

Enabling GSPO is simple, all you need to do is set the importance_sampling_level = "sequence" flag in the GRPO config.

training_args = GRPOConfig(
    output_dir="vlm-grpo-unsloth",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    importance_sampling_level = "sequence",
    loss_type="dr_grpo",
    #beta=0.00,
    epsilon=3e-4,
    epsilon_high=4e-4,
    num_generations=8,    
    max_prompt_length=1024,
    max_completion_length=1024,
    log_completions=True,
    max_grad_norm = 0.1,
    temperature = 0.9,
    #report_to="none", # Set to "wandb" if you want to log to Weights & Biases
    num_train_epochs=2, # For a quick test run, increase for full training
    report_to = "wandb"
)

Last updated

Was this helpful?