OpenAI gpt-oss & all model types now supported!

🧩Advanced RL Documentation

Advanced documentation settings when using Unsloth with GRPO.

Detailed guides on doing GRPO with Unsloth for Batching, Generation & Training Parameters:

Training Parameters

  • beta (float, default 0.0): KL coefficient.

    • 0.0 ⇒ no reference model loaded (lower memory, faster).

    • Higher beta constrains the policy to stay closer to the ref policy.

  • num_iterations (int, default 1): PPO epochs per batch (μ in the algorithm). Replays data within each gradient accumulation step; e.g., 2 = two forward passes per accumulation step.

  • epsilon (float, default 0.2): Clipping value for token-level log-prob ratios (typical ratio range ≈ [-1.2, 1.2] with default ε).

  • delta (float, optional): Enables upper clipping bound for two-sided GRPO when set. If None, standard GRPO clipping is used. Recommended > 1 + ε when enabled (per INTELLECT-2 report).

  • epsilon_high (float, optional): Upper-bound epsilon; defaults to epsilon if unset. DAPO recommends 0.28.

  • importance_sampling_level (“token” | “sequence”, default "token"):

    • "token": raw per-token ratios (one weight per token).

    • "sequence": average per-token ratios to a single sequence-level ratio. GSPO shows sequence-level sampling often gives more stable training for sequence-level rewards.

  • reward_weights (list[float], optional): One weight per reward. If None, all weights = 1.0.

  • scale_rewards (str|bool, default "group"):

    • True or "group": scale by std within each group (unit variance in group).

    • "batch": scale by std across the entire batch (per PPO-Lite).

    • False or "none": no scaling. Dr. GRPO recommends not scaling to avoid difficulty bias from std scaling.

  • loss_type (str, default "dapo"):

    • "grpo": normalizes over sequence length (length bias; not recommended).

    • "dr_grpo": normalizes by a global constant (introduced in Dr. GRPO; removes length bias). Constant ≈ max_completion_length.

    • "dapo" (default): normalizes by active tokens in the global accumulated batch (introduced in DAPO; removes length bias).

    • "bnpo": normalizes by active tokens in the local batch only (results can vary with local batch size; equals GRPO when per_device_train_batch_size == 1).

  • mask_truncated_completions (bool, default False): When True, truncated completions are excluded from loss (recommended by DAPO for stability). Note: There are KL issues in this framework; recommend disabling here. Example logic:

    # If mask_truncated_completions is enabled, zero out truncated completions in completion_mask
    if self.mask_truncated_completions:
        truncated_completions = ~is_eos.any(dim=1)
        completion_mask = completion_mask * (~truncated_completions).unsqueeze(1).int()

    This can zero out all completion_mask entries when many completions are truncated, making n_mask_per_reward = 0 and causing KL to become NaN. See

  • vllm_importance_sampling_correction (bool, default True): Applies Truncated Importance Sampling (TIS) to correct off-policy effects when generation (e.g., vLLM / fast_inference) differs from training backend. In Unsloth, this is auto-set to True if you’re using vLLM/fast_inference; otherwise False.

  • vllm_importance_sampling_cap (float, default 2.0): Truncation parameter C for TIS; sets an upper bound on the importance sampling ratio to improve stability.

Generation Parameters

  • temperature (float, default 1.0): Higher ⇒ more randomness. Use a relatively high value (≈1.0) to increase diversity across generations (helps learning).

  • top_p (float, default 1.0): Nucleus sampling mass to consider (0,1]; set to 1.0 to consider all tokens.

  • top_k (int, optional): Keep only the top-k tokens; if None, consider all tokens.

  • min_p (float, optional): Minimum token probability scaled by the max token’s probability (typical 0.01–0.2).

  • repetition_penalty (float, default 1.0): >1.0 discourages repeats; <1.0 encourages repeats.

  • steps_per_generation (int, optional): If None, defaults to gradient_accumulation_steps. Mutually exclusive with generation_batch_size.

Prefer adjusting per_device_train_batch_size and gradient_accumulation_steps for batch sizing.

Batch & Throughput Parameters

  • train_batch_size: Number of samples per process per step. If this integer is less than num_generations, it will default to num_generations.

  • steps_per_generation: Number of microbatches that contribute to one generation’s loss calculation (forward passes only). A new batch of data is generated every steps_per_generation steps; backpropagation timing depends on gradient_accumulation_steps.

  • num_processes: Number of distributed training processes (e.g., GPUs / workers).

  • gradient_accumulation_steps (aka gradient_accumulation): Number of microbatches to accumulate before applying backpropagation and optimizer update.

  • Effective batch size:

    effective_batch_size = steps_per_generation * num_processes * train_batch_size

    Total samples contributing to gradients before an update (across all processes and steps).

  • Optimizer steps per generation:

    optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps

    Example: 4 / 2 = 2.

  • num_generations: Number of generations produced per prompt (applied after computing effective_batch_size). The number of unique prompts in a generation cycle is:

    unique_prompts = effective_batch_size / num_generations

    Must be > 2 for GRPO to work.

It’s usually less confusing to tune per_device_train_batch_size and gradient_accumulation_steps rather than steps_per_generation when targeting a specific effective batch size.

GRPO Batch Examples

The tables below illustrate how batches flow through steps, when optimizer updates occur, and how new batches are generated.

Example 1

num_gpus = 1
per_device_train_batch_size = 3
gradient_accumulation_steps = 2
steps_per_generation = 4

effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3

Generation cycle A

Step
Batch
Notes

0

[0,0,0]

1

[1,1,1]

→ optimizer update (accum = 2 reached)

2

[2,2,2]

3

[3,3,3]

optimizer update

Generation cycle B

Step
Batch
Notes

0

[4,4,4]

1

[5,5,5]

→ optimizer update (accum = 2 reached)

2

[6,6,6]

3

[7,7,7]

optimizer update

Example 2

num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4

effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3

Generation cycle A

Step
Batch
Notes

0

[0,0,0]

1

[1,1,1]

2

[2,2,2]

3

[3,3,3]

optimizer update (accum = 4 reached)

Generation cycle B

Step
Batch
Notes

0

[4,4,4]

1

[5,5,5]

2

[6,6,6]

3

[7,7,7]

optimizer update (accum = 4 reached)

Example 3

num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4

effective_batch_size = 4 * 3 * 1 = 12
num_generations = 4
unique_prompts = effective_batch_size / num_generations = 3

Generation cycle A

Step
Batch
Notes

0

[0,0,0]

1

[0,1,1]

2

[1,1,3]

3

[3,3,3]

optimizer update (accum = 4 reached)

Generation cycle B

Step
Batch
Notes

0

[4,4,4]

1

[4,5,5]

2

[5,5,6]

3

[6,6,6]

optimizer update (accum = 4 reached)

Example 4

num_gpus = 1
per_device_train_batch_size = 6
steps_per_generation = gradient_accumulation_steps = 2

effective_batch_size = 2 * 6 * 1 = 12
num_generations = 3
unique_prompts = 4

Generation cycle A

Step
Batch
Notes

0

[0,0,0, 1,1,1]

1

[2,2,2, 3,3,3]

optimizer update (accum = 2 reached)

Generation cycle B

Step
Batch
Notes

0

[4,4,4, 5,5,5]

1

[6,6,6, 7,7,7]

optimizer update (accum = 2 reached)

Quick Formula Reference

effective_batch_size = steps_per_generation * num_processes * train_batch_size
optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps
unique_prompts = effective_batch_size / num_generations   # must be > 2

Last updated

Was this helpful?