🧩Advanced RL Documentation
Advanced documentation settings when using Unsloth with GRPO.
Detailed guides on doing GRPO with Unsloth for Batching, Generation & Training Parameters:
Training Parameters
beta
(float, default 0.0): KL coefficient.0.0
⇒ no reference model loaded (lower memory, faster).Higher
beta
constrains the policy to stay closer to the ref policy.
num_iterations
(int, default 1): PPO epochs per batch (μ in the algorithm). Replays data within each gradient accumulation step; e.g.,2
= two forward passes per accumulation step.epsilon
(float, default 0.2): Clipping value for token-level log-prob ratios (typical ratio range ≈ [-1.2, 1.2] with default ε).delta
(float, optional): Enables upper clipping bound for two-sided GRPO when set. IfNone
, standard GRPO clipping is used. Recommended> 1 + ε
when enabled (per INTELLECT-2 report).epsilon_high
(float, optional): Upper-bound epsilon; defaults toepsilon
if unset. DAPO recommends 0.28.importance_sampling_level
(“token” | “sequence”, default "token"):"token"
: raw per-token ratios (one weight per token)."sequence"
: average per-token ratios to a single sequence-level ratio. GSPO shows sequence-level sampling often gives more stable training for sequence-level rewards.
reward_weights
(list[float], optional): One weight per reward. IfNone
, all weights = 1.0.scale_rewards
(str|bool, default "group"):True
or"group"
: scale by std within each group (unit variance in group)."batch"
: scale by std across the entire batch (per PPO-Lite).False
or"none"
: no scaling. Dr. GRPO recommends not scaling to avoid difficulty bias from std scaling.
loss_type
(str, default "dapo"):"grpo"
: normalizes over sequence length (length bias; not recommended)."dr_grpo"
: normalizes by a global constant (introduced in Dr. GRPO; removes length bias). Constant ≈max_completion_length
."dapo"
(default): normalizes by active tokens in the global accumulated batch (introduced in DAPO; removes length bias)."bnpo"
: normalizes by active tokens in the local batch only (results can vary with local batch size; equals GRPO whenper_device_train_batch_size == 1
).
mask_truncated_completions
(bool, default False): WhenTrue
, truncated completions are excluded from loss (recommended by DAPO for stability). Note: There are KL issues in this framework; recommend disabling here. Example logic:# If mask_truncated_completions is enabled, zero out truncated completions in completion_mask if self.mask_truncated_completions: truncated_completions = ~is_eos.any(dim=1) completion_mask = completion_mask * (~truncated_completions).unsqueeze(1).int()
This can zero out all
completion_mask
entries when many completions are truncated, makingn_mask_per_reward = 0
and causing KL to become NaN. Seevllm_importance_sampling_correction
(bool, default True): Applies Truncated Importance Sampling (TIS) to correct off-policy effects when generation (e.g., vLLM / fast_inference) differs from training backend. In Unsloth, this is auto-set to True if you’re using vLLM/fast_inference; otherwise False.vllm_importance_sampling_cap
(float, default 2.0): Truncation parameter C for TIS; sets an upper bound on the importance sampling ratio to improve stability.
Generation Parameters
temperature
(float, default 1.0): Higher ⇒ more randomness. Use a relatively high value (≈1.0) to increase diversity across generations (helps learning).top_p
(float, default 1.0): Nucleus sampling mass to consider (0,1]; set to 1.0 to consider all tokens.top_k
(int, optional): Keep only the top-k tokens; ifNone
, consider all tokens.min_p
(float, optional): Minimum token probability scaled by the max token’s probability (typical 0.01–0.2).repetition_penalty
(float, default 1.0): >1.0 discourages repeats; <1.0 encourages repeats.steps_per_generation
(int, optional): IfNone
, defaults togradient_accumulation_steps
. Mutually exclusive withgeneration_batch_size
.
Batch & Throughput Parameters
train_batch_size
: Number of samples per process per step. If this integer is less thannum_generations
, it will default tonum_generations
.steps_per_generation
: Number of microbatches that contribute to one generation’s loss calculation (forward passes only). A new batch of data is generated everysteps_per_generation
steps; backpropagation timing depends ongradient_accumulation_steps
.num_processes
: Number of distributed training processes (e.g., GPUs / workers).gradient_accumulation_steps
(akagradient_accumulation
): Number of microbatches to accumulate before applying backpropagation and optimizer update.Effective batch size:
effective_batch_size = steps_per_generation * num_processes * train_batch_size
Total samples contributing to gradients before an update (across all processes and steps).
Optimizer steps per generation:
optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps
Example:
4 / 2 = 2
.num_generations
: Number of generations produced per prompt (applied after computingeffective_batch_size
). The number of unique prompts in a generation cycle is:unique_prompts = effective_batch_size / num_generations
Must be > 2 for GRPO to work.
GRPO Batch Examples
The tables below illustrate how batches flow through steps, when optimizer updates occur, and how new batches are generated.
Example 1
num_gpus = 1
per_device_train_batch_size = 3
gradient_accumulation_steps = 2
steps_per_generation = 4
effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3
Generation cycle A
0
[0,0,0]
1
[1,1,1]
→ optimizer update (accum = 2 reached)
2
[2,2,2]
3
[3,3,3]
optimizer update
Generation cycle B
0
[4,4,4]
1
[5,5,5]
→ optimizer update (accum = 2 reached)
2
[6,6,6]
3
[7,7,7]
optimizer update
Example 2
num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4
effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3
Generation cycle A
0
[0,0,0]
1
[1,1,1]
2
[2,2,2]
3
[3,3,3]
optimizer update (accum = 4 reached)
Generation cycle B
0
[4,4,4]
1
[5,5,5]
2
[6,6,6]
3
[7,7,7]
optimizer update (accum = 4 reached)
Example 3
num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4
effective_batch_size = 4 * 3 * 1 = 12
num_generations = 4
unique_prompts = effective_batch_size / num_generations = 3
Generation cycle A
0
[0,0,0]
1
[0,1,1]
2
[1,1,3]
3
[3,3,3]
optimizer update (accum = 4 reached)
Generation cycle B
0
[4,4,4]
1
[4,5,5]
2
[5,5,6]
3
[6,6,6]
optimizer update (accum = 4 reached)
Example 4
num_gpus = 1
per_device_train_batch_size = 6
steps_per_generation = gradient_accumulation_steps = 2
effective_batch_size = 2 * 6 * 1 = 12
num_generations = 3
unique_prompts = 4
Generation cycle A
0
[0,0,0, 1,1,1]
1
[2,2,2, 3,3,3]
optimizer update (accum = 2 reached)
Generation cycle B
0
[4,4,4, 5,5,5]
1
[6,6,6, 7,7,7]
optimizer update (accum = 2 reached)
Quick Formula Reference
effective_batch_size = steps_per_generation * num_processes * train_batch_size
optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps
unique_prompts = effective_batch_size / num_generations # must be > 2
Last updated
Was this helpful?