🧩Advanced RL Documentation
Advanced documentation settings when using Unsloth with GRPO.
Detailed guides on doing GRPO with Unsloth for Batching, Generation & Training Parameters:
Training Parameters
beta(float, default 0.0): KL coefficient.0.0⇒ no reference model loaded (lower memory, faster).Higher
betaconstrains the policy to stay closer to the ref policy.
num_iterations(int, default 1): PPO epochs per batch (μ in the algorithm). Replays data within each gradient accumulation step; e.g.,2= two forward passes per accumulation step.epsilon(float, default 0.2): Clipping value for token-level log-prob ratios (typical ratio range ≈ [-1.2, 1.2] with default ε).delta(float, optional): Enables upper clipping bound for two-sided GRPO when set. IfNone, standard GRPO clipping is used. Recommended> 1 + εwhen enabled (per INTELLECT-2 report).epsilon_high(float, optional): Upper-bound epsilon; defaults toepsilonif unset. DAPO recommends 0.28.importance_sampling_level(“token” | “sequence”, default "token"):"token": raw per-token ratios (one weight per token)."sequence": average per-token ratios to a single sequence-level ratio. GSPO shows sequence-level sampling often gives more stable training for sequence-level rewards.
reward_weights(list[float], optional): One weight per reward. IfNone, all weights = 1.0.scale_rewards(str|bool, default "group"):Trueor"group": scale by std within each group (unit variance in group)."batch": scale by std across the entire batch (per PPO-Lite).Falseor"none": no scaling. Dr. GRPO recommends not scaling to avoid difficulty bias from std scaling.
loss_type(str, default "dapo"):"grpo": normalizes over sequence length (length bias; not recommended)."dr_grpo": normalizes by a global constant (introduced in Dr. GRPO; removes length bias). Constant ≈max_completion_length."dapo"(default): normalizes by active tokens in the global accumulated batch (introduced in DAPO; removes length bias)."bnpo": normalizes by active tokens in the local batch only (results can vary with local batch size; equals GRPO whenper_device_train_batch_size == 1).
mask_truncated_completions(bool, default False): WhenTrue, truncated completions are excluded from loss (recommended by DAPO for stability). Note: There are some KL issues with this flag, so we recommend to disable it.# If mask_truncated_completions is enabled, zero out truncated completions in completion_mask if self.mask_truncated_completions: truncated_completions = ~is_eos.any(dim=1) completion_mask = completion_mask * (~truncated_completions).unsqueeze(1).int()This can zero out all
completion_maskentries when many completions are truncated, makingn_mask_per_reward = 0and causing KL to become NaN. Seevllm_importance_sampling_correction(bool, default True): Applies Truncated Importance Sampling (TIS) to correct off-policy effects when generation (e.g., vLLM / fast_inference) differs from training backend. In Unsloth, this is auto-set to True if you’re using vLLM/fast_inference; otherwise False.vllm_importance_sampling_cap(float, default 2.0): Truncation parameter C for TIS; sets an upper bound on the importance sampling ratio to improve stability.
Generation Parameters
temperature (float, defaults to 1.0):Temperature for sampling. The higher the temperature, the more random the completions. Make sure you use a relatively high (1.0) temperature to have diversity in generations which helps learning.top_p (float, optional, defaults to 1.0):Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1.0 to consider all tokens.top_k (int, optional):Number of highest probability vocabulary tokens to keep for top-k-filtering. If None, top-k-filtering is disabled and all tokens are considered.min_p (float, optional):Minimum token probability, which will be scaled by the probability of the most likely token. It must be a value between 0.0 and 1.0. Typical values are in the 0.01-0.2 range.repetition_penalty (float, optional, defaults to 1.0):Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1.0 encourage the model to use new tokens, while values < 1.0 encourage the model to repeat tokens.steps_per_generation: (int, optional):Number of steps per generation. If None, it defaults togradient_accumulation_steps. Mutually exclusive withgeneration_batch_size.
Batch & Throughput Parameters
Parameters that control batches
train_batch_size: Number of samples per process per step. If this integer is less thannum_generations, it will default tonum_generations.steps_per_generation: Number of microbatches that contribute to one generation’s loss calculation (forward passes only). A new batch of data is generated everysteps_per_generationsteps; backpropagation timing depends ongradient_accumulation_steps.num_processes: Number of distributed training processes (e.g., GPUs / workers).gradient_accumulation_steps(akagradient_accumulation): Number of microbatches to accumulate before applying backpropagation and optimizer update.Effective batch size:
effective_batch_size = steps_per_generation * num_processes * train_batch_sizeTotal samples contributing to gradients before an update (across all processes and steps).
Optimizer steps per generation:
optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_stepsExample:
4 / 2 = 2.num_generations: Number of generations produced per prompt (applied after computingeffective_batch_size). The number of unique prompts in a generation cycle is:unique_prompts = effective_batch_size / num_generationsMust be > 2 for GRPO to work.
GRPO Batch Examples
The tables below illustrate how batches flow through steps, when optimizer updates occur, and how new batches are generated.
Example 1
num_gpus = 1
per_device_train_batch_size = 3
gradient_accumulation_steps = 2
steps_per_generation = 4
effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3Generation cycle A
0
[0,0,0]
1
[1,1,1]
→ optimizer update (accum = 2 reached)
2
[2,2,2]
3
[3,3,3]
optimizer update
Generation cycle B
0
[4,4,4]
1
[5,5,5]
→ optimizer update (accum = 2 reached)
2
[6,6,6]
3
[7,7,7]
optimizer update
Example 2
num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4
effective_batch_size = 4 * 3 * 1 = 12
num_generations = 3Generation cycle A
0
[0,0,0]
1
[1,1,1]
2
[2,2,2]
3
[3,3,3]
optimizer update (accum = 4 reached)
Generation cycle B
0
[4,4,4]
1
[5,5,5]
2
[6,6,6]
3
[7,7,7]
optimizer update (accum = 4 reached)
Example 3
num_gpus = 1
per_device_train_batch_size = 3
steps_per_generation = gradient_accumulation_steps = 4
effective_batch_size = 4 * 3 * 1 = 12
num_generations = 4
unique_prompts = effective_batch_size / num_generations = 3Generation cycle A
0
[0,0,0]
1
[0,1,1]
2
[1,1,3]
3
[3,3,3]
optimizer update (accum = 4 reached)
Generation cycle B
0
[4,4,4]
1
[4,5,5]
2
[5,5,6]
3
[6,6,6]
optimizer update (accum = 4 reached)
Example 4
num_gpus = 1
per_device_train_batch_size = 6
steps_per_generation = gradient_accumulation_steps = 2
effective_batch_size = 2 * 6 * 1 = 12
num_generations = 3
unique_prompts = 4Generation cycle A
0
[0,0,0, 1,1,1]
1
[2,2,2, 3,3,3]
optimizer update (accum = 2 reached)
Generation cycle B
0
[4,4,4, 5,5,5]
1
[6,6,6, 7,7,7]
optimizer update (accum = 2 reached)
Quick Formula Reference
effective_batch_size = steps_per_generation * num_processes * train_batch_size
optimizer_steps_per_generation = steps_per_generation / gradient_accumulation_steps
unique_prompts = effective_batch_size / num_generations # must be > 2Last updated
Was this helpful?

