gpt-oss Reinforcement Learning
You can now train OpenAI gpt-oss with RL and GRPO via Unsloth. Unsloth now offers the fastest inference (3x faster), lowest VRAM (50% less) and most context (8x longer) for gpt-oss RL vs. any implementation - with no accuracy loss. Since RL on gpt-oss isn't yet vLLM compatible, we rewrote Transformers inference code to deliver 3x faster inference for gpt-oss at ~21 tokens/s. For BF16, Unsloth also achieves the fastest inference (~30 tokens/s), especially relative to VRAM usage, using 50% less VRAM vs. any implementation.
Free notebook: gpt-oss-20b GSPO Colab notebook This notebook automatically creates faster matrix multiplication kernels and uses a new Unsloth reward function. We also show how to counteract reward-hacking which is one of RL's biggest challenges.
With Unsloth, you can train gpt-oss-20b with GRPO on 15GB VRAM and free on Colab. Unsloth's new inference runs faster on any GPU including A100, H100 and old T4's. gpt-oss-120b fits on 80GB VRAM.
Unsloth is the only framework to support 4-bit RL for gpt-oss. All performance gains are due to Unsloth's unique weight sharing, Flex Attention, Standby and custom kernels.
Reminder: Flash Attention 3 (FA3) is unsuitable for gpt-oss training since it currently doesn’t support backward passes for attention sinks, causing incorrect training loss. If you’re not using Unsloth, FA3 may be enabled by default, so please double-check it’s not in use!
⚡Making Inference Much Faster
Inference is crucial in RL training. To achieve the fastest inference speed for gpt-oss without vLLM, we rewrote Transformers inference and integrated many innovations including custom algorithms like Unsloth Flex Attention, torch.compile. The new inference was evaluated against an already optimized baseline (2x faster than native Transformers).
vLLM does not support RL for gpt-oss since it lacks bf16 training and LoRA support for gpt-oss. Without Unsloth, only training via bf16 works, making memory use even 800%+ higher. Most frameworks enable FA3 by default (which reduces VRAM use & increases speed) but this causes incorrect training loss. You must disable FA3, though that prevents long-context training, so instead, we implemented Unsloth Flex Attention.
We evaluated gpt-oss RL inference by benchmarking BitsandBytes 4-bit and also did separate tests for BF16. Unsloth’s 4-bit inference is ~4x faster, and BF16 is also more efficient, especially in VRAM use.
The best part about Unsloth's gpt-oss RL is that it can work on any GPU, even those that do not support bf16. Our free gpt-oss-20b Colab notebooks use older 15GB T4 GPUs, so the inference examples work well!
🛠️ gpt-oss Flex Attention Issues and Quirks
Masking was a tricky issue to deal with. We found that we had to, not only account for KV Cache prefill during generations of tokens (this is necessary for efficient inference as we need to know what KVCache corresponds to what token in what layer), but also account for a unique amount of pad tokens in each prompt for batch generations which would change the way we would need to store the block mask. Example of such can be seen below:
Normal Causal Mask:
k0 k1 k2 k3 k4 <-- keys
q0 X
q1 X X
q2 X X X
q3 X X X X
q4 X X X X X <-- last query row (most important for decoding)
For inference in general case (decoding)
k0 k1 k2 k3 k4
q0
q1
q2
q3
q4 X X X X X
If we naively use the same masking strategy
k0 k1 k2 k3 k4
q0
q1
q2
q3
q4 X (note that q4 is with q_idx 0 as this is the first query in current setup)
For generation (decoding phase), we usually only care about the last row of the attention matrix, since there’s just one query token attending to all previous key tokens. If we naïvely apply the causal mask (q_idx ≥ k_idx), it fails as our single query has index 0, while there are n_k key tokens. To fix this, we need an offset in mask creation to decide which tokens to attend. But a naïve approach is slow, since offsets change each step, forcing mask and kernel regeneration. We solved this with cache and compile optimizations.
The harder part is batch generation. Sequences differ in length, so padding complicates mask creation. Flex Attention had a lot of challenges and dynamic masks are tricky. Worse, if not compiled, it falls back to eager attention which is slow and memory-heavy (quadratic vs. linear in sequence length). Ultimately, the mask must dynamically handle prefill vs decode with KVCache, batch and padding tokens per sequence, remain torch.compile friendly, and support sliding windows.
🔍 FlashAttention Investigation
Another interesting direction we explored was integrating FlashAttention. Its advantages are widely recognized, but one limitation is that it does not support attention sinks during the backward pass for gpt-oss. To work around this, we restructured the attention mechanism so that it operates solely on the attention output and the log-sum-exp values that FlashAttention readily provides. Given these benefits, it seemed like an obvious choice to try.
However, we soon began noticing issues. While the first few layers behaved as expected, the later layers, particularly layers 18 through 24, produced outputs that diverged significantly from the eager-mode implementation in transformers. Importantly, this discrepancy cannot be attributed to error accumulation, since the inputs to each method are identical at every layer. For further validation, we also compared the results against Unsloth FlexAttention.

This needs further investigation into why only the last few layers show such a drastic difference between flash attention implementation vs. the others.
Flash Attention 3
FA3 is often enabled by default for most training packages (not Unsloth), but this is incorrect for gpt-oss. Using FA3 will make training loss completely wrong as FA3 doesn’t support gpt-oss backward passes for attention sinks. Many people are still unaware of this so please be cautious!
⚠️ Can We Counter Reward Hacking?
The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric). But RL can cheat. When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking". It's the reason models learn to modify unit tests to pass coding challenges, and these are critical blockers for real world deployment. Some other good examples are from Wikipedia.

In our notebook we explore how to counter reward hacking in a code generation setting and showcase tangible solutions to common error modes. We saw the model edit the timing function, outsource to other libraries, cache the results, and outright cheat. After countering, the result is our model generates genuinely optimized matrix multiplication kernels, not clever cheats.
💫 From OpenAI's Labs to Your Laptop
gpt-oss is a legitimate frontier-class architecture from OpenAI that could power breakthrough AI applications. Until now, training these models with RL was exclusively limited to well-funded labs with H100s to spare.
Today, that changes. You can train gpt-oss-20b with GRPO on a free Google Colab tier here. Free, Frontier Model, Training.
Last updated
Was this helpful?