⚡3x Faster LLM Training with Unsloth Kernels + Packing

Learn how Unsloth increases training throughput and eliminates padding waste for fine-tuning.

Unsloth now supports up to 5× faster (typically 3x) training with our new custom RoPE and MLP Triton kernels, plus our new smart auto packing. Unsloth's new kernels + features not only increase training speed, but also further reduces VRAM use (30% - 90%) with no accuracy loss. This means you can now train LLMs like Qwen3-4B not only on just 3GB VRAM, but also 3x faster.

Our auto padding-free uncontaminated packing is smartly enabled for all training runs without any changes, and all fast attention backends (FlashAttention 3, xFormers, SDPA). Benchmarks show training losses match non-packing runs exactly.

2.3x faster QK Rotary Embedding fused Triton kernel with packing support
Updated SwiGLU, GeGLU kernels with int64 indexing for long context
2.5x to 5x faster uncontaminated packing with xformers, SDPA, FA3 backends
2.1x faster padding free, 50% less VRAM, 0% accuracy change
Unsloth also now has improved SFT loss stability and more predictable GPU utilization.
This new upgrade works for all training methods e.g. full fine-tuning, pretraining etc.

🥁Fused QK RoPE Triton Kernel with packing

Back in December 2023, we introduced a RoPE kernel coded up in Triton as part of our Unsloth launch. In March 2024, a community member made end to end training 1-2% faster by optimizing the RoPE kernel to allow launching a block for a group of heads. See PR 238.

One issue is for each Q and K, there are 2 Triton kernels. We merged them into 1 Triton kernel now, and enabled variable length RoPE, which was imperative for padding free and packing support. This makes the RoPE kernel in micro benchmarks 2.3x faster on longer context lengths, and 1.9x faster on shorter context lengths.

We also eliminated all clones and contiguous transpose operations, and so RoPE is now fully inplace, reducing further GPU memory. Note for the backward pass, we see that sin1 = -sin1 since:

Q * cos + rotate_half(Q) * sin
is equivalent to
Q * cos + Q @ R * sin
where R is a rotation matrix [ 0,  I]
                             [-I,  0]
dC/dY = dY * cos + dY @ R.T * sin
where R.T is again the same  [ 0, -I]
but the minus is transposed. [ I,  0]

🚃Int64 Indexing for Triton Kernels

During 500K long context training which we introduced in 500K Context Training, we would get CUDA out of bounds errors. This was because MLP kernels for SwiGLU, GeGLU had int32 indexing which is by default in Triton and CUDA.

We can't just do tl.program_id(0).to(tl.int64) since training will be slightly slower due to int64 indexing. We instead make this a LONG_INDEXING: tl.constexpr variable so the Triton compiler can specialize this. This allows shorter and longer context runs to both run great!

block_idx = tl.program_id(0)
if LONG_INDEXING:
    offsets = block_idx.to(tl.int64) * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE).to(tl.int64)
    n_elements = tl.cast(n_elements, tl.int64)
else:
    offsets = block_idx * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)

♠️Uncontaminated Packing 2-5x faster training

Real datasets can contain different sequence lengths, so increasing the batch size to 32 for example will cause padding, making training slower and use more VRAM.

In the past, increasing batch_size to large numbers (>32) will make training SLOWER, not faster. This was due to padding - we can now eliminate this issue via packing = True, and so training is FASTER!

When we pack multiple samples into a single one-dimensional tensor, we keep sequence length metadata around in order to properly mask samples, without leaking attention between samples. We also need the RoPE kernel described in Fused QK RoPE Triton Kernel with packing to allow reset position ids.

🧮Why is padding needed & mathematical speedup

Computers and GPUs cannot process different length datasets, so we have to pad them with 0s. This causes wastage. Assume we have a dataset of 50% short sequences S, and 50% long sequences L, then in the worst case, padding will cause token usage to be $\text{batchsize} \times L$ since the longest sequence length dominates.

By packing multiple examples into a single, long one-dimensional tensor, we can eliminate a significant amount of padding. In fact we get the below token usage:

\text{Token Usage} = \frac{\text{batchsize}}{2}L+\frac{\text{batchsize}}{2}S

By some math and algebra, we can work out the speedup via:

\text{Speedup} = \frac{\text{batchsize} \times L}{\frac{\text{batchsize}}{2}L+\frac{\text{batchsize}}{2}S} = 2 \frac{L}{L + S}

By assuming $S\rightarrow0$ then we get a 2x theoretical speedup since $2 \frac{L}{L + 0} = 2$

By changing the ratio of 50% short sequences, and assuming we have MORE short sequences, for eg 20% long sequences and 80% long sequences, we get $\frac{L}{0.2L + 0.8S}\rightarrow\frac{L}{0.2L}=5$ so 5x faster training! This means packing's speedup depends on how short rows your dataset has (the more shorter, the faster).

🏖️Analysis and Benchmarks

To demonstrate the various improvements when training with our new kernels and packed data, we ran fine-tuning runs with Qwen3-32B, Qwen3-8B, Llama 3 8B on the yahma/alpaca-cleaned dataset and measured various training loss throughput and efficiency metrics. We compared our new runs vs. a standard optimized training run with our own kernels/optimizations turned on and kernels like Flash Attention 3 (FA3) enabled. We fixed max_length = 1024 and varied the batch size in {1, 2, 4, 8, 16, 32}. This allows the maximum token count per batch to vary in {1024, 2048, 4096, 8192, 16K, 32K}.

The above shows how tokens per second (tokens/s) training throughput varies for new Unsloth with varying batch size. This translates into training your model on an epoch of your dataset 1.7-3x faster (sometimes even 5x or more)! These gains will be more pronounced if there are many short sequences in your data and if you have longer training runs, as described in Why is padding needed & mathematical speedup

The above shows the average percentage of tokens per batch that are valid (i.e., non-padding). As the batch size length grows, many more padding tokens are seen in the unpacked case, while we achieve a high packing efficiency in the packed case regardless of max sequence length.

Note that, since the batching logic trims batches to the maximum sequence length seen in the batch, when the batch size is 1, the unpacked data is all valid tokens (i.e., no padding). However, as more examples are added into the batch, padding increases on average, hitting nearly 50% padding with batch size is 8! Our sample packing implementation eliminates that waste.

The first graph (above) plots progress on yahma/alpaca-cleaned with max_length = 2048, Unsloth new with packing + kernels (maroon) vs. Unsloth old (gray). Both are trained with max_steps = 500, but we plot the x-axis in wall-clock time. Notice that we train on nearly 40% of an epoch in the packed case in the same amount of steps (and only a bit more wall-clock time) that it takes to train less than 5% of an epoch in the unpacked case.

Similarly, the 2nd graph (above) plots loss from the same runs, this time plotted with training steps on the x-axis. Notice that the losses match in scale and trend, but the loss in the packing case is less variable since the model is seeing more tokens per training step.

🎬Padding-Free by Default

In addition to large throughput gains available when setting packing = True in your SFTConfig , we will automatically use padding-free batching in order to reduce padding waste improve throughput and increases tokens/s throughput, while resulting in the exact same loss as seen in the previous version of Unsloth.

For example for Qwen3-8B and Qwen3-32B, we see memory usage decrease by 60%, be 2x faster, and have the same exact loss and grad norm curves!

✨How to enable packing?

Update Unsloth first and padding free is done by default! So all training is immediately 1.1 to 2x faster with 30% less memory usage at least and 0 change in loss curve metric!

pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth
pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth_zoo

We also support Flash Attention 3 via Xformers, SDPA support, Flash Attention 2, and this works on old GPUs (Tesla T4, RTX 2080) and new GPUs like H100s, B200s etc! Sample packing works regardless of choice of attention backend or model family, so enjoy the same speedups previously had with these fast attention implementations!

Add packing = True to enable up to 5x faster training!

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-14B",
)

trainer = SFTTrainer(
    model = model,
    processing_class = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,
        max_length = 4096,
        …,
        packing = True, # required to enable sample packing!
    ),
)
trainer.train()

All our notebooks are automatically faster (no need to do anything). See Unsloth Notebooks

Qwen3 14B faster:

Google Colabcolab.research.google.com

Llama 3.1 Conversational faster:

Google Colabcolab.research.google.com

Thank you! If you're interested, see our 500K Context Training blog, Memory Efficient RL blog and Long Context gpt-oss blog for more topics on kernels and performance gains!

PreviousDPO, ORPO, KTO Next500K Context Training

Last updated 22 minutes ago

Was this helpful?