๐Ÿ“ŠUnsloth Benchmarks

Want to know how fast Unsloth is?

We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down):

Model
VRAM
๐ŸฆฅUnsloth speed
๐ŸฆฅVRAM reduction
๐ŸฆฅLonger context
๐Ÿ˜ŠHugging Face + FA2

Llama 3.3 (70B)

80GB

2x

>75%

13x longer

1x

Llama 3.1 (8B)

80GB

2x

>70%

12x longer

1x

Context length benchmarks

The more data you have, the less VRAM Unsloth uses due to our gradient checkpointing algorithm + Apple's CCE algorithm!

Llama 3.1 (8B) max. context length

We tested Llama 3.1 (8B) Instruct and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.

GPU VRAM
๐ŸฆฅUnsloth context length
Hugging Face + FA2

8 GB

2,972

OOM

12 GB

21,848

932

16 GB

40,724

2,551

24 GB

78,475

5,789

40 GB

153,977

12,264

48 GB

191,728

15,502

80 GB

342,733

28,454

Llama 3.3 (70B) max. context length

We tested Llama 3.3 (70B) Instruct on a 80GB A100 and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.

GPU VRAM
๐ŸฆฅUnsloth context length
Hugging Face + FA2

48 GB

12,106

OOM

80 GB

89,389

6,916

Last updated