📊Unsloth Benchmarks

Want to know how fast Unsloth is?

For our most detailed benchmarks, read our Llama 3.3 Blog.
Benchmarking of Unsloth was also conducted by 🤗Hugging Face.

We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down):

Model

VRAM

🦥Unsloth speed

🦥VRAM reduction

🦥Longer context

😊Hugging Face + FA2

Llama 3.3 (70B)

80GB

>75%

13x longer

Llama 3.1 (8B)

80GB

>70%

12x longer

Context length benchmarks

The more data you have, the less VRAM Unsloth uses due to our gradient checkpointing algorithm + Apple's CCE algorithm!

Llama 3.1 (8B) max. context length

We tested Llama 3.1 (8B) Instruct and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.

GPU VRAM

🦥Unsloth context length

Hugging Face + FA2

8 GB

2,972

OOM

12 GB

21,848

932

16 GB

40,724

2,551

24 GB

78,475

5,789

40 GB

153,977

12,264

48 GB

191,728

15,502

80 GB

342,733

28,454

Llama 3.3 (70B) max. context length

We tested Llama 3.3 (70B) Instruct on a 80GB A100 and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.

GPU VRAM

🦥Unsloth context length

Hugging Face + FA2

48 GB

12,106

OOM

80 GB

89,389

6,916

PreviousTraining LLMs with Blackwell, RTX 50 series & Unsloth NextMulti-GPU Training with Unsloth

Last updated 6 months ago