Multi-GPU Training with Unsloth
Learn how to fine-tune LLMs on multiple GPUs and parallelism with Unsloth.
Unsloth currently supports multi-GPU setups through libraries like Accelerate and DeepSpeed. This means you can already leverage parallelism methods such as FSDP and DDP with Unsloth.
You can use our Magistral-2509 Kaggle notebook as an example which utilizes multi-GPU Unsloth to fit the 24B parameter model
However, we know that the process can be complex and requires manual setup. We’re working hard to make multi-GPU support much simpler and more user-friendly, and we’ll be announcing official multi-GPU support for Unsloth soon.
In the meantime, to enable multi GPU for DDP, do the following:
Save your training script to
train.py
and set inSFTConfig
orTrainingArguments
the flagddp_find_unused_parameters = False
Run
accelerate launch train.py
ortorchrun --nproc_per_node N_GPUS -m train.py
where N_GPUS is the number of GPUs you have.
Pipeline / model splitting loading is also allowed, so if you do not have enough VRAM for 1 GPU to load say Llama 70B, no worries - we will split the model for you on each GPU! To enable this, use the device_map = "balanced"
flag:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.3-70B-Instruct",
load_in_4bit = True,
device_map = "balanced",
)
Also several contributors have created repos to enable or improve multi-GPU support with Unsloth, including:
unsloth-5090-multiple: A fork enabling Unsloth to run efficiently on multi-GPU systems, particularly for the NVIDIA RTX 5090 and similar setups.
opensloth: Unsloth with support for multi-GPU training including experimental features.
Stay tuned for our official announcement! For more details, check out our ongoing Pull Request discussing multi-GPU support.
Last updated
Was this helpful?