Text-to-speech, all model types & full fine-tuning now supported!

Multi-GPU Training with Unsloth

Learn how to fine-tune LLMs on multiple GPUs and parallelism with Unsloth.

Unsloth currently supports multi-GPU setups through libraries like Accelerate and DeepSpeed. This means you can already leverage parallelism methods such as FSDP and DDP with Unsloth.

However, we know that the process can be complex and requires manual setup. We’re working hard to make multi-GPU support much simpler and more user-friendly, and we’ll be announcing official multi-GPU support for Unsloth soon.

In the meantime, to enable multi GPU for DDP, do the following:

  1. Save your training script to train.py and set in SFTConfig or TrainingArguments the flag ddp_find_unused_parameters = False

  2. Run accelerate launch train.py or torchrun --nproc_per_node N_GPUS -m train.py where N_GPUS is the number of GPUs you have.

Pipeline / model splitting loading is also allowed, so if you do not have enough VRAM for 1 GPU to load say Llama 70B, no worries - we will split the model for you on each GPU! To enable this, use the device_map = "balanced" flag:

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.3-70B-Instruct",
    load_in_4bit = True,
    device_map = "balanced",
)

Also several contributors have created repos to enable or improve multi-GPU support with Unsloth, including:

  • unsloth-5090-multiple: A fork enabling Unsloth to run efficiently on multi-GPU systems, particularly for the NVIDIA RTX 5090 and similar setups.

  • opensloth: Unsloth with support for multi-GPU training including experimental features.

Stay tuned for our official announcement! For more details, check out our ongoing Pull Request discussing multi-GPU support.

Last updated