Multi-GPU Fine-tuning with Distributed Data Parallel (DDP)
Learn how to use the Unsloth CLI to train on multiple GPUs with Distributed Data Parallel (DDP)!
Let’s assume we have multiple GPUs, and we want to fine-tune a model using all of them! To do so, the most straightforward strategy is to use Distributed Data Parallel (DDP), which creates one copy of the model on each GPU device, feeding each copy distinct samples from the dataset during training and aggregating their contributions to weight updates per optimizer step.
Why would we want to do this? Well, as we add more GPUs into the training process, we scale the number of samples our models train on per step, making each gradient update more stable and increasing our training throughput dramatically with each added GPU.
Here’s a step-by-step guide on how to do this using Unsloth’s command-line interface (CLI)!
Note: Unsloth DDP will work with any of your training scripts, not just via our CLI! More details below.
Install Unsloth from source
We’ll clone Unsloth from GitHub and install it. Please consider using a virtual environment; we like to use uv venv –python 3.12 && source .venv/bin/activate, but any virtual environment creation tooling will do.
git clone https://github.com/unslothai/unsloth.git
cd unsloth
pip install .Choose target model and dataset for finetuning
In this demo, we will fine-tune Qwen/Qwen3-8B on the yahma/alpaca-cleaned chat dataset. This is a Supervised Fine-Tuning (SFT) workload that is commonly used when attempting to adapt a base model to a desired conversational style, or improve the model’s performance on a downstream task.
Use the Unsloth CLI!
First, let’s take a look at the help message built-in to the CLI (we’ve abbreviated here with “...” in various places for brevity):
$ python unsloth-cli.py --help
usage: unsloth-cli.py [-h] [--model_name MODEL_NAME] [--max_seq_length MAX_SEQ_LENGTH] [--dtype DTYPE]
[--load_in_4bit] [--dataset DATASET] [--r R] [--lora_alpha LORA_ALPHA]
[--lora_dropout LORA_DROPOUT] [--bias BIAS]
[--use_gradient_checkpointing USE_GRADIENT_CHECKPOINTING]
…
🦥 Fine-tune your llm faster using unsloth!
options:
-h, --help show this help message and exit
🤖 Model Options:
--model_name MODEL_NAME
Model name to load
--max_seq_length MAX_SEQ_LENGTH
Maximum sequence length, default is 2048. We auto support RoPE Scaling
internally!
…
🧠 LoRA Options:
These options are used to configure the LoRA model.
--r R Rank for Lora model, default is 16. (common values: 8, 16, 32, 64, 128)
--lora_alpha LORA_ALPHA
LoRA alpha parameter, default is 16. (common values: 8, 16, 32, 64, 128)
…
🎓 Training Options:
--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE
Batch size per device during training, default is 2.
--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE
Batch size per device during evaluation, default is 4.
--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
Number of gradient accumulation steps, default is 4.
…This should give you a sense of what options are available for you to pass into the CLI for training your model!
For multi-GPU training (DDP in this case), we will use the torchrun launcher, which allows you to spin up multiple distributed training processes in single-node or multi-node settings. In our case, we will focus on the single-node (i.e., one machine) case with two H100 GPUs.
Let’s also check our GPUs’ status by using the nvidia-smi command-line tool:
Great! We have two H100 GPUs, as expected. Both are sitting at 0MiB memory usage as we’re currently not training anything, or have any model loaded into memory.
To start your training run, issue a command like the following:
If you have more GPUs, you may set --nproc_per_node accordingly to utilize them.
Note: You can use the torchrun launcher with any of your Unsloth training scripts, including the scripts converted from our free Colab notebooks, and DDP will be auto-enabled when training with >1 GPU!
Taking a look again at nvidia-smi while training is in-flight:
We can see that both GPUs are now using ~19GB of VRAM per H100 GPU!
Inspecting the training logs, we see that we’re able to train at a rate of ~1.1 iterations/s. This training speed is ~constant even as we add more GPUs, so our training throughput increases ~linearly with the number of GPUs!
Training metrics
We ran a few short rank-16 LoRA fine-tunes on unsloth/Llama-3.2-1B-Instruct on the yahma/alpaca-cleaned dataset to demonstrate the improved training throughput when using DDP training with multiple GPUs.

The above figure compares training loss between two Llama-3.2-1B-Instruct LoRA fine-tunes over 500 training steps, with single GPU training (pink) vs. multi-GPU DDP training (blue).
Notice that the loss curves match in scale and trend, but otherwise are a bit different, since the multi-GPU training processes twice as much training data per step. This results in a slightly different training curve with less variability on a step-by-step basis.

The above figure plots training progress for the same two fine-tunes.
Notice that the multi-GPU DDP training progresses through an epoch of the training data in half as many steps as single GPU training. This is because each GPU can process a distinct batch (of size per_device_train_batch_size) per step. However, the per-step timing for DDP training is slightly slower due to distributed communication for the model weight updates. As you increase the number of GPUs, the training throughput will continue to increase ~linearly (but with a small, but increasing penalty for the distributed comms).
These same loss and training epoch progress behaviors hold for QLoRA fine-tunes, in which we loaded the base models in 4-bit precision in order to save additional GPU memory. This is particularly useful for training large models on limited amounts of GPU VRAM:

Training loss comparison between two Llama-3.2-1B-Instruct QLoRA fine-tunes over 500 training steps, with single GPU training (orange) vs. multi-GPU DDP training (purple).

Training progress comparison for the same two fine-tunes.
Last updated
Was this helpful?

