🏁Finetuning from Last Checkpoint

Checkpointing allows you to save your finetuning progress so you can pause it and then continue.

You must edit the Trainer first to add save_strategy and save_steps. Below saves a checkpoint every 50 steps to the folder outputs.

trainer = SFTTrainer(
    ....
    args = TrainingArguments(
        ....
        output_dir = "outputs",
        save_strategy = "steps",
        save_steps = 50,
    ),
)

Then in the trainer do:

trainer_stats = trainer.train(resume_from_checkpoint = True)

Which will start from the latest checkpoint and continue training.

Wandb Integration

# Install library
!pip install wandb --upgrade

# Setting up Wandb
!wandb login <token>

import os

os.environ["WANDB_PROJECT"] = "<name>"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"

Then in TrainingArguments() set

report_to = "wandb",
logging_steps = 1, # Change if needed
save_steps = 100 # Change if needed
run_name = "<name>" # (Optional)

To train the model, do trainer.train(); to resume training, do

import wandb
run = wandb.init()
artifact = run.use_artifact('<username>/<Wandb-project-name>/<run-id>', type='model')
artifact_dir = artifact.download()
trainer.train(resume_from_checkpoint=artifact_dir)

Last updated