🏁Finetuning from Last Checkpoint
Checkpointing allows you to save your finetuning progress so you can pause it and then continue.
You must edit the Trainer
first to add save_strategy
and save_steps
. Below saves a checkpoint every 50 steps to the folder outputs
.
trainer = SFTTrainer(
....
args = TrainingArguments(
....
output_dir = "outputs",
save_strategy = "steps",
save_steps = 50,
),
)
Then in the trainer do:
trainer_stats = trainer.train(resume_from_checkpoint = True)
Which will start from the latest checkpoint and continue training.
Wandb Integration
# Install library
!pip install wandb --upgrade
# Setting up Wandb
!wandb login <token>
import os
os.environ["WANDB_PROJECT"] = "<name>"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"
Then in TrainingArguments()
set
report_to = "wandb",
logging_steps = 1, # Change if needed
save_steps = 100 # Change if needed
run_name = "<name>" # (Optional)
To train the model, do trainer.train()
; to resume training, do
import wandb
run = wandb.init()
artifact = run.use_artifact('<username>/<Wandb-project-name>/<run-id>', type='model')
artifact_dir = artifact.download()
trainer.train(resume_from_checkpoint=artifact_dir)
❓How do I do Early Stopping?
If you want to stop or pause the finetuning / training run since the evaluation loss is not decreasing, then you can use early stopping which stops the training process. Use EarlyStoppingCallback
.
As usual, set up your trainer and your evaluation dataset. The below is used to stop the training run if the eval_loss
(the evaluation loss) is not decreasing after 3 steps or so.
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
args = SFTConfig(
fp16_full_eval = True,
per_device_eval_batch_size = 2,
eval_accumulation_steps = 4,
output_dir = "training_checkpoints", # location of saved checkpoints for early stopping
save_strategy = "steps", # save model every N steps
save_steps = 10, # how many steps until we save the model
save_total_limit = 3, # keep ony 3 saved checkpoints to save disk space
eval_strategy = "steps", # evaluate every N steps
eval_steps = 10, # how many steps until we do evaluation
load_best_model_at_end = True, # MUST USE for early stopping
metric_for_best_model = "eval_loss", # metric we want to early stop on
greater_is_better = False, # the lower the eval loss, the better
),
model = model,
tokenizer = tokenizer,
train_dataset = new_dataset["train"],
eval_dataset = new_dataset["test"],
)
We then add the callback which can also be customized:
from transformers import EarlyStoppingCallback
early_stopping_callback = EarlyStoppingCallback(
early_stopping_patience = 3, # How many steps we will wait if the eval loss doesn't decrease
# For example the loss might increase, but decrease after 3 steps
early_stopping_threshold = 0.0, # Can set higher - sets how much loss should decrease by until
# we consider early stopping. For eg 0.01 means if loss was
# 0.02 then 0.01, we consider to early stop the run.
)
trainer.add_callback(early_stopping_callback)
Then train the model as usual via trainer.train() .
Last updated