🧠LoRA Hyperparameters Guide

Best practices for LoRA hyperparameters and learn how they affect the finetuning process.

There are millions of possible hyperparameter combinations, and choosing the right values is crucial for fine-tuning. You'll learn the best practices for hyperparameters - based on insights from hundreds of research paper/experiments and how they impact the model. We recommend you to use Unsloth's pre-selected defaults. The goal is to change hyperparameter numbers to increase accuracy, but also counteract over-fitting or underfitting. Over-fitting is where the model memorizes the data and struggles with new questions. We want a model that generalizes, not one that just memorizes.

Key Fine-tuning Hyperparameters

Learning Rate

Defines how much the model’s weights adjust per training step.

  • Higher Learning Rates: Faster training, reduces overfitting just make sure to not make it too high as it will overfit

  • Lower Learning Rates: More stable training, may require more epochs.

  • Typical Range: 1e-4 (0.0001) to 5e-5 (0.00005).

Epochs

Number of times the model sees the full training dataset.

  • Recommended: 1-3 epochs (anything more than 3 is generally not optimal unless you want your model to have much less hallucinations but also less creativity)

  • More Epochs: Better learning, higher risk of overfitting.

  • Fewer Epochs: May undertrain the model.

Advanced Hyperparameters:

Hyperparameter
Function
Recommended Settings

LoRA Rank

Controls the number of low-rank factors used for adaptation.

4-128

LoRA Alpha

Scaling factor for weight updates.

LoRA Rank * 1 or 2

Max Sequence Length

Maximum context a model can learn.

Adjust based on dataset needs

Batch Size

Number of samples processed per training step.

Higher values require more VRAM. 1 for long context, 2 or 4 for shorter context.

LoRA Dropout

Dropout rate to prevent overfitting.

0.1-0.2

Warmup Steps

Gradually increases learning rate at the start of training.

5-10% of total steps

Scheduler Type

Adjusts learning rate dynamically during training.

Linear Decay

Seed or Random State

Ensures reproducibility of results.

Fixed number (e.g., 42)

Weight Decay

Penalizes large weight updates to prevent overfitting.

1.0 or 0.3 (if you have issues)

LoRA Hyperparameters in Unsloth

You can manually adjust the hyperparameters below if you’d like - but feel free to skip it, as Unsloth automatically chooses well-balanced defaults for you.

  1. r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128

    The rank of the finetuning process. A larger number uses more memory and will be slower, but can increase accuracy on harder tasks. We normally suggest numbers like 8 (for fast finetunes), and up to 128. Too large numbers can causing over-fitting, damaging your model's quality.

  2. target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],

    We select all modules to finetune. You can remove some to reduce memory usage and make training faster, but we highly do not suggest this. Just train on all modules!

  3. lora_alpha = 16,

    The scaling factor for finetuning. A larger number will make the finetune learn more about your dataset, but can promote over-fitting. We suggest this to equal to the rank r, or double it.

  4. lora_dropout = 0, # Supports any, but = 0 is optimized

    Leave this as 0 for faster training! Can reduce over-fitting, but not that much.

  5. bias = "none",    # Supports any, but = "none" is optimized

    Leave this as 0 for faster and less over-fit training!

  6. use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context

    Options include True, False and "unsloth". We suggest "unsloth" since we reduce memory usage by an extra 30% and support extremely long context finetunes. You can read up here: https://unsloth.ai/blog/long-context for more details.

  7. random_state = 3407,

    The number to determine deterministic runs. Training and finetuning needs random numbers, so setting this number makes experiments reproducible.

  8. use_rslora = False,  # We support rank stabilized LoRA

    Advanced feature to set the lora_alpha = 16 automatically. You can use this if you want!

  9. loftq_config = None, # And LoftQ

    Advanced feature to initialize the LoRA matrices to the top r singular vectors of the weights. Can improve accuracy somewhat, but can make memory usage explode at the start.

Avoiding Overfitting & Underfitting

Overfitting (Too Specialized)

The model memorizes training data, failing to generalize to unseen inputs. Solution:

  • Increase learning rate.

  • Increase batch size.

  • Lower the number of training epochs.

  • Combine your dataset with a generic dataset e.g. ShareGPT

  • Increase dropout rate to introduce regularization.

Underfitting (Too Generic)

Though not as common, underfitting is where a low rank model fails to generalize due to a lack of learnable params and so your model may fail to learn from training data. Solution:

  • Reduce learning rate.

  • Train for more epochs.

  • Increasing rank and alpha. Alpha should at least equal to the rank number, and rank should be bigger for smaller models/more complex datasets; it usually is between 4 and 64.

  • Use a more domain-relevant dataset.

Fine-tuning has no single "best" approach, only best practices. Experimentation is key to finding what works for your needs. Our notebooks auto-set optimal parameters based on evidence from research papers and past experiments.

Last updated

Was this helpful?