🧠LoRA Hyperparameters Guide
Best practices for LoRA hyperparameters and how they affect the fine-tuning process.
LoRA hyperparameters are adjustable parameters that control how Low-Rank Adaptation (LoRA) fine-tunes LLMs. With many options (such as learning rate and epochs) and millions of possible combinations, selecting the right values is crucial for achieving accuracy, stability, quality, and fewer hallucinations during fine-tuning.
You'll learn the best practices for these parameters, based on insights from hundreds of research papers and experiments, and see how they impact the model. While we recommend using Unsloth's defaults, understanding these concepts will give you full control. The goal is to change hyperparameter numbers to increase accuracy while counteracting overfitting or underfitting. Overfitting occurs when the model memorizes the training data, harming its ability to generalize to new, unseen inputs. The objective is a model that generalizes well, not one that simply memorizes.
🔢 Key Fine-tuning Hyperparameters
Learning Rate
Defines how much the model’s weights are adjusted during each training step.
Higher Learning Rates: Lead to faster initial convergence but can cause training to become unstable or fail to find an optimal minimum if set too high.
Lower Learning Rates: Result in more stable and precise training but may require more epochs to converge, increasing overall training time. While low learning rates are often thought to cause underfitting, they actually can lead to overfitting or even prevent the model from learning.
Typical Range:
2e-4
(0.0002) to5e-6
(0.000005). 💡 We recommend2e-4
as a starting point for normal finetuning. For reinforcement learning (DPO, GRPO etc.), we recommend5e-6
.
Epochs
The number of times the model sees the full training dataset.
More Epochs: Can help the model learn better, but a high number can cause it to memorize the training data, hurting its performance on new tasks.
Fewer Epochs: Reduces training time and can prevent overfitting, but may result in an undertrained model if the number is insufficient for the model to learn the dataset's underlying patterns.
Recommended: 1-3 epochs. For most instruction-based datasets, training for more than 3 epochs offers diminishing returns and increases the risk of overfitting.
LoRA or QLoRA
LoRA uses 16-bit precision, while QLoRA is a 4-bit fine-tuning method.
LoRA: 16-bit fine-tuning. It's slightly faster and slightly more accurate, but consumes significantly more VRAM (4× more than QLoRA). Recommended for 16-bit environments and scenarios where maximum accuracy is required.
QLoRA: 4-bit fine-tuning. Slightly slower and marginally less accurate, but uses much less VRAM (4× less). 🦥 70B LLaMA fits in <48GB VRAM with QLoRA in Unsloth - more details here.
Hyperparameters & Recommendations:
LoRA Rank (r
)
Controls the number of trainable parameters in the LoRA adapter matrices. A higher rank increases model capacity but also memory usage.
8, 16, 32, 64, 128 Choose 16 or 32
LoRA Alpha (lora_alpha
)
Scales the strength of the fine-tuned adjustments in relation to the rank (r
).
r
(standard) or r * 2
(common heuristic)
LoRA Dropout
A regularization technique that randomly sets a fraction of LoRA activations to zero during training to prevent overfitting. Not that useful, so we default set it to 0.
0 (default) to 0.1
Weight Decay
A regularization term that penalizes large weights to prevent overfitting and improve generalization. Don't use too large numbers!
0.01 (recommended) - 0.1
Warmup Steps
Gradually increases the learning rate at the start of training.
5-10% of total steps
Scheduler Type
Adjusts the learning rate dynamically during training.
linear
or cosine
Seed (random_state
)
A fixed number to ensure reproducibility of results.
Any integer (e.g., 42
, 3407
)
Target Modules
Specify which parts of the model you want to apply LoRA adapters to — either the attention, the MLP, or both.
Attention: q_proj, k_proj, v_proj, o_proj
MLP: gate_proj, up_proj, down_proj
Recommended to target all major linear layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
.
🌳 Gradient Accumulation and Batch Size equivalency
Effective Batch Size
Correctly configuring your batch size is critical for balancing training stability with your GPU's VRAM limitations. This is managed by two parameters whose product is the Effective Batch Size.
Effective Batch Size = batch_size * gradient_accumulation_steps
A larger Effective Batch Size generally leads to smoother, more stable training.
A smaller Effective Batch Size may introduce more variance.
While every task is different, the following configuration provides a great starting point for achieving a stable Effective Batch Size of 16, which works well for most fine-tuning tasks on modern GPUs.
Batch Size (batch_size
)
The number of samples processed in a single forward/backward pass on one GPU. Primary Driver of VRAM Usage. Higher values can improve hardware utilization and speed up training, but only if they fit in memory.
2
Gradient Accumulation (gradient_accumulation_steps
)
The number of micro-batches to process before performing a single model weight update.
Primary Driver of Training Time. Allows simulation of a larger batch_size
to conserve VRAM. Higher values increase training time per epoch.
8
Effective Batch Size (Calculated)
The true batch size used for each gradient update. It directly influences training stability, quality, and final model performance.
4 to 16 Recommended: 16 (from 2 * 8)
The VRAM & Performance Trade-off
Assume you want 32 samples of data per training step. Then you can use any of the following configurations:
batch_size = 32, gradient_accumulation_steps = 1
batch_size = 16, gradient_accumulation_steps = 2
batch_size = 8, gradient_accumulation_steps = 4
batch_size = 4, gradient_accumulation_steps = 8
batch_size = 2, gradient_accumulation_steps = 16
batch_size = 1, gradient_accumulation_steps = 32
While all of these are equivalent for the model's weight updates, they have vastly different hardware requirements.
The first configuration (batch_size = 32
) uses the most VRAM and will likely fail on most GPUs. The last configuration (batch_size = 1
) uses the least VRAM, but at the cost of slightly slower training. To avoid OOM (out of memory) errors, always prefer to set a smaller batch_size
and increase gradient_accumulation_steps
to reach your target Effective Batch Size.
🦥 Unsloth Gradient Accumulation Fix
Gradient accumulation and batch sizes are now fully equivalent in Unsloth due to our bug fixes for gradient accumulation. We have implemented specific bug fixes for gradient accumulation that resolve a common issue where the two methods did not produce the same results. This was a known challenge in the wider community, but for Unsloth users, the two methods are now interchangeable.
Read our blog post for more details.
Prior to our fixes, combinations of batch_size
and gradient_accumulation_steps
that yielded the same Effective Batch Size (i.e., batch_size × gradient_accumulation_steps = 16
) did not result in equivalent training behavior. For example, configurations like b1/g16
, b2/g8
, b4/g4
, b8/g2
, and b16/g1
all have an Effective Batch Size of 16, but as shown in the graph, the loss curves did not align when using standard gradient accumulation:
After applying our fixes, the loss curves now align correctly, regardless of how the Effective Batch Size of 16 is achieved:
🦥 LoRA Hyperparameters in Unsloth
The following demonstrates a standard configuration. While Unsloth provides optimized defaults, understanding these parameters is key to manual tuning.

r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
The rank (
r
) of the fine-tuning process. A larger rank uses more memory and will be slower, but can increase accuracy on complex tasks. We suggest ranks like 8 or 16 (for fast fine-tunes) and up to 128. Using a rank that is too large can cause overfitting and harm your model's quality.target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
For optimal performance, LoRA should be applied to all major linear layers. Research has shown that targeting all major layers is crucial for matching the performance of full fine-tuning. While it's possible to remove modules to reduce memory usage, we strongly advise against it to preserve maximum quality as the savings are minimal.
lora_alpha = 16,
A scaling factor that controls the strength of the fine-tuned adjustments. Setting it equal to the rank (
r
) is a reliable baseline. A popular and effective heuristic is to set it to double the rank (r * 2
), which makes the model learn more aggressively by giving more weight to the LoRA updates.lora_dropout = 0, # Supports any, but = 0 is optimized
A regularization technique that helps prevent overfitting by randomly setting a fraction of the LoRA activations to zero during each training step. Recent research suggests that for the short training runs common in fine-tuning,
lora_dropout
may be an unreliable regularizer. 🦥 Unsloth's internal code can optimize training whenlora_dropout = 0
, making it slightly faster, but we recommend a non-zero value if you suspect overfitting.bias = "none", # Supports any, but = "none" is optimized
Leave this as
"none"
for faster training and reduced memory usage. This setting avoids training the bias terms in the linear layers, which adds trainable parameters for little to no practical gain.use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
Options are
True
,False
, and"unsloth"
. 🦥 We recommend"unsloth"
as it reduces memory usage by an extra 30% and supports extremely long context fine-tunes. You can read more on our blog post about long context training.random_state = 3407,
The seed to ensure deterministic, reproducible runs. Training involves random numbers, so setting a fixed seed is essential for consistent experiments.
use_rslora = False, # We support rank stabilized LoRA
An advanced feature that implements Rank-Stabilized LoRA. If set to
True
, the effective scaling becomeslora_alpha / sqrt(r)
instead of the standardlora_alpha / r
. This can sometimes improve stability, particularly for higher ranks.loftq_config = None, # And LoftQ
An advanced technique, as proposed in LoftQ, initializes LoRA matrices with the top 'r' singular vectors from the pretrained weights. This can improve accuracy but may cause a significant memory spike at the start of training.
🎯 LoRA Target Modules and QLoRA vs LoRA
Use:
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",]
to target both MLP and attention layers to increase accuracy.
QLoRA uses 4-bit precision, reducing VRAM usage by over 75%.
LoRA (16-bit) is slightly more accurate and faster.
According to empirical experiments and research papers like the original QLoRA paper, it's best to apply LoRA to both attention and MLP layers.

The chart shows RougeL scores (higher is better) for different target module configurations, comparing LoRA vs QLoRA.
The first 3 dots show:
QLoRA-All: LoRA applied to all FFN/MLP and Attention layers. 🔥 This performs best overall.
QLoRA-FFN: LoRA only on FFN. Equivalent to:
gate_proj
,up_proj
,down_proj.
QLoRA-Attention: LoRA applied only to Attention layers. Equivalent to:
q_proj
,k_proj
,v_proj
,o_proj
.
😎 Training on completions only, masking out inputs
The QLoRA paper shows that masking out inputs and training only on completions (outputs or assistant messages) can further increase accuracy by a few percentage points (1%). Below demonstrates how this is done in Unsloth:
NOT training on completions only:
USER: Hello what is 2+2? ASSISTANT: The answer is 4. USER: Hello what is 3+3? ASSISTANT: The answer is 6.
Training on completions only:
USER: Hello what is 2+2?
ASSISTANT: The answer is 4.
USER: Hello what is 3+3?
ASSISTANT: The answer is 6.
The QLoRA paper states that training on completions only increases accuracy by quite a bit, especially for multi-turn conversational finetunes! We do this in our conversational notebooks here.

To enable training on completions in Unsloth, you will need to define the instruction and assistant parts. 🦥 We plan to further automate this for you in the future!
For Llama 3, 3.1, 3.2, 3.3 and 4 models, you define the parts as follows:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)
For Gemma 2, 3, 3n models, you define the parts as follows:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<start_of_turn>user\n",
response_part = "<start_of_turn>model\n",
)
🔑 Avoiding Overfitting & Underfitting
Overfitting (Poor Generalization/Too Specialized)
The model memorizes the training data, including its statistical noise, and consequently fails to generalize to unseen data.
If your training loss drops below 0.2, your model is likely overfitting — meaning it may perform poorly on unseen tasks.
One simple trick is LoRA alpha scaling — just multiply the alpha value of each LoRA matrix by 0.5. This effectively scales down the impact of fine-tuning.
This is closely related to merging / averaging weights.
You can take the original base (or instruct) model, add the LoRA weights, then divide the result by 2. This gives you an averaged model — which is functionally equivalent to reducing the alpha
by half.
Solution:
Adjust the learning rate: A high learning rate often leads to overfitting, especially during short training runs. For longer training, a higher learning rate may work better. It’s best to experiment with both to see which performs best.
Reduce the number of training epochs. Stop training after 1, 2, or 3 epochs.
Increase
weight_decay
. A value of0.01
or0.1
is a good starting point.Increase
lora_dropout
. Use a value like0.1
to add regularization.Increase batch size or gradient accumulation steps.
Dataset expansion - make your dataset larger by combining or concatenating open source datasets with your dataset. Choose higher quality ones.
Evaluation early stopping - enable evaluation and stop when the evaluation loss increases for a few steps.
LoRA Alpha Scaling - scale the alpha down after training and during inference - this will make the finetune less pronounced.
Weight averaging - literally add the original instruct model and the finetune and divide the weights by 2.
Underfitting (Too Generic)
The model fails to capture the underlying patterns in the training data, often due to insufficient complexity or training duration.
Solution:
Adjust the Learning Rate: If the current rate is too low, increasing it may speed up convergence, especially for short training runs. For longer runs, try lowering the learning rate instead. Test both approaches to see which works best.
Increase Training Epochs: Train for more epochs, but monitor validation loss to avoid overfitting.
Increase LoRA Rank (
r
) and alpha: Rank should at least equal to the alpha number, and rank should be bigger for smaller models/more complex datasets; it usually is between 4 and 64.Use a More Domain-Relevant Dataset: Ensure the training data is high-quality and directly relevant to the target task.
Decrease batch size to 1. This will cause the model to update more vigorously.
Fine-tuning has no single "best" approach, only best practices. Experimentation is key to finding what works for your specific needs. Our notebooks automatically set optimal parameters based on many papers research and our experiments, giving you a great starting point. Happy fine-tuning!
Acknowledgements: A huge thank you to Eyera for contributing to this guide!
Last updated
Was this helpful?