🧠LoRA Parameters Encyclopedia
Learn how parameters affect the finetuning process. Written by Sebastien.
LoraConfig Parameters
Adjusting the LoraConfig
parameters allows you to balance model performance and computational efficiency in Low-Rank Adaptation (LoRA). Here’s a concise breakdown of key parameters:
r
Description: Rank of the low-rank decomposition for factorizing weight matrices.
Impact:
Higher: Retains more information, increases computational load.
Lower: Fewer parameters, more efficient training, potential performance drop if too small.
lora_alpha
Description: Scaling factor for the low-rank matrices' contribution.
Impact:
Higher: Increases influence, speeds up convergence, risks instability or overfitting.
Lower: Subtler effect, may require more training steps.
lora_dropout
Description: Probability of zeroing out elements in low-rank matrices for regularization.
Impact:
Higher: More regularization, prevents overfitting, may slow training and degrade performance.
Lower: Less regularization, may speed up training, risks overfitting.
loftq_config
Description: Configuration for LoftQ, a quantization method for the backbone weights and initialization of LoRA layers.
Impact:
Not None: If specified, LoftQ will quantize the backbone weights and initialize the LoRA layers. It requires setting
init_lora_weights='loftq'
.None: LoftQ quantization is not applied.
Note: Do not pass an already quantized model when using LoftQ as LoftQ handles the quantization process itself.
use_rslora
Description: Enables Rank-Stabilized LoRA (RSLora).
Impact:
True: Uses Rank-Stabilized LoRA, setting the adapter scaling factor to
lora_alpha/math.sqrt(r)
, which has been proven to work better as per the Rank-Stabilized LoRA paper.False: Uses the original default scaling factor
lora_alpha/r
.
gradient_accumulation_steps
Default: 1
Description: The number of steps to accumulate gradients before performing a backpropagation update.
Impact:
Higher: Accumulate gradients over multiple steps, effectively increasing the batch size without requiring additional memory. This can improve training stability and convergence, especially with large models and limited hardware.
Lower: Faster updates but may require more memory per step and can be less stable.
weight_decay
Default: 0.01
Description: Regularization technique that applies a small penalty to the weights during training.
Impact:
Non-zero Value (e.g., 0.01): Adds a penalty proportional to the magnitude of the weights to the loss function, helping to prevent overfitting by discouraging large weights.
Zero: No weight decay is applied, which can lead to overfitting, especially in large models or with small datasets.
learning_rate
Default: 2e-4
Description: The rate at which the model updates its parameters during training.
Impact:
Higher: Faster convergence but risks overshooting optimal parameters and causing instability in training.
Lower: More stable and precise updates but may slow down convergence, requiring more training steps to achieve good performance.
Target Modules
q_proj (query projection)
Description: Part of the attention mechanism in transformer models, responsible for projecting the input into the query space.
Impact: Transforms the input into query vectors that are used to compute attention scores.
k_proj (key projection)
Description: Projects the input into the key space in the attention mechanism.
Impact: Produces key vectors that are compared with query vectors to determine attention weights.
v_proj (value projection)
Description: Projects the input into the value space in the attention mechanism.
Impact: Produces value vectors that are weighted by the attention scores and combined to form the output.
o_proj (output projection)
Description: Projects the output of the attention mechanism back into the original space.
Impact: Transforms the combined weighted value vectors back to the input dimension, integrating attention results into the model.
gate_proj (gate projection)
Description: Typically used in gated mechanisms within neural networks, such as gating units in gated recurrent units (GRUs) or other gating mechanisms.
Impact: Controls the flow of information through the gate, allowing selective information passage based on learned weights.
up_proj (up projection)
Description: Used for up-projection, typically increasing the dimensionality of the input.
Impact: Expands the input to a higher-dimensional space, often used in feedforward layers or when transitioning between different layers with differing dimensionalities.
down_proj (down projection)
Description: Used for down-projection, typically reducing the dimensionality of the input.
Impact: Compresses the input to a lower-dimensional space, useful for reducing computational complexity and controlling the model size.
Last updated