🧠LoRA Parameters Encyclopedia

Learn how parameters affect the finetuning process. Written by Sebastien.

LoraConfig Parameters

Adjusting the LoraConfig parameters allows you to balance model performance and computational efficiency in Low-Rank Adaptation (LoRA). Here’s a concise breakdown of key parameters:

r

  • Description: Rank of the low-rank decomposition for factorizing weight matrices.

  • Impact:

    • Higher: Retains more information, increases computational load.

    • Lower: Fewer parameters, more efficient training, potential performance drop if too small.

lora_alpha

  • Description: Scaling factor for the low-rank matrices' contribution.

  • Impact:

    • Higher: Increases influence, speeds up convergence, risks instability or overfitting.

    • Lower: Subtler effect, may require more training steps.

lora_dropout

  • Description: Probability of zeroing out elements in low-rank matrices for regularization.

  • Impact:

    • Higher: More regularization, prevents overfitting, may slow training and degrade performance.

    • Lower: Less regularization, may speed up training, risks overfitting.

loftq_config

  • Description: Configuration for LoftQ, a quantization method for the backbone weights and initialization of LoRA layers.

  • Impact:

    • Not None: If specified, LoftQ will quantize the backbone weights and initialize the LoRA layers. It requires setting init_lora_weights='loftq'.

    • None: LoftQ quantization is not applied.

    • Note: Do not pass an already quantized model when using LoftQ as LoftQ handles the quantization process itself.

use_rslora

  • Description: Enables Rank-Stabilized LoRA (RSLora).

  • Impact:

    • True: Uses Rank-Stabilized LoRA, setting the adapter scaling factor to lora_alpha/math.sqrt(r), which has been proven to work better as per the Rank-Stabilized LoRA paper.

    • False: Uses the original default scaling factor lora_alpha/r.

gradient_accumulation_steps

  • Default: 1

  • Description: The number of steps to accumulate gradients before performing a backpropagation update.

  • Impact:

    • Higher: Accumulate gradients over multiple steps, effectively increasing the batch size without requiring additional memory. This can improve training stability and convergence, especially with large models and limited hardware.

    • Lower: Faster updates but may require more memory per step and can be less stable.

weight_decay

  • Default: 0.01

  • Description: Regularization technique that applies a small penalty to the weights during training.

  • Impact:

    • Non-zero Value (e.g., 0.01): Adds a penalty proportional to the magnitude of the weights to the loss function, helping to prevent overfitting by discouraging large weights.

    • Zero: No weight decay is applied, which can lead to overfitting, especially in large models or with small datasets.

learning_rate

  • Default: 2e-4

  • Description: The rate at which the model updates its parameters during training.

  • Impact:

    • Higher: Faster convergence but risks overshooting optimal parameters and causing instability in training.

    • Lower: More stable and precise updates but may slow down convergence, requiring more training steps to achieve good performance.

Target Modules

q_proj (query projection)

  • Description: Part of the attention mechanism in transformer models, responsible for projecting the input into the query space.

  • Impact: Transforms the input into query vectors that are used to compute attention scores.

k_proj (key projection)

  • Description: Projects the input into the key space in the attention mechanism.

  • Impact: Produces key vectors that are compared with query vectors to determine attention weights.

v_proj (value projection)

  • Description: Projects the input into the value space in the attention mechanism.

  • Impact: Produces value vectors that are weighted by the attention scores and combined to form the output.

o_proj (output projection)

  • Description: Projects the output of the attention mechanism back into the original space.

  • Impact: Transforms the combined weighted value vectors back to the input dimension, integrating attention results into the model.

gate_proj (gate projection)

  • Description: Typically used in gated mechanisms within neural networks, such as gating units in gated recurrent units (GRUs) or other gating mechanisms.

  • Impact: Controls the flow of information through the gate, allowing selective information passage based on learned weights.

up_proj (up projection)

  • Description: Used for up-projection, typically increasing the dimensionality of the input.

  • Impact: Expands the input to a higher-dimensional space, often used in feedforward layers or when transitioning between different layers with differing dimensionalities.

down_proj (down projection)

  • Description: Used for down-projection, typically reducing the dimensionality of the input.

  • Impact: Compresses the input to a lower-dimensional space, useful for reducing computational complexity and controlling the model size.

Last updated