From: hu-po
Low-Rank Adaptation (LoRA) is a technique that has gained popularity for fine-tuning existing large models, particularly large language models (LLMs) and image generation models like Stable Diffusion [00:01:09] [00:01:27] [00:01:34]. Developed by Microsoft in October 2021 [00:01:40], LoRA addresses the challenges of traditional full fine-tuning by significantly reducing the number of trainable parameters and computational requirements [00:03:37].
Traditional Finetuning (Full Finetuning)
Traditional fine-tuning involves updating all parameters of a pre-trained model to adapt it to a particular domain or task [00:02:28]. This approach is common in natural language processing (NLP), where large-scale pre-training on general domain data is followed by adaptation to specific tasks [00:02:26] [00:06:44].
Challenges of Full Finetuning
As models become larger, such as GPT-3 with 175 billion parameters [00:03:32] [00:08:43], full fine-tuning becomes less feasible due to:
- Prohibitive Cost: Deploying independent instances of fully fine-tuned models is extremely expensive [00:03:34] [01:16:43].
- Memory Requirements: Training large models requires substantial GPU memory, for example, 1.2 terabytes for GPT-3 with Adam optimizer [00:50:53]. This includes storing optimizer states for all parameters [00:22:00] [00:24:05].
- Storage Burden: Each fine-tuned model contains as many parameters as the original model, making storing and deploying many independent instances challenging [00:22:00] [00:22:27]. The checkpoint size for GPT-3 can be 30 gigabytes [00:51:30].
Low-Rank Adaptation (LoRA)
LoRA tackles these issues by freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture [00:03:40] [00:04:07]. This means only a small number of additional parameters (the “LoRA modules”) are trained [00:05:06].
Core Mechanism
The fundamental idea behind LoRA is based on the hypothesis that the change in weights during model adaptation (ΔW) has a low intrinsic rank [00:10:33] [00:37:04]. Instead of directly training ΔW (which would be the same size as the original weight matrix W₀), LoRA represents ΔW as a product of two smaller matrices, B and A, where ΔW = BA [00:38:41] [00:39:02].
- W₀ (pre-trained weight matrix) has dimensions D x K.
- B has dimensions D x R.
- A has dimensions R x K.
- R (the rank) is chosen to be much smaller than D or K [00:41:14].
During training, W₀ is frozen and does not receive gradient updates [00:40:08]. Only matrices A and B (which contain trainable parameters) are optimized [00:40:10]. Matrix A is initialized with random Gaussian values, while B is initialized with zeros, making ΔW zero at the start of training [00:41:34] [00:41:48]. The output of the forward pass is modified to be (W₀ + BA)X [00:40:36]. A scaling factor (Alpha/R) is applied to ΔWX, which helps reduce the need to retune hyperparameters when varying R [00:41:57] [00:42:21].
Benefits of LoRA
- Reduced Trainable Parameters: LoRA can reduce the number of trainable parameters by up to 10,000 times compared to full fine-tuning for models like GPT-3 175B [00:35:29]. For GPT-3, trainable parameters can be as small as 0.01% of the total model parameters [01:16:17]. This is achieved by encoding the task-specific parameter increment (ΔΦ) with a much smaller set of parameters (Θ) [00:24:46].
- Lower GPU Memory Requirement: LoRA reduces GPU memory requirements by up to three times, especially when using adaptive optimizers like Adam [00:32:03]. This is because optimizer states are not maintained for the frozen pre-trained parameters [00:21:58] [00:24:22] [00:50:51]. For GPT-3, VRAM usage during training can be reduced from 1.2 terabytes to 350 gigabytes [00:50:53].
- No Additional Inference Latency: When deployed, the learned LoRA weights (BA) can be explicitly computed and added directly to the original pre-trained weights (W₀ + BA) [00:45:26]. This results in no additional inference latency compared to a fully fine-tuned model, as the model size remains the same [00:05:50] [00:33:37] [01:19:37]. This contrasts with “adapter layers” which add sequential compute steps, increasing latency [00:29:53] [00:30:52].
- Faster Training Throughput: LoRA offers a 25% speed-up during training compared to full fine-tuning because gradients are not calculated for the vast majority of parameters [00:52:03].
- Efficient Task Switching: For multi-task scenarios, a pre-trained model can be shared, and multiple small LoRA modules can be built for different tasks [00:12:11]. Switching between tasks is efficient: the existing BA can be subtracted from W₀, and a new BA for another task can be added [00:45:46] [00:51:44]. This significantly reduces storage requirements and task switching overhead [00:12:25]. The checkpoint size for a LoRA module can be 10,000 times smaller (e.g., 35 megabytes vs. 30 gigabytes for GPT-3) [00:51:27].
Application of LoRA
While initially focused on language models like GPT-3, LoRA’s principles apply to any dense layers in deep learning models [00:33:42] [01:49:45]. It has been successfully applied to image generation models (e.g., diffusion models like Stable Diffusion) [00:01:34] [00:35:37]. In Transformer architectures, LoRA is typically applied to the query (WQ) and value (WV) projection matrices within the self-attention module [00:46:50] [00:50:00] [01:03:46].
LoRA vs. Other Parameter-Efficient Methods
LoRA stands apart from other parameter-efficient adaptation methods such as Adapter layers and Prefix Tuning:
- Adapter Layers: These involve inserting new, small layers between existing layers and training only those [00:29:16]. While they reduce trainable parameters, they introduce additional inference latency because they add sequential compute steps [00:29:53] [00:30:22].
- Prefix Tuning: This method optimizes a specific “prompt” or input layer activations by adding special tokens to the input sequence [00:29:22] [01:01:15]. This approach can be difficult to optimize and reduces the effective sequence length available for the task [00:31:30] [01:01:40].
LoRA performs on par with or better than full fine-tuning and other methods in model quality, despite having fewer trainable parameters [00:05:40] [00:58:24] [01:11:49]. Its effectiveness is particularly pronounced in larger models like GPT-2 and GPT-3, possibly because larger models have more model capacity, leading to a higher likelihood of low-rank structure in their weight updates [01:05:52] [01:07:26].
Determining the Rank (R)
The rank r
is a crucial hyperparameter for LoRA. Instead of a theoretical calculation, r
is often chosen based on a predetermined “parameter budget” for the LoRA modules [01:25:34] [01:25:58]. Studies have shown that even a very small rank, such as r=1
, can achieve competitive performance on some datasets, suggesting that the intrinsic rank of weight updates is indeed very low [01:29:56] [01:30:30] [01:40:03].
Limitations of LoRA
- Batching Challenges: It is not straightforward to batch inputs from different tasks that require different A and B matrices in a single forward pass [01:16:43] [01:17:10] [01:17:14].
- Optimal Configuration: The optimal choice of which weight matrices to adapt with LoRA (e.g., Query, Key, Value, or MLP layers) can vary significantly depending on the model architecture and the specific task [00:34:30] [01:27:29]. This suggests further research is needed to determine the best application strategy [00:50:04].
- Task/Language Dependence: While LoRA performs well on many tasks, a small rank might not suffice for every task or dataset, particularly if the downstream tasks are in a different language than the pre-training data [01:30:51].
Conclusion
LoRA represents a significant advancement in fine-tuning large models by enabling finetuning pretrained models with minimal additional parameters [00:05:40]. Its ability to drastically reduce memory consumption, storage needs, and training time, while maintaining or even improving model quality and introducing no additional inference latency after merging, makes it a highly efficient and practical technical aspects of AI model training and finetuning solution [00:51:57] [00:33:37]. The principle of low-rank updates suggests an inherent compressibility in the necessary changes for adaptation, providing a glimpse into the underlying mechanisms of deep learning [01:21:06] [01:50:48].