With the rise of ChatGPT and its alternatives, various Parameter-Efficient Fine-Tuning methods have gained prominence. Among the most popular is the protagonist of this article: LoRA (Low-Rank Adaptation), introduced in the paper "LoRA: Low-Rank Adaptation of Large Language Models". The LoRA method is relatively straightforward, with numerous existing implementations, making it easy to understand and use; thus, there is little need for extensive elaboration.
However, direct implementation of LoRA requires modifying network architecture, which is somewhat cumbersome. Additionally, LoRA reminds the author of the earlier optimizer AdaFactor. Hence, the author's question is: Can we analyze and implement LoRA from an optimization perspective? This article explores this theme.
Method Introduction#
Previous results (e.g., "Exploring Universal Intrinsic Task Subspace via Prompt Tuning") indicate that although pre-trained models have vast parameter counts, the intrinsic dimension for each downstream task is not large. In other words, theoretically, we can fine-tune a very small number of parameters to achieve good performance on downstream tasks.
LoRA draws inspiration from these findings, proposing that for a pre-trained parameter matrix $W_0\in\mathbb{R}^{n\times m}$, instead of directly fine-tuning $W_0$, we impose a low-rank decomposition assumption on the increment:
where one of $A,B$ is initialized with zeros, $W_0$ remains fixed, and the optimizer only updates $A,B$. Given the conclusion that intrinsic dimensions are small, we can set $r$ to be very small, commonly $r=8$, and in extreme cases, we can even set $r=1$. Thus, LoRA is a parameter-efficient fine-tuning method, at least significantly reducing the number of optimized parameters.
A schematic diagram of the LoRA decomposition:
Figure 1: Low-rank adaptation (LoRA) decomposition visualization. The pre-trained weight matrix $W_0$ (blue) is augmented by a low-rank product $A \times B$ (orange), where $r \ll \min(n,m)$.
Gradient Analysis#
As mentioned in "Ladder Side-Tuning: The 'Ladder' for Pre-trained Models", many parameter-efficient fine-tuning methods only reduce memory requirements without decreasing computational load. Is LoRA an exception? How efficient is it in terms of memory and computation? Let's analyze.
First, we know that memory consumption during training comes from four sources: model parameters, model gradients, model activations, and optimizer states. LoRA reduces model parameters through low-rank decomposition; consequently, gradients and optimizer states are also reduced, leading to significant memory savings. But can it reduce computational load?
This depends on the implementation of LoRA; different implementations have varying complexities for gradient computation. The two equivalent implementations of LoRA are:
where $X\in\mathbb{R}^{b\times n}$ is the model input, and $Z=XA\in\mathbb{R}^{b\times r}$ is an intermediate output. For implementation $\eqref{eq:lora-1}$, we have:
where $\mathcal{L}$ is the loss function. Clearly, this implementation requires computing the full gradient $\frac{\partial \mathcal{L}}{\partial W}\in\mathbb{R}^{n\times m}$ before computing gradients for $A,B$, meaning it could be slower and more memory-intensive than standard training without LoRA. For implementation $\eqref{eq:lora-2}$, we have:
Here, $Z,\frac{\partial \mathcal{L}}{\partial Z}\in\mathbb{R}^{b\times r}$, which are significantly smaller than full gradients, and computational complexity is markedly reduced. Therefore, to maximize memory and computational savings with LoRA, it is crucial to implement according to $\eqref{eq:lora-2}$ rather than $\eqref{eq:lora-1}$.
(Note: Regarding matrix gradient computation, we can "deduce" using the chain rule and output shapes. For example, for $\frac{\partial \mathcal{L}}{\partial A}$, the chain rule tells us it must involve multiplication of $\frac{\partial \mathcal{L}}{\partial W}$ and $B$ in some manner. By convention, $\frac{\partial \mathcal{L}}{\partial A}$ has the same shape as $A$, i.e., $n\times r$. To obtain an $n\times r$ result from $\frac{\partial \mathcal{L}}{\partial W}$ and $B$, the only possibility is $\frac{\partial \mathcal{L}}{\partial W} B^{\top}$.)
Additional Factors#
Beyond benefits from low-rank decomposition, the following points also contribute to LoRA's memory savings and speed improvements:
1. Partial Parameter Updates: For instance, the original LoRA paper chooses to update only Self-Attention parameters. In practice, we can further select to update only parameters from specific layers.
2. Reduced Communication Time: Since the number of updated parameters decreases, the amount of data to transmit (especially in multi-GPU training) also decreases, reducing communication time.
3. Low-Precision Acceleration Techniques: Employing various low-precision acceleration techniques such as FP16, FP8, or INT8 quantization.
Certainly, these three factors can accelerate training, but they are not unique to LoRA; indeed, nearly all parameter-efficient methods share these characteristics. LoRA's standout advantages are its intuitive low-rank decomposition, its performance often matching full fine-tuning in many scenarios, and its ability to merge $W_0, A, B$ into a single matrix during inference without increasing inference costs.
Optimization Perspective#
Gradient also informs us how to implement LoRA from an optimizer perspective. The optimizer can directly access the full gradient $\frac{\partial \mathcal{L}}{\partial W}$. We then only need to project this gradient according to formula to obtain gradients for $A,B$, after which we can update $A,B$ following conventional optimizer implementations.
If the optimizer is SGD, then:
For optimizers with momentum variables like Adam, we only need to apply momentum to the projected gradients, thereby reducing the optimizer's parameter count and saving some memory. The larger the model, the greater the proportion of memory occupied by these parameters.
LoRA stipulates that one of $A$ or $B$ is initialized with zeros to ensure the initial model state matches the pre-trained model, but this introduces asymmetry (one all-zero, one non-zero). In fact, initializing both $A$ and $B$ non-zero is also feasible; we simply need to subtract $A_0 B_0$ from the pre-trained weights beforehand, or equivalently, parameterize $W$ as:
This maintains initial consistency while allowing both $A,B$ to be non-zero initialized, enhancing symmetry.
Random Projection#
If we expand the update term $A_{t+1} B_{t+1} - A_t B_t$ in the SGD scenario, the result is:
Assuming the $\eta^2$ term is negligible higher-order, we are left with:
From this perspective, compared to full fine-tuning SGD, LoRA essentially replaces the full gradient $\frac{\partial \mathcal{L}}{\partial W_t}$ with the bracketed result.
For simplicity, consider only the case $r=1$. Note that in the above expression, the projection vectors $A_t,B_t$ at time $t$ depend on $t$. What if we replace them with random vectors independent of $t$ (regenerated randomly at each training step)? Consider $u,v\sim\mathcal{N}(0,1)$, where $u\in\mathbb{R}^{m\times 1}, v\in\mathbb{R}^{1\times n}$. Then the update becomes:
It can be proven that:
Here $I_{n\times n},I_{m\times m}$ denote $n\times n$ and $m\times m$ identity matrices, respectively. Therefore, similar to "zeroth-order gradients," on average, LoRA with re-initialization at each step is equivalent to full-rank SGD. However, implementing this way could be even slower than full-rank SGD, so its purpose is not speed but potentially mitigating catastrophic forgetting—by using low-rank matrices (instead of full-rank) for updates on individual (batch) samples, reducing impact on overall model weights. Of course, this is conjecture; actual effects require experimental validation, which the author has not yet conducted.
Proposed Variant#
Again, first consider only $r=1$. LoRA assumes $\Delta w_{i,j} = u_i v_j$. Can we consider other low-rank decomposition assumptions? For example, $\Delta w_{i,j} = u_i + v_j$? In matrix form:
where $\mathbb{1}_{1\times m},\mathbb{1}_{n\times 1}$ denote $1\times m$ and $n\times 1$ all-ones matrices, respectively. Its gradients are easily derived:
Essentially, row sums and column sums of the original gradient. Compared to original LoRA, this additive decomposition has two advantages: 1. Addition is computationally cheaper than multiplication, and gradient forms are simpler; 2. $AB$ necessarily has rank 1, but $A \mathbb{1}_{1\times m} + \mathbb{1}_{n\times 1} B$ could have rank up to 2. If rank represents model capacity, this means with the same parameter count, additive decomposition might offer stronger expressive power. As for actual performance, the author will conduct comparative experiments when using LoRA in the future.
Can additive decomposition generalize to $r > 1$? Naturally, but with some nuance. Assume $m,n$ are divisible by $r$. We modify the parameterization to:
Here, $I_{r(1\times m/r)}$, $I_{r(n/r\times 1)}$ denote block matrices of size $1\times m/r$ and $n/r\times 1$, respectively, where each block is an $r\times r$ identity matrix. In essence, this treats $A,B$ as block matrices of size $n/r\times 1$ and $1\times m/r$, then applies the $r=1$ approach.
Summary#
This article introduces understanding LoRA from a gradient perspective. Beyond basic introduction, it includes the author's conjectures and extensions for readers' reference.
Original Article: Su Jianlin. Gradient Perspective on LoRA: Introduction, Analysis, Conjectures, and Extensions. Scientific Spaces.
How to cite this translation:
BibTeX: