Different Learning Rates for LoRA: Can LoRA Improve Further?

LoRA Background

LoRA (Low-Rank Adaptation) represents one of the parameter-efficient fine-tuning methods for Large Language Models (LLMs). We previously discussed LoRA from a gradient perspective in "LoRA from a Gradient Perspective: Introduction, Analysis, Conjectures, and Extensions". In this article, we explore a new finding regarding LoRA:

By assigning different learning rates to the two matrices in LoRA, the effectiveness of LoRA can be further enhanced.

This conclusion originates from the recent paper "LoRA+: Efficient Low Rank Adaptation of Large Models" (hereinafter referred to as "LoRA+"). At first glance, this finding may not appear particularly remarkable, as introducing different learning rates essentially adds new hyperparameters, and typically any carefully tuned hyperparameter can lead to improvements. What distinguishes "LoRA+" is that it theoretically justifies this necessity and asserts that the optimal solution invariably requires the learning rate of the right matrix to exceed that of the left matrix. In summary, "LoRA+" serves as a classic example where theoretical guidance in training proves effective in practice, warranting careful examination.

Conclusion Analysis#

Suppose the pre-trained parameters are represented by $W_0 \in \mathbb{R}^{n\times m}$. If full-parameter fine-tuning were employed, the update would also be an $n\times m$ matrix. To reduce the number of parameters, LoRA constrains the update to a low-rank matrix by setting $W = W_0 + AB$, where $A \in \mathbb{R}^{n\times r}$, $B \in \mathbb{R}^{r\times m}$, and $r \ll \min(n, m)$. The model's original parameters are then replaced with the new $W$, while $W_0$ remains fixed during training, with only $A$ and $B$ being updated, as illustrated below:

$W_0\in\mathbb{R}^{n\times m}$ $+$ $A\in\mathbb{R}^{n\times r}$ $\times$ $B\in\mathbb{R}^{r\times m}$

Note that LoRA is typically applied to dense layers, but the original paper's analysis is based on left-multiplication of weights, whereas implementations generally use right-multiplication of inputs. To avoid confusion, this article aligns notation with implementation: assuming the layer input is $X \in \mathbb{R}^{b\times n}$, the layer operation is $XW = X(W_0 + AB)$. Since the "LoRA+" conclusion is independent of pre-trained weights, we can without loss of generality set $W_0 = 0$, simplifying the layer operation to $Y = XAB \in \mathbb{R}^{b\times m}$.

The conclusion of "LoRA+" is:

To maximize LoRA's effectiveness and approximate the optimal solution as closely as possible, the learning rate for weight $B$ should exceed that for weight $A$.

Note that to ensure the initial model is equivalent to the original pre-trained model, LoRA typically initializes either $A$ or $B$ entirely to zero. Initially, the author assumed this conclusion might depend on which matrix is zero-initialized, but careful reading reveals that the conclusion claimed by "LoRA+" is independent of zero initialization. In other words, although $A$ and $B$ appear symmetric superficially, they possess inherent asymmetry such that regardless of whether $A$ or $B$ is zero-initialized, the conclusion remains that $B$'s learning rate should exceed $A$'s. This makes the finding particularly interesting.

However, it must be noted that the original "LoRA+" paper's explanation is rather convoluted, so what follows is the author's simplified derivation. Essentially, it rests on two assumptions:

1. Numerical Stability: The output values of each layer in the model should be numerically stable and independent of network width.

2. Equal Contribution: For optimal LoRA performance, matrices $A$ and $B$ should contribute equally to the model's effectiveness.

We now analyze and quantify these two assumptions one by one.

Numerical Stability#

Numerical Stability Analysis

First, numerical stability means that each component of $X$, $XA$, and $XAB$ should be $\mathcal{O}(1)$ in magnitude, independent of network width $n$ and $m$. Here, $\mathcal{O}(1)$ primarily denotes zero-order dependence on network width, not that the absolute values are necessarily close to 1. This assumption should be uncontroversial, as it is difficult to imagine a numerically unstable network achieving good predictive performance. However, some readers might question the necessity of "$XA$ being $\mathcal{O}(1)$" since $X$ is the input and $XAB$ is the output—requiring their numerical stability seems reasonable—but $XA$ is merely an intermediate variable. Must it also be numerically stable?

From a forward propagation perspective alone, numerical stability of $XA$ is indeed not strictly necessary. However, if $XA$ is numerically unstable while $XAB$ remains stable, two scenarios arise: if $XA$ has large magnitude and $B$ has small magnitude, according to differentiation formulas, this would cause $A$'s gradients to be small and $B$'s gradients to be large; conversely, if $XA$ has small magnitude and $B$ has large magnitude, $A$'s gradients would be large and $B$'s gradients small. In summary, numerical instability in $XA$ leads to unstable gradients for $A$ and $B$, increasing optimization difficulty. Thus, imposing numerical stability on $XA$ as a condition is preferable.

This numerical stability condition recalls "LeCun initialization", which states that if $W \in \mathbb{R}^{n\times m}$ is sampled i.i.d. from a distribution with mean 0 and variance $1/n$, then the magnitude of each component in $XW$ is roughly the same as that of $X$'s components. Following the same principle, if input $X$ is already $\mathcal{O}(1)$, then to ensure $XA$ and $XAB$ components remain $\mathcal{O}(1)$, $A$ and $B$ should be initialized with variances $1/n$ and $1/r$, respectively (mean defaults to 0 and will not be repeated).

As mentioned earlier, LoRA typically zero-initializes either $A$ or $B$ to maintain identity initialization, but this detail is less critical. We only need to recognize that variances $1/n$ and $1/r$ can keep $XA$ and $XAB$ numerically stable, suggesting that after training, $A$ and $B$ likely approximate these variances. Given $r \ll n$, this implies that $A$'s component magnitudes will be significantly smaller than $B$'s component magnitudes, which is the source of asymmetry between $A$ and $B$.

Equal Contribution#

Equal Contribution Principle

Next, we examine the second assumption: $A$ and $B$ should contribute equally to model effectiveness. This assumption appears reasonable since in LLM+LoRA scenarios, typically $m = n$, meaning $A$ and $B$ have the same number of parameters, so equal contribution seems fair. If $m \neq n$, we can generalize this assumption to state that contribution is proportional to parameter count. The most fundamental metric for effectiveness is naturally the loss function, denoted here as $\mathcal{L}$.

We aim to measure the change in loss function when $A \to A + \Delta A$ and $B \to B + \Delta B$:

(1) \[ \mathcal{L}(A+\Delta A,B+\Delta B) - \mathcal{L}(A,B)\approx \left\langle \frac{\partial\mathcal{L}}{\partial A},\Delta A\right\rangle + \left\langle \frac{\partial\mathcal{L}}{\partial B},\Delta B\right\rangle \]

Here, a first-order linear approximation is used, where $\frac{\partial\mathcal{L}}{\partial A}$ and $\frac{\partial\mathcal{L}}{\partial B}$ are gradients of $A$ and $B$, and $\langle\cdot,\cdot\rangle$ denotes the Frobenius inner product. The two terms on the right can be interpreted as the respective contributions of $A$ and $B$ to effectiveness. However, note that the validity of linear approximation depends on increments $\Delta A$ and $\Delta B$ being small. For well-trained weights, increments relative to original weights may not necessarily be small. Therefore, we modify the "equal contribution" assumption to: "$A$ and $B$ should contribute equally to effectiveness in each update step." Since single-step updates are typically small, linear approximation holds reasonably well.

Considering each step's update naturally leads us to optimizer considerations. Currently, Adam is the mainstream optimizer for both pre-training and fine-tuning. Adam maintains two sets of moving average states with corresponding hyperparameters $\beta_1$ and $\beta_2$, making precise analysis challenging. However, for the purposes of this article, we only require an order-of-magnitude estimate. Thus, we consider an extreme case and assume it yields the same order-of-magnitude result as the general case. This case is $\beta_1 = \beta_2 = 0$, where Adam reduces to SignSGD:

(2) \[ \Delta A = -\eta_A\,\text{sign}\left(\frac{\partial\mathcal{L}}{\partial A}\right),\quad\Delta B = -\eta_B\,\text{sign}\left(\frac{\partial\mathcal{L}}{\partial B}\right) \]

where $\eta_A$ and $\eta_B$ are respective learning rates. The "LoRA+" conclusion is that $\eta_B \gg \eta_A$.

Substituting SignSGD increments from Equation (2) into Equation (1) yields:

(3) \[ \mathcal{L}(A+\Delta A,B+\Delta B) - \mathcal{L}(A,B)\approx \underbrace{-\,\eta_A \left\Vert\frac{\partial\mathcal{L}}{\partial A}\right\Vert_1}_{\Delta \mathcal{L}_A}\,\underbrace{-\,\eta_B \left\Vert \frac{\partial\mathcal{L}}{\partial B}\right\Vert_1}_{\Delta \mathcal{L}_B} \]

Here, $\Vert\cdot\Vert_1$ is the $L_1$ norm, i.e., the sum of absolute values of all components. "Equal contribution" means $\Delta \mathcal{L}_A$ and $\Delta \mathcal{L}_B$ should be of the same order of magnitude.

Quick Derivation#

Gradient Analysis and Derivation

Further analysis requires explicit gradient expressions. Setting $Y = XAB$ again, we can compute:

(4) \[ \frac{\partial \mathcal{L}}{\partial A} = X^{\top}\frac{\partial \mathcal{L}}{\partial Y}B^{\top},\quad \frac{\partial \mathcal{L}}{\partial B} = A^{\top} X^{\top}\frac{\partial \mathcal{L}}{\partial Y} \]

Readers unfamiliar with matrix calculus might find these derivations confusing; the author admits the same. However, a simple technique can be employed. For $\frac{\partial \mathcal{L}}{\partial A}$, we know it is an $n\times r$ matrix (same shape as $A$). Similarly, $\frac{\partial \mathcal{L}}{\partial Y}$ is a $b\times m$ matrix. According to the chain rule, $\frac{\partial \mathcal{L}}{\partial A}$ should be a product of $\frac{\partial \mathcal{L}}{\partial Y}$, $X$, and $B$. We simply consider how these three matrices multiply to yield an $n\times r$ matrix according to matrix multiplication rules.

With explicit forms of $\frac{\partial \mathcal{L}}{\partial A}$ and $\frac{\partial \mathcal{L}}{\partial B}$, we have a quick way to understand LoRA+. First, $\Delta \mathcal{L}_A$ is proportional to $\left\Vert\frac{\partial\mathcal{L}}{\partial A}\right\Vert_1$, which is the sum of absolute values of $nr$ components. Assuming each component is comparable, $\Delta \mathcal{L}_A$ roughly scales with $nr$. Then, $\frac{\partial\mathcal{L}}{\partial A}$ is linear in $B$, so each component magnitude roughly scales with $B$'s component magnitude. Combined, $\Delta \mathcal{L}_A$ scales with both $nr$ and $B$'s magnitude. Similarly, $\Delta \mathcal{L}_B$ roughly scales with both $mr$ and $A$'s magnitude. As discussed in the Numerical Stability section, for forward numerical stability, $B$'s magnitude should exceed $A$'s magnitude (proportional to their approximate standard deviations $\sqrt{1/r}$ and $\sqrt{1/n}$, respectively). To ensure $\Delta \mathcal{L}_A$ and $\Delta \mathcal{L}_B$ are comparable, we have the approximate relation:

(5) \[ \eta_A \times nr \times \sqrt{1/r} \approx \eta_B \times mr \times \sqrt{1/n}\quad\Rightarrow\quad \frac{\eta_B}{\eta_A} \approx \frac{n}{m}\sqrt{\frac{n}{r}} \]

Considering practical usage often has $m = n$ and $r = \mathcal{O}(1)$, we can simply write:

(6) \[ \frac{\eta_B}{\eta_A} = \mathcal{O}(\sqrt{n}) \]

But we are not done yet—we must check consistency since one condition we used, "forward numerical stability," remains an ideal assumption. How can we make this assumption as plausible as possible? We combat one assumption with another:

In the Adam optimizer, if the learning rate ratio between two parameters is $\lambda$, then after prolonged training, the magnitude ratio of these two parameters will also be approximately $\lambda$.

According to Adam's approximation in Equation (2), each step's increment magnitude indeed scales with learning rate. However, the cumulative update result is not simply a sum of individual steps, so this assumption feels "somewhat plausible but not entirely convincing." Nevertheless, assumptions often have this character—somewhat plausible is sufficient, and the rest relies on faith. Under this assumption, if we train with $\frac{\eta_B}{\eta_A} = \mathcal{O}(\sqrt{n})$, then the magnitude ratio of $B$ to $A$ will also be $\mathcal{O}(\sqrt{n})$. Earlier we expected them to have approximate standard deviations $\sqrt{1/r}$ and $\sqrt{1/n}$, whose ratio is precisely $\mathcal{O}(\sqrt{n})$—perfectly consistent!

The original paper's result slightly differs from the above, giving $\mathcal{O}(n)$. This discrepancy arises because the original paper considers $\Delta A$ and $\Delta B$ contributing equally to $Y$, but $Y$ is merely the layer output, not the final effectiveness metric, making this approach questionable. Although the original paper attempts to link $Y$'s increment to $\mathcal{L}$'s increment, it does not thoroughly compute, leading to deviation. Additionally, the original derivation原则上 only applies to the special case $b=1, r=1, m=n$, with $b > 1, r > 1$ cases directly extended, meaning the analysis lacks generality.

Of course, whether it is $\mathcal{O}(n)$ or $\mathcal{O}(\sqrt{n})$ is not critically important—practical tuning is still required. However, LoRA+ conducted experiments on models of various sizes with $r$ typically 8 and $n$ ranging from 768 to 4096, ultimately recommending a default learning rate ratio of $2^4 = 16$, which aligns with $\sqrt{n/r}$. Thus, the optimal value is closer to $\mathcal{O}(\sqrt{n})$ than $\mathcal{O}(n)$.

Article Summary#

Key Takeaways

In this article, we introduced and derived a result known as "LoRA+", which supports the inherent asymmetry between the two low-rank matrices $A$ and $B$ in LoRA. Regardless of which matrix is zero-initialized, $B$'s learning rate should be set larger than $A$'s to achieve better performance.

The theoretical analysis combines considerations of numerical stability during forward propagation with the principle of equal contribution during optimization. The derived optimal learning rate ratio $\frac{\eta_B}{\eta_A} \approx \frac{n}{m}\sqrt{\frac{n}{r}}$ provides practical guidance for LoRA fine-tuning, with empirical experiments supporting a default ratio around 16 for typical model configurations.

This work exemplifies how theoretical insights can guide practical optimization strategies in deep learning, demonstrating that even simple modifications like adjusting individual learning rates for different parameter groups can yield measurable improvements when properly justified by analysis.

Citation Information

Original Article: Su Jianlin. Different Learning Rates for LoRA: Can LoRA Improve Further? Scientific Spaces.

How to cite this translation:

Su, J. Different Learning Rates for LoRA: Can LoRA Improve Further? [Translated by Juanxi Tian]. Scientific Spaces.

BibTeX:

@article{su2025lora_different_learning_rates, title = {Different Learning Rates for LoRA: Can LoRA Improve Further?}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/11216}, note = {Translated by Juanxi Tian (ScalingOpt Team)} }