In the article "Appreciation of the Muon Optimizer: From Vector to Matrix Essential Leap", we introduced a new optimizer named "Muon," which can be understood as steepest gradient descent under spectral norm regularization. This perspective seems to reveal more fundamental optimization directions for matrix parameters. As is well known, for matrix parameters, we often apply weight decay, which can be interpreted as the gradient of the Frobenius norm squared. From the Muon perspective, could constructing new weight decay through the gradient of spectral norm squared potentially yield better results?
This raises the question: what does the gradient or derivative of the spectral norm look like? And what would the new weight decay designed from it look like? In the following sections, we explore these questions in detail.
Fundamentals#
The spectral norm, also known as the "2-norm," is one of the most commonly used matrix norms. Compared to the simpler Frobenius norm, it often reveals more fundamental signals related to matrix multiplication, as its definition inherently connects to matrix multiplication: for a matrix parameter $\boldsymbol{W}\in\mathbb{R}^{n\times m}$, its spectral norm is defined as:
Here, $\boldsymbol{x}\in\mathbb{R}^m$ is a column vector, and the $\Vert\Vert$ on the right-hand side denotes vector magnitude (Euclidean norm). From another perspective, the spectral norm is the smallest constant $C$ that makes the following inequality hold for all $\boldsymbol{x}\in\mathbb{R}^m$:
It's not difficult to prove that when $C$ takes the Frobenius norm $\Vert \boldsymbol{W}\Vert_F$, the above inequality also holds. Therefore, we can write $\Vert \boldsymbol{W}\Vert_2\leq \Vert \boldsymbol{W}\Vert_F$ (since $\Vert \boldsymbol{W}\Vert_F$ is just one $C$ that makes the inequality hold, while $\Vert \boldsymbol{W}\Vert_2$ is the smallest such $C$). This conclusion also indicates that if we want to control output magnitude, using spectral norm as a regularization term is more precise than using Frobenius norm.
Gradient Derivation#
Now let's get to the main topic and attempt to derive the gradient of the spectral norm $\nabla_{\boldsymbol{W}} \Vert\boldsymbol{W}\Vert_2$. We know that numerically, the spectral norm equals its maximum singular value. We provided proof for this in the "Matrix Norms" section of "Low-Rank Approximation Path (II): SVD". This means that if $\boldsymbol{W}$ can be SVD decomposed as $\sum\limits_{i=1}^{\min(n,m)}\sigma_i \boldsymbol{u}_i\boldsymbol{v}_i^{\top}$, then:
where $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(n,m)} \geq 0$ are the singular values of $\boldsymbol{W}$. Taking differentials on both sides, we obtain:
Observing that:
Similarly, $\boldsymbol{u}_1^{\top}\boldsymbol{W}d\boldsymbol{v}_1=0$. Therefore:
(Note: This proof process references a response on Stack Exchange, but that answer did not prove $d\boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1=0$ and $\boldsymbol{u}_1^{\top}\boldsymbol{W}d\boldsymbol{v}_1=0$; this portion has been completed by the author.)
Weight Decay#
Based on this result and the chain rule, we have:
Comparing with the result under Frobenius norm:
Comparing these side by side reveals a clear distinction: weight decay derived from Frobenius norm squared as a regularization term penalizes all singular values simultaneously; whereas weight decay corresponding to spectral norm squared penalizes only the maximum singular value. If our goal is to compress output magnitude, penalizing the maximum singular value is "just right," while penalizing all singular values may achieve similar objectives but could also compress parameter expressive capacity.
According to the "Eckart-Young-Mirsky Theorem," the rightmost result in equation (7) has another interpretation: it represents the "optimal rank-1 approximation" of matrix $\boldsymbol{W}$. In other words, spectral norm-based weight decay changes each step of subtracting the matrix itself to subtracting its optimal rank-1 approximation, weakening the penalty strength while making the penalty more "essential."
Numerical Computation#
For practical implementation, the most crucial question arises: how do we compute $\sigma_1 \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$? SVD is certainly the most straightforward solution, but its computational complexity is undoubtedly the highest; we must find more efficient computational approaches.
Without loss of generality, assume $n\geq m$. First, observe that:
This shows that computing $\sigma_1 \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$ only requires knowledge of $\boldsymbol{v}_1$. According to our discussion in "Low-Rank Approximation Path (II): SVD", $\boldsymbol{v}_1$ is actually the eigenvector corresponding to the maximum eigenvalue of matrix $\boldsymbol{W}^{\top}\boldsymbol{W}$. Thus, we transform the problem from SVD of general matrix $\boldsymbol{W}$ to eigenvalue decomposition of real symmetric matrix $\boldsymbol{W}^{\top}\boldsymbol{W}$, which actually reduces complexity since eigenvalue decomposition is typically significantly faster than SVD.
If this is still considered slow, we need to introduce the principle behind many eigenvalue decomposition algorithms—"Power Iteration":
When $\sigma_1 > \sigma_2$, the iteration \begin{equation}\boldsymbol{x}_{t+1} = \frac{\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{x}_t}{\Vert\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{x}_t\Vert}\end{equation} converges to $\boldsymbol{v}_1$ at a rate of $(\sigma_2/\sigma_1)^{2t}$.
Power iteration requires only two "matrix-vector" multiplications per step, with complexity $\mathcal{O}(nm)$. For $t$ iterations, total complexity is $\mathcal{O}(tnm)$, which is very efficient. The drawback is slower convergence when $\sigma_1,\sigma_2$ are close. However, power iteration's practical performance often surpasses theoretical expectations; early work even achieved good results with just one iteration, because when $\sigma_1,\sigma_2$ are close, they and their eigenvectors are somewhat interchangeable to some extent. Even if power iteration doesn't fully converge, it yields an average of both eigenvectors, which is completely sufficient.
Iterative Proof#
This section completes the proof of power iteration. It's not difficult to see that power iteration can be equivalently written as:
To prove this limit, starting from $\boldsymbol{W}=\sum\limits_{i=1}^m\sigma_i \boldsymbol{u}_i\boldsymbol{v}_i^{\top}$, substituting and calculating yields:
Since $\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_m$ form an orthonormal basis for $\mathbb{R}^m$, $\boldsymbol{x}_0$ can be expressed as $\sum\limits_{j=1}^m c_j \boldsymbol{v}_j$. Thus, we have:
And:
Due to random initialization, the probability of $c_1=0$ is extremely small, so we can assume $c_1\neq 0$. Then:
When $\sigma_1 > \sigma_2$, all $\sigma_i/\sigma_1(i\geq 2)$ are less than 1. Therefore, as $t\to \infty$, corresponding terms become 0, and the final limit is $\boldsymbol{v}_1$.
Related Work#
The earliest paper proposing spectral norm regularization appears to be the 2017 work "Spectral Norm Regularization for Improving the Generalizability of Deep Learning", which compared weight decay, adversarial training, spectral norm regularization, and other methods, finding that spectral norm regularization performed best in terms of generalization capability.
The approach in that paper was not to compute $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2^2 = 2\sigma_1\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$ as we do in this article, but rather to directly estimate $\Vert\boldsymbol{W}\Vert_2$ through power iteration, then weight $\Vert\boldsymbol{W}\Vert_2^2$ into the loss function and let the optimizer compute gradients itself. This approach is somewhat less efficient and also makes it difficult to decouple weight decay from the optimizer in a form similar to AdamW. Our approach here is relatively more flexible, allowing us to separate weight decay from main loss function optimization, similar to AdamW.
My preliminary experimental results on language models show weak improvements at the Loss level (hopefully not illusory; at worst, there's no degradation). The experimental process involves using power iteration to find an approximation of $\boldsymbol{v}_1$ (initialized as an all-ones vector, iterated 10 times), then modifying the original weight decay $-\lambda \boldsymbol{W}$ to $-\lambda \boldsymbol{W}\boldsymbol{v}_1\boldsymbol{v}_1^{\top}$, with the $\lambda$ value unchanged.
Conclusion#
This article derived the gradient of the spectral norm, leading to a new form of weight decay, and shared my reflections on it.
The main contributions of this work include:
- Gradient Derivation: Providing a complete derivation of the spectral norm gradient $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2 = \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$ under the assumption of distinct singular values.
- Novel Weight Decay Formulation: Proposing $-\lambda \boldsymbol{W}\boldsymbol{v}_1\boldsymbol{v}_1^{\top}$ as an alternative to traditional Frobenius norm weight decay $-\lambda \boldsymbol{W}$.
- Efficient Computation: Demonstrating that the required computation reduces to finding the principal right singular vector $\boldsymbol{v}_1$, which can be efficiently approximated via power iteration.
- Theoretical Interpretation: Explaining the new weight decay as subtracting the optimal rank-1 approximation of the weight matrix, providing both computational and theoretical advantages.
- Implementation Guidance: Offering practical recommendations for implementation, including power iteration with 10 iterations for approximating $\boldsymbol{v}_1$.
Potential advantages of spectral norm-based weight decay include:
- Targeted Regularization: Focusing penalty on the most influential singular value that controls output magnitude.
- Parameter Efficiency: Potentially preserving expressive capacity while controlling model complexity.
- Theoretical Justification: Connection to Lipschitz continuity and output magnitude control.
- Implementation Flexibility: Can be integrated similarly to AdamW-style decoupled weight decay.
Future research directions could explore:
- Empirical evaluation on larger-scale models and diverse tasks
- Analysis of convergence properties with spectral norm weight decay
- Combination with other regularization techniques
- Theoretical analysis of generalization bounds
- Extension to other matrix norms and their applications
While preliminary results show promise, further experimentation and theoretical analysis are needed to fully understand the benefits and limitations of this approach. The connection between spectral norm regularization and weight decay opens new avenues for developing more effective regularization techniques in deep learning.
Original Article: Su Jianlin. From Spectral Norm Gradient to Novel Weight Decay: A Theoretical Exploration. Scientific Spaces.
How to cite this translation:
BibTeX: