From Spectral Norm Gradient to Novel Weight Decay: A Theoretical Exploration

Research Motivation

In the article "Appreciation of the Muon Optimizer: From Vector to Matrix Essential Leap", we introduced a new optimizer named "Muon," which can be understood as steepest gradient descent under spectral norm regularization. This perspective seems to reveal more fundamental optimization directions for matrix parameters. As is well known, for matrix parameters, we often apply weight decay, which can be interpreted as the gradient of the Frobenius norm squared. From the Muon perspective, could constructing new weight decay through the gradient of spectral norm squared potentially yield better results?

This raises the question: what does the gradient or derivative of the spectral norm look like? And what would the new weight decay designed from it look like? In the following sections, we explore these questions in detail.

Fundamentals#

Spectral Norm Properties

The spectral norm, also known as the "2-norm," is one of the most commonly used matrix norms. Compared to the simpler Frobenius norm, it often reveals more fundamental signals related to matrix multiplication, as its definition inherently connects to matrix multiplication: for a matrix parameter $\boldsymbol{W}\in\mathbb{R}^{n\times m}$, its spectral norm is defined as:

(1) \[ \Vert\boldsymbol{W}\Vert_2 \triangleq \max_{\Vert\boldsymbol{x}\Vert=1} \Vert\boldsymbol{W}\boldsymbol{x}\Vert \]

Here, $\boldsymbol{x}\in\mathbb{R}^m$ is a column vector, and the $\Vert\Vert$ on the right-hand side denotes vector magnitude (Euclidean norm). From another perspective, the spectral norm is the smallest constant $C$ that makes the following inequality hold for all $\boldsymbol{x}\in\mathbb{R}^m$:

(2) \[ \Vert\boldsymbol{W}\boldsymbol{x}\Vert \leq C\Vert\boldsymbol{x}\Vert \]

It's not difficult to prove that when $C$ takes the Frobenius norm $\Vert \boldsymbol{W}\Vert_F$, the above inequality also holds. Therefore, we can write $\Vert \boldsymbol{W}\Vert_2\leq \Vert \boldsymbol{W}\Vert_F$ (since $\Vert \boldsymbol{W}\Vert_F$ is just one $C$ that makes the inequality hold, while $\Vert \boldsymbol{W}\Vert_2$ is the smallest such $C$). This conclusion also indicates that if we want to control output magnitude, using spectral norm as a regularization term is more precise than using Frobenius norm.

As early as six years ago in "Lipschitz Constraints in Deep Learning: Generalization and Generative Models", we discussed spectral norms with two application scenarios: first, WGAN explicitly imposes Lipschitz constraints on discriminators, with one implementation method based on spectral norm normalization; second, some works demonstrate that spectral norm as a regularization term yields better performance compared to Frobenius norm regularization.

Gradient Derivation#

Spectral Norm Gradient Derivation

Now let's get to the main topic and attempt to derive the gradient of the spectral norm $\nabla_{\boldsymbol{W}} \Vert\boldsymbol{W}\Vert_2$. We know that numerically, the spectral norm equals its maximum singular value. We provided proof for this in the "Matrix Norms" section of "Low-Rank Approximation Path (II): SVD". This means that if $\boldsymbol{W}$ can be SVD decomposed as $\sum\limits_{i=1}^{\min(n,m)}\sigma_i \boldsymbol{u}_i\boldsymbol{v}_i^{\top}$, then:

(3) \[ \Vert\boldsymbol{W}\Vert_2 = \sigma_1 = \boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1 \]

where $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(n,m)} \geq 0$ are the singular values of $\boldsymbol{W}$. Taking differentials on both sides, we obtain:

(4) \[ d\Vert\boldsymbol{W}\Vert_2 = d\boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1 + \boldsymbol{u}_1^{\top}d\boldsymbol{W}\boldsymbol{v}_1 + \boldsymbol{u}_1^{\top}\boldsymbol{W}d\boldsymbol{v}_1 \]

Observing that:

(5) \[ d\boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1 = d\boldsymbol{u}_1^{\top}\sum_{i=1}^{\min(n,m)}\sigma_i \boldsymbol{u}_i\boldsymbol{v}_i^{\top}\boldsymbol{v}_1 = d\boldsymbol{u}_1^{\top}\sigma_1 \boldsymbol{u}_1 = \frac{1}{2}\sigma_1 d(\Vert\boldsymbol{u}_1\Vert^2)=0 \]

Similarly, $\boldsymbol{u}_1^{\top}\boldsymbol{W}d\boldsymbol{v}_1=0$. Therefore:

(6) \[ d\Vert\boldsymbol{W}\Vert_2 = \boldsymbol{u}_1^{\top}d\boldsymbol{W}\boldsymbol{v}_1 = \text{Tr}((\boldsymbol{u}_1 \boldsymbol{v}_1^{\top})^{\top} d\boldsymbol{W}) \quad\Rightarrow\quad \nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2 = \boldsymbol{u}_1 \boldsymbol{v}_1^{\top} \]

Note: This proof process relies on a key condition: $\sigma_1 > \sigma_2$. If $\sigma_1=\sigma_2$, then $\Vert\boldsymbol{W}\Vert_2$ could be expressed as either $\boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1$ or $\boldsymbol{u}_2^{\top}\boldsymbol{W}\boldsymbol{v}_2$. Using the same method to compute gradients yields $\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$ and $\boldsymbol{u}_2 \boldsymbol{v}_2^{\top}$ respectively, with non-unique results indicating gradient non-existence. From a practical perspective, however, the probability of two numbers being exactly equal is very small, so this can be disregarded.

(Note: This proof process references a response on Stack Exchange, but that answer did not prove $d\boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1=0$ and $\boldsymbol{u}_1^{\top}\boldsymbol{W}d\boldsymbol{v}_1=0$; this portion has been completed by the author.)

Weight Decay#

Spectral Norm-Based Weight Decay

Based on this result and the chain rule, we have:

(7) \[ \nabla_{\boldsymbol{W}}\left(\frac{1}{2}\Vert\boldsymbol{W}\Vert_2^2\right) = \Vert\boldsymbol{W}\Vert_2\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2 = \sigma_1 \boldsymbol{u}_1 \boldsymbol{v}_1^{\top} \]

Comparing with the result under Frobenius norm:

(8) \[ \nabla_{\boldsymbol{W}}\left(\frac{1}{2}\Vert\boldsymbol{W}\Vert_F^2\right) = \boldsymbol{W} = \sum_{i=1}^{\min(n,m)}\sigma_i \boldsymbol{u}_i \boldsymbol{v}_i^{\top} \]

Comparing these side by side reveals a clear distinction: weight decay derived from Frobenius norm squared as a regularization term penalizes all singular values simultaneously; whereas weight decay corresponding to spectral norm squared penalizes only the maximum singular value. If our goal is to compress output magnitude, penalizing the maximum singular value is "just right," while penalizing all singular values may achieve similar objectives but could also compress parameter expressive capacity.

Interpretation via Low-Rank Approximation

According to the "Eckart-Young-Mirsky Theorem," the rightmost result in equation (7) has another interpretation: it represents the "optimal rank-1 approximation" of matrix $\boldsymbol{W}$. In other words, spectral norm-based weight decay changes each step of subtracting the matrix itself to subtracting its optimal rank-1 approximation, weakening the penalty strength while making the penalty more "essential."

Numerical Computation#

Efficient Computation Methods

For practical implementation, the most crucial question arises: how do we compute $\sigma_1 \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$? SVD is certainly the most straightforward solution, but its computational complexity is undoubtedly the highest; we must find more efficient computational approaches.

Without loss of generality, assume $n\geq m$. First, observe that:

(9) \[ \sigma_1 \boldsymbol{u}_1 \boldsymbol{v}_1^{\top} = \sum_{i=1}^m\sigma_i \boldsymbol{u}_i \boldsymbol{v}_i^{\top} \boldsymbol{v}_1 \boldsymbol{v}_1^{\top} = \boldsymbol{W}\boldsymbol{v}_1 \boldsymbol{v}_1^{\top} \]

This shows that computing $\sigma_1 \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$ only requires knowledge of $\boldsymbol{v}_1$. According to our discussion in "Low-Rank Approximation Path (II): SVD", $\boldsymbol{v}_1$ is actually the eigenvector corresponding to the maximum eigenvalue of matrix $\boldsymbol{W}^{\top}\boldsymbol{W}$. Thus, we transform the problem from SVD of general matrix $\boldsymbol{W}$ to eigenvalue decomposition of real symmetric matrix $\boldsymbol{W}^{\top}\boldsymbol{W}$, which actually reduces complexity since eigenvalue decomposition is typically significantly faster than SVD.

Power Iteration Algorithm

If this is still considered slow, we need to introduce the principle behind many eigenvalue decomposition algorithms—"Power Iteration":

When $\sigma_1 > \sigma_2$, the iteration \begin{equation}\boldsymbol{x}_{t+1} = \frac{\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{x}_t}{\Vert\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{x}_t\Vert}\end{equation} converges to $\boldsymbol{v}_1$ at a rate of $(\sigma_2/\sigma_1)^{2t}$.

Power iteration requires only two "matrix-vector" multiplications per step, with complexity $\mathcal{O}(nm)$. For $t$ iterations, total complexity is $\mathcal{O}(tnm)$, which is very efficient. The drawback is slower convergence when $\sigma_1,\sigma_2$ are close. However, power iteration's practical performance often surpasses theoretical expectations; early work even achieved good results with just one iteration, because when $\sigma_1,\sigma_2$ are close, they and their eigenvectors are somewhat interchangeable to some extent. Even if power iteration doesn't fully converge, it yields an average of both eigenvectors, which is completely sufficient.

Iterative Proof#

Power Iteration Convergence Proof

This section completes the proof of power iteration. It's not difficult to see that power iteration can be equivalently written as:

(11) \[ \lim_{t\to\infty} \frac{(\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0}{\Vert(\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0\Vert} = \boldsymbol{v}_1 \]

To prove this limit, starting from $\boldsymbol{W}=\sum\limits_{i=1}^m\sigma_i \boldsymbol{u}_i\boldsymbol{v}_i^{\top}$, substituting and calculating yields:

(12) \[ \boldsymbol{W}^{\top}\boldsymbol{W} = \sum_{i=1}^m\sigma_i^2 \boldsymbol{v}_i\boldsymbol{v}_i^{\top},\qquad(\boldsymbol{W}^{\top}\boldsymbol{W})^t = \sum_{i=1}^m\sigma_i^{2t} \boldsymbol{v}_i\boldsymbol{v}_i^{\top} \]

Since $\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_m$ form an orthonormal basis for $\mathbb{R}^m$, $\boldsymbol{x}_0$ can be expressed as $\sum\limits_{j=1}^m c_j \boldsymbol{v}_j$. Thus, we have:

(13) \[ (\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0 = \sum_{i=1}^m\sigma_i^{2t} \boldsymbol{v}_i\boldsymbol{v}_i^{\top}\sum_{j=1}^m c_j \boldsymbol{v}_j = \sum_{i=1}^m\sum_{j=1}^m c_j\sigma_i^{2t} \boldsymbol{v}_i\underbrace{\boldsymbol{v}_i^{\top} \boldsymbol{v}_j}_{=\delta_{i,j}} = \sum_{i=1}^m c_i\sigma_i^{2t} \boldsymbol{v}_i \]

And:

(14) \[ \Vert(\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0\Vert = \left\Vert \sum_{i=1}^m c_i\sigma_i^{2t} \boldsymbol{v}_i\right\Vert = \sqrt{\sum_{i=1}^m c_i^2\sigma_i^{4t}} \]

Due to random initialization, the probability of $c_1=0$ is extremely small, so we can assume $c_1\neq 0$. Then:

(15) \[ \frac{(\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0}{\Vert(\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0\Vert} = \frac{\sum\limits_{i=1}^m c_i\sigma_i^{2t} \boldsymbol{v}_i}{\sqrt{\sum\limits_{i=1}^m c_i^2\sigma_i^{4t}}} = \frac{\boldsymbol{v}_1 + \sum\limits_{i=2}^m (c_i/c_1)(\sigma_i/\sigma_1)^{2t} \boldsymbol{v}_i}{\sqrt{1 + \sum\limits_{i=2}^m (c_i/c_1)^2(\sigma_i/\sigma_1)^{4t}}} \]

When $\sigma_1 > \sigma_2$, all $\sigma_i/\sigma_1(i\geq 2)$ are less than 1. Therefore, as $t\to \infty$, corresponding terms become 0, and the final limit is $\boldsymbol{v}_1$.

Historical Context and Comparisons

The earliest paper proposing spectral norm regularization appears to be the 2017 work "Spectral Norm Regularization for Improving the Generalizability of Deep Learning", which compared weight decay, adversarial training, spectral norm regularization, and other methods, finding that spectral norm regularization performed best in terms of generalization capability.

The approach in that paper was not to compute $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2^2 = 2\sigma_1\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$ as we do in this article, but rather to directly estimate $\Vert\boldsymbol{W}\Vert_2$ through power iteration, then weight $\Vert\boldsymbol{W}\Vert_2^2$ into the loss function and let the optimizer compute gradients itself. This approach is somewhat less efficient and also makes it difficult to decouple weight decay from the optimizer in a form similar to AdamW. Our approach here is relatively more flexible, allowing us to separate weight decay from main loss function optimization, similar to AdamW.

From today's LLM perspective, the biggest problem with those early experiments is their small scale, making them insufficiently convincing. However, given that the Muon optimizer based on spectral norm is "a gem already shining," I believe it's worth rethinking and attempting spectral norm weight decay. Of course, whether it's Frobenius norm or spectral norm weight decay, these techniques aimed at "generalization" often contain some element of luck; we should maintain balanced expectations.

My preliminary experimental results on language models show weak improvements at the Loss level (hopefully not illusory; at worst, there's no degradation). The experimental process involves using power iteration to find an approximation of $\boldsymbol{v}_1$ (initialized as an all-ones vector, iterated 10 times), then modifying the original weight decay $-\lambda \boldsymbol{W}$ to $-\lambda \boldsymbol{W}\boldsymbol{v}_1\boldsymbol{v}_1^{\top}$, with the $\lambda$ value unchanged.

Conclusion#

Summary and Implications

This article derived the gradient of the spectral norm, leading to a new form of weight decay, and shared my reflections on it.

The key insight is that spectral norm-based weight decay selectively penalizes only the maximum singular value, offering a more targeted regularization approach compared to Frobenius norm weight decay.

The main contributions of this work include:

Gradient Derivation: Providing a complete derivation of the spectral norm gradient $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2 = \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$ under the assumption of distinct singular values.
Novel Weight Decay Formulation: Proposing $-\lambda \boldsymbol{W}\boldsymbol{v}_1\boldsymbol{v}_1^{\top}$ as an alternative to traditional Frobenius norm weight decay $-\lambda \boldsymbol{W}$.
Efficient Computation: Demonstrating that the required computation reduces to finding the principal right singular vector $\boldsymbol{v}_1$, which can be efficiently approximated via power iteration.
Theoretical Interpretation: Explaining the new weight decay as subtracting the optimal rank-1 approximation of the weight matrix, providing both computational and theoretical advantages.
Implementation Guidance: Offering practical recommendations for implementation, including power iteration with 10 iterations for approximating $\boldsymbol{v}_1$.

Potential advantages of spectral norm-based weight decay include:

Targeted Regularization: Focusing penalty on the most influential singular value that controls output magnitude.
Parameter Efficiency: Potentially preserving expressive capacity while controlling model complexity.
Theoretical Justification: Connection to Lipschitz continuity and output magnitude control.
Implementation Flexibility: Can be integrated similarly to AdamW-style decoupled weight decay.

Future research directions could explore:

Empirical evaluation on larger-scale models and diverse tasks
Analysis of convergence properties with spectral norm weight decay
Combination with other regularization techniques
Theoretical analysis of generalization bounds
Extension to other matrix norms and their applications

While preliminary results show promise, further experimentation and theoretical analysis are needed to fully understand the benefits and limitations of this approach. The connection between spectral norm regularization and weight decay opens new avenues for developing more effective regularization techniques in deep learning.

Citation Information

Original Article: Su Jianlin. From Spectral Norm Gradient to Novel Weight Decay: A Theoretical Exploration. Scientific Spaces.

How to cite this translation:

Su, J. From Spectral Norm Gradient to Novel Weight Decay: A Theoretical Exploration [Translated by Academic Translation Team]. Scientific Spaces.

BibTeX:

@article{su2025spectral_norm_weight_decay, title = {From Spectral Norm Gradient to Novel Weight Decay: A Theoretical Exploration}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/xxxx}, note = {Translated by Academic Translation Team} }

Academic Translation

Fundamentals#

Gradient Derivation#

Weight Decay#

Numerical Computation#

Iterative Proof#

Related Work#

Conclusion#