Series Continuation

Readers who have completed the first three articles are presumably familiar with this series' "routine"—first impose constraints on the update direction to find the steepest descent direction, then add constraints to parameters to find a new steepest descent direction. When solving parameter constraint problems, we adopt the "first-order approximation suffices" principle to simplify constraint forms, which geometrically corresponds to the tangent space. Then, we use the method of undetermined coefficients to convert to an unconstrained form for analytical solutions, and finally solve the undetermined coefficients numerically.

Spectral Sphere Constraint

In this article, we solve a new example—Muon under spectral sphere constraints—which is an analog extension of the first article "Steepest Descent on Manifolds: 1. SGD + Hypersphere". It can be considered when we want the spectral norm of parameters to remain constant. Alternatively, it can simply serve as an exercise for practice.

Problem Formulation#

Mathematical Formulation

In "Steepest Descent on Manifolds: 2. Muon + Orthogonal" and "Steepest Descent on Manifolds: 3. Muon + Stiefel", we have detailed the collision between Muon and orthogonal constraints. Therefore, we will not expand on related background and directly present the problem form:

(1) \[ \newcommand{tr}{\mathop{\text{tr}}}\max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \Vert\boldsymbol{W}\Vert_2 = 1,\,\,\Vert\boldsymbol{W} - \eta \boldsymbol{\Phi}\Vert_2=1 \]

where $\boldsymbol{W},\boldsymbol{\Phi}\in\mathbb{R}^{n\times m}(n \geq m)$, and $\Vert\cdot\Vert_2$ denotes the spectral norm. Of course, if interested, the latter two spectral norms can be replaced by other norms; for instance, using the Frobenius norm represents the combination "Muon + Hypersphere".

Tangent Space Approximation

The "first-order approximation suffices" principle requires computing the gradient of the spectral norm, which we have introduced in "From Spectral Norm Gradient to Thoughts on New Weight Decay" and "Derivatives of SVD". The answer is $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2=\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$, where $\boldsymbol{u}_1,\boldsymbol{v}_1$ are the singular vectors corresponding to the largest singular value of $\boldsymbol{W}$, which can be solved via power iteration. This result also assumes the largest singular value is unique; we will discuss the non-unique case later.

For the Frobenius norm, we have $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_F=\boldsymbol{W}/\Vert\boldsymbol{W}\Vert_F$. In summary, regardless of the norm, there exists a matrix $\boldsymbol{\Theta}$ dependent only on $\boldsymbol{W}$ such that $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert=\boldsymbol{\Theta}$. Then from $\Vert\boldsymbol{W}\Vert = 1$ and $\Vert\boldsymbol{W} - \eta \boldsymbol{\Phi}\Vert=1$, we obtain their first-order approximation version: $0 = \langle\boldsymbol{\Theta},\boldsymbol{\Phi}\rangle_F = \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})$. Therefore, under first-order approximation, the general formulation for such problems is:

(2) \[ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0 \]

Undetermined Coefficients#

Lagrange Multiplier Method

The routine remains the same: introduce an undetermined coefficient $\lambda$, and we have:

(3) \[ \begin{aligned} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) =&\, \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \lambda \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) \\ =&\, \tr((\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}\boldsymbol{\Phi}) \\ \leq &\,\Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_* \end{aligned} \]

The last inequality is a result of Muon itself, analogous to Hölder's inequality for two vectors, with equality achieved at:

(4) \[ \boldsymbol{\Phi} = \newcommand{msign}{\mathop{\text{msign}}}\msign(\boldsymbol{G} + \lambda\boldsymbol{\Theta}) \]

The next task is to solve for a $\lambda$ satisfying the constraint condition $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$, and the problem is solved.

Nonlinear Equation Challenge

Due to the presence of $\msign$, $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$ is actually a nonlinear equation. I tend to believe it has no analytical solution, so we seek numerical methods. However, with the experience from "Steepest Descent on Manifolds: 3. Muon + Stiefel", we can now confidently construct iterative methods for such equations.

Iterative Solution#

Iterative Solution Development

First, according to the definition $\msign(\boldsymbol{M}) = \boldsymbol{M}(\boldsymbol{M}^{\top}\boldsymbol{M})^{-1/2}$, we can write $\boldsymbol{\Phi}=(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1}$, where $\boldsymbol{Q}=((\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta}))^{1/2}$. Then:

(5) \[ \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0\qquad\Rightarrow\qquad \lambda = -\frac{\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{G}\boldsymbol{Q}^{-1})}{\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta}\boldsymbol{Q}^{-1})} \]

But note, this is not an analytical solution because $\boldsymbol{Q}$ also depends on $\lambda$. However, based on this expression, we can construct an iterative scheme: substitute an initial $\lambda$ to compute $\boldsymbol{Q}= (\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}\boldsymbol{\Phi}$, then substitute into the above formula to update $\lambda$, repeating until convergence.

Improved Iterative Scheme

However, although this iterative scheme is theoretically feasible, it requires computing $\boldsymbol{Q}^{-1}$. Although we have discussed efficient algorithms for this in "Efficient Computation of Matrix r-th Roots and Inverse r-th Roots", from the perspective of "Occam's Razor" (do not multiply entities unnecessarily), we hope to avoid iterations beyond $\msign$. Therefore, I attempted to find another iterative scheme that does not require $\boldsymbol{Q}^{-1}$. For this purpose, we first write:

(6) \[ \boldsymbol{\Theta}^{\top} \boldsymbol{\Phi} = \boldsymbol{\Theta}^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1} \]

For our objective, the trace of the above equals zero. We can explicitly subtract $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m$ from the left side to ensure it satisfies this condition:

(7) \[ \boldsymbol{\Theta}^{\top} \boldsymbol{\Phi} - \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1} \]

Now multiply both sides by $\boldsymbol{Q}$ and take the trace to solve for $\lambda$:

(8) \[ \lambda = \frac{\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}\boldsymbol{Q}) - \tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Phi}) \tr(\boldsymbol{Q})/m - \tr(\boldsymbol{\Theta}^{\top}\boldsymbol{G})}{\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta})} \]

This way, iteration does not require computing $\boldsymbol{Q}^{-1}$.

Reference Code#

Python Implementation

We use $\lambda=-\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{G})/\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta})$ as the initial value. The test code is as follows:

Python Implementation
import numpy as np

def msign(g):
    """Singular Value Decomposition for exact msign computation
    """
    u, s, vh = np.linalg.svd(g, full_matrices=False)
    return u @ np.diag(np.sign(s)) @ vh

def dot(a, b):
    """Equivalent to np.trace(a.T @ b)
    """
    return (a * b).sum()

n, m = 100, 50
w = np.random.randn(n, m) / m**0.5
g = np.random.randn(n, m) / m**0.5
u, s, vh = np.linalg.svd(w, full_matrices=False)
theta = u[:, :1] @ vh[:1]

lamb = - dot(theta, g) / dot(theta, theta)
for i in range(10):
    phi = msign(z := g + lamb * theta)
    print('step:', i, ', inner product:', dot(phi, g), ', tangent error:', dot(theta, phi))
    q, x = z.T @ phi, theta.T @ phi
    lamb = (dot(x, q) - np.trace(x) * np.trace(q) / m - dot(theta, g)) / dot(theta, theta)

Numerical Details#

Implementation Considerations

Similar to the first three articles, due to using the "first-order approximation suffices" principle, the spectral norm of $\boldsymbol{W} - \eta\boldsymbol{\Phi}$ is accurate to $1 + \mathcal{O}(\eta^2)$, typically not precisely 1. Therefore, we need to perform spectral normalization once:

(9) \[ \boldsymbol{W}\quad\leftarrow\quad \frac{\boldsymbol{W} - \eta\boldsymbol{\Phi}}{\Vert\boldsymbol{W} - \eta\boldsymbol{\Phi}\Vert_2} \]

Fortunately, the spectral norm can be efficiently computed via power iteration, so this is not particularly expensive (compared to the iteration of $\msign$ itself).

Degenerate Case: Multiple Maximum Singular Values

Another aspect worth analyzing is the case where the largest singular value is not unique. In practical numerical computation, such special cases can be ignored, but for theoretical completeness, it should be included in the analysis. In this case, the corresponding singular vectors are also not unique, equivalent to having multiple different tangent spaces; the actual feasible space is the intersection of these tangent spaces. Taking two maximum singular values as an example, the problem becomes:

(10) \[ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \tr(\boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi})=0,\,\, \tr(\boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi})=0 \]

Here $\boldsymbol{\Theta}_1=\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}, \boldsymbol{\Theta}_2=\boldsymbol{u}_2 \boldsymbol{v}_2^{\top}$. Introducing two undetermined coefficients $\lambda_1,\lambda_2$, we can solve:

(11) \[ \boldsymbol{\Phi} = \msign(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2) \]

Next, we need to solve the equation system $\tr(\boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi})=0,\tr(\boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi})=0$. Similarly, introducing:

(12) \[ \boldsymbol{Q}=((\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2))^{1/2} = (\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)^{\top}\boldsymbol{\Phi} \]

we can write the equation system:

(13) \[ \begin{gathered} \boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi} - \tr(\boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}_1^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)\boldsymbol{Q}^{-1} \\ \boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi} - \tr(\boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}_2^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)\boldsymbol{Q}^{-1} \\ \end{gathered} \]

Multiplying both sides by $\boldsymbol{Q}$ and taking the trace yields a system of linear equations in $\lambda_1,\lambda_2$ that can be solved, allowing construction of an iterative solution scheme. The details will not be elaborated here; interested readers practicing can complete them independently.

Article Summary#

This article primarily considered Muon forms corresponding to imposing spectral norm or general norm constraints on parameters. Building on the first three articles, this installment presents no significant technical difficulties and can be viewed purely as a supplementary exercise for practice.

Key contributions:

  1. Problem Extension: Extended the manifold optimization framework to spectral sphere constraints, providing a matrix analog to hypersphere constraints from the first article.
  2. Methodological Consistency: Demonstrated the consistent application of the "first-order approximation suffices" principle and method of undetermined coefficients across different constraint types.
  3. Iterative Solution: Developed efficient iterative schemes for solving the resulting nonlinear equations without requiring matrix inverse computations at each iteration.
  4. Practical Implementation: Provided clean Python implementation demonstrating the convergence of the iterative solution.
  5. Theoretical Considerations: Addressed degenerate cases with multiple maximum singular values, extending the framework's theoretical completeness.

The article showcases the versatility of the Muon optimization framework when combined with manifold constraints, providing practitioners with tools for maintaining specific matrix properties during gradient-based optimization.

Citation Information

Original Article: Su Jianlin. Steepest Descent on Manifolds: 4. Muon + Spectral Sphere. Scientific Spaces.

How to cite this translation:

Su, J. Steepest Descent on Manifolds: 4. Muon + Spectral Sphere [Translated by ScalingOpt Team]. Scientific Spaces.

BibTeX:

@article{su2025steepest_muon_spectral, title = {Steepest Descent on Manifolds: 4. Muon + Spectral Sphere}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/10570}, note = {Translated by ScalingOpt Team} }