Manifold Gradient Descent: 2. Muon with Orthogonal Constraints

This article continues our series on constrained optimization. In the previous article "Manifold Gradient Descent: 1. SGD on Hyperspheres", we revisited the "Least Action Principle" for optimizers, proposing that the core distinction among different optimizers lies in the various constraints imposed on the update vector. If this constraint is the Euclidean norm, the corresponding steepest descent is SGD. Furthermore, we discussed results when simultaneously imposing magnitude constraints on parameters, constituting steepest descent on hyperspherical manifolds.

However, the previous article should be considered merely a "warm-up," as it addressed relatively simple vector parameter optimization. This article formally enters more challenging territory—optimization parameters transition from vectors to matrices, with the incremental constraint changed to the spectral norm, giving rise to the Muon optimizer. Subsequently, we impose orthogonal constraints on parameters, yielding the Muon optimizer on orthogonal manifolds.

Problem Formulation#

Matrix Parameter Formulation

Let the parameters to be optimized have matrix form $\boldsymbol{W}\in\mathbb{R}^{n\times m}$. Without loss of generality, assume $n\geq m$. According to the "Least Action Principle" from the previous article, the steepest descent increment $\Delta\boldsymbol{W}$ should satisfy:

\[ \min_{\Delta \boldsymbol{W}} \mathcal{L}(\boldsymbol{W} +\Delta\boldsymbol{W}) \qquad \text{subject to}\qquad \rho(\Delta\boldsymbol{W})\leq \eta \]

If $\rho$ is taken as the Frobenius norm, the result will be identical to the previous section, since the Frobenius norm essentially computes the L2 norm of the vectorized matrix, resulting in SGD treating the matrix as a vector. To obtain results more revealing and intrinsic to matrix nature, here we choose the spectral norm (also denoted $\Vert\cdot\Vert_2$).

As for why the spectral norm is selected, readers may refer to "Muon Optimizer Appreciation: The Essential Leap from Vectors to Matrices", "Muon Sequel: Why Did We Choose to Experiment with Muon?", and "Higher-Order MuP: More Concise yet More Insightful Spectral Conditioning Scaling"; we will not repeat the introduction here. In brief, the spectral norm is the most compact norm revealing linear layer transformations, making it more suitable as a measure of matrix "stability."

Following previous steps, applying a first-order approximation to $\mathcal{L}(\boldsymbol{W} +\Delta\boldsymbol{W})$ yields $\mathcal{L}(\boldsymbol{W}) + \langle \boldsymbol{G}, \Delta\boldsymbol{W}\rangle_F$, where $\boldsymbol{G}=\nabla_{\boldsymbol{W}}\mathcal{L}(\boldsymbol{W})$. Here $\langle\cdot,\cdot\rangle_F$ denotes flattening both matrices to vectors and computing their inner product, equivalently $\newcommand{tr}{\mathop{\text{tr}}}\tr(\boldsymbol{G}^{\top}\Delta\boldsymbol{W})$. Setting $\Delta\boldsymbol{W} = -\eta \boldsymbol{\Phi}$, the original problem simplifies to:

(2) \[ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{subject to}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1\label{eq:muon-obj} \]

These transformation steps are general; if details are forgotten, please refer back to the previous article.

Fundamental Result#

The solution process for objective (2) was already provided in the "Matrix Norms" section of "Muon Optimizer Appreciation: The Essential Leap from Vectors to Matrices". However, for completeness, we restate it here. Let the SVD of $\boldsymbol{G}$ be $\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top} = \sum\limits_{i=1}^r \sigma_i \boldsymbol{u}_i \boldsymbol{v}_i^{\top}$, where $r$ is the rank of $\boldsymbol{G}$. We have:

(3) \[ \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi})=\tr\left(\sum_{i=1}^r \sigma_i \boldsymbol{v}_i \boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\right) = \sum_{i=1}^r \sigma_i \boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\boldsymbol{v}_i \]

By definition, when $\Vert\boldsymbol{\Phi}\Vert_2=1$, $\Vert\boldsymbol{\Phi}\boldsymbol{v}_i\Vert_2\leq \Vert\boldsymbol{v}_i\Vert_2=1$, hence $\boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\boldsymbol{v}_i\leq 1$. Therefore:

(4) \[ \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi})\leq \sum_{i=1}^r \sigma_i = \Vert \boldsymbol{G}\Vert_* \]

Here $\Vert\cdot\Vert_*$ is called the Nuclear Norm of the matrix. Equality holds when all $\boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\boldsymbol{v}_i$ equal 1, yielding:

(5) \[ \newcommand{msign}{\mathop{\text{msign}}}\boldsymbol{\Phi} = \sum_{i=1}^r \boldsymbol{u}_i \boldsymbol{v}_i^{\top} = \boldsymbol{U}_{[:,:r]}\boldsymbol{V}_{[:,:r]}^{\top} = \msign(\boldsymbol{G}) \]

Notes on Solution Uniqueness

Note that if $r < m$, adding terms $\boldsymbol{u}_{r+1} \boldsymbol{v}_{r+1}^{\top},\boldsymbol{u}_{r+2} \boldsymbol{v}_{r+2}^{\top},\cdots$ will also satisfy equality, meaning the solution is not unique. However, terms beyond $r$ cannot be uniquely determined, so the above expression can be considered a deterministic, minimal solution. Interested readers may also attempt to use the "heavy artillery"—the von Neumann trace inequality—to find general solutions under Schatten-$p$ norms, with the spectral norm corresponding to the special case $p\to\infty$.

Orthogonal Manifold#

Thus far, we have demonstrated that for matrix parameters, the direction of steepest descent under spectral norm constraints is not the opposite gradient $-\boldsymbol{G}$, but requires an additional $\msign$ operator: $-\msign(\boldsymbol{G})$. This is precisely the Muon optimizer used in training Kimi K2, making it one of the most competitive optimizers currently. This conversely indicates that the spectral norm as a stability constraint for matrices is highly appropriate.

Of course, the results so far are not new. Now we begin exploring novel territory—imposing orthogonal constraints $\boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I}$ on parameters $\boldsymbol{W}$ (source: "Orthogonal manifold"). This divides into two cases: first, when $n=m$, $\boldsymbol{W}$ is a proper orthogonal matrix satisfying $\boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{W}\boldsymbol{W}^{\top}=\boldsymbol{I}$; second, when $n > m$, $\boldsymbol{W}\boldsymbol{W}^{\top}=\boldsymbol{I}$ cannot be satisfied, typically called a semi-orthogonal matrix, with the corresponding space termed the Stiefel manifold.

Constrained Optimization on Orthogonal Manifold

Specifically, we now need to solve:

\[ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{subject to}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I},\,\,(\boldsymbol{W} - \eta \boldsymbol{\Phi})^{\top}(\boldsymbol{W} - \eta \boldsymbol{\Phi})=\boldsymbol{I} \]

Adhering to the "first-order approximation suffices" principle, the last condition simplifies to $\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} = \boldsymbol{0}$, meaning $\boldsymbol{W}^{\top}\boldsymbol{\Phi}$ is a skew-symmetric matrix:

\[ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{subject to}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I},\,\,\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} = \boldsymbol{0}\label{eq:muon-obj-orth} \]

When might orthogonal constraints be applied? In fact, scenarios are not rare. For example, in classification problems, if classes are known to have little correlation, orthogonal constraints on class matrices may be considered, though often implemented through regularization terms $\Vert\boldsymbol{W}^{\top}\boldsymbol{W}-\boldsymbol{I}\Vert_F^2$ for approximate orthogonality. Another example is in LoRA scenarios, where parameterization of form $\boldsymbol{A}\boldsymbol{B}$ inherently contains redundancy, which can be reduced via orthogonal constraints (reference), etc.

Solution Process#

To solve objective, similar to the previous article, we introduce an undetermined coefficient matrix $\boldsymbol{\Lambda}\in\mathbb{R}^{m\times m}$, obtaining:

(6) \[ \begin{aligned} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) =&\, \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \tr(\boldsymbol{\Lambda}^{\top}(\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W})) \\ =&\, \tr((\boldsymbol{G} + \boldsymbol{W}(\boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top}))^{\top}\boldsymbol{\Phi}) \\ \leq &\,\Vert\boldsymbol{G} + \boldsymbol{W}(\boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top})\Vert_* \end{aligned} \]

The second equality uses trace identities $\begin{equation}\tr(\boldsymbol{A}\boldsymbol{B}) = \tr(\boldsymbol{B}\boldsymbol{A}) = \tr(\boldsymbol{A}^{\top}\boldsymbol{B}^{\top}) = \tr(\boldsymbol{B}^{\top}\boldsymbol{A}^{\top})\end{equation}$. According to the previous Muon result, equality holds when:

(7) \[ \boldsymbol{\Phi} = \msign(\boldsymbol{G} + \boldsymbol{W}(\boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top})) \]

The remaining problem is finding the real symmetric matrix $\boldsymbol{X} = \boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top}$ such that $\boldsymbol{W}^{\top}\boldsymbol{\Phi}$ is skew-symmetric. For $n=m$, this is easily solved because $\boldsymbol{W}^{\top}$ can be absorbed into $\msign$:

(8) \[ \boldsymbol{W}^{\top}\boldsymbol{\Phi} = \boldsymbol{W}^{\top}\msign(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X}) = \msign(\boldsymbol{W}^{\top}(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X})) = \msign(\boldsymbol{W}^{\top}\boldsymbol{G} +\boldsymbol{X}) \]

Key Property of msign Operator

Note that $\msign$ possesses an additional property: it preserves skew-symmetry. That is, if a square matrix $\boldsymbol{M}$ is skew-symmetric, then $\msign(\boldsymbol{M})$ is also skew-symmetric (please prove this). Therefore, to make $\boldsymbol{W}^{\top}\boldsymbol{\Phi}$ skew-symmetric, it suffices to make $\boldsymbol{W}^{\top}\boldsymbol{G} +\boldsymbol{X}$ skew-symmetric. Note that $\boldsymbol{X}$ is symmetric, so this is equivalent to decomposing $\boldsymbol{W}^{\top}\boldsymbol{G}$ into symmetric and skew-symmetric components—a problem with a ready solution:

\[ \boldsymbol{W}^{\top}\boldsymbol{G} = \underbrace{\frac{1}{2}(\boldsymbol{W}^{\top}\boldsymbol{G} + \boldsymbol{G}^{\top}\boldsymbol{W})}_{[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}} + \underbrace{\frac{1}{2}(\boldsymbol{W}^{\top}\boldsymbol{G} - \boldsymbol{G}^{\top}\boldsymbol{W})}_{[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}} \]

where $[\boldsymbol{M}]_{\text{sym}} = (\boldsymbol{M}+\boldsymbol{M}^{\top})/2$ and $[\boldsymbol{M}]_{\text{skew}} = (\boldsymbol{M}-\boldsymbol{M}^{\top})/2$. Based on this identity, we directly obtain $\boldsymbol{X} = -[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}$. Solving for the case $n > m$ is more complex and will be discussed in detail in the next article; this article aims to fully address the $n=m$ scenario.

Retraction Operation#

In summary, for $n=m$, our final result is:

(9) \[ \begin{aligned} \boldsymbol{\Phi} =&\, \msign(\boldsymbol{G} - \boldsymbol{W}[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}) \\ =&\, \boldsymbol{W}\boldsymbol{W}^{\top}\msign(\boldsymbol{G} - \boldsymbol{W}[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}) \\ =&\, \boldsymbol{W}\msign(\boldsymbol{W}^{\top}\boldsymbol{G} - [\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}) \\ =&\, \boldsymbol{W}\msign([\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}) \\ \end{aligned} \]

Therefore, the updated variable is:

(10) \[ \boldsymbol{W} - \eta \boldsymbol{\Phi} = \boldsymbol{W}(\boldsymbol{I} - \eta\,\underbrace{\msign([\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}})}_{\text{denoted as }\boldsymbol{O}})\label{eq:updated-W} \]

This is not an orthogonal matrix but is accurate to $\mathcal{O}(\eta^2)$, consistent with our "first-order approximation suffices" principle. To verify this, simply check:

(11) \[ \begin{aligned} (\boldsymbol{I} - \eta\boldsymbol{O})^{\top}\boldsymbol{W}^{\top}\boldsymbol{W}(\boldsymbol{I} - \eta\boldsymbol{O}) =&\,(\boldsymbol{I} - \eta\boldsymbol{O})^{\top}(\boldsymbol{I} - \eta\boldsymbol{O}) \\ =&\,\boldsymbol{I} - \eta(\boldsymbol{O}^{\top} + \boldsymbol{O}) + \eta^2\boldsymbol{O}^{\top}\boldsymbol{O} \\ =&\,\boldsymbol{I} + \eta^2\boldsymbol{O}^{\top}\boldsymbol{O} \\ \end{aligned}\label{eq:orth-check} \]

Retraction and Projection onto Orthogonal Manifold

If $[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}$ is full-rank, then $\boldsymbol{O}$ is an orthogonal matrix, i.e., $\boldsymbol{O}^{\top}\boldsymbol{O}=\boldsymbol{I}$. In this case, dividing by $\sqrt{1+\eta^2}$ would make $\eqref{eq:updated-W}$ satisfy orthogonality. However, when $[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}$ is not full-rank, no simple transformation ensures orthogonality. The general approach is to find the nearest orthogonal matrix—which is precisely what $\msign$ accomplishes (refer to this section). Therefore, the complete update rule is:

\[ \boldsymbol{W} \quad \leftarrow\quad \msign(\boldsymbol{W} - \eta \boldsymbol{\Phi}) = \msign(\boldsymbol{W}(\boldsymbol{I} - \eta\boldsymbol{O})) = \boldsymbol{W}\msign(\boldsymbol{I} - \eta\boldsymbol{O}) \]

But this requires computing $\msign$ twice; let's attempt to simplify. By definition and Equation:

\[ \msign(\boldsymbol{I} - \eta\boldsymbol{O}) = (\boldsymbol{I} - \eta\boldsymbol{O})(\boldsymbol{I} + \eta^2\boldsymbol{O}^{\top}\boldsymbol{O})^{-1/2} \]

Note that regardless of rank, $(\boldsymbol{O}^{\top}\boldsymbol{O})^2 = \boldsymbol{O}^{\top}\boldsymbol{O}$. Let $(1+\eta^2 x)^{-1/2}=1 + a_1 x + a_2 x^2 + a_3 x^3 + \cdots$. Then:

\[ \begin{aligned} (\boldsymbol{I} + \eta^2\boldsymbol{O}^{\top}\boldsymbol{O})^{-1/2} =&\, \boldsymbol{I} + a_1 (\boldsymbol{O}^{\top}\boldsymbol{O}) + a_2 (\boldsymbol{O}^{\top}\boldsymbol{O})^2 + a_3 (\boldsymbol{O}^{\top}\boldsymbol{O})^3 + \cdots \\ =&\, \boldsymbol{I} + a_1 (\boldsymbol{O}^{\top}\boldsymbol{O}) + a_2 (\boldsymbol{O}^{\top}\boldsymbol{O}) + a_3 (\boldsymbol{O}^{\top}\boldsymbol{O}) + \cdots \\ =&\, \boldsymbol{I} - \boldsymbol{O}^{\top}\boldsymbol{O} + \underbrace{(1 + a_1 + a_2 + a_3 + \cdots)}_{(1+\eta^2 x)^{-1/2}\text{ evaluated at }x=1}\boldsymbol{O}^{\top}\boldsymbol{O} \\ \end{aligned} \]

This eliminates one $\msign$ computation. The simplified complete result is:

\[ \boldsymbol{W} \quad \leftarrow\quad \boldsymbol{W}(\boldsymbol{I} - \eta\boldsymbol{O})\left(\boldsymbol{I} - \boldsymbol{O}^{\top}\boldsymbol{O} + \frac{\boldsymbol{O}^{\top}\boldsymbol{O}}{\sqrt{1+\eta^2}}\right) \]

Summary#

In this article, we revisited the conclusion that imposing spectral norm constraints on matrix parameter updates yields the Muon optimizer. We then explored the form of the Muon optimizer when orthogonal constraints are imposed on parameters. If you wish parameters to remain orthogonal matrices during updates, this article may provide some reference value.

Citation Information

Original Article: Su Jianlin. Manifold Gradient Descent: 2. Muon with Orthogonal Constraints. Scientific Spaces.

How to cite this translation:

Su, J. Manifold Gradient Descent: 2. Muon with Orthogonal Constraints [Translated by Juanxi Tian]. Scientific Spaces.

BibTeX:

@article{su2025manifold_gradient_muon_orthogonal, title = {Manifold Gradient Descent: 2. Muon with Orthogonal Constraints}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/11196}, note = {Translated by Juanxi Tian (ScalingOpt Team)} }