In the previous two articles "Rethinking Learning Rate and Batch Size (Part 1): Current Landscape" and "Rethinking Learning Rate and Batch Size (Part 2): Mean Field Theory", we primarily introduced the mean field method to simplify the computation related to learning rate and batch size. At that time, we analyzed optimizers such as SGD, SignSGD, and SoftSignSGD, with the main goal being simplification—essentially no new conclusions were reached.
However, in today's optimizer feast, how could Muon be left out? Therefore, in this article we will attempt to compute the relevant conclusions for Muon, to see if its relationship between learning rate and batch size reveals new patterns.
Basic Notation#
As is well known, the main feature of Muon is its non-element-wise update rule. Therefore, the element-wise computation methods previously used in "How Should Learning Rate Scale with Batch Size?" and "How Does Adam's Epsilon Affect Learning Rate Scaling Laws?" will be completely inapplicable. Fortunately, the mean field theory introduced in the previous article still works well, requiring only slight adjustments to the details.
Let the loss function be $\mathcal{L}(\boldsymbol{W})$, where $\boldsymbol{W}\in\mathbb{R}^{n\times m}$ is a matrix parameter (assuming $n\geq m$), and $\boldsymbol{G}$ is its gradient. The gradient for a single sample is denoted as $\tilde{\boldsymbol{G}}$, whose mean is $\boldsymbol{G}$, and variance is $\sigma^2$. When the batch size is $B$, the gradient is denoted as $\tilde{\boldsymbol{G}}_B$, whose mean remains $\boldsymbol{G}$, but the variance becomes $\sigma^2/B$. Note that the variance here is only a scalar $\sigma^2$, unlike the full covariance matrix considered previously.
The primary reason for this simplification is that the random variable itself is already a matrix, and its corresponding covariance matrix would actually be a fourth-order tensor, which is cumbersome to discuss. Does simplifying to a single scalar severely compromise accuracy? Actually, no. Although we considered the full covariance matrix $\boldsymbol{\Sigma}$ in the previous two articles, careful observation reveals that the final results only depend on $\newcommand{tr}{\mathop{\text{tr}}}\tr(\boldsymbol{\Sigma})$, which is equivalent to simplifying it to a scalar from the beginning.
Hessian Matrix#
Similarly, we set the update as $-\eta\tilde{\boldsymbol{\Phi}}_B$ and consider the second-order expansion of the loss function:
The first two terms should be uncontroversial; the third term is more difficult to understand. Similar to the covariance matrix, the Hessian matrix $\boldsymbol{H}$ here is a fourth-order tensor, which is cumbersome to comprehend.
The simplest approach here is from the perspective of linear operators: understanding $\boldsymbol{H}$ as a linear operator whose inputs and outputs are both matrices. We don't need to know what $\boldsymbol{H}$ looks like, nor how $\boldsymbol{H}$ operates with $\tilde{\boldsymbol{\Phi}}_B$; we only need to know that $\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B$ is linear with respect to $\tilde{\boldsymbol{\Phi}}_B$. In this way, the objects we handle remain matrices without increasing mental burden. Any linear operator satisfying these conditions can serve as an approximation to the Hessian matrix, without needing to write out specific higher-order tensor forms.
The protagonist of this article is Muon; we take $\tilde{\boldsymbol{\Phi}}_B=\newcommand{msign}{\mathop{\text{msign}}}\msign(\tilde{\boldsymbol{G}}_B)$ as its approximation for computation. According to the definition, we write $\msign(\tilde{\boldsymbol{G}}_B)=\tilde{\boldsymbol{G}}_B(\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B)^{-1/2}$. From the perspective of Newton's method, this is equivalent to assuming $\boldsymbol{H}^{-1}\boldsymbol{X} = \eta_{\max}\boldsymbol{X}(\boldsymbol{G}^{\top}\boldsymbol{G})^{-1/2}$, thus $\boldsymbol{H}\boldsymbol{X} = \eta_{\max}^{-1}\boldsymbol{X}(\boldsymbol{G}^{\top}\boldsymbol{G})^{1/2}$, which will be used in later calculations.
Expectation Calculation#
Taking expectations on both sides of equation (1), we obtain:
First, compute $\mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]$:
We write $\mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B]$ component-wise and assume independence between different components:
Combining these gives $\mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B]=\boldsymbol{G}^{\top}\boldsymbol{G} + (n\sigma^2/B) \boldsymbol{I}$, so:
To further simplify the dependency on $B$, we approximate $\boldsymbol{G}^{\top}\boldsymbol{G}$ by $\tr(\boldsymbol{G}^{\top}\boldsymbol{G})\boldsymbol{I}/m$—that is, we only retain the diagonal part of $\boldsymbol{G}^{\top}\boldsymbol{G}$, and then replace the diagonal entries with their average. This yields:
where $\mathcal{B}_{\text{simple}} = mn\sigma^2/\tr(\boldsymbol{G}^{\top}\boldsymbol{G})= mn\sigma^2/\Vert\boldsymbol{G}\Vert_F$. This is essentially the same as treating $\boldsymbol{G}$ as a vector and computing $\mathcal{B}_{\text{simple}}$ from the previous two articles. The form of the above equation is identical to that of SignSGD, leading us to conjecture that Muon will not yield many novel results regarding the relationship between learning rate and batch size.
Identical Patterns#
As for $\mathbb{E}[\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B)]$, we only compute under the assumption corresponding to Muon derived earlier, i.e., $\boldsymbol{H}\boldsymbol{X} = \eta_{\max}^{-1}\boldsymbol{X}(\boldsymbol{G}^{\top}\boldsymbol{G})^{1/2}$. Then:
Note that $\tilde{\boldsymbol{\Phi}}_B$ is the result of $\msign$, which must be an orthogonal matrix (full rank), so $\tilde{\boldsymbol{\Phi}}{}_B^{\top}\tilde{\boldsymbol{\Phi}}_B=\boldsymbol{I}$. In this case, $\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B)$ is a deterministic constant: $\eta_{\max}^{-1}\tr((\boldsymbol{G}^{\top}\boldsymbol{G})^{1/2})=\eta_{\max}^{-1}\msign(\boldsymbol{G})^{\top}\boldsymbol{G}$. Therefore, we obtain:
As expected, its form is completely identical to the SignSGD result, revealing no novel patterns.
Upon careful reflection, this is entirely reasonable. SignSGD directly applies $\newcommand{sign}{\mathop{\text{sign}}}\sign$ to the gradient, while Muon's $\msign$ applies $\sign$ to singular values. Intuitively, this is equivalent to applying $\sign$ in a different coordinate system. What it introduces is a new matrix update rule, whereas the learning rate $\eta^*$ and batch size $B$ are merely scalars. Given that both fundamentally rely on $\sign$, it is highly likely that the asymptotic relationships of these scalars will not exhibit significant changes.
Of course, we have only computed for a specific $\boldsymbol{H}$. If we consider more general $\boldsymbol{H}$, it is possible that, like SignSGD, Muon could exhibit the Surge phenomenon where "increasing batch size should actually decrease the learning rate." However, as discussed in the "Causal Reflection" section of the previous article, if the Surge phenomenon is truly observed, perhaps one should consider changing the optimizer rather than correcting the relationship between $\eta^*$ and $B$.
Summary#
In this article, we attempted a simple analysis of Muon using mean field approximation. The conclusion is that its relationship between learning rate and batch size aligns with that of SignSGD, revealing no novel patterns.
Original Article: Su Jianlin. Rethinking Learning Rate and Batch Size (Part 3): Muon Analysis. Scientific Spaces.
How to cite this translation:
BibTeX: