Rethinking Learning Rate and Batch Size (Part 2): Mean Field Theory

At the end of the previous article "Rethinking Learning Rate and Batch Size (Part 1): Current Landscape", we mentioned that for cases where $\tilde{\boldsymbol{\varphi}}_B$ nonlinearly depends on $\tilde{\boldsymbol{g}}_B$, such as SignSGD and SoftSignSGD, the computational process is mentally burdensome and faces difficulties in generalization. To address this, I invested some effort in attempting to simplify the derivations, and fortunately made some progress. The key idea is the theme of this article—mean field theory.

Mean field theory is a common approximation method in physics that has no fixed form, but the general idea is to move the averaging operation inside the function. In fact, we have already glimpsed the charm of mean field theory in "Why is Adam's Update RMS 0.2?". In this article, we will once again witness its remarkable effectiveness in computing the learning rate scaling laws for SignSGD/SoftSignSGD.

Method Overview#

Following the notation from the previous article, for SignSGD we have $\newcommand{sign}{\mathop{\text{sign}}}\tilde{\boldsymbol{\varphi}}_B=\sign(\tilde{\boldsymbol{g}}_B)$. We first need to compute $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]$ and $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$, then we can calculate:

(1) \[ \newcommand{tr}{\mathop{\text{tr}}}\eta^* \approx \frac{\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g}}{\tr(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H})} \]

where $\boldsymbol{g}$ is the gradient and $\boldsymbol{H}$ is the Hessian matrix. According to our assumptions, the random variable $\tilde{\boldsymbol{g}}_B$ has mean $\boldsymbol{g}$ and covariance matrix $\boldsymbol{\Sigma}/B$. We are primarily concerned with the relationship between $\eta^*$ and batch size $B$. Since $\sign$ is an element-wise operation, we can attempt the analysis starting from a single scalar. The mean field method originated from my sudden discovery one day of a potentially valid approximation:

(2) \[ \mathbb{E}[\sign(\tilde{g}_B)] = \mathbb{E}\bigg[\frac{\tilde{g}_B}{\sqrt{\tilde{g}_B^2}}\bigg]\approx \frac{\mathbb{E}[\tilde{g}_B]}{\sqrt{\mathbb{E}[\tilde{g}_B^2}]} = \frac{g}{\sqrt{g^2 + \sigma^2/B}} \]

Readers who have read "How Should Learning Rate Scale with Batch Size?" should be surprised to discover that this result, which can be derived in just one line, differs from the result obtained through numerous assumptions and approximations in the original article only by an insignificant constant $\pi/2$! This fact made me realize that the mean field approximation might be completely sufficient for analyzing the relationship between learning rate and batch size.

Advantages of Mean Field Derivation

Fewer Assumptions: The original derivation contained at least three assumptions: component independence, normal distribution, and approximating $\text{erf}(x)$ with $x/\sqrt{x^2+c}$. However, the mean field approximation can eliminate the distributional form assumption, requiring only the assumption that the approximation itself is valid.

Simpler Computation: We completed the calculation in just one line above, whereas the original derivation was much more complex even under numerous assumptions.

Computation Process#

In this section, we will use the mean field approximation to provide the complete computation process for SignSGD. First, for the mean $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]$, the calculation from the previous section is almost complete—we just need to add a few details. Using component notation:

(3) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]_i = \mathbb{E}[\sign((\tilde{g}_B)_i)] = \mathbb{E}\bigg[\frac{(\tilde{g}_B)_i}{\sqrt{(\tilde{g}_B)_i^2}}\bigg]\approx \frac{\mathbb{E}[(\tilde{g}_B)_i]}{\sqrt{\mathbb{E}[(\tilde{g}_B)_i^2]}} = \frac{g_i}{\sqrt{g_i^2 + \sigma_i^2/B}} = \frac{\sign(g_i)}{\sqrt{1 + (\sigma_i^2/g_i^2)/B}} \]

where $\sigma_i^2 = \boldsymbol{\Sigma}_{i,i}$. Since we are ultimately mainly concerned with the relationship between $\eta^*$ and $B$, both of which are scalars, we apply the mean field approximation once more here to separate the denominator part related to $B$ in scalar form:

(4) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]_i \approx \frac{\sign(g_i)}{\sqrt{1 + (\sigma_i^2/g_i^2)/B}} \approx \frac{\sign(g_i)}{\sqrt{1 + \mathcal{B}_{\text{simple}}/B}} \triangleq \mu_i \]

Here, $\mathcal{B}_{\text{simple}}$ is the same as in the previous article: $\mathcal{B}_{\text{simple}} = \tr(\boldsymbol{\Sigma})/\boldsymbol{g}^{\top}\boldsymbol{g}$, which is also equal to $\mathbb{E}[\sigma_i^2]/\mathbb{E}[g_i^2]$ (where $\mathbb{E}$ denotes averaging over index $i$). That is, it replaces the originally index-dependent $\sigma_i^2/g_i^2$ with an index-independent average $\mathbb{E}[\sigma_i^2]/\mathbb{E}[g_i^2]$. After this approximation, the result is simplified while still preserving the functional form with respect to $B$.

Next, for the second moment $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$, we reintroduce the component independence assumption to simplify the result. It is possible to compute without this assumption, but the result would be more complex and would require other assumptions to simplify the computation, so it's better to directly introduce the independence assumption. Under the independence assumption, $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j}$ is computed in two parts: $i\neq j$ and $i=j$. When $i\neq j$,

(5) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j} = \mathbb{E}[(\tilde{\varphi}_B)_i(\tilde{\varphi}_B)_j] = \mathbb{E}[(\tilde{\varphi}_B)_i]\mathbb{E}[(\tilde{\varphi}_B)_j] \approx \mu_i \mu_j \]

When $i=j$, it's even simpler because the square of $\sign$ is necessarily 1, so its expectation is naturally 1. Therefore, the overall result can be concisely written as $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j}\approx \mu_i\mu_j + \delta_{i,j}(1 - \mu_i\mu_j)$.

Anomalous Phenomenon#

Substituting the above calculation results into equation (1), we obtain:

(6) \[ \eta^* \approx \frac{\sum_i |g_i|}{\frac{1}{\beta}\sum_i H_{i,i} + \beta\sum_{i\neq j} H_{i,j}\sign(g_i g_j)} \]

where $\beta = (1 + \mathcal{B}_{\text{simple}}/B)^{-1/2}$. Note that $\beta$ is monotonically increasing with respect to $B$, and $\beta\in(0,1)$, so $\beta$ can be viewed as a normalized batch size. However, $\eta^*$ is not always monotonic with respect to $\beta$, which could lead to the anomalous behavior where "increasing batch size should actually decrease the learning rate"—termed the "Surge phenomenon" in the original paper.

Let's understand this step by step. When $B\ll \mathcal{B}_{\text{simple}}$, we have $\beta\approx \sqrt{B/\mathcal{B}_{\text{simple}}}$, and thus $\beta \ll 1$. In this case, the $1/\beta$ term in the denominator of equation (6) dominates, giving:

(7) \[ \eta^* \approx \frac{\sum_i |g_i|}{\sum_i H_{i,i}}\beta \approx \frac{\sum_i |g_i|}{\sum_i H_{i,i}}\sqrt{B/\mathcal{B}_{\text{simple}}}\propto \sqrt{B} \]

This indicates that SignSGD's learning rate follows square root scaling for small batch sizes. Since we assume positive definiteness of the Hessian matrix during analysis, we necessarily have $\sum_i H_{i,i} > 0$. Then when $\sum_{i\neq j} H_{i,j}\sign(g_i g_j) \leq 0$, equation (6) is always monotonically increasing with respect to $\beta$, so $\eta^*$ is also monotonically increasing with respect to $B$, showing no anomalous behavior.

When $\sum_{i\neq j} H_{i,j}\sign(g_i g_j) > 0$, using basic inequalities we can find that the denominator of equation (6) has a minimum point at:

(8) \[ \beta^* = \sqrt{\frac{\sum_i H_{i,i}}{\sum_{i\neq j} H_{i,j}\sign(g_i g_j)}} \]

Note that $\beta\in(0, 1)$, so there is an additional condition $\beta^*\in(0, 1)$. In this case, $\eta^*$ is no longer monotonically increasing with respect to $B$, but rather increases first and then decreases. There exists a critical batch size beyond which the learning rate should actually decrease—this is the "Surge phenomenon".

Causal Reflection#

Why does this anomalous Surge phenomenon occur? In fact, it reflects an incompatibility between the optimizer's underlying assumptions and our analytical method. Specifically, to estimate the optimal learning rate, we expanded the loss increment to second-order approximation and assumed positive definiteness of the Hessian matrix. Under these settings, the optimal update should be Newton's method: $\boldsymbol{H}^{-1}\boldsymbol{g}$.

From the perspective of Newton's method, different optimizers essentially correspond to different assumptions about the Hessian matrix. For example, SGD corresponds to assuming $\boldsymbol{H}=\eta_{\max}^{-1} \boldsymbol{I}$, while SignSGD corresponds to assuming $\newcommand{diag}{\mathop{\text{diag}}}\boldsymbol{H}=\eta_{\max}^{-1} \diag(|\boldsymbol{g}|)$, though in practice we can only replace $\boldsymbol{g}$ with $\tilde{\boldsymbol{g}}_B$. The Surge phenomenon actually reflects that as $B\to\infty$, the deviation between SignSGD's assumed Hessian matrix and the actual Hessian matrix increases.

We know that today's LLM models have parameters numbering in the billions. Computing either the full Hessian matrix or the covariance matrix is nearly impossible, which is one reason why we introduce the independence assumption when computing the second moment $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$—only then does the covariance matrix become diagonal, making estimation feasible. The Hessian matrix is similar; we can often only compute it for specific structures.

For example, substituting $\boldsymbol{H}=\eta_{\max}^{-1} \diag(|\boldsymbol{g}|)$ into equation (6) yields $\eta^*\approx \eta_{\max} \beta = \eta_{\max} / \sqrt{1 + \mathcal{B}_{\text{simple}}/B}$. This form is very concise and shows no anomalous behavior. Does this mean the Surge phenomenon won't occur? Not exactly. The Surge phenomenon objectively exists; the point here is more that: when we observe the Surge phenomenon in experiments, perhaps the first thing to consider is not correcting the variation pattern of $\eta^*$, but rather changing the optimizer.

Loss Variation#

With $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]$ and $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$, we can also compute $\overline{\Delta\mathcal{L}}$ as in the previous article. Interestingly, it has the same format as the SGD result:

(9) \[ \overline{\Delta\mathcal{L}} = \mathcal{L}(\boldsymbol{w}) - \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta^*\tilde{\boldsymbol{g}}_B)] \approx \frac{(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g})^2}{2\tr(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H})}\approx \frac{\Delta\mathcal{L}_{\max}}{1 + \mathcal{B}_{\text{noise}}/B} \]

where:

(10) \[ \Delta\mathcal{L}_{\max} = \frac{\frac{1}{2}(\sum_i |g_i|)^2}{\sum_i H_{i,i} + \sum_{i\neq j} H_{i,j}\sign(g_i g_j)},\quad \mathcal{B}_{\text{noise}} = \frac{\mathcal{B}_{\text{simple}}\sum_i H_{i,i}}{\sum_i H_{i,i} + \sum_{i\neq j} H_{i,j}\sign(g_i g_j)} \]

Note that this retains the full Hessian matrix, so the result is actually quite interesting—although the learning rate $\eta^*$ may exhibit the Surge phenomenon, the average loss increment does not show this phenomenon. It is always monotonically increasing with respect to $B$ and maintains the same form as SGD, meaning we can derive the same "training data volume–training steps" relationship:

(11) \[ \left(\frac{S}{S_{\min}} - 1\right)\left(\frac{E}{E_{\min}} - 1\right) = 1 \]

A more thought-provoking question is: why do SGD and SignSGD have completely different update magnitudes, including明显 different behaviors in learning rate $\eta^*$, yet $\overline{\Delta\mathcal{L}}$ has the same form with respect to $B$? Is this merely a coincidence, or is there a deeper principle underlying it?

General Patterns#

Starting again from the mean field approximation, I obtained an answer leaning toward the latter. Whether for $\eta^*$ or $\overline{\Delta\mathcal{L}}$, the core difficulty lies in computing $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]$ and $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$, so our goal is to explore unified computational patterns for both.

We generally set $\tilde{\boldsymbol{\varphi}}_B=\tilde{\boldsymbol{H}}{}_B^{-1}\tilde{\boldsymbol{g}}_B$, where $\tilde{\boldsymbol{H}}_B$ is some positive semidefinite matrix. Then we can write:

(12) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] = \mathbb{E}[\tilde{\boldsymbol{H}}{}_B^{-1}\tilde{\boldsymbol{g}}_B]\approx \underbrace{\mathbb{E}[\tilde{\boldsymbol{H}}_B]^{-1}}_{\text{denoted }\hat{\boldsymbol{H}}{}^{-1}}\mathbb{E}[\tilde{\boldsymbol{g}}_B] = \hat{\boldsymbol{H}}{}^{-1}\boldsymbol{g} \]

and:

(13) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}] = \mathbb{E}[\tilde{\boldsymbol{H}}{}_B^{-1}\tilde{\boldsymbol{g}}_B\tilde{\boldsymbol{g}}_B^{\top}\tilde{\boldsymbol{H}}{}_B^{-1}]\approx \mathbb{E}[\tilde{\boldsymbol{H}}_B]^{-1}\mathbb{E}[\tilde{\boldsymbol{g}}_B\tilde{\boldsymbol{g}}_B^{\top}]\mathbb{E}[\tilde{\boldsymbol{H}}_B]^{-1} = \hat{\boldsymbol{H}}{}^{-1}(\boldsymbol{g}\boldsymbol{g}^{\top} + \boldsymbol{\Sigma}/B)\hat{\boldsymbol{H}}{}^{-1} \]

Substituting into the expression for $\overline{\Delta\mathcal{L}}$, we obtain:

(14) \[ \overline{\Delta\mathcal{L}} \approx \frac{1}{2}\frac{(\boldsymbol{g}^{\top}\hat{\boldsymbol{H}}{}^{-1}\boldsymbol{g})^2}{\boldsymbol{g}^{\top}\hat{\boldsymbol{H}}{}^{-1}\boldsymbol{H}\hat{\boldsymbol{H}}{}^{-1}\boldsymbol{g} + \tr(\boldsymbol{\Sigma}\hat{\boldsymbol{H}}{}^{-1}\boldsymbol{H}\hat{\boldsymbol{H}}{}^{-1})/B} \]

Note that the above expression is homogeneous with respect to $\hat{\boldsymbol{H}}$. If we assume that the relationship between $\hat{\boldsymbol{H}}$ and $B$ can be separated into a scalar form such as $\hat{\boldsymbol{H}}\approx f(B) \boldsymbol{G}$, where $f(B)$ is a scalar function of $B$ and $\boldsymbol{G}$ is not明显 related to $B$, then $f(B)$ can be canceled out in both numerator and denominator. Ultimately, the relationship with $B$ can be organized into the following form:

(15) \[ \overline{\Delta\mathcal{L}} \approx \frac{\Delta\mathcal{L}_{\max}}{1 + \mathcal{B}_{\text{noise}}/B} \]

This proves that $\overline{\Delta\mathcal{L}}$ has the same asymptotic pattern with respect to $B$, with the core being the homogeneity with respect to $\hat{\boldsymbol{H}}$. In contrast, $\eta^*$ does not have such a unified result because it is not homogeneous with respect to $\hat{\boldsymbol{H}}$.

Validity Analysis#

By this point, I believe everyone has some understanding of the mean field method. Its main characteristic is computational simplicity, or more fundamentally, mean field theory chooses the simple, computable direction for computation, which gives it great flexibility. Flexibility is often also a drawback, meaning it's difficult to grasp the patterns for the next step.

As for explaining why this approach is valid, that's even harder—it can only be analyzed case by case, and sometimes even case-by-case analysis is difficult. My feeling is that the mean field method is three parts computation, three parts luck, three parts intuition, plus one part metaphysics. Of course, trying it out is fine. Let's take the previous SignSGD computation as an example and attempt some analysis.

Validity Analysis of Mean Field Approximation

Clearly, the core computation for SignSGD is $\mathbb{E}[\sign(x)]$. We denote $\mathbb{E}[x]=\mu,\mathbb{E}[x^2]=\mu^2 + \sigma^2$, and then write:

\[ \sign(x) = \frac{x}{\sqrt{x^2}} = \frac{x}{\sqrt{\mu^2 + \sigma^2 + (x^2 - \mu^2 - \sigma^2)}} \]

Assuming $x^2 - \mu^2 - \sigma^2$ is small, we perform a Taylor expansion:

\[ \sign(x) = \frac{x}{\sqrt{\mu^2 + \sigma^2}} - \frac{1}{2}\frac{x(x^2 - \mu^2 - \sigma^2)}{(\mu^2 + \sigma^2)^{3/2}} + \frac{3}{8}\frac{x(x^2 - \mu^2 - \sigma^2)^2}{(\mu^2 + \sigma^2)^{5/2}}-\cdots \]

Now the denominators are independent of $x$, and the numerators are polynomials in $x$, so taking expectations on both sides, the first term is the mean field approximation result $\mu/\sqrt{\mu^2 + \sigma^2}$. To examine the reasonableness of the mean field approximation, we compute the second term:

\[ \frac{1}{2}\frac{\mathbb{E}[x(x^2 - \mu^2 - \sigma^2)]}{(\mu^2 + \sigma^2)^{3/2}} = \frac{1}{2}\frac{\mathbb{E}[x^3] - (\mu^3 + \mu\sigma^2)}{(\mu^2 + \sigma^2)^{3/2}} \]

This involves $\mathbb{E}[x^3]$, a new statistical quantity that is the key factor in the mean field error. We can use the normal distribution $\mathcal{N}(x;\mu,\sigma^2)$ to get a sense of this: here $\mathbb{E}[x^3]=\mu^3 + 3\mu\sigma^2$. Substituting into the above gives:

\[ \frac{\mu\sigma^2}{(\mu^2 + \sigma^2)^{3/2}} = \frac{\sigma^2/\mu^2}{(1 + \sigma^2/\mu^2)^{3/2}} \]

The right side is a bounded expression, with the maximum at $\sigma^2/\mu^2=2$, giving $2/3^{3/2}=0.3849\cdots$. This indicates that the error of the mean field approximation is likely finite, and the error term tends to 0 as both $\sigma\to 0$ and $\sigma\to\infty$, which to some extent demonstrates the usability of the mean field approximation.

Generalized Approximation#

One reason for choosing to analyze SignSGD is that we typically use it as a theoretical approximation for Adam. In "How Does Adam's Epsilon Affect Learning Rate Scaling Laws?", we computed a theoretically better approximation: SoftSignSGD, which considers the effect of $\epsilon$.

(16) \[ \sign(x)=\frac{x}{\sqrt{x^2}}\quad\to\quad\newcommand{softsign}{\mathop{\text{softsign}}}\softsign(x)=\frac{x}{\sqrt{x^2+\epsilon^2}} \]

In this case, $\tilde{\boldsymbol{\varphi}}_B = \softsign(\tilde{\boldsymbol{g}}_B)$. Let's proceed directly to the main topic:

(17) \[ \begin{aligned} &\,\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]_i = \mathbb{E}[\softsign((\tilde{g}_B)_i)] = \mathbb{E}\bigg[\frac{(\tilde{g}_B)_i}{\sqrt{(\tilde{g}_B)_i^2 + \epsilon^2}}\bigg]\approx \frac{\mathbb{E}[(\tilde{g}_B)_i]}{\sqrt{\mathbb{E}[(\tilde{g}_B)_i^2]+ \epsilon^2}} \\[8pt] =&\, \frac{g_i}{\sqrt{g_i^2 + \sigma_i^2/B + \epsilon^2}} = \frac{\softsign(g_i)}{\sqrt{1 + \sigma_i^2/(g_i^2 + \epsilon^2)/B}}\approx \frac{\softsign(g_i)}{\sqrt{1 + \mathcal{B}_{\text{simple}}/B}}\triangleq \nu_i\beta \end{aligned} \]

Here, $\mathcal{B}_{\text{simple}}$ is slightly different: it is $\tr(\boldsymbol{\Sigma})/(\boldsymbol{g}^{\top}\boldsymbol{g} + N\epsilon^2)$, where $N$ is the total number of model parameters, i.e., $\boldsymbol{g}\in\mathbb{R}^N$. As for the final terms: $\nu_i=\softsign(g_i)$ and $\beta = (1 + \mathcal{B}_{\text{simple}}/B)^{-1/2}$. Next, compute $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$. Under the independence assumption, when $i\neq j$ we can still compute the means separately, so $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j}=\nu_i \nu_j \beta^2$. Thus we only need to compute the case $i=j$:

(18) \[ \begin{aligned} &\,\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,i} = \mathbb{E}[\softsign((\tilde{g}_B)_i)^2] = \mathbb{E}\bigg[\frac{(\tilde{g}_B)_i^2}{(\tilde{g}_B)_i^2 + \epsilon^2}\bigg]\approx \frac{\mathbb{E}[(\tilde{g}_B)_i^2]}{\mathbb{E}[(\tilde{g}_B)_i^2]+ \epsilon^2} \\[8pt] =&\, \frac{g_i^2 + \sigma_i^2/B}{g_i^2 + \sigma_i^2/B + \epsilon^2} = 1 - \frac{1 - \softsign(g)^2}{1 + \sigma_i^2/(g_i^2 + \epsilon^2)/B}\approx 1 - \frac{1 - \softsign(g)^2}{1 + \mathcal{B}_{\text{simple}}/B} \end{aligned} \]

This can be uniformly written as $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j}\approx \nu_i \nu_j\beta^2 + \delta_{i,j}(1-\beta^2)$. Therefore:

(19) \[ \eta^* \approx \frac{\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g}}{\text{Tr}(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H})} \approx \frac{\beta\sum_i \nu_i g_i}{\sum_i H_{i,i} + \beta^2(\sum_{i,j} \nu_i \nu_j H_{i,j} - \sum_i H_{i,i})} \]

Except for $\beta$, all other parts of the above expression are independent of $B$, so we have already obtained the explicit relationship between $\eta^*$ and $B$, with a form大同小异 to that of SignSGD. The remaining analysis can refer to "How Does Adam's Epsilon Affect Learning Rate Scaling Laws?" or follow the previous content.

Summary#

In this article, we used mean field approximation to recompute the conclusions for SignSGD and SoftSignSGD, greatly simplifying the related computational processes, and preliminarily considered the general patterns of these computations.

Citation Information

Original Article: Su Jianlin. Rethinking Learning Rate and Batch Size (Part 2): Mean Field Theory. Scientific Spaces.

How to cite this translation:

Su, J. Rethinking Learning Rate and Batch Size (Part 2): Mean Field Theory [Translated by Juanxi Tian]. Scientific Spaces.

BibTeX:

@article{su2025rethinking_lr_bs_part2, title = {Rethinking Learning Rate and Batch Size (Part 2): Mean Field Theory}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/11280}, note = {Translated by Juanxi Tian (ScalingOpt Team)} }