How Should Learning Rate Scale with Batch Size?

With the rapid advancement of computing power, there are increasingly many scenarios that hope to achieve "computing power for time," i.e., to shorten model training time by stacking computing resources. Ideally, we hope that by investing $n$ times the computing power, the time to achieve the same effect is reduced to $1/n$, at which point the total computing cost remains consistent. This "hope" seems reasonable and natural, but in fact it is not trivial. Even if we disregard bottlenecks like communication, when computing power exceeds a certain scale or the model is smaller than a certain size, increasing computing power often only increases Batch Size. However, does increasing Batch Size necessarily shorten training time while maintaining the same effect?

This is precisely the topic we will discuss next: when Batch Size increases, how should various hyperparameters, especially the learning rate, be adjusted to maintain the original training effect while maximizing training efficiency? We can also refer to this as the Scaling Law between Batch Size and learning rate.

Variance Perspective#

Intuitively, when Batch Size increases, the gradient of each batch becomes more accurate, so we can take larger steps, i.e., increase the learning rate, to reach the endpoint faster, thereby shortening training time. This is generally understood. The question is: how much increase is most appropriate?

Historical Context

The relationship between batch size and learning rate has been a fundamental question in deep learning optimization since the early days of large-scale training. The debate between square root scaling and linear scaling reflects different theoretical assumptions about gradient noise and optimization dynamics.

Square Root Scaling#

The earliest answer to this question may be square root scaling, i.e., when Batch Size expands by a factor of $n$, the learning rate expands by $\sqrt{n}$, from the 2014 paper "One weird trick for parallelizing convolutional neural networks". The derivation principle is to keep the variance of SGD increments constant. Specifically, we denote the gradient of randomly sampling one sample as $\tilde{\boldsymbol{g}}$, with mean and covariance denoted as $\boldsymbol{g}$ and $\boldsymbol{\Sigma}$ respectively, where $\boldsymbol{g}$ is the gradient of the entire dataset. When we increase the number of samples to $B$, we have:

(1) \[ \tilde{\boldsymbol{g}}_B \triangleq \frac{1}{B}\sum_{i=1}^B \tilde{\boldsymbol{g}}^{(i)},\quad \mathbb{E}[\tilde{\boldsymbol{g}}_B] = \boldsymbol{g},\quad \mathbb{E}[(\tilde{\boldsymbol{g}}_B-\boldsymbol{g})(\tilde{\boldsymbol{g}}_B-\boldsymbol{g})^{\top}]=\frac{\boldsymbol{\Sigma}}{B} \]

That is, increasing the number of samples does not change the mean, but reduces the covariance to $1/B$. For the SGD optimizer, the increment is $-\eta \tilde{\boldsymbol{g}}_B$, and its covariance is proportional to $\eta^2/B$. We believe that an appropriate amount of (neither too much nor too little) noise is necessary during optimization. Therefore, when Batch Size $B$ changes, we adjust the learning rate $\eta$ to keep the noise intensity of the increment, i.e., the covariance matrix, constant, leading to:

(2) \[ \frac{\eta^2}{B} = \text{constant}\quad\Rightarrow\quad \eta\propto \sqrt{B} \]

This yields the square root scaling law between learning rate and Batch Size. The later paper "Train longer, generalize better: closing the generalization gap in large batch training of neural networks" also endorsed this choice.

Linear Scaling#

Interestingly, linear scaling, i.e., $\eta\propto B$, often performs better in practice. Even the author of the aforementioned paper that first proposed square root scaling, "One weird trick for parallelizing convolutional neural networks", pointed this out in the paper and stated that he could not provide a reasonable explanation.

To some extent, linear scaling aligns more with our intuitive understanding, especially as in "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour", which assumes that the gradient directions of consecutive $n$ batches do not change much, making linear scaling almost obviously valid. However, this assumption is clearly too strong. Relaxing this assumption requires connecting SGD with SDEs (Stochastic Differential Equations), which was accomplished by "Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations", but the first paper to point out the scaling relationship between learning rate and Batch Size was likely "On the Generalization Benefit of Noise in Stochastic Gradient Descent".

In retrospect, establishing this connection is not difficult to understand. Let the model parameters be $\boldsymbol{w}$, then the SGD update rule can be rewritten as:

(3) \[ \boldsymbol{w}_{t+1} =\boldsymbol{w}_t - \eta \tilde{\boldsymbol{g}}_{B,t} =\boldsymbol{w}_t - \eta \boldsymbol{g}_t - \eta (\tilde{\boldsymbol{g}}_{B,t} - \boldsymbol{g}_t) \]

where $\tilde{\boldsymbol{g}}_{B,t} - \boldsymbol{g}_t$ is the gradient noise. So far, we have not made any assumptions about the distribution of this noise, only knowing that its mean is $\boldsymbol{0}$ and its covariance is $\boldsymbol{\Sigma}_t/B$. Next, we assume that this noise follows a normal distribution $\mathcal{N}(\boldsymbol{0},\boldsymbol{\Sigma}_t/B)$, then the iteration can be further rewritten as:

(4) \[ \begin{aligned} \boldsymbol{w}_{t+1} =&\, \boldsymbol{w}_t - \eta \boldsymbol{g}_t - \eta (\tilde{\boldsymbol{g}}_{B,t} - \boldsymbol{g}_t) \\[5pt] =&\, \boldsymbol{w}_t - \eta \boldsymbol{g}_t - \eta \sqrt{\frac{\boldsymbol{\Sigma}_t}{B}}\boldsymbol{z},\quad \boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) \\[5pt] =&\, \boldsymbol{w}_t - \eta \boldsymbol{g}_t - \sqrt{\eta} \sqrt{\frac{\eta\boldsymbol{\Sigma}_t}{B}}\boldsymbol{z},\quad \boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) \end{aligned} \]

Key Insight

The critical step here is that the step size of the noise term in the SDE is the square root of the non-noise term, thus separating out a $\sqrt{\eta}$ term. We have also analyzed this in "Generative Diffusion Model Discussion (5): General Framework of SDEs". Simply put, zero-mean Gaussian noise tends to cancel out over the long term, so the step size must be increased to manifest the noise effect.

This means that the SGD iteration format $\boldsymbol{w}_{t+1} =\boldsymbol{w}_t - \eta \tilde{\boldsymbol{g}}_{B,t}$ is actually approximately solving the SDE:

(5) \[ d\boldsymbol{w} = - \boldsymbol{g}_t dt - \sqrt{\frac{\eta\boldsymbol{\Sigma}_t}{B}}d\boldsymbol{z},\quad d\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},dt\boldsymbol{I}) \]

Therefore, to ensure that the running results do not change significantly when $B$ changes, the form of the above SDE should remain unchanged, leading to linear scaling $\eta\propto B$.

The above conclusions are all based on the SGD optimizer. The paper "On the SDEs and Scaling Rules for Adaptive Gradient Algorithms" extended this to RMSProp, Adam, and other optimizers, resulting in square root scaling. Coincidentally, the slightly earlier "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes" also applied square root scaling when testing Adam and its variant LAMB. More content can also be found in the blog "How to Scale Hyperparameters as Batch Size Increases".

Scaling Law Comparison

Square Root Scaling ($\eta \propto \sqrt{B}$): Based on keeping gradient variance constant. Works well for adaptive optimizers like Adam, RMSProp.

Linear Scaling ($\eta \propto B$): Based on SDE analysis. Works well for SGD, especially when consecutive batches have correlated gradients.

Bounded Monotonic: Based on loss function analysis. Learning rate increases with batch size but saturates at a maximum value.

Loss Function Perspective#

It is certain that both square root scaling and linear scaling can only be approximately valid within a local range, because they both contain the conclusion that "as long as Batch Size is sufficiently large, the learning rate can be arbitrarily large," which is obviously impossible. Additionally, the work in the previous two sections revolves around variance, but our fundamental task is to reduce the loss function. Therefore, a loss function-oriented approach may be more fundamental.

Monotonic with Bound#

The classic work from this perspective is OpenAI's "An Empirical Model of Large-Batch Training", which analyzes the optimal learning rate of SGD through the second-order approximation of the loss function, concluding that "the learning rate increases monotonically with Batch Size but has an upper bound." The same idea also appeared in the slightly earlier "Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients", although that paper was not used to analyze the effect of Batch Size.

The most crucial idea in the entire derivation process is to treat the learning rate as an optimization parameter: let the loss function be $\mathcal{L}(\boldsymbol{w})$, the current batch gradient be $\tilde{\boldsymbol{g}}_B$, then the loss function after SGD is $\mathcal{L}(\boldsymbol{w} - \eta\tilde{\boldsymbol{g}}_B)$. We formulate the solution of the optimal learning rate as an optimization problem:

(6) \[ \eta^* = \mathop{\text{argmin}}_{\eta} \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta\tilde{\boldsymbol{g}}_B)] \]

This objective is obviously intuitive: choose the learning rate so that on average the training efficiency is highest (the loss function decreases the fastest). To solve this problem, we approximate the loss function with a second-order expansion:

(7) \[ \mathcal{L}(\boldsymbol{w} - \eta\tilde{\boldsymbol{g}}_B) \approx \mathcal{L}(\boldsymbol{w}) - \eta\tilde{\boldsymbol{g}}_B^{\top}\underbrace{\frac{\partial \mathcal{L}(\boldsymbol{w})}{\partial\boldsymbol{w}}}_{\text{is }\boldsymbol{g}} + \frac{1}{2}\eta^2 \tilde{\boldsymbol{g}}_B^{\top}\underbrace{\frac{\partial^2 \mathcal{L}(\boldsymbol{w})}{\partial\boldsymbol{w}^2}}_{\text{denoted as }\boldsymbol{H}}\tilde{\boldsymbol{g}}_B = \mathcal{L}(\boldsymbol{w}) - \eta\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{g} + \frac{1}{2}\eta^2 \tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{g}}_B \]

Here $\boldsymbol{H}$ is the Hessian matrix, and $\frac{\partial \mathcal{L}(\boldsymbol{w})}{\partial\boldsymbol{w}}$ is the gradient of the loss function. The ideal objective function is based on the full dataset, which is why its gradient is the mean $\boldsymbol{g}$ of $\tilde{\boldsymbol{g}}_B$. Next, taking the expectation, we get:

(8) \[ \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta\tilde{\boldsymbol{g}}_B)] \approx \mathbb{E}[\mathcal{L}(\boldsymbol{w}) - \eta\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{g} + \frac{1}{2}\eta^2 \tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{g}}_B] = \mathcal{L}(\boldsymbol{w}) - \eta\boldsymbol{g}^{\top}\boldsymbol{g} + \frac{1}{2}\eta^2 \mathbb{E}[\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{g}}_B] \]

The last term requires a slight trick:

(9) \[ \newcommand{tr}{\mathop{\text{tr}}}\begin{aligned} \mathbb{E}[\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{g}}_B] =&\, \mathbb{E}[\tr(\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{g}}_B)]= \mathbb{E}[\tr(\tilde{\boldsymbol{g}}_B\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H})] = \tr(\mathbb{E}[\tilde{\boldsymbol{g}}_B\tilde{\boldsymbol{g}}_B^{\top}]\boldsymbol{H})\\[5pt] =&\, \tr((\boldsymbol{g}\boldsymbol{g}^{\top} + \boldsymbol{\Sigma}/B)\boldsymbol{H}) = \boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g} + \tr(\boldsymbol{\Sigma}\boldsymbol{H})/B \end{aligned} \]

The transformation mainly utilizes $\tr(\boldsymbol{A}\boldsymbol{B}) = \tr(\boldsymbol{B}\boldsymbol{A})$. Now, assuming the positive definiteness of $\boldsymbol{H}$, the problem becomes minimizing a quadratic function, easily solving for:

(10) \[ \eta^* \approx \frac{\boldsymbol{g}^{\top}\boldsymbol{g}}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g} + \tr(\boldsymbol{\Sigma}\boldsymbol{H})/B} \]

We can rewrite this as:

(11) \[ \eta^* \approx \frac{\eta_{\max}}{1 + \mathcal{B}_{\text{noise}}/B} \]

This yields the result of "monotonically increasing with $B$ but having an upper bound," where:

(12) \[ \eta_{\max} = \frac{\boldsymbol{g}^{\top}\boldsymbol{g}}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}},\qquad\mathcal{B}_{\text{noise}} = \frac{\tr(\boldsymbol{\Sigma}\boldsymbol{H})}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}} \]

Practical Analysis#

When $B \ll \mathcal{B}_{\text{noise}}$, $1 + \mathcal{B}_{\text{noise}}/B\approx \mathcal{B}_{\text{noise}}/B$, so $\eta^* \approx \eta_{\max}B/\mathcal{B}_{\text{noise}}\propto B$, i.e., linear scaling. This again shows that linear scaling is only a local approximation for small Batch Sizes; when $B > \mathcal{B}_{\text{noise}}$, $\eta^*$ gradually saturates to the maximum value $\eta_{\max}$, meaning that the increase in training cost far outweighs the improvement in training efficiency. Therefore, $\mathcal{B}_{\text{noise}}$ acts as a watershed: when Batch Size exceeds this value, there is no need to continue investing computing power to increase Batch Size.

Practical Implementation

Estimating $\mathcal{B}_{\text{simple}}$: Since $\mathcal{B}_{\text{noise}}$ is computationally expensive (requires Hessian), we use the approximation $\mathcal{B}_{\text{simple}} = \frac{\tr(\boldsymbol{\Sigma})}{\boldsymbol{g}^{\top}\boldsymbol{g}}$, which only requires computing gradient variances, not the full Hessian.

Estimating $\eta_{\max}$: Perform grid search for learning rate at a small batch size to find approximate $\eta^*$, then combine with estimated $\mathcal{B}_{\text{simple}}$ to back-calculate $\eta_{\max}$.

Monitoring: Since these values change during training, it's best to compute $\mathcal{B}_{\text{simple}}$ after the model has entered a "steady state" of training, or continuously monitor it throughout training.

For practice, the most critical question is undoubtedly how to estimate $\eta_{\max}$ and $\mathcal{B}_{\text{noise}}$, especially $\mathcal{B}_{\text{noise}}$, which directly relates to the scaling law of learning rate and the saturation of training efficiency. Direct computation of both involves the Hessian matrix $\boldsymbol{H}$, with computational cost proportional to the square of the number of parameters. In today's era where models with hundreds of millions of parameters are considered small, computing the Hessian matrix is clearly impractical, so more efficient computational methods must be sought.

Let's first look at $\mathcal{B}_{\text{noise}}$. Its expression is $\frac{\tr(\boldsymbol{\Sigma}\boldsymbol{H})}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}}$, with $\boldsymbol{H}$ in both numerator and denominator, giving us an impulse to "cancel them out." Indeed, the simplification idea is the same: assuming $\boldsymbol{H}$ is approximately some multiple of the identity matrix, we get:

(13) \[ \mathcal{B}_{\text{noise}} = \frac{\tr(\boldsymbol{\Sigma}\boldsymbol{H})}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}}\approx \frac{\tr(\boldsymbol{\Sigma})}{\boldsymbol{g}^{\top}\boldsymbol{g}}\triangleq \mathcal{B}_{\text{simple}} \]

$\mathcal{B}_{\text{simple}}$ is more feasible to compute, and experiments show it is usually a good approximation of $\mathcal{B}_{\text{noise}}$, so we choose to estimate $\mathcal{B}_{\text{simple}}$ rather than $\mathcal{B}_{\text{noise}}$. Note that $\tr(\boldsymbol{\Sigma})$ only requires diagonal elements, so the full covariance matrix does not need to be computed; only the variance of each gradient component needs to be calculated separately and then summed. In data parallel scenarios, the gradient computed on each device can be directly used to estimate the gradient variance.

It should be pointed out that results like Equation (11) are actually dynamic; that is, theoretically, $\eta_{\max}$, $\mathcal{B}_{\text{noise}}$, $\mathcal{B}_{\text{simple}}$ are different at each training step. So if we want to obtain a static law, we need to train for a period until the model's training enters a "steady state," at which point the computed $\mathcal{B}_{\text{simple}}$ is reliable. Alternatively, we can continuously monitor $\mathcal{B}_{\text{simple}}$ during training to judge the gap between the current settings and the optimum.

As for $\eta_{\max}$, there is actually no need to estimate it according to the formula. Simply perform a grid search for the learning rate at a small Batch Size to find an approximate $\eta^*$, then combine it with the estimated $\mathcal{B}_{\text{simple}}$ to back-calculate $\eta_{\max}$.

Data Efficiency#

Starting from the above results, we can also derive an asymptotic relationship between training data volume and training steps. The derivation is also simple: substituting Equation (11) into the loss function, we can calculate that under the optimal learning rate, the loss function reduction per iteration step is:

(14) \[ \overline{\Delta\mathcal{L}} = \mathcal{L}(\boldsymbol{w}) - \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta^*\tilde{\boldsymbol{g}}_B)] \approx \frac{\Delta\mathcal{L}_{\max}}{1 + \mathcal{B}_{\text{noise}}/B} \]

where $\Delta\mathcal{L}_{\max} = \frac{(\boldsymbol{g}^{\top}\boldsymbol{g})^2}{2\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}}$. The next focus is on interpreting this result.

When $B\to\infty$, i.e., full-batch SGD, the loss function reduction per step reaches the maximum $\Delta\mathcal{L}_{\max}$. At this point, the fewest training steps (denoted $S_{\min}$) can be used to reach the target point. When $B$ is finite, the average loss reduction per step is only $\overline{\Delta\mathcal{L}}$, meaning we need $1 + \mathcal{B}_{\text{noise}}/B$ steps to achieve the reduction of a single full-batch SGD step. Therefore, the total training steps are roughly $S = (1 + \mathcal{B}_{\text{noise}}/B)S_{\min}$.

Since the Batch Size is $B$, the total number of samples consumed in the training process is $E = BS = (B + \mathcal{B}_{\text{noise}})S_{\min}$, which is an increasing function of $B$. When $B\to 0$, $E_{\min} = \mathcal{B}_{\text{noise}}S_{\min}$, indicating that as long as we use a sufficiently small Batch Size to train the model, the total training samples $E$ required will also decrease accordingly, at the cost of many training steps $S$. Further, using these notations, we can write the relationship between them as:

(15) \[ \left(\frac{S}{S_{\min}} - 1\right)\left(\frac{E}{E_{\min}} - 1\right) = 1 \]

This is the scaling law between training data volume and training steps, indicating that with smaller data volumes, we should reduce the Batch Size and allow more training steps to have a better chance of reaching a better solution. This derivation has been simplified by the author, assuming the invariance of $\mathcal{B}_{\text{noise}}$ and $\Delta\mathcal{L}_{\max}$ throughout the training process. If necessary, the original paper's appendix can be followed to more finely handle dynamic changes using integrals (but this requires introducing the assumption $B = \sqrt{r\mathcal{B}_{\text{noise}}}$), which will not be expanded here.

Additionally, since $\mathcal{B}_{\text{noise}} = E_{\min}/S_{\min}$, the above equation also provides another scheme for estimating $\mathcal{B}_{\text{noise}}$: obtain multiple $(S,E)$ pairs through multiple experiments and grid search, then fit the above equation to estimate $E_{\min},S_{\min}$, and then calculate $\mathcal{B}_{\text{noise}}$.

Adaptive Optimizers#

It must be said that OpenAI is indeed a pioneer in various Scaling Laws. The aforementioned analysis is quite brilliant, and the results are also quite rich. More commendably, the entire derivation process is not complicated, giving a sense of simplicity being the ultimate sophistication. However, the current conclusions are all derived based on SGD, and the applicability to Adam and other adaptive learning rate optimizers is still unclear. This part of the content was completed by "Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling".

Sign Approximation#

The approach for analyzing Adam is the same as for SGD, both based on second-order expansion. The difference is that the direction vector changes from $\tilde{\boldsymbol{g}}_B$ to a general vector $\tilde{\boldsymbol{\varphi}}_B$. At this point, we have:

(16) \[ \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta\tilde{\boldsymbol{\varphi}}_B)] \approx \mathcal{L}(\boldsymbol{w}) - \eta\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g} + \frac{1}{2}\eta^2 \tr(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H}) \]

Now we need to determine $\tilde{\boldsymbol{\varphi}}_B$ and compute the corresponding $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]$ and $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$. Since only an asymptotic relationship is needed, similar to "Can LoRA Gain More with Different Learning Rates?", we choose SignSGD, i.e., $\newcommand{sign}{\mathop{\text{sign}}}\tilde{\boldsymbol{\varphi}}_B = \sign(\tilde{\boldsymbol{g}}_B)$, as an approximation for Adam. The earliest source of this practice might be "Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients". The rationality of this approximation is reflected in two points:

1. Regardless of the values of $\beta_1,\beta_2$, Adam's first update vector is $\sign(\tilde{\boldsymbol{g}}_B)$;

2. When $\beta_1=\beta_2=0$, Adam's update vector is always $\sign(\tilde{\boldsymbol{g}}_B)$.

To compute $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]$ and $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$, we also need to assume, as in the "Linear Scaling" section, that $\tilde{\boldsymbol{g}}_B$ follows the distribution $\mathcal{N}(\boldsymbol{g},\boldsymbol{\Sigma}/B)$. To simplify calculations, we further assume $\boldsymbol{\Sigma}$ is a diagonal matrix $\text{diag}(\sigma_1^2,\sigma_2^2,\sigma_3^2,\cdots)$, i.e., assuming components are independent, allowing us to handle each component independently. According to the reparameterization trick, we know $\tilde{g}_B\sim \mathcal{N}(g, \sigma^2/B)$ is equivalent to $\tilde{g}_B=g + \sigma z/\sqrt{B},z\sim\mathcal{N}(0,1)$, therefore:

(17) \[ \begin{aligned} \mathbb{E}[\tilde{\varphi}_B] =&\, \mathbb{E}[\sign(g + \sigma z/\sqrt{B})] = \mathbb{E}[\sign(g\sqrt{B}/\sigma + z)] \\[5pt] =&\,\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} \sign(g\sqrt{B}/\sigma + z) e^{-z^2/2}dz \\[5pt] =&\,\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{-g\sqrt{B}/\sigma} (-1)\times e^{-z^2/2}dz + \frac{1}{\sqrt{2\pi}}\int_{-g\sqrt{B}/\sigma}^{\infty} 1\times e^{-z^2/2}dz \\[5pt] =&\,\text{erf}\left(\frac{g}{\sigma}\sqrt{\frac{B}{2}}\right) \end{aligned} \]

Here $\text{erf}$ is the error function, an S-shaped function with range $(-1,1)$ similar to $\tanh$, which can serve as a smooth approximation of $\sign$. However, $\text{erf}$ itself has no elementary function expression, so we better find an elementary function approximation to more intuitively observe the change pattern. We previously discussed this topic in "Where Do the Two Elementary Function Approximations of GELU Come From?", but the approximation there was too complex (involving exponentiation). Here we use a simpler one:

(18) \[ \text{erf}(x)\approx \sign(x) = \frac{x}{|x|} = \frac{x}{\sqrt{x^2}}\approx \frac{x}{\sqrt{x^2+c}} \]

We choose $c=\pi/4$, so that the first-order approximation of this approximation at $x=0$ equals the first-order approximation of $\text{erf}$. Of course, after so many approximations, the value of $c$ is not very important; we only need to know there exists such a $c > 0$. Based on this approximation, we get:

(19) \[ \mathbb{E}[\tilde{\varphi}_B] \approx \frac{g/\sigma}{\sqrt{\pi/2B+(g/\sigma)^2}}\quad\Rightarrow\quad\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]_i \approx \frac{g_i/\sigma_i}{\sqrt{\pi/2B+(g_i/\sigma_i)^2}}\triangleq \mu_i \]

It can be seen that a clear difference between Adam and SGD is that $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]$ is already related to $B$ at this step. Fortunately, the second moment at this point is simpler because the square of $\sign(x)$ must be 1, so:

(20) \[ \mathbb{E}[\tilde{\varphi}_B^2] = 1\quad\Rightarrow\quad\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j} \to\left\{\begin{aligned}&=1, & i = j \\ &\approx\mu_i \mu_j,&\,i\neq j\end{aligned}\right. \]

Using these results, we can obtain:

(21) \[ \eta^* \approx \frac{\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g}}{\tr(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H})} \approx \frac{\sum_i \mu_i g_i}{\sum_i H_{i,i} + \sum_{i\neq j} \mu_i \mu_j H_{i,j}} \]

(22) \[ \overline{\Delta\mathcal{L}} = \mathcal{L}(\boldsymbol{w}) - \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta^*\tilde{\boldsymbol{\varphi}}_B)] \approx \frac{1}{2}\frac{(\sum_i \mu_i g_i)^2}{\sum_i H_{i,i} + \sum_{i\neq j} \mu_i \mu_j H_{i,j}} \]

Two Special Cases#

Compared to SGD's Equation (11), Adam's Equation (21) is more complex, making it impossible to intuitively see its dependence on $B$. Therefore, we start with a few special cases.

First consider $B\to\infty$, at which point $\mu_i = \sign(g_i)$, so:

(23) \[ \eta^* \approx \frac{\sum_i |g_i|}{\sum_i H_{i,i} + \sum_{i\neq j} \sign(g_i g_j) H_{i,j}} \]

Its difference from SGD's $\eta_{\max}$ is that it is not homogeneous with respect to the gradient but proportional to the gradient's scale.

Next, we consider the example where $\boldsymbol{H}$ is a diagonal matrix, i.e., $H_{i,j}=0$ when $i\neq j$. At this point:

(24) \[ \eta^* \approx \frac{\sum_i \mu_i g_i}{\sum_i H_{i,i}}=\frac{1}{\sum_i H_{i,i}}\sum_i \frac{g_i^2/\sigma_i}{\sqrt{\pi/2B+(g_i/\sigma_i)^2}} \]

Each term in the summation is monotonically increasing with $B$ and has an upper bound, so the overall result is the same. To capture the most essential pattern, we can consider further simplifying $\mu_i$ (starting here, it differs from the original paper):

(25) \[ \mu_i = \frac{g_i/\sigma_i}{\sqrt{\pi/2B+(g_i/\sigma_i)^2}} = \frac{\sign(g_i)}{\sqrt{1 + \pi(\sigma_i/g_i)^2/2B}} \approx \frac{\sign(g_i)}{\sqrt{1 + \pi\kappa^2/2B}} \]

The assumption here is that there exists some constant $\kappa^2$ independent of $i$ [for example, consider taking some kind of mean of all $(\sigma_i/g_i)^2$; actually, this $\kappa^2$ is similar to the previous $\mathcal{B}_{\text{simple}}$, and estimating according to the definition of $\mathcal{B}_{\text{simple}}$ is also possible], such that replacing $(\sigma_i/g_i)^2$ with $\kappa^2$ for any $i$ is a good approximation, thus:

(26) \[ \eta^* \approx \frac{\sum_i \mu_i g_i}{\sum_i H_{i,i}}\approx \frac{\sum_i |g_i|}{\sum_i H_{i,i}}\frac{1}{\sqrt{1 + \pi\kappa^2/2B}} \]

When $\pi\kappa^2\gg 2B$, i.e., $B \ll \pi\kappa^2/2$, we can further write the approximation:

(27) \[ \eta^* \approx \frac{\sum_i |g_i|}{\kappa\sum_i H_{i,i}}\sqrt{\frac{2B}{\pi}} \propto \sqrt{B} \]

This shows that when Batch Size itself is small, Adam indeed follows the square root scaling law.

Emergent Behavior#

If we apply approximation (25) to the original Equation (21), we find it exhibits some new characteristics. Specifically, we have:

(28) \[ \eta^* \approx \frac{\sum_i \mu_i g_i}{\sum_i H_{i,i} + \sum_{i\neq j} \mu_i \mu_j H_{i,j}} \approx \frac{\eta_{\max}}{\frac{1}{2}\left(\frac{\beta_{\text{noise}}}{\beta} + \frac{\beta}{\beta_{\text{noise}}}\right)} \]

where $\beta = (1 + \pi\kappa^2/2B)^{-1/2}$, and:

(29) \[ \beta_{\text{noise}} = \sqrt{\frac{\sum_i H_{i,i}}{\sum_{i\neq j}\sign(g_i g_j) H_{i,j}}},\quad \eta_{\max} = \frac{\sum_i |g_i|}{2\sqrt{\left(\sum_i H_{i,i}\right)\left(\sum_{i\neq j} \sign(g_i g_j) H_{i,j}\right)}} \]

Surge Phenomenon Analysis

Note that $\beta$ is a monotonically increasing function of $B$, but the final approximation in Equation (28) is not a monotonically increasing function of $\beta$; it first increases then decreases, with the maximum reached at $\beta=\beta_{\text{noise}}$. This means there exists a corresponding $\mathcal{B}_{\text{noise}}$. When Batch Size exceeds this $\mathcal{B}_{\text{noise}}$, the optimal learning rate should not increase but rather decrease! This is the so-called "Surge phenomenon" mentioned in the original paper's title. (Of course, there is also a limitation here: $\beta$ is always less than $1$. If $\beta_{\text{noise}} \geq 1$, then the relationship between the optimal learning rate and Batch Size is still monotonically increasing.)

Regarding Adam's $\eta^*$, OpenAI actually "guessed" without proof in the paper's appendix that Adam's optimal learning rate should be:

(30) \[ \eta^* \approx \frac{\eta_{\max}}{(1 + \mathcal{B}_{\text{noise}}/B)^{\alpha}} \]

where $0.5 < \alpha < 1$. It now appears that this form is only an approximation result when the diagonal elements of the Hessian matrix dominate. When the role of off-diagonal elements cannot be ignored, the Surge phenomenon of "when Batch Size is large enough, the learning rate should instead decrease" may emerge.

How to intuitively understand the Surge phenomenon? The author believes this is essentially a manifestation of the suboptimality of adaptive learning rate strategies. Still taking the approximation $\tilde{\boldsymbol{\varphi}}_B = \sign(\tilde{\boldsymbol{g}}_B)$ as an example, the larger $B$ is, the more accurate $\tilde{\boldsymbol{g}}_B$ becomes; $B\to \infty$ gives $\sign(\boldsymbol{g})$. However, is $\sign(\boldsymbol{g})$ the most scientific update direction? Not necessarily, especially in later training stages where such adaptive strategies might have negative effects. Therefore, when $B$ takes an appropriate value, the noise in $\sign(\tilde{\boldsymbol{g}}_B)$ might correct this suboptimality, while when $B$ continues to increase, the noise decreases, reducing the chance of correction, thus requiring more cautious lowering of the learning rate.

Efficiency Relationship#

Similar to the SGD analysis, we can also consider $\overline{\Delta\mathcal{L}}$. Substituting Equation (28) into Equation (22), restoring the notation $B$ and simplifying (the simplification process requires no approximation), we get:

(31) \[ \overline{\Delta\mathcal{L}} \approx \frac{\Delta\mathcal{L}_{\max}}{1 + \mathcal{B}_{\text{noise-2}}/B} \]

where:

(32) \[ \Delta\mathcal{L}_{\max} = \frac{\beta_{\text{noise}}\eta_{\max}\sum_i|g_i|}{1 + \beta_{\text{noise}}^2},\quad \mathcal{B}_{\text{noise-2}} = \frac{\pi\kappa^2\beta_{\text{noise}}^2}{2(1 + \beta_{\text{noise}}^2)} \]

Note that $\mathcal{B}_{\text{noise-2}}$ here is a new notation; it is not $\mathcal{B}_{\text{noise}}$. The latter is the theoretically optimal Batch Size solved inversely from $\beta=\beta_{\text{noise}}$, resulting in:

(33) \[ \mathcal{B}_{\text{noise}} = \frac{\pi\kappa^2\beta_{\text{noise}}^2}{2(1 - \beta_{\text{noise}}^2)} \]

The relationship between them is:

(34) \[ \frac{1}{\mathcal{B}_{\text{noise-2}}} - \frac{1}{\mathcal{B}_{\text{noise}}} = \frac{4}{\pi\kappa^2}\quad\Rightarrow\quad \mathcal{B}_{\text{noise}} = \left(\frac{1}{\mathcal{B}_{\text{noise-2}}} - \frac{4}{\pi\kappa^2}\right)^{-1} \]

Since Equation (31) has the same form as SGD's Equation (14), the analysis in that section similarly applies, thus also deriving Equation (15):

(35) \[ \left(\frac{S}{S_{\min}} - 1\right)\left(\frac{E}{E_{\min}} - 1\right) = 1 \]

Only now $E_{\min}/S_{\min} = \mathcal{B}_{\text{noise-2}}$. This way, we have another scheme to estimate $\beta_{\text{noise}}$ and $\mathcal{B}_{\text{noise}}$: obtain multiple $(S,E)$ pairs through multiple experiments, during which $\kappa^2$ can also be estimated incidentally, then fit the above equation to obtain $E_{\min},S_{\min}$, subsequently estimate $\mathcal{B}_{\text{noise-2}}$, and finally solve for $\beta_{\text{noise}}$ using Equation (32).

If $\beta_{\text{noise}} \geq 1$, then there is no optimal $\mathcal{B}_{\text{noise}}$; if $\beta_{\text{noise}} \gg 1$, it indicates that the diagonal elements of the Hessian matrix dominate, at which point the scaling law (26) applies, and increasing Batch Size can always appropriately increase the learning rate; when $\beta_{\text{noise}} < 1$, the optimal $\mathcal{B}_{\text{noise}}$ can be solved from Equation (34). When Batch Size exceeds this value, the learning rate should instead decrease.

Article Summary#

Key Takeaways

For SGD: Learning rate increases with batch size but saturates at $\eta_{\max}$. Linear scaling ($\eta \propto B$) is a local approximation for small batch sizes.

For Adam: More complex behavior: square root scaling ($\eta \propto \sqrt{B}$) for small batches, but may exhibit "Surge phenomenon" where optimal learning rate decreases after certain batch size threshold.

Practical Implications: There's a computational sweet spot $\mathcal{B}_{\text{noise}}$ beyond which increasing batch size yields diminishing returns. Adaptive optimizers may require more careful tuning than SGD for large-batch training.

It should be pointed out that the starting point and final conclusions of the above few sections are actually similar to the original paper "Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling", but the intermediate approximation treatments differ.

Most conclusions obtained in the original paper are approximate results under the assumption $B \ll \pi(\sigma_i/g_i)^2/2$, so they conclude that the Surge phenomenon almost always occurs, which is not very scientific. The most obvious issue is that the form of the assumption $B \ll \pi(\sigma_i/g_i)^2/2$ itself is somewhat problematic; its right side depends on $i$. We cannot assign a separate Batch Size to each component, so to obtain a global result, it would have to be $B \ll \min_i \pi(\sigma_i/g_i)^2/2$, but this is somewhat too stringent.

The approach in this article is to introduce approximation (25), which can be seen as a mean-field approximation, intuitively more reasonable than the pointwise assumption $B \ll \pi(\sigma_i/g_i)^2/2$. Therefore, in principle, the conclusions will be more accurate, such as obtaining the conclusion that "even if the off-diagonal elements of the Hessian matrix cannot be ignored, the Surge phenomenon does not necessarily occur" (depending on $\beta_{\text{noise}}$). In particular, this accuracy does not sacrifice simplicity; for example, Equation (28) is also quite clear and concise, the form of Equation (31) is also consistent with the original paper, and no additional approximation assumptions are needed, etc.

Finally, a slight reflection: OpenAI's analysis of SGD was actually work from 2018, while the Surge phenomenon paper was only released in mid-2024. It took six years from SGD to Adam, which is quite surprising, largely due to OpenAI's "prestige" and the guess (30), making people think there was nothing more to do with Adam, not expecting Adam might have some new characteristics. Of course, questions such as how reasonable $\tilde{\boldsymbol{\varphi}}_B = \sign(\tilde{\boldsymbol{g}}_B)$ is as an approximation for Adam and to what extent it represents actual situations are, in the author's opinion, still worth further consideration.

Conclusion#

This article discusses the classic topic of "Scaling Laws between Batch Size and Learning Rate" from multiple perspectives, with a focus on introducing OpenAI's derivation and conclusions based on the second-order approximation of the loss function, as well as subsequent work using the same ideas to analyze the Adam optimizer.

Citation Information

Original Article: Su Jianlin. How Should Learning Rate Scale with Batch Size?. Scientific Spaces.

How to cite this translation:

Su, J. How Should Learning Rate Scale with Batch Size? [Translated by Juanxi Tian]. Scientific Spaces.

BibTeX:

@article{su2025learning_rate_scaling, title = {How Should Learning Rate Scale with Batch Size?}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/10542}, note = {Translated by Juanxi Tian (ScalingOpt Team)} }