Recently, I revisited a Meta paper from last year titled "A Theory on Adam Instability in Large-Scale Machine Learning", which provides a novel perspective on Adam and other adaptive learning rate optimizers: it suggests that the exponential moving average of gradient squares approximately estimates the square of the Hessian matrix, implying that optimizers like Adam and RMSprop are effectively approximating second-order Newton methods.
This perspective is quite novel and appears significantly different from previous Hessian approximations in the literature, making it worth studying and contemplating further.
Newton's Method#
Let the loss function be $\mathcal{L}(\boldsymbol{\theta})$, where the parameters to optimize are $\boldsymbol{\theta}$. Our optimization objective is:
Assuming the current value of $\boldsymbol{\theta}$ is $\boldsymbol{\theta}_t$, Newton's method seeks $\boldsymbol{\theta}_{t+1}$ by expanding the loss function to second order:
where $\boldsymbol{g}_t = \nabla_{\boldsymbol{\theta}_t}\mathcal{L}(\boldsymbol{\theta}_t)$ is the gradient and $\boldsymbol{\mathcal{H}}_t=\nabla_{\boldsymbol{\theta}_t}^2\mathcal{L}(\boldsymbol{\theta}_t)$ is the Hessian matrix. Assuming positive definiteness of the Hessian matrix, the right-hand side of the equation has a unique minimum at $\boldsymbol{\theta}_t - \boldsymbol{\mathcal{H}}_t^{-1}\boldsymbol{g}_t$. Newton's method uses this as the next iteration $\boldsymbol{\theta}_{t+1}$:
Note that this equation contains no additional learning rate parameter, making Newton's method inherently an adaptive learning rate algorithm. However, since the complexity of the Hessian matrix scales quadratically with the number of parameters, full Newton's method in deep learning remains primarily theoretical. To practically apply Newton's method, significant simplifications must be made to the Hessian matrix, such as assuming diagonal or low-rank structures.
Next, we aim to demonstrate that $\eta_t^{-1}\text{diag}(\sqrt{\hat{\boldsymbol{v}}_t})$ constitutes a superior approximation of $\boldsymbol{\mathcal{H}}_t$.
Gradient Approximation#
The key to the proof lies in utilizing the first-order approximation of gradients:
where $\boldsymbol{g}_{\boldsymbol{\theta}^*}$ and $\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}$ indicate expansion at $\boldsymbol{\theta}=\boldsymbol{\theta}^*$. Here, $\boldsymbol{\theta}^*$ represents our optimization target from equation (1), where the model's gradient is zero. Thus, the equation simplifies to:
Consequently,
Assuming the training has reached a "stable regime" where the model oscillates around $\boldsymbol{\theta}^*$ with slow, spiral convergence, we can to some extent treat $\boldsymbol{\theta} - \boldsymbol{\theta}^*$ as a random variable following a normal distribution $\mathcal{N}(\boldsymbol{0},\sigma^2\boldsymbol{I})$. Then,
Assuming the Hessian matrix is diagonal, we can retain only the diagonal elements of the above equation:
This analysis also explains why Adam's $\beta_2$ is typically larger than $\beta_1$. To estimate the Hessian more accurately, the moving average $\hat{\boldsymbol{v}}_t$ should be as "long-term" as possible (approaching a uniform average), so $\beta_2$ should be very close to 1. In contrast, the momentum $\hat{\boldsymbol{m}}_t$ is a moving average of gradients. If the gradient average becomes too long-term, the result would approach $\boldsymbol{g}_{\boldsymbol{\theta}^*}=\boldsymbol{0}$, which is undesirable. Therefore, the momentum's moving average should be more local.
Related Work#
For readers familiar with Hessian matrix theory, the first reaction to the above conclusion might be confusion rather than recognition. This is because a classical approximation of the Hessian matrix is the outer product of Jacobian matrices (similar to gradients), whereas the Hessian approximation here is the square root of the gradient outer product, differing by a square root.
Specifically, let's consider squared error loss as an example:
Expanding around $\boldsymbol{\theta}_t$, we have $\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})\approx \boldsymbol{f}_{\boldsymbol{\theta}_t}(\boldsymbol{x}) + \boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}^{\top} (\boldsymbol{\theta} - \boldsymbol{\theta}_t)$, where $\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}=\nabla_{\boldsymbol{\theta}_t} \boldsymbol{f}_{\boldsymbol{\theta}_t}(\boldsymbol{x})$ is the Jacobian matrix. Substituting into the above equation yields:
The simplified equation above is merely a quadratic form in $\boldsymbol{\theta}$, allowing direct computation of its Hessian matrix:
This is the Hessian approximation based on the outer product of Jacobian matrices, forming the theoretical foundation of the "Gauss–Newton method". Of course, $\boldsymbol{\mathcal{J}}$ is not yet $\boldsymbol{g}$; we need to connect this result to $\boldsymbol{g}$. Differentiating equation (10) directly gives:
Thus,
The two approximations above lack rigorous justification but can be loosely considered mean-field approximations. Since $\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})$ represents regression prediction residuals, we typically assume they follow $\mathcal{N}(\boldsymbol{0},\sigma^2\boldsymbol{I})$. Therefore,
This reveals the connection between $\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}_t}$ and $\boldsymbol{g}_{\boldsymbol{\theta}} \boldsymbol{g}_{\boldsymbol{\theta}}^{\top}$. Comparing with equation (8) from the previous section, we observe an apparent discrepancy of a square factor.
Examining the derivation processes, both results appear correct. How should we interpret this inconsistency? We can understand it as follows: Equation (15) provides an "instantaneous approximation" of the Hessian at time $t$, while equation (8) represents a "long-term average" result over time steps. The long-term averaging effect partially cancels out some intensity (though theoretically making the estimate more accurate), necessitating an additional square root.
A similar effect appears in SDEs discussed in "Generative Diffusion Models: A General Framework with SDEs", where the noise term intensity in SDEs needs to be half an order higher than the non-noise terms. This is also because noise terms cancel out under long-term averaging, requiring higher-order noise to manifest its effects in the final results.
Additional Connections#
In the previous derivation, we assumed $\boldsymbol{\theta}^*$ to be the theoretical optimum, implying $\boldsymbol{g} _{\boldsymbol{\theta}^*} = \boldsymbol{0}$. What if $\boldsymbol{\theta}^*$ is any arbitrary point? Then equation (8) becomes:
In other words, as long as we take the moving average of the covariance rather than the second moment, we obtain a Hessian approximation within a local range. This precisely corresponds to the approach of the AdaBelief optimizer, whose $\boldsymbol{v}$ computes the moving average of the squared difference between $\boldsymbol{g}$ and $\boldsymbol{m}$:
Conclusion#
This article has introduced a perspective for understanding adaptive learning rate optimizers like Adam through Newton's method and Hessian approximation, discussing related results in Hessian approximation theory.
The main contributions of this analysis include:
- Unified Framework: Presenting Adam and related optimizers as approximations to Newton's method with diagonal Hessian assumptions.
- Theoretical Derivation: Demonstrating how gradient statistics relate to Hessian matrix approximations through statistical assumptions about parameter distributions.
- Hyperparameter Interpretation: Explaining why Adam's $\beta_2$ is typically larger than $\beta_1$ based on the different roles of gradient and Hessian estimation.
- Historical Context: Contrasting the proposed Hessian approximation with classical Jacobian outer product approximations and resolving apparent discrepancies through temporal averaging considerations.
- Algorithmic Connections: Extending the analysis to other optimizers like AdaBelief, which uses gradient covariance rather than raw second moments.
This perspective offers several practical implications:
- Algorithm Design: Provides guidance for designing new adaptive optimizers based on improved Hessian approximations.
- Hyperparameter Tuning: Offers theoretical justification for common hyperparameter choices in adaptive optimizers.
- Convergence Analysis: Suggests new approaches for analyzing the convergence properties of adaptive methods through their connections to second-order methods.
- Numerical Stability: Explains the role of the $\epsilon$ parameter in preventing division by small values in Hessian approximation.
Future research directions could explore more accurate Hessian approximations within adaptive optimizers, investigate the effects of different statistical assumptions on optimization performance, and develop theoretical guarantees for convergence rates based on these Hessian approximation perspectives.
This analysis bridges the gap between practical adaptive optimization methods and classical optimization theory, providing a unified framework for understanding and improving modern deep learning optimizers.
Original Article: Su Jianlin. Understanding Adaptive Learning Rate Optimizers from Hessian Approximation. Scientific Spaces.
How to cite this translation:
BibTeX: