Why is the Default Norm for Gradient Clipping 1?

As we know, Gradient Clipping is a common technique used to stabilize model training. Standard gradient clipping operates on the total norm of all parameter gradients, which can be expressed mathematically as:

(1) \[ \text{clip}(\boldsymbol{g},\tau)=\left\{\begin{aligned}&\boldsymbol{g}, &\Vert\boldsymbol{g}\Vert\leq \tau \\ &\frac{\tau}{\Vert\boldsymbol{g}\Vert}\boldsymbol{g},&\Vert\boldsymbol{g}\Vert > \tau \end{aligned}\right. \]

This operation preserves the direction of $\boldsymbol{g}$ while limiting its magnitude to at most $\tau$. Note that $\Vert\boldsymbol{g}\Vert$ here refers to the global gradient norm—the Euclidean norm computed by treating all model parameter gradients as a single vector.

Have you ever noticed a subtle detail: regardless of whether the model has millions or billions of parameters, $\tau$ is frequently set to 1? What does this signify? Is this merely the reuse of a default value, or does it imply some profound underlying principle?

What is it?#

Some readers might think: "Default values are not necessarily optimal values, so why dwell on this?" Indeed, $\tau=1$ may not be the optimal choice, but it is the default setting for many models, and under this default setting, performance remains acceptable. This in turn suggests that $\tau=1$ possesses universal rationality.

The Nature of Clipping Operations

What does "rationality" mean here? Let us return to the $\text{clip}$ operation. If $\Vert\boldsymbol{g}\Vert$ is always less than $\tau$, then $\text{clip}$ reduces to an identity transformation; if $\Vert\boldsymbol{g}\Vert$ is always greater than $\tau$, then $\text{clip}$ degenerates into L2 normalization. In other words, $\text{clip}$ is meaningful precisely because $\tau$ creates an appropriate discrimination threshold, ensuring that most $\Vert\boldsymbol{g}\Vert$ values are below $\tau$, with only a small fraction exceeding it. This is the essence of $\tau$'s rationality.

Caveat: Certainly, counterexamples exist, and they are not uncommon. The emphasis here is on the generality of this phenomenon and the universality of this default setting. Thus, meticulous readers need not be overly preoccupied with isolated details.

Therefore, we contend that the universal rationality of $\tau=1$ implies that, regardless of a model's parameter count, initialization scheme, or loss function, its total gradient norm naturally hovers around 1 as a threshold for "abnormal values." This is undoubtedly a remarkable and counterintuitive property—the author's initial reaction upon realizing this conclusion was precisely that sense of wonder.

Why is it?#

Why such a "coincidence"? The author's answer may come as a surprise: because only under these conditions can models achieve stable training.

Consider the loss function $\mathcal{L}(\boldsymbol{\theta})$ with optimizer update rule $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\, \boldsymbol{u}_t$. The change in loss can be approximated as:

(2) \[ \Delta \mathcal{L} = \mathcal{L}(\boldsymbol{\theta}_{t+1}) - \mathcal{L}(\boldsymbol{\theta}_t) \approx (\boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}_t)\cdot\nabla_{\boldsymbol{\theta}_t}\mathcal{L}(\boldsymbol{\theta}) = -\eta\, \boldsymbol{u}_t\cdot \boldsymbol{g}_t \]

Training Stability and Gradient Norms

SGD Analysis: First, consider the simplest SGD case where $\boldsymbol{u}_t = \boldsymbol{g}_t$ and $\Delta \mathcal{L}=-\eta\Vert\boldsymbol{g}_t\Vert^2$. Here, the loss change is proportional to the square of the gradient norm.

Empirical Observation: We know that pure SGD (without momentum) is highly inefficient for both CV and NLP tasks. During mid-to-late training stages, the average loss reduction per step is typically far smaller than the learning rate, i.e., $|\Delta \mathcal{L}| < \eta$. This implies $\Vert\boldsymbol{g}_t\Vert < 1$, demonstrating that $\Vert\boldsymbol{g}_t\Vert < 1$ represents the long-term behavior of models that converge normally.

Of course, during early training, $\Vert\boldsymbol{g}_t\Vert > 1$ may occur, which is normal. However, $\Vert\boldsymbol{g}_t\Vert \gg 1$ is rare. A well-designed initialization should avoid $\Vert\boldsymbol{g}_t\Vert \gg 1$, as theoretical foundations like DeepNorm suggest. The rationale is similar: if gradient norms are too large, early learning becomes overly "aggressive," leading to premature convergence to suboptimal local minima. An alternative approach is to reduce $\eta$, which similarly reduces $|\Delta \mathcal{L}|$—this is precisely why we typically use Warmup during early training.

How to Handle It#

In summary, because the loss function change is proportional to the square of the gradient norm, training stability necessitates that gradient norms cannot be too large, with long-term behavior characterized by values less than 1. If significantly larger gradient norms appear during early training, the typical strategy is Warmup.

Alternatively, consider a more general strategy: set another threshold $\mathcal{T}$ and clip $\eta$ based on the value of $\boldsymbol{u}_t\cdot \boldsymbol{g}_t$:

(3) \[ \eta_t = \left\{\begin{aligned}&\eta,& \boldsymbol{u}_t\cdot \boldsymbol{g}_t\leq \mathcal{T} \\ &\frac{\mathcal{T}}{\boldsymbol{u}_t\cdot \boldsymbol{g}_t}\eta,& \boldsymbol{u}_t\cdot \boldsymbol{g}_t > \mathcal{T} \end{aligned}\right. \]

This approach eliminates the need for additional Warmup settings and offers greater adaptability.

Analysis for Adam and Related Optimizers

For Adam and similar optimizers, we can follow the methodology of "How Should Learning Rate Scale with Batch Size?" and approximate using $\boldsymbol{u}_t=\text{sign}(\boldsymbol{g}_t)$. In this case:

(4) \[ \Delta \mathcal{L} = -\eta\, \text{sign}(\boldsymbol{g}_t)\cdot \boldsymbol{g}_t = -\eta\, \Vert\boldsymbol{g}_t\Vert_1 \]

Here, $\Vert\Vert_1$ denotes the L1 norm (sum of absolute values of components). Since gradient components are typically less than 1, we have $\Vert\boldsymbol{g}_t\Vert_1 \gg \Vert\boldsymbol{g}_t\Vert$. Consequently, to maintain training stability, Adam's learning rate is usually significantly smaller than SGD's learning rate.

Furthermore, equation (4) can be reformulated as:

(5) \[ \Delta \mathcal{L} = -\eta\, \text{sign}(\boldsymbol{g}_t)\cdot \boldsymbol{g}_t = -\eta\, \sqrt{N}\Vert\boldsymbol{g}_t\Vert \cos(\text{sign}(\boldsymbol{g}_t), \boldsymbol{g}_t) \]

Here, we assume $\boldsymbol{g}_t$ has no zero components, so $\Vert\text{sign}(\boldsymbol{g}_t)\Vert=\sqrt{N}$, where $N$ is the total number of model parameters. Empirically, both $\Vert\boldsymbol{g}_t\Vert$ and $\cos(\text{sign}(\boldsymbol{g}_t), \boldsymbol{g}_t)$ remain roughly constant across different model scales. Therefore, to maintain constant $\Delta \mathcal{L}$, $\eta$ should be inversely proportional to $\sqrt{N}$—meaning if the parameter count quadruples, the learning rate should be halved.

Conclusion#

This article presents our perspective and analysis on the phenomenon of "gradient clipping's default norm being 1." We have traced this seemingly arbitrary default value back to fundamental requirements of training stability and loss landscape geometry, revealing its surprisingly universal applicability across diverse model architectures and scales.

Citation Information

Original Article: Su Jianlin. Why is the Default Norm for Gradient Clipping 1?. Scientific Spaces.

How to cite this translation:

Su, J. Why is the Default Norm for Gradient Clipping 1? [Translated by ScalingOpt Team]. Scientific Spaces.

BibTeX:

@article{su2025gradient_clipping_norm, title = {Why is the Default Norm for Gradient Clipping 1?}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/10564}, note = {Translated by ScalingOpt Team} }