Optimizer Blogs
Stay updated with the latest insights, research, and tutorials about optimization algorithms from leading experts in the field.
Weight Decay and Learning Rate from a Moving Average Perspective
A Theoretical Analysis of Optimal Scheduling in Large Language Model Pre-training. Viewing the training process as a moving average memory of the training data.
Muon Optimizer Appreciation: The Essential Leap from Vectors to Matrices
An In-Depth Analysis of the Matrix-Centric Muon Optimizer. Exploring why Muon is efficient and how it embodies essential differences between vectors and matrices.
Asymptotic Estimation of Weight RMS in AdamW (Part 1)
Using Mean-Field Approximation to Derive Asymptotic Weight RMS Expressions. Revealing how weight norms are embedded in optimizer hyperparameters.
Asymptotic Estimation of Weight RMS in AdamW (Part 2)
Extension to Dynamic Weight Decay and Learning Rate Schedules. Generalizing the asymptotic estimation to practical training scenarios.
Muon Sequel: Why We Chose to Experiment with Muon?
A Practical Investigation of the Muon Optimizer for Large-Scale LLM Training. Discussing the motivation and initial results of applying Muon to scale-up.
QK-Clip: Advancing Muon Further in Scale-up
A novel method to address MaxLogit explosion in large-scale transformer training, enabling stable training of trillion-parameter models with Muon optimizer.
Higher-Order MuP: A More Concise Yet Profound Spectral Condition Scaling
A sophisticated spectral condition approach for cross-model hyperparameter transfer, providing intuitive algebraic insights into initialization and learning rate scaling laws.
A Preliminary Exploration of MuP: Cross-Model Scaling Laws for Hyperparameters
An in-depth analysis of Maximal Update Parametrization (MuP) for zero-shot hyperparameter transfer across model scales.
Rethinking Learning Rate and Batch Size (Part 1): Current Landscape
A Theoretical Analysis of Learning Rate Scaling Laws and Their Computational Challenges in Modern Optimizers.
Rethinking Learning Rate and Batch Size (Part 2): Mean Field Theory
Applying mean field approximations to simplify the analysis of learning rate scaling laws for non-linear optimizers like SignSGD.
Rethinking Learning Rate and Batch Size (Part 3): Muon Analysis
Applying mean field theory to analyze learning rate scaling laws for the Muon optimizer, revealing its asymptotic similarity to SignSGD.
Rethinking Learning Rate and Batch Size (Part 4): EMA Effects
Analyzing how Exponential Moving Averages in optimizers affect learning rate scaling laws, with implications for Adam and the Surge phenomenon.
Why is Adam's Update RMS 0.2?
A Theoretical and Empirical Investigation of Adam Optimizer's Update RMS Stability.
How Does Adam's Epsilon Affect Learning Rate Scaling Laws?
Analyzing the impact of Adam's epsilon parameter on learning rate scaling laws with batch size, exploring the transition between SignSGD and SGD behaviors.
How Should Learning Rate Scale with Batch Size?
A comprehensive analysis of scaling laws between learning rate and batch size from multiple perspectives, covering SGD, Adam, and their variants.
Efficient Inversion Methods for "Diagonal + Low-Rank" Triangular Matrices
Efficient algorithms for inverting triangular matrices with diagonal plus low-rank structure commonly appearing in modern linear attention architectures.
Understanding Adaptive Learning Rate Optimizers from Hessian Approximation
Analyzing Adam and related optimizers through the lens of Hessian approximation, revealing their connections to second-order Newton methods.
From Spectral Norm Gradient to Novel Weight Decay: A Theoretical Exploration
Deriving spectral norm gradients and proposing novel weight decay mechanisms based on spectral norm regularization for improved model generalization.
Gradient Perspective on LoRA: Introduction, Analysis, Conjectures, and Extensions
A comprehensive analysis of LoRA from an optimization perspective, examining computational efficiency, gradient dynamics, and proposing novel extensions to the low-rank adaptation framework.
Manifold Gradient Descent: 1. SGD on Hyperspheres
A rigorous mathematical analysis of steepest descent directions on constrained manifolds, beginning with SGD on hyperspheres and examining the geometric foundations of optimization under constraints.
Manifold Gradient Descent: 2. Muon with Orthogonal Constraints
Extending the constrained optimization framework to matrix parameters with spectral norm constraints (Muon optimizer) and orthogonal manifold constraints, deriving steepest descent directions on orthogonal manifolds.
Manifold Gradient Descent: 3. Muon on Stiefel Manifold
Completing the solution for orthogonal constraints on non-square matrices (Stiefel manifolds) with Muon optimizer, deriving iterative algorithms and comparing with heuristic approaches.
Exploring Linear Attention: Must Attention Have a Softmax?
Comprehensive analysis of linear attention mechanisms that remove softmax to achieve O(n) complexity while preserving attention functionality.
A Brief History of Linear Attention: From Imitation and Innovation to Back-Feeding
Comprehensive overview of linear attention development from early approximations to modern innovations and reciprocal influences on softmax attention architectures.
Different Learning Rates for LoRA: Can LoRA Improve Further?
Analyzing the theoretical basis and empirical evidence for assigning different learning rates to LoRA matrices, with practical recommendations for improved fine-tuning performance.
Why is the Default Norm for Gradient Clipping 1?
An analysis of the theoretical underpinnings behind the ubiquitous default clipping threshold of 1 in gradient clipping algorithms, exploring its connection to training stability and model scaling.
The Path to Low-Rank Approximation (Part 1): Pseudo-Inverse
An in-depth exploration of pseudo-inverses from a low-rank approximation perspective, covering mathematical foundations, optimization formulations, and computational methods for matrix generalized inverses.
The Path to Low-Rank Approximation (Part 2): SVD
A comprehensive exploration of Singular Value Decomposition (SVD) from a low-rank approximation perspective, covering theoretical foundations, applications to matrix optimization, and the Eckart-Young-Mirsky theorem.
The Path to Low-Rank Approximation (Part 3): CR Approximation
Exploring Column-Row (CR) approximation as a structured low-rank approximation technique for matrix multiplication acceleration, with connections to sampling methods and practical algorithms.
The Path to Low-Rank Approximation (Part 4): Interpolative Decomposition
Comprehensive exploration of Interpolative Decomposition (ID) as a structured low-rank factorization technique, covering deterministic QR-based algorithms, randomized methods, and theoretical foundations of column subset selection.
Steepest Descent on Manifolds: 4. Muon + Spectral Sphere
Exploring Muon optimization with spectral norm constraints, extending steepest descent on manifolds to spectral sphere constraints with iterative solution methods and practical implementations.
Steepest Descent on Manifolds: 5. Dual Gradient Descent
Exploring dual gradient descent for manifold optimization: Transforming constraint equations into minimization problems via Lagrangian duality with applications to Stiefel manifolds and spectral sphere constraints.
Effective Rank of Matrices
Comprehensive exploration of Effective Rank as an extension of matrix rank for numerical computation, covering threshold-based definitions, norm ratios, entropy approaches, and connections to sparsity measures.
Fast Estimation of Spectral Norms for Random Matrices
A heuristic derivation of the √n+√m approximation for spectral norms of Gaussian random matrices, providing intuitive understanding and practical estimation methods with extensions to minimal singular values.
All Articles
Want to Contribute an Article?
Share your insights and expertise with the community! We welcome high-quality articles about optimization algorithms, research findings, and practical tutorials.