Optimizer Blogs

Stay updated with the latest insights, research, and tutorials about optimization algorithms from leading experts in the field.

Articles

Authors

Topics

Research

Backbone-Optimizer Coupling Bias: The Hidden Co-Design Principle

A formal investigation into the intrinsic interdependence between neural network architectures and optimizers, introducing the Backbone-Optimizer Coupling Bias (BOCB) and principles for systematic co-design.

Author: Juanxi Tian and Yufei Gu (ScalingOpt Team)

Jianlin Su's Blog Collection (CN)

Visit Scientific Spaces

Optimization

Weight Decay and Learning Rate from a Moving Average Perspective

A Theoretical Analysis of Optimal Scheduling in Large Language Model Pre-training. Viewing the training process as a moving average memory of the training data.

Optimizer Blogs

ScalingOpt Team Blogs

Backbone-Optimizer Coupling Bias: The Hidden Co-Design Principle

Jianlin Su's Blog Collection (CN)

Jianlin Su's Blog Collection (EN)

Weight Decay and Learning Rate from a Moving Average Perspective

Muon Optimizer Appreciation: The Essential Leap from Vectors to Matrices

Asymptotic Estimation of Weight RMS in AdamW (Part 1)

Asymptotic Estimation of Weight RMS in AdamW (Part 2)

Muon Sequel: Why We Chose to Experiment with Muon?

QK-Clip: Advancing Muon Further in Scale-up

Higher-Order MuP: A More Concise Yet Profound Spectral Condition Scaling

A Preliminary Exploration of MuP: Cross-Model Scaling Laws for Hyperparameters

Rethinking Learning Rate and Batch Size (Part 1): Current Landscape

Rethinking Learning Rate and Batch Size (Part 2): Mean Field Theory

Rethinking Learning Rate and Batch Size (Part 3): Muon Analysis

Rethinking Learning Rate and Batch Size (Part 4): EMA Effects

Why is Adam's Update RMS 0.2?

How Does Adam's Epsilon Affect Learning Rate Scaling Laws?

How Should Learning Rate Scale with Batch Size?

Efficient Inversion Methods for "Diagonal + Low-Rank" Triangular Matrices

Understanding Adaptive Learning Rate Optimizers from Hessian Approximation

From Spectral Norm Gradient to Novel Weight Decay: A Theoretical Exploration

Gradient Perspective on LoRA: Introduction, Analysis, Conjectures, and Extensions

Manifold Gradient Descent: 1. SGD on Hyperspheres

Manifold Gradient Descent: 2. Muon with Orthogonal Constraints

Manifold Gradient Descent: 3. Muon on Stiefel Manifold

Exploring Linear Attention: Must Attention Have a Softmax?

A Brief History of Linear Attention: From Imitation and Innovation to Back-Feeding

Different Learning Rates for LoRA: Can LoRA Improve Further?

Why is the Default Norm for Gradient Clipping 1?

The Path to Low-Rank Approximation (Part 1): Pseudo-Inverse

The Path to Low-Rank Approximation (Part 2): SVD

The Path to Low-Rank Approximation (Part 3): CR Approximation

The Path to Low-Rank Approximation (Part 4): Interpolative Decomposition

Steepest Descent on Manifolds: 4. Muon + Spectral Sphere

Steepest Descent on Manifolds: 5. Dual Gradient Descent

Effective Rank of Matrices

Fast Estimation of Spectral Norms for Random Matrices

All Articles

Want to Contribute an Article?