Nanyang Technological University
Singapore
Discover, compare, and contribute to cutting-edge optimization algorithms designed for large-scale deep learning.
A research-oriented PyTorch codebase for optimizer-centric scaling studies in large language model training. Designed to make optimizer comparisons reproducible, fair, and ergonomically extensible.
30+ Optimizers
Single entrypoint
17 Model Configs
9M to 13B params
3 Data Pipelines
C4, Pile, OpenWebText
Multi-GPU DDP
Native torchrun
Architectures
Optimizer Categories
Training Modes
Pretrain
SFT / DPO
Evaluation
Recent updates from the ScalingOpt community
A novel manifold optimization method that projects momentum onto tangent space and constrains it on rotational Oblique manifold. Outperforms AdamW and Muon with lower memory consumption and computational complexity.
Explore OptimizerAdded many profound articles from Jianlin Su (Scientific Spaces) covering optimization theory, Muon, and scaling laws.
Explore BlogsA systematic academic summary of optimization in deep learning, covering foundational theories to modern adaptive and higher-order methods.
Open ReaderMeet the members behind ScalingOpt. We thank them for their contributions.
Team member information is continuously updated. We welcome email applications for collaboration.
ScalingOpt is shaped by collaborations across universities and foundation-model research teams, with shared interests in optimization theory, systems, and large-scale language model training.
Nanyang Technological University
Singapore
National University of Singapore
Singapore
University of Illinois Urbana-Champaign
United States
University of California, Los Angeles
United States
Hong Kong University of Science and Technology
Hong Kong SAR & Guangzhou
The Chinese University of Hong Kong
Hong Kong SAR & Shenzhen
Peking University
Beijing
Qwen
Foundation model research partner
Industry research interface
Discover the most powerful and innovative optimization algorithms powering modern AI
Restriking manifold optimization for LLM training
Better downstream generalization via common minima in pretraining
Orthogonal weight updates via Newton-Schulz iteration
Improving and Stabilizing Shampoo using Adam
Production-ready libraries with improved distributed support and hardware optimization
Optimizers integrated into Transformers (AdamW, AdaFactor) with native support for distributed training and mixed precision.
Cutting-edge optimization algorithms like Distributed Shampoo developed by Meta for large-scale model training.
Advanced model optimization toolkit for NVIDIA GPUs, focusing on quantization and inference acceleration.
Our latest academic synthesis covers the entire spectrum of optimization in deep learning. From foundational gradient descent to the frontiers of second-order adaptive methods.
Everything you need to understand, implement, and scale optimization algorithms for modern AI
Explore all optimization algorithms from foundational SGD to cutting-edge Adam-mini and Muon, with detailed implementations and PyTorch code.
Access research papers, tutorials, and educational content covering optimization theory, implementation guides, and latest developments.
Contribute to open-source implementations, join GitHub discussions, and collaborate with researchers worldwide on optimization algorithms.
Connect with researchers and practitioners exploring efficient AI and optimization algorithms. Discover, learn, and contribute to the future of machine learning optimization.