Training Strategy
Explore the comprehensive guide to LLM training stages: from Pre-training foundations to Fine-tuning adaptation and Reinforcement Learning alignment.
Quick Navigation
Pre-training
Foundational Knowledge Acquisition
Pre-training is the first and most computationally intensive stage of training an LLM. In this phase, the model learns to predict the next token on a massive corpus of text data (e.g., CommonCrawl, The Pile). This process instills the model with general world knowledge, grammar, and reasoning capabilities.
Key Characteristics
- Massive Datasets (Trillions of tokens)
- Self-Supervised Learning (Next token prediction)
- High Computational Cost (Thousands of GPUs)
Challenges
- Training Stability (Loss spikes, divergence)
- Efficient Distributed Training (3D Parallelism)
Key Concepts: Scaling Laws & Dynamics
Kaplan Scaling Laws (2020)
Performance (Loss) scales as a power-law with model size \(N\), dataset size \(D\), and compute \(C\). The relationship follows \(L(N) \propto N^{-\alpha}\), suggesting that larger models are significantly more sample-efficient.
Chinchilla Scaling (Compute-Optimal)
Hoffmann et al. (2022) revised the scaling laws, proposing that model size and training tokens should scale equally (\(N \propto D\)) for optimal compute usage. A rule of thumb is ~20 training tokens per parameter.
Architecture Trends (MoE & Long Context)
Mixture of Experts (MoE): Models like Mixtral and DeepSeek-V3 activate only a subset of parameters per token, decoupling inference cost from model capacity. Long Context: Techniques like Ring Attention and RoPE scaling enable training on millions of tokens context, crucial for RAG and reasoning.
Data Quality & Deduplication
Beyond raw quantity, data quality is paramount. Deduplication and strict filtering (e.g., filtering out low-quality web text) can shift the scaling curves, allowing smaller models to outperform larger ones trained on noisier data.
Fine-tuning
Task Adaptation & Specialization
Fine-tuning adapts a pre-trained model to specific tasks or domains. This can involve Supervised Fine-Tuning (SFT) on instruction datasets to make the model follow user commands, or continued pre-training on domain-specific corpus (e.g., medical, legal, coding).
Techniques
- Full Fine-tuning: Updating all model parameters.
- PEFT (Parameter-Efficient Fine-Tuning): LoRA, QLoRA, Adapters.
Use Cases
- Instruction Following (Chatbots)
- Domain Adaptation (Medical, Legal, Code)
Key Concepts: Adaptation & Efficiency
Full Fine-tuning (FFT) & Efficiency
Concept & Challenges
Updating all model parameters (\(\Phi\)) for downstream tasks. While effective, it requires maintaining optimizer states (e.g., Adam requires 2x params size), gradients, and activations, often exceeding single GPU memory (VRAM) for large models.
Key Optimization Techniques
- ZeRO (Zero Redundancy Optimizer): Partitioning optimizer states, gradients, and parameters across GPUs (DeepSpeed/FSDP).
- Gradient Checkpointing: Re-computing activations during backward pass to save memory (trade-off: increased computation).
- Mixed Precision (BF16/FP16): Training with lower precision to reduce memory and increase throughput.
Recommended Analysis
"Scaling Laws for Transfer" (Hernandez et al., 2021) investigates the compute-optimality of transfer learning, suggesting that fine-tuning compute should scale with the pre-training compute budget.
Parameter-Efficient Fine-Tuning (PEFT)
Decomposition Methods (LoRA Family)
Injecting low-rank trainable matrices \(A\) and \(B\) into frozen layers ($W' = W + BA$).
- LoRA (Hu et al., 2021): Low-Rank Adaptation of Large Language Models. The gold standard for efficiency.
- DoRA (Liu et al., 2024): Weight-Decomposed Low-Rank Adaptation. Decomposes weights into magnitude and direction, updating them separately for better learning capacity than LoRA.
- QLoRA (Dettmers et al., 2023): 4-bit quantization + LoRA. Enables fine-tuning 65B models on a single 48GB GPU.
- AdaLoRA (Zhang et al., 2023): Adaptive rank allocation based on importance scores.
Addition Methods (Adapters)
Inserting small bottleneck layers (down-projection -> non-linearity -> up-projection) within transformer blocks.
- Bottleneck Adapters (Houlsby et al., 2019): The original adapter architecture.
- IA3 (Liu et al., 2022): Infused Adapter by Inhibiting and Amplifying Inner Activations. Rescaling activations with learned vectors.
Prompt-based Methods
- Prefix Tuning (Li & Liang, 2021): Prepending trainable continuous vectors to Attention Keys and Values.
- P-Tuning (Lester et al., 2021): Optimizing continuous prompt embeddings at the input layer.
Reinforcement Learning
Alignment & Preference Learning
Reinforcement Learning from Human Feedback (RLHF) is the stage where the model is aligned with human values and preferences. It helps to reduce toxicity, hallucinations, and ensure the model is helpful and harmless. Modern approaches also include direct preference optimization without an explicit reward model.
RLHF Process
- Collect comparison data (Human Preference).
- Train a Reward Model (RM).
- Optimize the Policy using PPO against the RM.
Algorithms
Key Concepts: Modern RL Paradigms
Alignment & Verification (RLHF/RLVR)
Preference Learning (RLHF/DPO)
Aligning models with human intent. RLHF uses PPO with a reward model. DPO optimizes preferences directly. KTO handles unpaired binary feedback.
Group-Based Optimization
GRPO (Group Relative Policy Optimization): Optimizes groups of outputs relative to each other, removing the need for a value network (critic) and saving significant memory (e.g., DeepSeek-R1). GSPO: Supervised group-level signals for stability.
Verifiable Rewards (RLVR)
For reasoning tasks (Math, Code) with objective ground truths, verifiable outcomes (unit tests, answers) provide robust reward signals for "System 2" reasoning.
Data Interaction Paradigms (Online vs Offline)
Online RL (On-Policy)
Agent learns from data collected by its current policy. High sample complexity but stable.
- PPO (Proximal Policy Optimization): The default for RLHF. Constrains updates to stay close to old policy.
- GRPO (DeepSeek): Efficient on-policy variant without a critic network.
Online RL (Off-Policy)
Learns from a replay buffer of past experiences. More sample efficient but harder to tune.
- SAC (Soft Actor-Critic): Maximizes entropy for exploration.
- DQN (Deep Q-Network): Foundational value-based method.
Offline RL
Learns entirely from a fixed, static dataset without environment interaction. Safe but prone to OOD issues.
- CQL (Conservative Q-Learning): Penalizes Q-values for unseen actions.
- IQL (Implicit Q-Learning): Avoids querying unseen actions.
- Decision Transformer: RL as sequence modeling.
Exploration, Dynamics & Emerging Frontiers
Exploration Strategies
How to discover high-reward states in sparse environments.
- Intrinsic Motivation: Curiosity-driven learning (RND, ICM).
- Entropy Regularization: Encouraging diverse actions (SAC).
Model-Based RL
Learning a world model to simulate future states and plan. Dreamer and MuZero plan in latent spaces, achieving high sample efficiency.
Diffusion Policies
Representing the policy as a conditional diffusion process to model complex, multi-modal action distributions, overcoming limitations of Gaussian policies.
Trending Research: SFT vs. RL Dynamics
Recent studies (2025) have sparked intense debates on the distinct roles of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). The consensus is shifting from viewing them as sequential steps to understanding their specific contributions to generalization, reasoning, and memorization.
Generalization vs. Memorization
SFT Memorizes, RL Generalizes (Chu et al., 2025)
Key Finding: SFT tends to memorize training data distributions, struggling with out-of-distribution scenarios. In contrast, RL (especially with outcome-based rewards) drives true generalization in both textual rule-following and visual reasoning. However, SFT remains essential for "formatting" the model to ensure stable RL training.
Retaining by Doing (2025)
Key Finding: Highlights the critical role of on-policy data generation (the "Doing" in RL) in mitigating catastrophic forgetting, suggesting that active exploration helps retain general capabilities better than passive SFT replay.
Reasoning Shapes & Harmonization
RL Squeezes, SFT Expands (Matsutani et al., 2025)
Key Finding: A comparative study on reasoning LLMs. SFT "Expands" the model's capabilities by exposing it to diverse reasoning patterns, whereas RL "Squeezes" the policy distribution, optimizing it to converge on specific, high-reward reasoning paths (improving accuracy but potentially reducing diversity).
CHORD: On-Policy RL Meets Off-Policy Experts (Zhang et al., 2025)
Key Finding: Proposes Dynamic Weighting to harmonize paradigms. Instead of a separate SFT stage, it treats expert data (SFT) as an off-policy auxiliary objective that is dynamically weighted during the on-policy RL process, balancing exploration with imitation to prevent overfitting to expert demonstrations.