Training Strategy

Explore the comprehensive guide to LLM training stages: from Pre-training foundations to Fine-tuning adaptation and Reinforcement Learning alignment.

Pre-training

Foundational Knowledge Acquisition

Pre-training is the first and most computationally intensive stage of training an LLM. In this phase, the model learns to predict the next token on a massive corpus of text data (e.g., CommonCrawl, The Pile). This process instills the model with general world knowledge, grammar, and reasoning capabilities.

Key Characteristics

  • Massive Datasets (Trillions of tokens)
  • Self-Supervised Learning (Next token prediction)
  • High Computational Cost (Thousands of GPUs)

Challenges

  • Training Stability (Loss spikes, divergence)
  • Efficient Distributed Training (3D Parallelism)

Key Concepts: Scaling Laws & Dynamics

Kaplan Scaling Laws (2020)

Performance (Loss) scales as a power-law with model size \(N\), dataset size \(D\), and compute \(C\). The relationship follows \(L(N) \propto N^{-\alpha}\), suggesting that larger models are significantly more sample-efficient.

Chinchilla Scaling (Compute-Optimal)

Hoffmann et al. (2022) revised the scaling laws, proposing that model size and training tokens should scale equally (\(N \propto D\)) for optimal compute usage. A rule of thumb is ~20 training tokens per parameter.

Architecture Trends (MoE & Long Context)

Mixture of Experts (MoE): Models like Mixtral and DeepSeek-V3 activate only a subset of parameters per token, decoupling inference cost from model capacity. Long Context: Techniques like Ring Attention and RoPE scaling enable training on millions of tokens context, crucial for RAG and reasoning.

Data Quality & Deduplication

Beyond raw quantity, data quality is paramount. Deduplication and strict filtering (e.g., filtering out low-quality web text) can shift the scaling curves, allowing smaller models to outperform larger ones trained on noisier data.

Fine-tuning

Task Adaptation & Specialization

Fine-tuning adapts a pre-trained model to specific tasks or domains. This can involve Supervised Fine-Tuning (SFT) on instruction datasets to make the model follow user commands, or continued pre-training on domain-specific corpus (e.g., medical, legal, coding).

Techniques

  • Full Fine-tuning: Updating all model parameters.
  • PEFT (Parameter-Efficient Fine-Tuning): LoRA, QLoRA, Adapters.

Use Cases

  • Instruction Following (Chatbots)
  • Domain Adaptation (Medical, Legal, Code)

Key Concepts: Adaptation & Efficiency

Full Fine-tuning (FFT) & Efficiency
Concept & Challenges

Updating all model parameters (\(\Phi\)) for downstream tasks. While effective, it requires maintaining optimizer states (e.g., Adam requires 2x params size), gradients, and activations, often exceeding single GPU memory (VRAM) for large models.

Key Optimization Techniques
  • ZeRO (Zero Redundancy Optimizer): Partitioning optimizer states, gradients, and parameters across GPUs (DeepSpeed/FSDP).
  • Gradient Checkpointing: Re-computing activations during backward pass to save memory (trade-off: increased computation).
  • Mixed Precision (BF16/FP16): Training with lower precision to reduce memory and increase throughput.
Recommended Analysis

"Scaling Laws for Transfer" (Hernandez et al., 2021) investigates the compute-optimality of transfer learning, suggesting that fine-tuning compute should scale with the pre-training compute budget.

Parameter-Efficient Fine-Tuning (PEFT)
Decomposition Methods (LoRA Family)

Injecting low-rank trainable matrices \(A\) and \(B\) into frozen layers ($W' = W + BA$).

  • LoRA (Hu et al., 2021): Low-Rank Adaptation of Large Language Models. The gold standard for efficiency.
  • DoRA (Liu et al., 2024): Weight-Decomposed Low-Rank Adaptation. Decomposes weights into magnitude and direction, updating them separately for better learning capacity than LoRA.
  • QLoRA (Dettmers et al., 2023): 4-bit quantization + LoRA. Enables fine-tuning 65B models on a single 48GB GPU.
  • AdaLoRA (Zhang et al., 2023): Adaptive rank allocation based on importance scores.
Addition Methods (Adapters)

Inserting small bottleneck layers (down-projection -> non-linearity -> up-projection) within transformer blocks.

  • Bottleneck Adapters (Houlsby et al., 2019): The original adapter architecture.
  • IA3 (Liu et al., 2022): Infused Adapter by Inhibiting and Amplifying Inner Activations. Rescaling activations with learned vectors.
Prompt-based Methods
  • Prefix Tuning (Li & Liang, 2021): Prepending trainable continuous vectors to Attention Keys and Values.
  • P-Tuning (Lester et al., 2021): Optimizing continuous prompt embeddings at the input layer.

Reinforcement Learning

Alignment & Preference Learning

Reinforcement Learning from Human Feedback (RLHF) is the stage where the model is aligned with human values and preferences. It helps to reduce toxicity, hallucinations, and ensure the model is helpful and harmless. Modern approaches also include direct preference optimization without an explicit reward model.

RLHF Process

  1. Collect comparison data (Human Preference).
  2. Train a Reward Model (RM).
  3. Optimize the Policy using PPO against the RM.

Algorithms

PPO (Proximal Policy Optimization) DPO (Direct Preference Optimization) KTO

Key Concepts: Modern RL Paradigms

Alignment & Verification (RLHF/RLVR)
Preference Learning (RLHF/DPO)

Aligning models with human intent. RLHF uses PPO with a reward model. DPO optimizes preferences directly. KTO handles unpaired binary feedback.

Group-Based Optimization

GRPO (Group Relative Policy Optimization): Optimizes groups of outputs relative to each other, removing the need for a value network (critic) and saving significant memory (e.g., DeepSeek-R1). GSPO: Supervised group-level signals for stability.

Verifiable Rewards (RLVR)

For reasoning tasks (Math, Code) with objective ground truths, verifiable outcomes (unit tests, answers) provide robust reward signals for "System 2" reasoning.

Data Interaction Paradigms (Online vs Offline)
Online RL (On-Policy)

Agent learns from data collected by its current policy. High sample complexity but stable.

  • PPO (Proximal Policy Optimization): The default for RLHF. Constrains updates to stay close to old policy.
  • GRPO (DeepSeek): Efficient on-policy variant without a critic network.
Online RL (Off-Policy)

Learns from a replay buffer of past experiences. More sample efficient but harder to tune.

  • SAC (Soft Actor-Critic): Maximizes entropy for exploration.
  • DQN (Deep Q-Network): Foundational value-based method.
Offline RL

Learns entirely from a fixed, static dataset without environment interaction. Safe but prone to OOD issues.

  • CQL (Conservative Q-Learning): Penalizes Q-values for unseen actions.
  • IQL (Implicit Q-Learning): Avoids querying unseen actions.
  • Decision Transformer: RL as sequence modeling.
Exploration, Dynamics & Emerging Frontiers
Exploration Strategies

How to discover high-reward states in sparse environments.

  • Intrinsic Motivation: Curiosity-driven learning (RND, ICM).
  • Entropy Regularization: Encouraging diverse actions (SAC).
Model-Based RL

Learning a world model to simulate future states and plan. Dreamer and MuZero plan in latent spaces, achieving high sample efficiency.

Diffusion Policies

Representing the policy as a conditional diffusion process to model complex, multi-modal action distributions, overcoming limitations of Gaussian policies.