Precision & Quantization

A comprehensive guide to numerical precision in Deep Learning: understanding the trade-offs between FP32, BF16, FP8, INT8, and emerging FP4 formats.

High Precision: FP32 & TF32

The Standard for Stability

Single-precision floating-point (FP32) is the "gold standard" for deep learning training, offering a wide dynamic range and high precision. TF32 (TensorFloat-32) is an NVIDIA-specific optimization on Ampere+ GPUs that maintains the range of FP32 but reduces precision to match FP16, accelerating matrix multiplications.

Characteristics

  • FP32: 1 sign, 8 exponent, 23 mantissa. High precision (~7 decimal digits).
  • TF32: 1 sign, 8 exponent, 10 mantissa. Same range as FP32, lower precision.

Issues & Trade-offs

  • Memory Consumption: 4 bytes per parameter. A 7B model requires ~28GB just for weights.
  • Compute Bound: Significantly slower throughput compared to lower precisions on Tensor Cores.

Half Precision: FP16 & BF16

Mixed Precision Training

Half-precision formats reduce memory usage by 50% and double throughput. BF16 (Brain Float 16) has become the default for LLM training because it preserves the dynamic range of FP32, avoiding the overflow/underflow issues common with standard FP16.

Comparison

  • FP16: 1 sign, 5 exponent, 10 mantissa. Limited range \([6 \times 10^{-5}, 65504]\).
  • BF16: 1 sign, 8 exponent, 7 mantissa. Same range as FP32 (\(10^{-38}\) to \(10^{38}\)).

Optimization Issues

  • Loss Scaling (FP16): Required to prevent gradient underflow (gradients becoming 0). BF16 typically does not need this.
  • Precision (BF16): Lower precision than FP16. Gradients might need stochastic rounding or accumulation in FP32.

8-bit Precision: INT8 & FP8

Efficiency & FP8 Era

8-bit formats are revolutionizing both training and inference. INT8 is widely used for post-training quantization. FP8 (supported on NVIDIA H100/Ada) introduces two formats: E4M3 for weights/activations and E5M2 for gradients, enabling 8-bit training with minimal accuracy loss.

Formats

  • FP8 E4M3: Higher precision, lower range. Best for forward pass.
  • FP8 E5M2: Higher range (matches FP16), lower precision. Best for gradients.

Issues

  • Quantization Noise: Rounding errors can accumulate. Requires sophisticated scaling (e.g., row-wise/block-wise quantization).
  • Hardware Support: True FP8 acceleration requires newer GPUs (H100, RTX 4090).

Low Precision: FP4 & INT4

The Frontier of Compression

4-bit precision is the current frontier for extreme efficiency, primarily used in inference and fine-tuning (QLoRA). NVIDIA's Blackwell architecture introduces native FP4 support, promising 2x throughput over FP8. Research like BitNet is even pushing towards 1-bit variants.

FP4 Characteristics

  • Limited States: Only 16 representable values.
  • E2M1 vs E3M0: Likely variants offering trade-offs between range and precision.

Major Challenges

  • Accuracy Drop: Significant information loss. Requires "Quantization-Aware Training" (QAT) or advanced calibration.
  • Gradient Precision: Gradients usually need higher precision (FP16/BF16) to update weights effectively.