Precision & Quantization
A comprehensive guide to numerical precision in Deep Learning: understanding the trade-offs between FP32, BF16, FP8, INT8, and emerging FP4 formats.
Quick Navigation
High Precision: FP32 & TF32
The Standard for Stability
Single-precision floating-point (FP32) is the "gold standard" for deep learning training, offering a wide dynamic range and high precision. TF32 (TensorFloat-32) is an NVIDIA-specific optimization on Ampere+ GPUs that maintains the range of FP32 but reduces precision to match FP16, accelerating matrix multiplications.
Characteristics
- FP32: 1 sign, 8 exponent, 23 mantissa. High precision (~7 decimal digits).
- TF32: 1 sign, 8 exponent, 10 mantissa. Same range as FP32, lower precision.
Issues & Trade-offs
- Memory Consumption: 4 bytes per parameter. A 7B model requires ~28GB just for weights.
- Compute Bound: Significantly slower throughput compared to lower precisions on Tensor Cores.
Half Precision: FP16 & BF16
Mixed Precision Training
Half-precision formats reduce memory usage by 50% and double throughput. BF16 (Brain Float 16) has become the default for LLM training because it preserves the dynamic range of FP32, avoiding the overflow/underflow issues common with standard FP16.
Comparison
- FP16: 1 sign, 5 exponent, 10 mantissa. Limited range \([6 \times 10^{-5}, 65504]\).
- BF16: 1 sign, 8 exponent, 7 mantissa. Same range as FP32 (\(10^{-38}\) to \(10^{38}\)).
Optimization Issues
- Loss Scaling (FP16): Required to prevent gradient underflow (gradients becoming 0). BF16 typically does not need this.
- Precision (BF16): Lower precision than FP16. Gradients might need stochastic rounding or accumulation in FP32.
Key Papers
8-bit Precision: INT8 & FP8
Efficiency & FP8 Era
8-bit formats are revolutionizing both training and inference. INT8 is widely used for post-training quantization. FP8 (supported on NVIDIA H100/Ada) introduces two formats: E4M3 for weights/activations and E5M2 for gradients, enabling 8-bit training with minimal accuracy loss.
Formats
- FP8 E4M3: Higher precision, lower range. Best for forward pass.
- FP8 E5M2: Higher range (matches FP16), lower precision. Best for gradients.
Issues
- Quantization Noise: Rounding errors can accumulate. Requires sophisticated scaling (e.g., row-wise/block-wise quantization).
- Hardware Support: True FP8 acceleration requires newer GPUs (H100, RTX 4090).
Key Papers
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats (Chen et al., 2025)
Compares FP and INT formats, revealing that for 8-bit fine-grained quantization, MXINT8 is superior to its FP counterpart in both accuracy and efficiency.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022)
Introduces vector-wise quantization and mixed-precision decomposition to handle outlier features in LLMs.
FP8 Formats for Deep Learning (Micikevicius et al., 2022)
Standardizing the E4M3 and E5M2 formats for deep learning training.
Low Precision: FP4 & INT4
The Frontier of Compression
4-bit precision is the current frontier for extreme efficiency, primarily used in inference and fine-tuning (QLoRA). NVIDIA's Blackwell architecture introduces native FP4 support, promising 2x throughput over FP8. Research like BitNet is even pushing towards 1-bit variants.
FP4 Characteristics
- Limited States: Only 16 representable values.
- E2M1 vs E3M0: Likely variants offering trade-offs between range and precision.
Major Challenges
- Accuracy Drop: Significant information loss. Requires "Quantization-Aware Training" (QAT) or advanced calibration.
- Gradient Precision: Gradients usually need higher precision (FP16/BF16) to update weights effectively.
Key Papers
Pretraining Large Language Models with NVFP4 (NVIDIA et al., 2025)
Introduces a method for stable LLM pretraining in NVFP4 (using Random Hadamard transforms and 2D quantization), achieving performance comparable to FP8 on a 12B model.
QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., NeurIPS 2023)
Introduces "NormalFloat4" (NF4), an information-theoretically optimal 4-bit data type for normally distributed weights, enabling fine-tuning of 65B models on a single GPU.
BitNet: Scaling 1-bit Transformers for Large Language Models (Ma et al., 2023)
Proposes 1-bit variants (binary/ternary weights), showing that ultra-low precision training is possible with specific architectural changes (BitLinear).