Precision & Quantization

A comprehensive guide to numerical precision in Deep Learning: understanding the trade-offs between FP32, BF16, FP8, INT8, and emerging FP4 formats.

Quick Navigation

High Precision (FP32/TF32) Half Precision (BF16/FP16) 8-bit (FP8/INT8) Low Precision (FP4)

High Precision: FP32 & TF32

The Standard for Stability

Single-precision floating-point (FP32) is the "gold standard" for deep learning training, offering a wide dynamic range and high precision. TF32 (TensorFloat-32) is an NVIDIA-specific optimization on Ampere+ GPUs that maintains the range of FP32 but reduces precision to match FP16, accelerating matrix multiplications.

Characteristics

FP32: 1 sign, 8 exponent, 23 mantissa. High precision (~7 decimal digits).
TF32: 1 sign, 8 exponent, 10 mantissa. Same range as FP32, lower precision.

Issues & Trade-offs

Memory Consumption: 4 bytes per parameter. A 7B model requires ~28GB just for weights.
Compute Bound: Significantly slower throughput compared to lower precisions on Tensor Cores.

Half Precision: FP16 & BF16

Mixed Precision Training

Half-precision formats reduce memory usage by 50% and double throughput. BF16 (Brain Float 16) has become the default for LLM training because it preserves the dynamic range of FP32, avoiding the overflow/underflow issues common with standard FP16.

Comparison

FP16: 1 sign, 5 exponent, 10 mantissa. Limited range \([6 \times 10^{-5}, 65504]\).
BF16: 1 sign, 8 exponent, 7 mantissa. Same range as FP32 (\(10^{-38}\) to \(10^{38}\)).

Optimization Issues

Loss Scaling (FP16): Required to prevent gradient underflow (gradients becoming 0). BF16 typically does not need this.
Precision (BF16): Lower precision than FP16. Gradients might need stochastic rounding or accumulation in FP32.

Key Papers

Mixed Precision Training (Micikevicius et al., ICLR 2018)

The foundational paper introducing mixed precision training techniques like loss scaling.

A Study of BFLOAT16 for Deep Learning Training (Kalamkar et al., 2019)

Analysis of BF16 performance and numerical stability compared to FP32.

8-bit Precision: INT8 & FP8

Efficiency & FP8 Era

8-bit formats are revolutionizing both training and inference. INT8 is widely used for post-training quantization. FP8 (supported on NVIDIA H100/Ada) introduces two formats: E4M3 for weights/activations and E5M2 for gradients, enabling 8-bit training with minimal accuracy loss.

Formats

FP8 E4M3: Higher precision, lower range. Best for forward pass.
FP8 E5M2: Higher range (matches FP16), lower precision. Best for gradients.

Issues

Quantization Noise: Rounding errors can accumulate. Requires sophisticated scaling (e.g., row-wise/block-wise quantization).
Hardware Support: True FP8 acceleration requires newer GPUs (H100, RTX 4090).

Key Papers

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats (Chen et al., 2025)

Compares FP and INT formats, revealing that for 8-bit fine-grained quantization, MXINT8 is superior to its FP counterpart in both accuracy and efficiency.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022)

Introduces vector-wise quantization and mixed-precision decomposition to handle outlier features in LLMs.

FP8 Formats for Deep Learning (Micikevicius et al., 2022)

Standardizing the E4M3 and E5M2 formats for deep learning training.

Low Precision: FP4 & INT4

The Frontier of Compression

4-bit precision is the current frontier for extreme efficiency, primarily used in inference and fine-tuning (QLoRA). NVIDIA's Blackwell architecture introduces native FP4 support, promising 2x throughput over FP8. Research like BitNet is even pushing towards 1-bit variants.

FP4 Characteristics

Limited States: Only 16 representable values.
E2M1 vs E3M0: Likely variants offering trade-offs between range and precision.

Major Challenges

Accuracy Drop: Significant information loss. Requires "Quantization-Aware Training" (QAT) or advanced calibration.
Gradient Precision: Gradients usually need higher precision (FP16/BF16) to update weights effectively.

Precision & Quantization

Quick Navigation

High Precision: FP32 & TF32

The Standard for Stability

Characteristics

Issues & Trade-offs

Half Precision: FP16 & BF16

Mixed Precision Training

Comparison

Optimization Issues

Key Papers

Mixed Precision Training (Micikevicius et al., ICLR 2018)

A Study of BFLOAT16 for Deep Learning Training (Kalamkar et al., 2019)

8-bit Precision: INT8 & FP8

Efficiency & FP8 Era

Formats

Issues

Key Papers

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats (Chen et al., 2025)

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022)

FP8 Formats for Deep Learning (Micikevicius et al., 2022)

Low Precision: FP4 & INT4

The Frontier of Compression

FP4 Characteristics

Major Challenges

Key Papers

Pretraining Large Language Models with NVFP4 (NVIDIA et al., 2025)

QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., NeurIPS 2023)

BitNet: Scaling 1-bit Transformers for Large Language Models (Ma et al., 2023)