Mixed precision training (fp16/bf16) randomly produces NaN losses that destroy training runs with no warning
devtoolsdevtools0 views
You train a model in bf16 (bfloat16) to save memory and speed up computation. Training progresses normally for 12 hours. At step 8,000, the loss suddenly jumps to NaN (Not a Number). All subsequent steps are NaN. The model weights are corrupted. You must roll back to the last checkpoint (step 6,000, 3 hours ago) and restart. You add gradient clipping (max_grad_norm=1.0). Training runs for 20 hours and hits NaN again. You lower the learning rate. It helps — training runs for 50 hours, but then NaN appears during a specific batch that contains unusually long sequences. So what? Mixed precision training is standard practice (saves 30-50% memory, speeds up training 1.5-2x) but introduces numerical instability. A single NaN in any gradient propagates through the entire model in one step, corrupting all weights irreversibly. The causes are numerous: loss scaling overflow, gradient explosion on specific batches, underflow in attention softmax, division by near-zero in layer norm. Each cause has a different fix, and the NaN does not tell you which cause triggered it. Debugging NaN requires adding hooks to every layer to detect where it first appears — adding 20-30% overhead. Why does this persist? Low-precision floating point has smaller dynamic range: bf16 can represent numbers up to 3.39×10^38 but has only 8 bits of significand (vs 24 for fp32). Operations that produce intermediate values outside this range overflow to infinity, then become NaN when used in subsequent calculations. Gradient scaling (AMP) partially addresses this but is a heuristic, not a guarantee. The fundamental tension: fp16/bf16 is faster and cheaper but inherently less numerically stable, and there is no way to predict in advance which training runs will hit NaN.
Evidence
PyTorch AMP documentation explicitly warns about NaN losses with mixed precision. bf16 has 8-bit exponent (same as fp32) but 8-bit mantissa (vs 24-bit) — 3 decimal digits of precision vs 7. NaN debugging requires gradient hooks that slow training 20-30%. Loss scaling in AMP is the primary mitigation but does not prevent all NaN sources. Multiple r/MachineLearning threads on NaN debugging — it is a universal problem.