GPU drivers crash during long training runs and there is no automatic recovery — you lose hours of compute
devtoolsdevtools0 views
You start a 72-hour training run on 8 A100 GPUs. At hour 47, one GPU throws a 'GPU has fallen off the bus' error (NVIDIA driver reset). The training framework (PyTorch DDP) does not handle this gracefully — the process on GPU 0 hangs waiting for GPU 3 which is resetting, then all processes time out. 47 hours of compute ($940 at $2.50/hour/GPU) is wasted if you did not checkpoint recently. Your last checkpoint was 3 hours ago. You restart from checkpoint and lose 3 hours. This happens once every 2-5 days on most multi-GPU training clusters. So what? GPU hardware failures during training are not exceptional — they are expected. At scale (1000+ GPUs), at least one GPU fails every few hours. Google published that their TPU pods experience hardware faults every 2-3 hours during large training runs. But training frameworks (PyTorch, JAX) have minimal built-in fault tolerance: if one GPU fails, the entire distributed training job fails. The standard practice is 'checkpoint frequently and restart' — which means accepting 5-15% compute waste from repeated work between checkpoints. For a $10M training run, that is $500K-1.5M wasted on re-computation. Why does this persist? Fault-tolerant distributed training is a hard systems problem: you need to detect the failure, remove the failed GPU from the topology, redistribute the data, re-shard the model, and resume — all without losing optimizer state. Research prototypes exist (Bamboo, Oobleck, Varuna) but none are production-grade. PyTorch's elastic training (TorchElastic) handles node failures but not single-GPU failures. The ML community has accepted 'checkpoint and restart' as the norm because nobody has built production-ready fault-tolerant training.
Evidence
Google TPU pod failure rate: documented in PaLM and Gemini training reports (hardware faults every 2-3 hours at 6,000+ chip scale). NVIDIA GPU failure rate: estimated 1-5% per GPU per month under continuous load (anecdotal from GPU cloud operators). PyTorch DDP: no built-in single-GPU fault tolerance. TorchElastic handles node-level failures only. Meta's OPT-175B training log documented 100+ restarts over 2 months due to hardware failures.