Distributed training across multiple machines hangs with no error message and nobody knows why
devtoolsdevtools0 views
You set up training across 4 nodes with 8 GPUs each (32 GPUs total) using PyTorch FSDP (Fully Sharded Data Parallel). The training starts. At step 2,847, everything freezes. GPU utilization is 100% on 31 GPUs and 0% on 1 GPU. No error message. No exception. No log output. The training process is alive but not progressing. You wait 30 minutes to see if it recovers. It does not. You kill the job and restart from the last checkpoint (step 2,500, 1 hour of lost compute). It freezes again at a different step. You spend 3 days debugging: is it a NCCL timeout? A deadlock in the data loader? A GPU memory leak? A network switch dropping packets? A single slow GPU dragging down the synchronous all-reduce? You add NCCL_DEBUG=INFO and get 50MB/s of debug logs that are unreadable. After 3 days, you discover: one node's InfiniBand cable had a slightly loose connection, causing intermittent 10ms latency spikes that triggered NCCL timeout on large all-reduce operations. So what? Distributed training debugging is the hardest part of ML infrastructure. When something fails silently (no error, just a hang), the cause could be hardware (GPU, network, storage), software (NCCL, PyTorch, CUDA driver), configuration (environment variables, firewall rules), or data (a malformed batch causing one GPU to OOM while others wait). Narrowing down the cause requires expertise that most ML teams do not have — they are researchers, not distributed systems engineers. Why does this persist? Distributed training frameworks (FSDP, DeepSpeed, Megatron) are optimized for throughput, not debuggability. Error reporting is an afterthought. NCCL hangs do not produce error messages because the protocol is waiting for a response that never comes — there is nothing to report until the timeout fires, and by then the useful diagnostic information is gone.
Evidence
PyTorch FSDP documentation includes troubleshooting section for hangs — confirming it is a common issue. NCCL GitHub issues: 500+ issues related to hangs/timeouts. DeepSpeed GitHub has 100+ hang-related issues. Meta's RSC (Research SuperCluster) blog post mentioned 40% of training time lost to hardware/software failures during OPT-175B training. No tool provides automated diagnosis of distributed training hangs.