Multi-GPU training requires NCCL which only works reliably on NVIDIA — there is no vendor-neutral GPU communication library

devtools0 views
You want to train a model across 8 GPUs using data parallelism or tensor parallelism. On NVIDIA, you use NCCL (NVIDIA Collective Communications Library) — it handles all-reduce, all-gather, and broadcast operations across GPUs with near-optimal bandwidth utilization. It works out of the box. On AMD, you use RCCL (ROCm Communication Collectives Library) — a NCCL port that is 10-30% slower and has known deadlock issues with certain topologies. On Intel, you use oneCCL — which only supports limited collective operations. If you want to train across mixed hardware (NVIDIA + AMD, or GPU + TPU), there is no communication library that works across vendors. So what? Multi-GPU training is required for any model larger than 7B parameters. The communication library determines training efficiency: a 10% slower all-reduce on 256 GPUs wastes thousands of GPU-hours. NCCL's NVIDIA exclusivity means that multi-GPU training only works well on NVIDIA, creating another lock-in layer on top of CUDA. A company that wants to use AMD GPUs for some workloads and NVIDIA for others cannot efficiently train across them. Why does this persist? NCCL is optimized for NVIDIA's proprietary interconnects (NVLink, NVSwitch) which provide 900 GB/s per GPU. AMD's Infinity Fabric provides 400-800 GB/s but RCCL does not utilize it as efficiently. Intel's Gaudi uses a different network topology. Each vendor's communication library is tuned for their hardware's topology and cannot generalize. The MPI standard is vendor-neutral but 2-5x slower than vendor-optimized collectives for GPU workloads.

Evidence

NCCL: only supports NVIDIA GPUs, optimized for NVLink/NVSwitch (900 GB/s per GPU on DGX H100). RCCL: 10-30% slower than NCCL on equivalent benchmarks (published MLPerf results). oneCCL: Intel-only, limited operations. Gloo: vendor-neutral but CPU-mediated (10x slower than NCCL for large all-reduce). No vendor-neutral GPU collective communication library exists that matches NCCL performance.

Comments