GPU cloud providers oversell capacity — you reserve an instance and it is 'unavailable' when you need it

devtools0 views
You sign up for a reserved H100 instance on AWS (p5.48xlarge). You commit to 1 year at $30/hour (vs $98/hour on-demand). You plan a training run for next Monday. Monday morning: 'InsufficientInstanceCapacity — We currently do not have sufficient p5.48xlarge capacity in the requested Availability Zone.' Your reserved instance is not available. You try other AZs. Not available. You try other regions. Available in us-east-1c — but your data is in us-west-2. Transferring 5TB of training data cross-region takes 8 hours and costs $450 in data transfer fees. You lose a full day. So what? Cloud GPU capacity is sold on a model borrowed from airlines: overbook and hope not everyone shows up at once. But unlike airline seats, GPU workloads cannot be flexibly rescheduled — a training run that needs 8 H100s for 3 days cannot be split across time or geography. When the instance is not available, the entire workload is blocked. Teams build 30-50% schedule slack into timelines to account for GPU availability failures. A research paper submission deadline does not move because AWS overbooked GPUs. Why does this persist? Cloud providers do not publicly disclose their overbooking ratios. Reserved Instances guarantee pricing, not availability (this is in the fine print). Guaranteed capacity requires Dedicated Hosts or Capacity Reservations — which cost 30-50% more. The cloud GPU market has structural demand exceeding supply, and providers optimize revenue by selling the same physical GPUs to multiple customers who statistically will not use them simultaneously. When they all do (e.g., major conference deadline), nobody gets served.

Evidence

AWS documentation: Reserved Instances provide 'billing discount' not 'capacity reservation' (explicit in AWS docs). AWS Capacity Reservations (ODCR) cost additional premium. Multiple HN threads and r/MachineLearning posts documenting GPU availability failures. CoreWeave and Lambda Labs market 'guaranteed availability' as differentiator vs hyperscalers. NVIDIA DGX Cloud attempts to address this but costs $36,999/month per H100 node.

Comments