Training a custom LLM requires $1-10M in GPU compute but there is no way to estimate cost before starting

devtools0 views
Your company wants to fine-tune a 70B parameter model on your proprietary data. You ask your ML team: how much will this cost? They say: 'It depends — on the dataset size, number of epochs, hyperparameter search, batch size, and whether we need to restart runs that diverge.' Their estimate: $50K-500K. A 10x range. You approve $200K. Three weeks in, the first run diverged at step 40,000 (wasted $80K in compute). The second run's learning rate was too high (wasted $40K). The third run looks good but needs 2x more epochs ($160K more). Total actual cost: $380K — 1.9x the approved budget and 7.6x the low estimate. Nobody is fired because this is normal in ML. So what? ML training costs are fundamentally unpredictable because: (a) hyperparameter search is trial-and-error, (b) training runs fail silently (loss plateaus, gradients explode) after consuming significant compute, (c) evaluation is subjective (when is the model 'good enough'?), and (d) data quality issues are discovered during training, requiring preprocessing changes and restarts. Unlike software engineering where you can estimate scope, ML training is experimental — each dollar buys a lottery ticket on whether this run will work. Why does this persist? There are no reliable cost estimation tools for ML training. Cloud providers bill per GPU-hour with no per-project budget caps. ML experiment tracking tools (Weights & Biases, MLflow) track what you spent but do not predict what you will spend. No tool answers 'how much will it cost to fine-tune Llama 70B on 100K documents to achieve X quality?' because the answer depends on unknowable factors (data quality, hyperparameter sensitivity, random seed luck).

Evidence

Mosaic/Databricks training cost calculator provides rough estimates but assumes optimal hyperparameters. Weights & Biases reports average experiment failure rate of 30-50% (runs that produce unusable models). Lambda Labs estimates fine-tuning 70B model: $50K-500K depending on data and approach. OpenAI spent $100M+ training GPT-4 (reported by various sources). No ML framework provides pre-training cost estimation.

Comments