Hyperparameter tuning is trial-and-error that burns $10K-100K in compute before you find settings that work
devtoolsdevtools0 views
You are fine-tuning a 7B model for medical Q&A. You need to choose: learning rate (1e-5? 3e-5? 1e-4?), batch size (4? 8? 16?), number of epochs (1? 3? 5?), LoRA rank (8? 16? 64?), warmup steps (100? 500?), weight decay (0? 0.01? 0.1?). That is 6 hyperparameters with 3-5 options each — 729 to 15,625 possible combinations. You cannot predict which combination will work without running the experiment. You run 20 experiments, each costing $200-500 in GPU compute. 17 produce garbage models. 3 look promising. You run the 3 promising configs for full training: $2,000 each. One gives good results. Total spend to find good hyperparameters: $10,000-15,000. The actual training run that produced the model: $2,000. You spent 5-7x more on the search than on the final training. So what? Hyperparameter search is the dirty secret of ML: most of the cost and time goes not into training the model but into figuring out what settings to train with. Automated search methods (Bayesian optimization, Optuna, Ray Tune) reduce the number of trials from 100s to 20-50 but do not eliminate the trial-and-error nature. Transfer learning of hyperparameters (if lr=3e-5 worked on model X, it probably works on model Y) is common wisdom but not reliable — a new dataset or different model size invalidates prior settings. Why does this persist? The relationship between hyperparameters and model quality is non-convex, non-linear, and dataset-dependent. There is no closed-form solution. The loss landscape changes with every combination. ML research publishes the final hyperparameters but not the 50 failed experiments that preceded them, creating a survivorship bias where every paper looks like the authors chose the right settings on the first try.
Evidence
Google's NAS (Neural Architecture Search) paper spent $150K in compute on architecture search alone. Optuna and Ray Tune reduce trials but typical fine-tuning still requires 15-30 experiments. No tool predicts optimal hyperparameters for a new dataset. Weights & Biases sweeps data shows 30-50% of runs produce unusable models. ML papers rarely publish failed experiment counts.