You cannot resume someone else's training run because they did not save the optimizer state, RNG state, or data loader position
devtoolsdevtools0 views
A colleague trained a model for 50K steps and shared the weights with you. You want to continue training from step 50K on new data. You load the weights and start training. The loss immediately spikes — the optimizer (Adam) has momentum and variance estimates built up over 50K steps, and you just reset them to zero by starting a new optimizer. The model takes 5K steps just to recover to where it was. You also cannot reproduce their exact training because the random seed, data shuffling order, and dropout masks are all lost. So what? Model weights are the 'product' of training, but they are only 33% of a training checkpoint. The optimizer state (33%) and training metadata (data loader position, RNG state, scheduler state — 33%) are equally important for reproducibility and continuation. Most people only share model weights because optimizer states are large (same size as the model) and training metadata is poorly documented. This makes it impossible to resume training, reproduce results, or debug training failures after the fact. Why does this persist? There is no standard checkpoint format that includes all necessary state. PyTorch checkpoints can include everything but the convention is to save only model state_dict. HuggingFace Hub hosts model weights but not optimizer states (they would double storage costs). Research papers publish model weights but not training checkpoints. The ML community treats training as a black box that produces weights, not as a reproducible process.
Evidence
HuggingFace Hub: models are uploaded as weights-only (pytorch_model.bin or safetensors). Optimizer states for a 70B model are 140GB — same size as the weights. PyTorch docs show how to save full state but most tutorials only save model.state_dict(). Reproducibility crisis in ML: Nature 2022 survey found 50%+ of ML papers cannot be reproduced even with published code.