Training runs fail silently — the loss looks fine for 10 hours then the model outputs garbage

devtoolsdevtools0 views3/21/2026

You start a fine-tuning run on a 13B model. Loss decreases nicely for 10 hours. You go to sleep. In the morning, loss is still decreasing — 0.8 after 20 hours, down from 1.2 at start. You evaluate the model: it outputs coherent-sounding but factually wrong answers, repeats itself in loops, and has memorized training examples verbatim instead of generalizing. The loss number looked healthy but the model overfit or mode-collapsed. The 20-hour run ($400 in compute) produced a useless model, and nothing in the standard training metrics warned you. So what? Loss is the only metric universally tracked during training, but loss does not measure what you care about: does the model actually perform well on your task? A model can have low loss and terrible task performance (overfitting). It can have moderate loss and excellent task performance (good generalization). The disconnect between training loss and actual quality means you cannot detect failure during training — only after, when you evaluate. By then, you have spent the compute. Why does this persist? Good evaluation requires human judgment or task-specific benchmarks. Running evaluation every 100 steps would add 20-50% to training time and cost. So teams evaluate infrequently (every few thousand steps or at the end), creating long blind spots where training could be going wrong. Online evaluation during training is an active research area but no production-grade tool exists that can detect mode collapse, memorization, or quality degradation in real time without expensive human evaluation.

Evidence

Mode collapse in language model fine-tuning is documented in RLHF literature (InstructGPT paper acknowledges it). Overfitting detection during training requires validation evaluation, which most fine-tuning scripts do not run continuously. Weights & Biases shows training loss is the most-logged metric; task-specific evals are logged 10x less frequently. No tool provides real-time quality alerts during training.

Training runs fail silently — the loss looks fine for 10 hours then the model outputs garbage

Evidence

Comments