Synthetic data from LLMs looks great in training but models trained on it perform 10-20% worse on real-world inputs

devtools0 views
You need 100K training examples for a customer intent classification model. Collecting real customer messages is slow (you have 5K). You prompt GPT-4 to generate 95K synthetic examples: 'Generate a customer message expressing frustration about a late delivery.' The synthetic data looks perfect — grammatically clean, well-structured, diverse topics. You train your classifier on 100K examples (5K real + 95K synthetic). Accuracy on your test set (also synthetic): 94%. Accuracy on actual customer messages from production: 78%. The 16% gap is because real customers write 'wtf my package isnt here???' not 'I am frustrated because my delivery has been delayed beyond the expected timeframe.' So what? Synthetic data is the most popular shortcut for insufficient training data, but it introduces a distribution mismatch: synthetic text is cleaner, more grammatical, more structured, and less diverse than real text. Models trained on synthetic data learn the patterns of LLM-generated text, not the patterns of real human text. They fail on misspellings, slang, code-switching, incomplete sentences, and the messy reality of how people actually communicate. The 10-20% accuracy gap between synthetic-test and real-world performance is consistent across studies. Why does this persist? Generating synthetic data is 100x cheaper and faster than collecting and labeling real data. The quality looks good on inspection — individual examples are plausible. The distribution mismatch is only visible at scale, when you measure aggregate performance on real inputs. There is no tool that warns you 'your synthetic data distribution differs from real data in these specific ways' before you waste compute training on it.

Evidence

Anthropic constitutional AI paper uses synthetic data but with heavy filtering. Microsoft Phi-2 used synthetic data for training with careful curation. Stanford Alpaca (synthetic data from GPT-3.5) showed significant quality gaps vs real data fine-tuning. Research: 'The Curse of Recursion' (Shumailov et al. 2023) shows model collapse when training on synthetic data iteratively. No standard tool detects synthetic-vs-real distribution gaps.

Comments