There is no way to know if your training data has been contaminated by test set data — your benchmark scores might be meaningless
devtoolsdevtools0 views
You train a model on a large web crawl dataset (like The Pile, RedPajama, or FineWeb). You evaluate it on MMLU, HumanEval, and GSM8K. The scores look great — 70%+ on MMLU, 50%+ on HumanEval. But wait: did your training data contain MMLU questions and answers? If the model memorized the test set during training, your benchmark scores are inflated and do not reflect actual capability. You cannot check — your training dataset is 15TB of text and searching for 14,000 MMLU questions across 15TB is computationally expensive and imprecise (rephrased questions would evade exact-match detection). So what? Data contamination invalidates every benchmark result. If a model trained on data containing HumanEval solutions scores 60% on HumanEval, you do not know if it can actually code or if it memorized the answers. This undermines the entire model evaluation ecosystem. Companies claim benchmark improvements that may be partially or entirely due to contamination. Users choose models based on benchmark scores that do not reflect real-world performance. Why does this persist? Web crawl datasets contain everything on the internet — including benchmark datasets, their solutions, and discussions about them. Deduplication at scale is imperfect: exact match dedup catches verbatim copies but not paraphrased versions. GPT-4's technical report acknowledged potential MMLU contamination but could not quantify it. The incentive structure rewards contamination: models with higher benchmark scores get more attention, funding, and adoption, regardless of whether the scores are legitimate.
Evidence
OpenAI GPT-4 technical report: acknowledged potential data contamination on benchmarks. Llama 2 paper ran contamination analysis and found overlap with some benchmarks. Stanford HELM benchmark found evidence of contamination in multiple models. No standard tool exists for contamination detection at web-crawl scale. arXiv paper 'Investigating Data Contamination in Modern Benchmarks for Large Language Models' documents widespread contamination.