Web-crawled datasets contain copyrighted content, personal data, and harmful content that nobody can fully audit at scale

devtools0 views
You download The Pile (800GB) or RedPajama (1.2T tokens) to train a language model. Somewhere in those trillions of tokens are: complete copyrighted books (New York Times articles, Harry Potter chapters), personal information (social security numbers, home addresses from leaked databases), CSAM (child sexual abuse material that was on the public web), malware code, and every form of hateful/violent/illegal content that exists on the internet. You did not put this content in your dataset — it was in the web crawl. But you trained on it, and your model learned from it. So what? Every company training on web crawl data is unknowingly training on copyrighted, private, and harmful content. The New York Times sued OpenAI for copyright infringement. Artists sued Stability AI for training on their work. The legal liability is real and growing. But auditing a multi-terabyte dataset for problematic content is practically impossible — you would need to classify every document, which requires the very ML models you are trying to build. Random sampling audits catch obvious problems but miss long-tail harmful content. Why does this persist? Web crawling is the only way to get enough text data for LLM pre-training (trillions of tokens). The alternative — curating a dataset manually — would take decades and cost billions. Basic filtering (blocklists, keyword filtering, language detection) removes some harmful content but is trivially evaded by obfuscation. The legal framework is unsettled: is training on copyrighted data fair use? Courts are still deciding (NYT v. OpenAI, Authors Guild v. OpenAI). The economic incentive is to train now and deal with legal consequences later.

Evidence

The Pile audit (Biderman et al.): found copyrighted books, personal data, and toxic content. NYT v. OpenAI (filed Dec 2023): alleges training on copyrighted articles. Stability AI sued by Getty Images for training on copyrighted photos. C4 dataset audit found 7% of content was from sites with robots.txt blocking crawlers. LAION-5B dataset taken down after CSAM was discovered in it (Stanford Internet Observatory, Dec 2023).

Comments