AI companies scraped billions of personal data records from the public internet for model training without consent, triggering $50M+ in settlements and lawsuits alleging violations of privacy rights for hundreds of millions of people
technologytechnology0 views
Major AI companies systematically scraped personal information -- including photos, social media posts, private messages, and biometric data -- from the public internet to train large language models and facial recognition systems without obtaining consent from the hundreds of millions of individuals whose data was collected. Clearview AI scraped over 30 billion facial images from social media platforms. OpenAI faces class action allegations of using 'stolen private information, including personally identifiable information, from hundreds of millions of internet users, including children.' LinkedIn is accused of harvesting private messages to train AI models. Why it matters: individuals who posted content on social media never consented to having their data used for AI training, so their personal information, writing styles, faces, and private communications are permanently embedded in commercial AI models, so there is no mechanism to remove an individual's data once it has been incorporated into model weights, so the right to be forgotten is technically impossible to exercise against trained AI models, so people are being forced to choose between participating in online life and having their personal data conscripted into commercial AI products. The structural root cause is that existing privacy laws were written before large-scale AI training existed and do not clearly define whether publicly accessible data can be scraped and used for machine learning purposes, creating a legal gray zone that AI companies exploit by arguing that public availability implies consent to any use.
Evidence
Clearview AI settled for $50 million (June 2024) after scraping billions of facial images from Facebook, Venmo, and other sites. PM v. OpenAI LP (filed June 2024, San Francisco federal court) alleges OpenAI used personal data from hundreds of millions of users including children. LinkedIn class-action lawsuit (2025) alleges harvesting of private messages for AI training. Reddit sued Perplexity AI, SerpApi, Oxylabs, and AWMProxy (October 2024) for industrial-scale scraping using false identities and proxy techniques. Facebook/Meta settled with Texas for $1.4 billion (July 2024) over non-consensual biometric data collection via Tag Suggestions. Canadian news outlets sued OpenAI (November 2024) in Ontario Superior Court. CNIL (France) issued guidance on web scraping for AI training data collection. Sources: Troutman Pepper Locke, American Bar Association, The Lyon Firm, California Law Review, CNIL.