Medical and legal training data cannot be shared or used because of privacy regulations — even anonymized data is re-identifiable
healthcarehealthcare0 views
A hospital has 10 million clinical notes that would be perfect for training a medical LLM. They cannot share them — HIPAA requires de-identification of 18 categories of identifiers. They run a de-identification tool that removes names, dates, and locations. But the remaining text still contains enough information to re-identify patients: 'The 67-year-old male patient from rural Vermont with a rare form of sarcoidosis who was previously treated at the Mayo Clinic in 2019' is identifiable even without a name. Research has shown that 87% of the US population can be uniquely identified by ZIP code + birth date + gender alone. So what? The most valuable training data — medical records, legal case files, financial transactions, therapy transcripts — is locked behind privacy regulations that prevent its use for ML. This creates a paradox: the domains where AI could help the most (healthcare diagnosis, legal research, financial fraud detection) are the domains with the least training data available. Models trained on publicly available medical text (PubMed, textbooks) perform 20-30% worse than models that could train on actual clinical notes. Why does this persist? De-identification is provably insufficient — any sufficiently detailed text about a unique individual is re-identifiable. Differential privacy (adding noise to data) works for tabular data but degrades text quality to the point of uselessness. Federated learning (training on data without moving it) exists but is slow, complex, and not supported by most ML frameworks. Synthetic medical data generation requires the original data to generate from — creating a circular dependency. The privacy regulations (HIPAA, GDPR) were written for databases, not for training data, and the legal framework for ML on private data does not exist.
Evidence
Sweeney (2000): 87% of US population uniquely identifiable by ZIP + DOB + gender. HIPAA Safe Harbor requires removal of 18 identifier types but does not guarantee anonymity. Google Health's medical LLM (Med-PaLM 2) trained on PubMed, not clinical notes. Federated learning for LLMs is 5-10x slower than centralized training. No hospital system has publicly released de-identified clinical notes for LLM training at scale.