Data versioning for ML is unsolved — nobody can tell you exactly which data a specific model was trained on 6 months ago
devtoolsdevtools0 views
Your company trained Model v2.3 six months ago. A customer reports it gives wrong answers about a specific topic. You want to investigate: was the problematic topic in the training data? What version of the training data was used? Were there any data quality issues in that batch? You check your training logs: they say 'dataset: customer_support_v7.' You look for customer_support_v7. It does not exist anymore — someone created v8 and v9 and did not keep v7. The S3 bucket was cleaned up to save costs. The data pipeline that created v7 pulled from a database that has since been updated with new records. You cannot reproduce the exact dataset that trained Model v2.3. So what? In software engineering, every deployment can be traced to a specific git commit. In ML, most teams cannot trace a deployed model to the exact dataset that trained it. Data versioning tools exist (DVC, LakeFS, Delta Lake) but adoption is 10-20% among ML teams. Most teams version code meticulously and treat data as ephemeral — regenerated from pipelines rather than preserved as artifacts. When something goes wrong with a model, the investigation dead-ends at 'we do not know what data it saw.' Why does this persist? Training datasets are 10-1000x larger than code (gigabytes to terabytes). Storing every version is expensive. The overhead of integrating DVC or LakeFS into existing pipelines is 2-4 weeks of engineering time that ML teams do not prioritize because 'we can always regenerate the data.' Until they cannot — because the source data changed, the pipeline code changed, or the API they scraped from changed their format.
Evidence
DVC (Data Version Control) has 13K+ GitHub stars but industry adoption is estimated at 10-20% of ML teams (MLOps surveys). LakeFS and Delta Lake are more common in data engineering than ML. Weights & Biases tracks experiment configs but not full dataset snapshots. Google's ML Test Score paper lists data versioning as a common gap. No major ML platform (SageMaker, Vertex AI, Azure ML) mandates data versioning.