Database Migration Rollback Failures Due to Destructive Schema Changes
devtoolsdevtools0 views
Database migration tools (Rails migrations, Flyway, Alembic, Prisma) generate 'down' migrations that fail in production because the 'up' migration was destructive (dropped a column, changed a type with data loss, or removed an index that took hours to build), making rollback impossible without data loss or extended downtime. So what? When a deployment with a bad migration needs rollback, the application code rolls back but the database cannot, leaving a version mismatch between the running code and the schema that causes runtime errors. So what? Engineers must write emergency forward-fix migrations under incident pressure, a high-stakes operation on production databases with no testing or review, dramatically increasing the risk of making the situation worse. So what? Teams adopt a 'never roll back' policy, which means every deployment is a one-way door, eliminating the safety net that rollback capability provides and making deployments inherently riskier. So what? Riskier deployments lead to less frequent releases, larger batch sizes, and longer code review cycles, directly reducing engineering velocity and increasing the blast radius of each release. So what? Reduced deployment frequency means bugs and features sit in staging for days or weeks, delaying customer value delivery and making it harder to bisect which change caused a production issue. The structural root cause is that migration tools treat schema changes as reversible by default, but many real-world schema operations (column drops, type narrowing, data backfills) are fundamentally irreversible, and the tooling provides no distinction between safe-to-rollback and destructive migrations at authoring time.
Evidence
GitHub's engineering blog describes their 'expand and contract' migration pattern specifically to avoid irreversible schema changes. The gh-ost tool for MySQL and pg_repack for PostgreSQL exist because ALTER TABLE operations are too dangerous to run directly in production. Thoughtbot's guidelines explicitly warn against writing down migrations for destructive changes. Incident postmortems from companies like GitLab (the 2017 database deletion incident) and Spotify highlight migration rollback failures as a recurring cause of extended outages.