Migration terminal output showing error in foreign key constraint violation

The Zero-Downtime Migration Myth

Zero-downtime database migrations are theoretically achievable. In practice, they require such careful coordination across application code, database state, and deployment tooling that most teams experience at least partial downtime on migrations they believed were safe. Here's what usually goes wrong.

The classic dual-write failure

The standard zero-downtime approach for column renames is dual-write: add the new column, write to both old and new, backfill historical data, switch reads to the new column, drop the old column. It works in theory. In practice, the backfill step is where it fails. Backfilling 2 billion rows while the application is running generates IO pressure that slows down the primary workload. Lock contention can emerge on large tables during the ALTER phase even with “safe” operations. The deployment coordination between the database state and application deployments is error-prone, especially with multiple service instances rolling.

The foreign key constraint trap

Adding a NOT VALID foreign key, validating it as a separate step, and then adding constraints seems safe. But VALIDATE CONSTRAINT on a large table still takes a lock in some databases and Postgres versions. Teams that planned for zero downtime on a constraint addition find themselves in a 45-minute table lock at 2am.

What actually achieves low downtime

The honest answer is that zero-downtime migrations require consistent, rehearsed practices: expand-contract migrations where you never remove before you're certain nothing reads it, automated lock-timeout settings that abort DDL rather than block the workload, migration tools that understand which operations require exclusive locks, and staging environments that actually mirror production table sizes. The teams that achieve true zero downtime have automated all of this. The teams that claim it have usually just gotten lucky.