And to add to this, one of brutal anti-rollback gotchas I’ve seen on smaller pro...

evanelias · on July 25, 2019

In my experience this is rare, but I suppose it depends what you mean by "Migration blows up in prod". Do you mean the ALTER TABLE is interrupted (query killed, or server shutdown unexpectedly)? Or do you mean you're bundling multiple schema changes together, and one of them failed, which is causing issues due to lack of ability to do multiple DDL in a single transaction?

The former case is infrequent in real life. Users with smaller tables don't encounter it, because DDL is fast with small tables. Meanwhile users with larger tables tend to use online schema change tools (e.g. Percona's pt-osc, GitHub's gh-ost, Facebook's fb-osc). These tools can be interrupted without harming the original table.

Additionally, DDL in MySQL 8.0 is now atomic. No risk of partial application if the server is shut down in the middle of a schema change.

The latter case (multiple schema changes bundled together) tends to be problematic only if you're using foreign keys, or if your application is immediately using new columns without a separate code push occurring. It's avoidable with operational practice. I do see your point regarding it being painful on smaller projects though.

tonyarkles · on July 25, 2019

Generally in my experience it's been bundled schema changes. As an (dumb) example in an SQL-like pseudocode (I've been doing EE work recently; apologies for imperfect syntax):

  ALTER TABLE user ADD COLUMN country; -- oops
  CREATE TABLE user_email (blah blah); 
  -- process here to move data from user.email column to user_email table. This blows up because someone has a 256-byte long email address and the user_email table has a smaller varchar()
  ALTER TABLE user ADD state_province;

Yes, this is sloppy and should be cleaner. The net result is still the same though; you have the user_email table and country column created, and due to MySQL DDL auto-commit, they're persisted even though the data copy process failed. The state_province table does not exist, and now if you want to re-run this migration after fixing the problem, you need to go drop the user_email table and country column.

With e.g. Postgres, you wrap the whole thing in a transaction and be done with it. It gets committed if it succeeds, or gets rolled back if it fails.

evanelias · on July 25, 2019

Makes sense, thanks. fwiw, generally this won't matter in larger-scale environments, as it eventually becomes impractical to bundle DDL and DML in a single system/process/transaction. In other words, the schema management system and the bulk row data copy/migration system tend to be separated after a certain point.

Otherwise, if the table is quite large, the DML step to copy row data would result in a huge long-running transaction. This tends to be painful in all MVCC databases (MySQL/InnoDB and iiuc Postgres even more so) if there are other concurrent UPDATE operations... old row versions will accumulate and then even simple reads slow down considerably.

A common solution is to have the application double-write to the old and new locations, while a backfill process copies row data in chunks, one transaction per chunk. But this inherently means the DDL and DML are separate, and cannot be atomically rolled back anyway.

tonyarkles · on July 25, 2019

Yeah, definitely for larger scale systems the process gets quite a bit different, although "large scale" will remain undefined :).

It's admittedly been a long time since I've used MySQL for anything significant, but I feel like teams in the past have run into issues where DDL operations on their own have succeeded in dev/staging and failed in prod, even though they're running on the same schema. Simply due to there being "weird" data in the rows. I don't know that for sure though. If I'm remembering right, one of those had something to do with the default "latin1_swedish_ci" vs utf8 thing...

marcosdumay · on July 25, 2019

Ideally, if you are not sure you have any environment equal to prod, you clone prod first, apply your schema migration there, install your applications and run their tests.