
Move Fast and Migrate Things: How We Automated Migrations in Postgres - sajithw
https://benchling.engineering/move-fast-and-migrate-things-how-we-automated-migrations-in-postgres-d60aba0fc3d4?
======
Daishiman
I am convinced that data migration is definitely one of the hardest problems
in data management and systems engineering.

There are basically no solutions today that satisfy fundamental requirements
such as minimizing downtime and guaranteeing correctness.

It is _such_ a huge problem that most inexperienced developers see kicking the
problem down the line with NoSQL document storage as a viable alternative (it
isn't; you'll be either dealing with migrating all data forever and special-
casing every old version of your documents, or writing even more convoluted
migration logic).

It's also clear that even the most modern ORMs and query builders have not
been built in mind to consider the issues that arise in migrating data.

It would be a refreshing thing to see more research devoted to this problem.
Unfortunately, migrations end up being so different from each other with such
heterogenous requirements that we'll probably be working on this for a really
long time.

~~~
dcosson
I think DB's could definitely do more to expose what the cost of various
operations are, it would be great if you could "explain" a migration before
you run it like you can with a query and it would calculate a rough cost, how
many rows need to be touched, what resources need to be locked, even how
likely the required locks are to cause contention with other frequently-taken
locks based on system statistics, etc.

But I was thinking about this recently and I feel like theres's some low
hanging fruit in the migration frameworks themselves which, at least as far as
I'm aware, all just completely punt on this problem. Rails, Alembic, and every
other framework I've used will let you write a migration like adding a new
non-nullable column, or renaming an existing column, things that can be really
slow on a big or frequently written table and/or will cause problems during
the rollout if an old version of the code is still running. It doesn't seem
like it would be that hard to add a safe mode where the framework will block
at least some of the most common variations of these unsafe migrations. Maybe
it's harder than I realize, or maybe it's just a matter of anyone opening up
some PR's and actually implementing this.

~~~
tomnipotent
> do more to expose what the cost of various operations are

Both Postgres & SQL Server support EXPLAIN on DML without running the
statement (with varying levels of depth).

    
    
        CREATE TEMPORARY TABLE tmp_blah (num INT);
        EXPLAIN INSERT INTO tmp_blah SELECT * FROM generate_series(1, 100000) ORDER BY random();
    
        QUERY PLAN
        Insert on tmp_blah  (cost=62.33..74.83 rows=1000 width=4)
            ->  Subquery Scan on "*SELECT*"  (cost=62.33..74.83 rows=1000 width=4)
                ->  Sort  (cost=62.33..64.83 rows=1000 width=12)
                    Sort Key: (random())
                    ->  Function Scan on generate_series  (cost=0.00..12.50 rows=1000 width=12)

~~~
jclulow
One challenge with that is that the query plan often depends on the current
table composition statistics. If a value is relatively common (or uncommon),
or the size of the table has grown or shrunk, you may end up with a materially
different (and conceivably substantially worse) plan.

~~~
tomnipotent
I can't tell you how many times over 20 years I've heard a DBA tell me "the
statistics weren't updated" after an incident.

~~~
ovi256
I've written cron jobs to update mysql statistics to prevent it from choosing
bad query plans. It's as terrible as it sounds.

~~~
pas
How come it's not automatic and built-in? Why is a script terrible? (Does it
have to enumerate all tables and piecewise run the stat update?)

~~~
ovi256
Yes, the stats update is per table. We only updated it for a few tables we
really cared about.

------
existencebox
This is an amazing writeup. I'm currently solving the "migrations" problem for
a side project of my own, and have basically resolved myself in the short term
to be OK with short downtime for the sake of making migrations somewhat
trivial.

And honestly? I hate this answer. As a solo dev it's pragmatic, but the
solutions described in this article are _SO NICE_ that I'd love to leverage
them.

If there's any way that those deprecate_column and rename functionalities
could make their way into OSS/upstream support, I'd have a field day. (Those
who know more about PG than I do and perhaps may be able to suggest another
workaround, feel free, I'm very much learning this space as I go)

If nothing else, thanks to the benchling team for taking the time to write
such a clear yet technical expose. This really hit the sweet spot of
"explanations without uneccessary verbosity, technical without being
impenetrable, and giving sufficient explanations of motivations and
tradeoffs/pitfalls" and will give me a north star for where I aim my own DB
work.

~~~
vineetg
Original author here. Thanks for the kind words!

We would love open source some of the work we did - there are a few edge cases
to still work out with deprecated_column and renamed_to before I’d be
comfortable doing that, but definitely agree that may be generally useful.

~~~
BenMorganIO
If it's just edge cases, you can just open source it and document the edge
cases. The community will likely fork and help with those edge cases.

------
ThePhysicist
In our startup we moved away from Alembic to using plain SQL files for
migrations, which (in our experience) is more robust and allows more control
over the actual migration process. We wrote a simple migration manager class
that loads a YAML config and a series of SQL files from a directory. Each
migration is defined as two files of the form
"[number]_[up/down]_[description].sql" and tied to a DB version, the YAML
config specifies the name of a version table that contains the current version
in the database. The manager then reads the current version from the table,
compares it to the requested one and executes the necessary SQL files.

Alembic is great for many simple use cases but we found that for a production
system it often isn't easy to maintain compatibility between two different DB
systems like Postgres and SQLite anyway, as that would mean either adding a
lot of feature switches and custom logic to our code or not using most of the
interesting native Postgres features. Therefore Alembic offered very little
benefit over a plain SQL file in terms of functionality and in addition made
it harder to generate correct migrations in some case, as the auto-generation
process does not work very reliably in our experience and some things are
buggy/inconsistent, e.g. the creation and deletion of enum types. In addition,
we found that it's much easier to write complex migration logic (e.g. create a
new column and populate it with data from a complex SELECT statement) directly
in SQL. Last point is that we can of course execute these migrations using any
programming language / tool we like (for example we also wrote a small Go
library to handle the migrations), which is a nice bonus.

That said we also heavily use SQLAlchemy in our Python backend code and like
it a lot.

------
theptip
This covers a lot of ground that I've recently had to learn the hard way.

One item I've been considering; under Downtime, a reason for flakes in
migrations is "long running transactions".

I've seen this too, and wonder if the correct fix is actually to forbid long-
running transactions. Typically if the naive long-running transaction does
something like:

    
    
        with transaction.atomic():
            for user in User.objects.all():
                user.do_expensive_thing_to_related_objects()
    
    

You can often recast that migration to something more like

    
    
        for user in User.objects.all():
            with transaction.atomic():
                user = User.objects.get(id=user.id)  # Read the row to lock it; or do a SELECT FOR UPDATE
                user.do_expensive_thing_to_related_objects()
    

This example is somewhat trivial, but in most cases I've seen you can fetch
your objects outside of the transaction, compute your expensive thing, and
then lock your row for the individual item you're working on (with a sanity-
check that your calculation inputs haven't changed, e.g. check the
last_modified timestamp is the same, or better that the values you're using
are the same).

I've considered simply configuring the DB connection with a very short
connection timeout (something like 5 seconds) to prevent anyone from writing a
query that performs badly enough to interfere with other tables' locks.

Anyone tried and failed/succeeded in making this approach work?

The other subject that's woefully underdeveloped is writing tests for
migrations; ideally I want to (in staging) migrate the DB forwards, run all
the e2es and smoke tests with the pre-migration application code, migrate back
(to test the down-migration), run the e2es again, and then really migrate
forwards again. That would cover the "subtly broken deleted field" migration
problem.

But how do we test that our migrations behave correctly in the face of long-
running transactions? I.e. what's the failing test case for that bug?

~~~
zepolen
No,

    
    
      with transaction.atomic():
        for user in db.users.all():
          user.do()
    

is _not the same_ as

    
    
      for user in db.users.all():
        with transaction.atomic():
          user.do()
    

If the first fails, the entire data rolls back, if the second fails, half your
db might be in an inconsistent state.

~~~
theptip
Yes, of course, they are not the same. I did not claim they were. I said:

> You can often recast that migration to something more like

By which I mean, in my experience you can usually write your migrations so
that your code can work with the old AND the new version of the data, in which
case you don't need to have a transaction around the whole operation.

This takes more work but is safer:

[https://martinfowler.com/bliki/ParallelChange.html](https://martinfowler.com/bliki/ParallelChange.html)

[https://www.amazon.com/Refactoring-Databases-Evolutionary-
pa...](https://www.amazon.com/Refactoring-Databases-Evolutionary-paperback-
Addison-Wesley/dp/0321774515)

------
ngrilly
Great post. I agree that we don't need automated post-deploy migrations. We
just need automated pre-deploy migrations. Post-deploy migrations, for example
to delete an unused column, can be implemented as pre-deploy migrations in a
subsequent commit.

------
Ixiaus
You can also automatically generate some migrations with a proof search in
linear logic, the way the beam[1] project does.

[1] [https://tathougies.github.io/beam/schema-
guide/migrations/#a...](https://tathougies.github.io/beam/schema-
guide/migrations/#automatic-migration-generation)

------
benmmurphy
In postgresql if you are using prepared statements and are doing a 'select
star' and drop or add a column then the prepared statement will start failing.
This is kind of bad when you are doing transactions because the bad statement
will taint your transaction and you will need to restart from the beginning.
Select star is incompatible with prepared statements and postgresql which
might also explain why SQL Alchemy explicitly names columns in select
statements. Rails will do 'Select star' so prepared statements are the first
thing I turn off in Rails/Postgresql projects. [Maybe they have a fix now?]

~~~
pas
> This is kind of bad when you are doing transactions because the bad
> statement will taint your transaction and you will need to restart from the
> beginning.

Why is this bad? Can you handle this on the application side somehow? Even if
it just means restarting Rails when the migration has finished.

Or the problem is that you want 0 downtime and 0 UX impact migration?

~~~
benmmurphy
It is bad that you have to write a bunch of extra code in your application to
retry on this failure. I guess potentially you could modify the rails
framework to automatically retry transactions in this situation and
aggressively purge the prepared statement cache but this is really a rails
change and not an application change. i don't think it is possible to hook the
transaction logic in rails to retry transactions safely from application code.

as you guessed restarting rails does fix the problem but the issue is no zero
downtime migrations for migrations that should be trivially safe. ie adding a
column

------
wwweston
How did it come to be that some portion of the industry use the term
"migration" to describe changes/updates to a database?

As far as I can tell, it's a really poor fit. It generates the expectation
that movement of existing schema + maybe data from one host to another or one
environment to another. What's usually happening instead is essentially a
schema diff / mutation.

~~~
Ixiaus
Because when you change the representation of data at rest, you need to
"migrate" that data to the new representation.

I agree that schema changes are not migration, but I think the author
correctly uses the word "migration" to mean migrating their data in the
database to some new schema representation.

~~~
alecbenzer
This is true, but it's a bit of an implementation detail. All I'm doing is a
schema change; a migration is what the database does under the hood to
facilitate it.

~~~
Ixiaus
The database doesn't know how to perform a data migration for you, however.
Sometimes it's possible to do it in SQL as part of the schema upgrade but that
isn't always possible, in which case you need to implement it in the
application. But regardless of how you implement it, it's still "data
migration" because you're migrating data from one representation to another
and your database won't always know how to do that "under the hood" for you.

------
danielbigham
These are the types of challenges I've been thinking about for the last year
-- nice to be able to read the approach taken by others.

------
bayesian_horse
I'm totally in love with Django's way of migrating databases.

It's not for every project, certainly, and you sometimes need to work around
limitations of the ORM. And of course some people don't like ORMs in the first
place.

------
seanwilson
Anyone else hold back on releasing side projects because having to do data
migrations with stored user data prevents you from being able to aggressively
refactoring your code? Is there a good compromise for this?

~~~
pas
If the side project is "small", then the migration can be done in a few hours
tops, so user impact is probably negligible. Have you encountered a more
severe problem maybe?

~~~
seanwilson
I just meant I notice I feel a lot less freedom after initial release. When I
haven't released something yet, I can wipe and recreate the database at any
time, move, rename etc. as I want to try things out. Once I've released
though, changes become significantly more painful and risky. Perhaps a
migration is a few hours tops but that's compared to a few minutes tops when
you don't have active users yet and when you're a solo developer each hour is
a significant amount of your productivity.

~~~
pas
> I just meant I notice I feel a lot less freedom after initial release.

Well, there's no arguing with that. And it usually doesn't matter if you have
2 or 200000 users, the migration has to be perfect - and automated - anyway :)

