
Zero-downtime Postgres migrations – the hard parts - Sinjo
https://gocardless.com/blog/zero-downtime-postgres-migrations-the-hard-parts/
======
Thaxll
For those using MySQL: [https://www.percona.com/doc/percona-toolkit/2.2/pt-
online-sc...](https://www.percona.com/doc/percona-toolkit/2.2/pt-online-
schema-change.html)

~~~
brlewis
"The technique of atomically renaming the original and new tables does not
work when foreign keys refer to the table."

I think generally mysql shops that use pt-online-schema-change don't create
foreign key constraints in production, sidestepping the whole problem this
article is about.

~~~
spotman
Which is most of them that have enough traffic/data to need this tool in the
first place. MySQL generally tends to perform badly at massive scale with
foreign keys. (i.e., when your doing 30k queries/second)

------
xzlzx
Did you ever think of testing these migrations against at least a subset of
your production data? It sounds like you just wrote the migrations and let
them go into the wild without thought.

~~~
ploxiln
According to the post, the pause/lockup could only happen under production
queries, and was not directly related to production data.

They did test against the full production data, afterwards, and could not
reproduce. Which was what they expected, since the migrations were on empty
tables, which just happened to have foreign key constraints against large
tables, but no rows for those constraints to actually apply to.

So for them to "fully" test this migration before applying it to production,
they would need to be replaying all production queries against the testing
database as well, and maybe even test the migration multiple times to get a
statistical sense of the possible latencies.

(I've actually done something like that before, but it's not something you do
for every little change)

~~~
Sinjo
Spot on. We ran the migrations against production-sized data beforehand, and
everything was fine. We did it again after the downtime, and got the same
results.

Most of our queries are fast (tens of milliseconds or lower), which is how we
got away without knowing about this for so long. Unsurprisingly, we've been
making a bigger effort to eliminate any slow queries we do find lately. ;)

~~~
spotman
Curious, is the:

    
    
      SELECT DISTINCT(email) FROM parent;
    

The actual query you run in production, or similar enough? That could be a
very expensive query with a very large amount of rows, even if indexed.
(Large, being in the multi millions, depending on hardware). Might make more
sense to run something like that on a reporting slave.

~~~
Sinjo
That was a fabricated example, rather than something plucked from our
codebase. It's the sort of query we avoid running on our production database.

I agree with your approach though. Our data analysis runs elsewhere.

------
akurilin
Timely article for me. Just now starting to deal with migrating Postgres
tables where adding a new non null column is taking 4+ hours. Love how the
process has to be completely different at this kind of scale.

~~~
Dylan16807
Does that lock the table for 4 hours? I was going to talk about how it should
be straightforward to add support for almost-non-blocking alters but it looks
like all the pieces are already there:

* It supports adding a null column immediately.

* It supports slowly filling in a possibly-default value on all records.

* It supports setting the column to non-null when every row has a value.

What pain points are you hitting, and is my analysis totally off-base?

Though a quick search shows that step three might decide to ignore indexes and
be slow; that would be disappointing.

~~~
Sinjo
You're right on the approach. One thing to be careful of is that there's no
option to add a not-null constraint with the `NOT VALID` flag. That means
it'll hold the AccessExclusive lock while it checks every row in the table
meets the constraint.

You can work around this by adding a `CHECK` constraint which does the same
thing, but it's a little inconvenient. There was some work being done to add
this feature to not-null constraints ([http://www.postgresql.org/message-
id/20140517155857.GD7857@e...](http://www.postgresql.org/message-
id/20140517155857.GD7857@eldon.alvh.no-ip.org)), but unfortunately it looks
like it's gone stale.

------
buckbova
I didn't see a specific answer for the current setup referenced in the article
that would have netted zero downtime.

It's true the table wasn't available for 15 seconds, but current connections
weren't dropped/lost, so a robust interface with the API should survive a 15
second delay.

Shouldn't a system like this be built to handle long delays?

~~~
Sinjo
We can't count on a user (either in the browser or using the API) waiting
around that long. Even some of our own services time out more aggressively
than that.

Those situations can be smoothed over with retries, but we try pretty hard to
avoid delays like this in the first place.

------
Joeri
Does anyone know of a similar article for doing zero downtime oracle
migrations?

~~~
polsoul
Not a particular link but openworld presentation of how Bank of America do it
using Golden Gate, there was also Pythian blog post about it, I believe.

