
Zero-downtime database migrations - fredsters_s
https://blog.rainforestqa.com/2014-06-27-zero-downtime-database-migrations/
======
dlubarov
> We also use Postgres which doesn't create the obstacles that some other
> database systems have, for example with locking tables on schema changes.

Actually schema changes still require exclusive table locks, they're just held
very briefly since Postgres doesn't need to rewrite each tuple. (With some
exceptions, like adding a NOT NULL column, or reducing the size of a VARCHAR.)

I don't mean to be pedantic, it can be important because the exclusive lock
can be blocked by other long-running queries. Then Postgres will try to grant
the alter table's exclusive lock before other conflicting locks, so normal
reads and writes will be locked out until the long query completes or one of
the queries is killed.

~~~
jeffdavis
"Downtime" is a under-defined in this context. The author seems to use the
definition "won't cause errors or disconnections" while you are also
considering the response time.

Everyone cares about the former, while the latter is of more interest to
heavily-loaded sites.

The case you mention of a long-running query is important, so everyone needs
to have some idea of their workload at the time.

------
dangrossman
pt-online-schema-change, part of the Percona Toolkit, is my go-to tool for
making production changes in MySQL.

Improvely and W3Counter both have tons of multi-gigabyte tables. A regular
ALTER TABLE statement could take hours to run, and would lock the table the
entire time, essentially taking down the service.

Percona's tool makes a copy of the table, runs the ALTER TABLE statements on
the copy, then sets up triggers on the original to temporarily mirror data-
altering queries to the copy. It then intelligently batches the data migration
from the old table to the new one to avoid overloading the server, while
printing progress and ETA on the console. When the copying's done, the old
table is dropped and the modified one renamed in a single transaction.

It takes just one command and has always worked flawlessly.

[http://www.percona.com/doc/percona-toolkit/2.2/pt-online-
sch...](http://www.percona.com/doc/percona-toolkit/2.2/pt-online-schema-
change.html)

~~~
dperfect
For ActiveRecord (or other Ruby-based) migrations with MySQL, I've had good
luck with Large Hadron Migrator[1]. It's worked flawlessly for some of our own
tables with several hundred million rows (and no downtime).

[1] [https://github.com/soundcloud/lhm](https://github.com/soundcloud/lhm)

~~~
hamburglar
This looks interesting. Do you know how it actually works? The readme simply
says, "The basic idea is to perform the migration online while the system is
live, without locking the table. In contrast to OAK and the facebook tool, we
only use a copy table and triggers." What's the technique used to achieve
that?

~~~
dperfect
From what I understand, it creates a new table with the same structure as the
existing table, runs your migrations on it, adds triggers to bring live
changes in from the original table while copying (in batches) the records in
the old table. Once everything is copied and updated with triggers, it swaps
out the old table for the new one in an atomic operation.

------
jpollock
The column change phase 2 has a sneaky bug in it.

In the order presented:

1) Run migration

2) Apply trigger

There will be a point in time where the migration has been done and the
trigger hasn't been applied.

Even sneakier, if the trigger is applied in code and not in the database,
there will be period in time where some of the code is running the trigger and
some isn't.

The trigger needs to be in production and executing prior to the backfill
running.

I'm also loathe to rely on "triggers" in application model logic. I've seen
too many platforms where the operations engineers interact directly with the
database, resulting in required logic not executing.

~~~
jeffdavis
Great point.

When tricky concurrency is involved, using the built-in tools offered by the
database is the safest approach.

Often, it's also the easiest and most performant.

------
oelmekki
Good idea for the fake delete column. I've always seen non rollback-able
column removal migrations as something we just have to do with, this totally
solves the problem.

As for when is the good time to actually remove the column, I think this can
merge pretty well with my current thinking about migrations.

A new developer joined my company (we use rails as well). When setting things
up, he tried to run `rake db:migrate` (my fault, I mentioned it in my doc
instead of `rake db:schema:load`, I wrote doc as I was setting up the very
first lines of the project). There was several years of migrations, some of
them not working anymore since they used code that does not exist anymore
(like `#add_hstore_index` from activerecord-postgres-hstore, which has been
replaced by native postgres support).

This made me thinks that there is no need to maintain migrations in repository
if we can't use them anymore. And thus, from time to time, migrations should
be flushed. This flush should be the perfect time to actually delete columns
(well, this should be done just before). That way, we ensure our migrations
can constantly be rolled back, without breaking anything.

Edit : refining on that, we probably don't want to flush all migrations, but
only the older ones. So, how do we insert the migration to actually remove
data ? We can't just put it between old migrations and the ones we'll keep,
since it would be considered already migrated. We can't remove it directly in
sql console either, or we would have to do it on every machine and this would
caused desync among developers machines.

I think the best way to do it would be to append a new non-reversible (but
that does not break on reverse) migration, that uses `table_exists?` and
`column_exists?` to ensure it is only ran on databases that contains the
deprecated columns / tables. Something like :

    
    
        class RemoveDeprecated < ActiveRecord::Migration
          def up
            remove_column :foos, :bar_deprecated if column_exists? :foos, :bar_deprecated
          end
    
          def down
          end
        end

~~~
jipiboily
I never flushed migrations, maybe I should! Thanks for the comment. Food for
thoughts :)

------
jipiboily
JP, author of the post here - let me know if you have any additions /
questions!

~~~
planckscnst
This is great. I've been wondering how to do this, and this is exactly what I
had come up with. I'm glad to have confirmation that it's the normal way
people do zero-downtime migrations.

I might add one change. You say add a column migration, ship it, then add the
feature that uses the column and ship again. I was planning on shipping the
code at the same time (or even before) the migration, but having the feature
disabled with a feature-flipper until the migration is done.

~~~
jipiboily
Depending on your framework and ORM and how its column cache is working, it
sounds like a good solution too!

------
msherry
When I saw this, I first thought it was an extension of
[https://news.ycombinator.com/item?id=5326159](https://news.ycombinator.com/item?id=5326159)
;)

~~~
waffle_ss
If I'm understanding correctly, you take all frontends out of HAProxy's
config, so that HAProxy basically queues requests in-memory until you put a
frontend back into circulation? That seems like a much simpler solution for
when you have migrations that do not take very long (and enough memory for
HAProxy to spool those requests).

Would you be willing to add a minimal example HAProxy config to your blog post
that demonstrates your setup? I've been meaning to try out HAProxy but the
documentation is so dense I've had trouble getting started. Your use case
sounds like a really good reason to give it another go.

------
junto
For .NET developers I can recommend Fluent Migrator. Works brilliantly with
Octopus Deploy.

------
wikwocket
I prefer our approach:

 _> Case #1: I want to add a new column_

Issue an ALTER TABLE ADD COLUMN on the production database. Try not to make it
NOT NULL.

 _> Case #2: I want to change a column type_

Add another column (see Case #1) of the same name, with a suffix like "_2" or
"_NUM" so you can tell them apart.

 _> Case #3: I want to rename a column_

Don't risk it. Keep using the original column name. Very few of the column
names reflect the reality of that they contain anyway. It's little bits of
insider knowledge like this ("Product name is actually stored in
ITEMNAME_LONG, ITEMNAME is only a CHAR(15) so we keep UPC in there") that
contribute to our fast-paced, agile work environment.

