
Managing database schema changes without downtime - jbaviat
https://samsaffron.com/archive/2018/03/22/managing-db-schema-changes-without-downtime
======
drinchev
Having no-down-time when doing migrations is not an easy topic.

> These defer drops will happen at least 30 minutes after the particular
> migration referenced ran (in the next migration cycle), giving us peace of
> mind that the new application code is in place.

Pretty much that's what I've seen before. You do series of deployments +
migration scripts.

Usually :

\- Migrate with backward-compatible modifications of the schema

\- Release application logic that takes advantage of both migrated / to-be-
deprecated tables

\- Possible extra migration to sync the new / old tables.

\- Release application logic that stops using the to-be-deprecated tables.

\- Migrate with destructive modifications to remove the old table.

Usually that's a huge complicated process which might be replaced with one
migration and a few minutes of downtime. Sadly some companies can't afford it,
so they do such changes with weeks planing.

I usually vote for having some planned non-working-hours downtime.

~~~
zombieprocesses
> Having no-down-time when doing migrations is not an easy topic.

Depending on the nature of the migration and the type of data and the type of
service and how coupled everything is, it's impossible. Especially when you
are dealing with financial data and transactions that you can't afford to
screw up even a single one.

> I usually vote for having some planned non-working-hours downtime.

Yep. No other way around it. You tell your clients weeks/months ahead of the
planned downtime. They'll understand and migrate in the middle of the night
over a holiday weekend.

But this is in regards to major structural migrations, not simple schema
updates or changes.

~~~
greenleafjacob
Are you saying no downtime is impossible or merely expensive?

Erlang/OTP with relups is an example of a system that provides zero downtime
online upgrade and rollback with zero downtime.

~~~
zombieprocesses
I'm talking about serious systematic migrations, not upgrades. In these
instances, it is impossible to avoid downtime ( whether it be a few seconds,
minutes or even hours ). It's one of the reasons why SLAs do not offer 100%
uptimes.

~~~
theptip
Interested in a concrete example of where you see a migration being impossible
without downtime; I'm not aware of Amazon or Google taking their APIs down for
maintenance, which suggests that you can solve these problems with enough
engineering spend.

> depending on... how coupled everything is

If you're just saying here that some systems aren't currently able to migrate
in a hitless way, rather than claiming that it would never be possible for
some systems, then I fully agree. Hitless migrations require a lot of work,
and probably aren't worth the cost for many applications/systems.

~~~
stickfigure
Neither Amazon nor Google run off of an RDBMS.

Relational databases are awesome, but it's much harder to achieve 100% uptime
compared to schemaless, distributed, replicated data stores. You easily can
take down your system with one innocuous-looking DDL migration that works
great in test but grinds the machine to a halt on a production dataset.

~~~
theptip
That's a good general point, but Youtube does use MySQL:

[https://www.mysql.com/customers/view/?id=750](https://www.mysql.com/customers/view/?id=750)
[http://highscalability.com/youtube-
architecture](http://highscalability.com/youtube-architecture)

------
YorickPeterse
At GitLab we do the following:

We have two migration directories: db/migrate and db/post_migrate. If a
migration adds something (e.g. a table or a column) or migrates data that
existing code can deal with (e.g. populating a new table) then it goes in
db/migrate. If a migration removes or updates something that first requires a
code deploy, it goes in db/post_migrate. We combine this with a variety of
helpers in various places to allow for zero downtime upgrades (if you're using
PostgreSQL). For example, to remove a column we take these steps:

1\. Deploy a new version of the code that ignores the column we will drop
(this is a matter of adding `ignore_column :column_name` in the right class).

2\. Run a post-deployment migration that removes said column.

3\. In the next release we can remove the `ignore_column` line (we require
that users upgrade at most one minor version for online upgrades).

The migrations in db/post_migrate are executed by default but you can opt-out
by setting an environment variable. In case of GitLab.com this results in the
following deployment procedure:

1\. Deploy code with this variable set (so we don't run the post-deployment
migrations)

2\. Re-run `rake db:migrate` on a particular host without setting this
environment variable.

For big data migrations we use Sidekiq to run them in the background. This
removes the need for a deployment procedure taking hours, though it comes with
some additional complexity.

More information about this can be found at the following places:

1\. [https://docs.gitlab.com/ee/update/README.html#upgrading-
with...](https://docs.gitlab.com/ee/update/README.html#upgrading-without-
downtime)

2\.
[https://docs.gitlab.com/ee/development/what_requires_downtim...](https://docs.gitlab.com/ee/development/what_requires_downtime.html#doc-
nav)

3\.
[https://docs.gitlab.com/ee/development/background_migrations...](https://docs.gitlab.com/ee/development/background_migrations.html)

~~~
lojack
Its worth noting that there are some edge cases where this could lead to
potential problems. Its enough of an edge case for many apps to ignore the
possibility. Take your example with the addition of the fact that the field
you're removing has a unique database constraint. With your deploy process
this would look something like this.

1\. Run predeploy migration in db/migrate to remove the unique constraint 2\.
Deploy new version of code that ignores the column 3\. Run post-deploy
migration that removes column 4\. Remove the part of code that ignores the
column

Failure to do step 1 would result in multiple rows being created with null
values, which would cause errors for all but one (or zero) insert. The edge
case, however, is a little more subtle. Between steps 1 and 2, there's a
period where the database constraint doesn't exist, but the old code still
expects the constraint. Validating uniqueness in the code can mitigate this,
but doesn't ensure consistency.

~~~
YorickPeterse
This is not how we do it. Constraints, columns, etc are removed using a post-
deployment migration. This means you can't end up with NULL values and the
likes.

~~~
lojack
This helps ensure database consistency, but might not actually avoid errors.
In this scenario, the period of time between the app code being deployed and
the post-deploy migration running there could potentially be application level
errors because the app is now ignoring the deleted field and the database
expects unique values (or non-null, or whatever constraints you have).

You can solve these issues with some creative migration paths both at the app
level and at the database level, but those changes aren't always trivial, and
may not necessarily fit in the pre-deploy post-deploy migration strategy.

This may also not be an issue for you. You may be willing to accept a brief
period of time where an error is unlikely, but technically possible.
Personally, most of the time this is acceptable to me, but I always find it
important to think about these scenarios in case something does happen.

------
michaeldejong
Researcher/author of a tool [0,1] also attempting to tackle this problem here.

Unfortunately zero-downtime schema changes are even more complex than
suggested here. Although the expand-contract method as described in the post
is a good approach to tackling this problem, the mere act of altering a
database table that is in active use is a dangerous one. I've already found
that some trivial operations such as adding a new column to an existing table
can block database clients from reading from that table through full table
locks for the duration of the schema operation [2].

In many cases it's safer to create a new table, copy data over from the old
table to the new table, and switch clients over. However this introduces a
whole new set of problems: keeping data in sync between tables, "fixing"
foreign key constraints, etc.

If there are others researching/building tooling for this problem, I'd love to
hear from you.

[0]
[http://github.com/quantumdb/quantumdb](http://github.com/quantumdb/quantumdb)

[1] [https://speakerdeck.com/michaeldejong/icse-17-zero-
downtime-...](https://speakerdeck.com/michaeldejong/icse-17-zero-downtime-sql-
database-schema-evolution-for-continuous-deployment-1)

[2] [http://blog.minicom.nl/blog/2015/04/03/revisiting-
profiling-...](http://blog.minicom.nl/blog/2015/04/03/revisiting-profiling-
ddl-statements-mysqls-return/)

~~~
ewjordan
Significantly altering an active table without downtime is a problem that will
not, and should not, ever be solved by a library. It's too fraught with peril
and special case fun for any competent devops person to leave it up to
external code.

~~~
nahname
The work that goes into "external code" is higher quality than most companies
can achieve. The cost of developing a similar tool may also be outside what
the company can afford to invest.

If your business is online schema changes, then inventing it yourself makes
sense. Otherwise you are likely throwing money away to create an inferior
product.

------
smoyer
We're using FlywayDB [0] and many of the ideas in Martin Fowler's
_Evolutionary DB_ article [1]. When database changes are breaking, we use a
pre-deploy and post-deploy "pair" to create an intermediate DB state that both
application versions will run against.

[0]
[https://flywaydb.org/documentation/migrations](https://flywaydb.org/documentation/migrations)

[1]
[https://www.martinfowler.com/articles/evodb.html](https://www.martinfowler.com/articles/evodb.html)

~~~
einrealist
Liquibase [0] is an alternative to FlywayDB. Unfortunately, FlywayDB changed
their feature scope for the free version.

[0] [https://www.liquibase.org/](https://www.liquibase.org/)

~~~
smoyer
I haven't run into a feature that I need from either of the paid plans. What
features were removed that you find compelling?

As an aside, Liquibase also effectively has a paid tier - it's just that it's
marketed on a different web-site than the OSS version
([http://www.datical.com/liquibase/](http://www.datical.com/liquibase/)).

~~~
einrealist
Dry-runs and rollbacks (Flyway's undos). Especially rollbacks are useful for
development.

~~~
smoyer
I don't remember roll-backs ever being in the OSS project either. We always do
potentially destructive operations in two different migrations since you'll
never get data back once you delete it. We tend to favor soft-deletes anyway
which makes life a lot easier but can bloat your DB.

------
danbruc
There is also anchor modeling [1] with quite a few publications [2] and even
an online modelling tool [3] to play around with. It is essentially a method
that yields a highly normalized bitemporal database model so that every prior
database state, schema and data, is a subset of the current database state.

I was personally not able to use it yet but I really like the idea behind it.
It is probably not the ideal choice in general due to its append only nature
ever increasing the consumed storage space and due to its high degree of
normalization which may or may not have a negative impact on performance when
implemented on top of a rational database depending on the nature of your
queries.

But if you do not have to deal with excessive amounts of data, if you expect a
lot of schema evolution, if you need traceability of all changes for auditing
purposes or such, then using anchor modeling might be worth a shoot.

[1]
[https://en.wikipedia.org/wiki/Anchor_modeling](https://en.wikipedia.org/wiki/Anchor_modeling)

[2]
[http://www.anchormodeling.com/?page_id=360](http://www.anchormodeling.com/?page_id=360)

[3]
[http://www.anchormodeling.com/modeler/latest/](http://www.anchormodeling.com/modeler/latest/)

------
pjungwir
This looks like a helpful approach to a tricky topic! I'm curious how the
post-deploy migrations work when you run all your migrations at once, from the
beginning, e.g. in Circle CI or when onboarding a new developer?

Also, do you do anything to protect yourself against migrations written at
time _t_0_ and then run much later at time _t_n_? I've seen a lot of problems
there when migrations use application code (e.g. ActiveRecord models), and
then that code changes. (My solution is to never call model code from
migrations, but there are other ways.) This isn't really specific to your
article I guess, but does your approach making managing that harder? Easier?

Does having that details table help at all when you have migrations on a long-
lived branch that is merged after other migrations have been added? Or is the
solution there just to rebase the branch and test before merging it in?

EDIT: I think this blog post by the Google SRE folks is great reading for
people thinking about migrations and deployment:

[https://cloudplatform.googleblog.com/2017/03/reliable-
releas...](https://cloudplatform.googleblog.com/2017/03/reliable-releases-and-
rollbacks-CRE-life-lessons.html)

I've never been comfortable with rolling back migrations in production, but
their plan changed my mind that it can be done safely. Is your approach
compatible with theirs?

------
jerkstate
This is cool but I've started reducing my need for migrations by just
serializing most of the structured pieces of the record into a text string
(json for example), and only have separate columns for things that need to be
joined (usually an id column) or selected on/indexed. I find this approach
prevents me from needing to do migrations nearly as often. If you find that
you really do need to index on a new field, adding a column is the easiest
kind of migration to do.

Of course sometimes you do still need to which is where strategies like this
come in.

------
theptip
>Migrate database to new schema... spin up new instance

Sadly this is more of a pain in Django, since it does not write default values
to the database. To be able to safely run the migration first, one must
manually rewrite the SQL for the migration. [1]

[1]: [http://pankrat.github.io/2015/django-migrations-without-
down...](http://pankrat.github.io/2015/django-migrations-without-downtimes/)

------
stuff4ben
Sigh, wishing there was some magic sauce for this. But currently will have to
sit through a 14 HOUR downtime while my team upgrades our SonarQube DB to 6.7.

------
zie
Just put your database schema in a VCS and get on with life:
[http://www.liquibase.org/](http://www.liquibase.org/)

------
PunchTornado
is it really 0 downtime as the title suggest, or minimal downtime?

~~~
theptip
If you do it right, it's zero downtime (i.e. no DB locks held for long enough
to cause timeouts to clients).

~~~
way0utwest
In most cases. However, some changes, such as adding the column and needing to
manipulate data (populate nulls, transforms, etc.) could still cause locking.

This really comes down to knowing the boundaries of what changes could cause
timeouts and working around them as best you can. However, some edge cases
could cause downtime.

------
xstartup
I wonder how it is done in Dynamodb or Mongo.

------
lndolent
I’d love to see an article about letting your users fucking deal with it,
because no one really needs this much uptime anymore.

~~~
stickfigure
This is probably not going to get a lot of love but it's really not a terrible
answer for many, many apps - especially anything business related where you
can assume that most users work single-country office hours.

This exact same compatibility problem exists at other boundaries of your app -
in particular, the web/backend interface.

