
Database versioning best practices - vkhorikov
http://enterprisecraftsmanship.com/2015/08/10/database-versioning-best-practices/
======
pjungwir
I agree. Rails database migrations get so many of these things right: every
change is a script, the database is versioned (via schema_migrations). I
really like that the database is the sum of a list of deltas. I've worked with
people who argued it'd be better if you could just write a declarative schema
description by hand (like schema.rb), and have Rails automatically adjust your
database to meet that description. But if you already have data you need to
migrate from the old to new structures (e.g. changing a 1-to-many to a many-
to-many), that isn't going to work.

I'm thrilled he mentioned that once you deploy a migration script, you can't
ever change it. I wrote the same thing here [1]. But I see it happen all the
time, and it guarantees that your schema gets out of sync with others'. If
I've already run that script, Rails doesn't know it needs to run it again.

One thing he left out is that you should write migrations so they work not
just on the current version of your code, but on future versions. I see this
cause problems on projects again and again. For instance a feature branch with
a migration from day 1 with `Permission.create!(name: "foo", value: "bar")`,
and then a week later the developer decided to rename the Permission class to
Setting. By the time the branch was merged into master, the migration was
broken. My solution for this is that a database migration should never depend
on the application code, but only use direct SQL.

[1] [http://illuminatedcomputing.com/posts/2013/03/rules-for-
rail...](http://illuminatedcomputing.com/posts/2013/03/rules-for-rails-
migrations/)

~~~
dperfect
I've arrived at a similar conclusion, with a few exceptions:

Never changing past migration scripts should be a guideline, not necessarily a
hard rule. With Rails, I generally prefer to use ActiveRecord in migration
scripts (rather than raw SQL) for the purpose of keeping things more readable
(again, guideline rather than hard rule). As you mentioned, sometimes changes
in application code can break past migrations, hence the need to occasionally
alter a past migration (e.g., placing parts of the migration in a
conditional). With practice, this actually encourages one to write relatively
"future-proof" migrations, anticipating possible future code changes and
creating each migration with as few assumptions as possible.

At times, as in your example, a migration may alter data (not just structure),
and sometimes old migrations alter data in an obsolete way that isn't really
helpful anymore. Occasionally (especially if the migration in question is
time-consuming), I'll revise an older migration, but I always keep it around
(so the version still exists) and add comments to note what is being removed
and why.

One thing that really helps is regularly running all migrations in sequence.
My test suite would routinely run them all, and I never use a schema.rb to
load a schema directly. Vagrant dev environments are always built using all
past migrations. Doing that regularly helps ensure that the current database
schema is always a result of repeatable deltas.

One more thing: I avoid migration rollbacks whenever possible. I've seen them
cause more chaos than relief in already-stressful situations. In practice,
this means (1) making sure the codebase before and after a migration can
handle the database state before and after the migration, (2) writing
migrations that allow you to easily restore the database to a consistent state
if they fail midway through, and (3) whenever possible, fixing a migration and
redeploying rather than attempting to roll back changes. A rollback is, after
all, a migration itself, prone to its own set of potential bugs and failures.

~~~
sanderjd
One solution to the problem of ActiveRecord code in migrations getting out of
date, that I usually like more than either the solution of only using raw SQL
(though often it's the _right_ tool for the job) or of editing migrations
later, is to define a version of the classes you want to use locally inside
the migration class. This is usually as simple as, eg.:

    
    
        class DoSomethingWithPermission < ActiveRecord::Migration
          class Permission < ActiveRecord::Base; end
    
          # ...
        end
    

It protects you from all sorts of tomfoolery – changes to the class name, its
relationships, its validations, etc. – and doesn't require any change in
style.

I also love the idea of migrations being forward-only, and deployed on their
own (ie. without accompanying code changes). If you can pull it off, those
constraints give rise to some nice properties: schema changes must work with
existing deployed code, code changes must work with existing deployed schema,
deploys can be done with zero downtime, botched schema changes can be fixed
without rolling back code changes, etc. It is harder to do than the
traditional style, requiring schema changes to be made in phases, and often
involving triggers and/or views to keep things consistent, so its cost/benefit
isn't necessarily a clear win.

~~~
claar
Thanks for this -- defining a local copy of the class is brilliant!

------
autarch
I mostly agree with this but with one (and a half) modifications and one
addition.

Instead of storing a version as an integer, I strongly prefer _naming_ each
migration and storing the applied migrations in a database table.

Rather than relying on a single integer, I can simply write code that applies
all the migrations which haven't yet been applied. This makes it much easier
for the database to be modified in multiple concurrent branches without any
merge pain.

At work we follow this practice and name our migrations something like "1-add-
foo-table". If another dev makes a branch with a migration named "1-add-bar-
table" then there's no conflict.

We also store our migrations as a directory rather than a single script. This
lets us split things up into multiple files if needed. We also allow for both
SQL and Perl (our main language) migration scripts in the directory, which is
handy.

Finally, all migrations must be idempotent. In theory no migration should ever
be run against the same database twice, but making our migrations idempotent
is a little extra insurance.

For the curious, I wrote some Perl modules that help manage a system of
migrations like what I just described:

* [https://metacpan.org/release/Database-Migrator](https://metacpan.org/release/Database-Migrator)

* [https://metacpan.org/release/Database-Migrator-mysql](https://metacpan.org/release/Database-Migrator-mysql)

* [https://metacpan.org/release/Database-Migrator-Pg](https://metacpan.org/release/Database-Migrator-Pg)

The system is designed to be extensible so you can build on top of it, rather
than a fully encapsulated tool.

There's also Sqitch ([http://sqitch.org/](http://sqitch.org/)). It's written
in Perl but is entirely language-agnostic to the best of my knowledge.

~~~
ars
> Finally, all migrations must be idempotent.

That must be tough. Are you doing it by simply checking a "I've already done
this flag"?

~~~
ansible
You'd have a database table where its entire contents explain what's been done
to the database schema and the data migrations.

And then you make inserting a row named '3-add_customers_table' as part of the
transaction for the migration. If this row already exists, the insert will
fail, aborting the entire transaction.

Or something like that...

------
stevecalifornia
I had a project from about 6 years ago that used this method of DB versioning.
The project was quite large and took 4 years to make-- it had a very large
database schema and required a lot of default data.

One day recently I pulled that project off the shelf and decided to see if I
could get it running for nostalgia's sake:

SGS Server starting...

Expected database version: 57.

Found database version: 0.

Applying DB change script: 1...success.

Applying DB change script: 2...success.

(...)

Database updated.

SGS Server running.

After seeing my old project set itself up automatically with no grief, I felt
an enormous amount of pride. I only wish my day job would implement a similar
system.

------
Nexialist
This can be a major pain for larger software projects. At work we have to
contend with:

* A database so large that even a minimally empty one cannot be created from scratch in less than 10-15 minutes. This creates problems for CI, integration testing, etc.

* The developers spent a good chunk of the late 90s/2000s writing Oracle PL/SQL code; hundreds of packages and thousands of stored procedures with oodles of business logic.

* We store reports, pdf attachments and other documents etc all in the DB as well.

* Since we put so much stuff on the database, small problems and schema fudges tend to creep in over the years, which makes every customer database a little bit different.

* Oracle licensing can be very unkind and the upper management mandate that we can't use the oracle XE version even in development/testing.

We ended up using a combination of Flyway for schema changes, hand-rolled
scripts to apply stored procedures and packages, and we had to roll a database
provisioning pool as-a-service for developers, and it's still a massively
janky and fragile setup. We really need better tooling for this.

~~~
jpollock
Last I looked, Oracle was free for development and test use. Is that no longer
the case?

~~~
richmarr
You see the problem there is that development and test are gateway
environments. They lead frequently to other environments such as staging, and
production, at which point you find yourself dependent on Oracle's latest
stuff and it can be time consuming and expensive to get yourself clean.

Next time you find yourself Jonesing for the latest features and someone
offers you dev & test for free, just say no.

------
slantyyz
Liquibase is pretty good for managing database versions

[http://www.liquibase.org/](http://www.liquibase.org/)

~~~
anton_gogolev
Oh, good old "programming in XML".

~~~
spand
"Supports XML, YAML, JSON and SQL formats" \- from the front page of
[http://www.liquibase.org/](http://www.liquibase.org/). What more do you want
?

~~~
anton_gogolev
As if programming in YAML or JSON is any better. And programming in platform-
specific SQL, for all non-trivial purposes, ties you to the specific flavor of
DBMS.

What I want is a DSL that is a platform-agnostic DSL: terse, typechecked,
possibly "compiled" to some IR, with basics of semantic analysis, with support
for (or at least awareness of) the quirks of the specific underlying platform.

------
benmos
I think this post only gives half of the story. The thing is you need to
manage not just your schema (and reference data) but also your DB code - i.e.
stored proc etc.

For schema, the approach recommended by the OP seems sensible - check diff
scripts into VC and consider them immutable (see also
[http://www.depesz.com/2010/08/22/versioning/](http://www.depesz.com/2010/08/22/versioning/)
for a lightweight Postgres approach).

For stored procs etc, I think you need an approach much more akin to normal
code - here you want to be able to leverage your VCS just as you do with
normal code - so for these you want to consider them mutable.

One of the best presentations I've seen on this topic is:

[http://www.slideshare.net/OleksiiKliukin/pgconf-
us-2015-alte...](http://www.slideshare.net/OleksiiKliukin/pgconf-
us-2015-alter-database-add-more-sanity)

~~~
eterm
Stored procedures are usually considered part of the schema. They can be
diffed. I know the Visual Studio comparison tool is able to diff databases but
also generates code to check affected stored procedures for errors after a
table schema change.

I would still recommend changing stored procedures through UPDATE scripts and
having this change scripts immutable. (Ideally not just immutable but also
idempotent and with forward and backward scripts to undo change.)

~~~
benmos
Curious why you'd recommend that. It seems to me that it costs you something
(convenience of standard VC practice on code) and gains you little.

I'd recommend checking out the linked presentation above - for things like
stored procs you have the option of installing multiple versions
simultaneously (e.g. under different names or 'Schemas') - obviously you can't
do that for the main schema itself. This is why I think a hybrid approach
makes more sense.

~~~
eterm
It's something I've seen working in a few places, but I'll take your
recommendation and review the slides for a better approach. You're right that
this approach does lose some of the benefits of version control but then
typically the SP schema changes are very closely linked to table schema
changes anyway.

------
anton_gogolev
Back in the day, I wrote my very own database migration toolkit [1] with its
special DSL and cool things like automatic rollback script generation and
(limited) type inference. It never gained any traction, though. But I still
use it in every project with great success.

[1]: [https://code.google.com/p/octalforty-
wizardby/](https://code.google.com/p/octalforty-wizardby/)

------
azov
_> Don’t use complex notations like “x.y.z” for the version number, just use a
single integer._

I prefer to use two integers: major & minor version, semver style. Minor
version gets incremented if old code can still use new DB (e.g. when adding a
column). Major gets incremented when the change is not backwards-compatible
(e.g. when renaming a column).

When you have to rollback a release not having to downgrade the DB can be a
major time saver.

~~~
agopaul
Why not timestamps + description? This way, even if you create migrations on
different branches, you don't have version number overlap. Rails does that if
I remember correctly.

eg:

201509102039-add-this-to-table.sql

201510220840-add-that-to-table.sql

~~~
azov
Because version numbers can be handled automatically, descriptions can't.

Say, you have app v.20150701 running in production. Today you deployed new
version and it ran migration script against production DB. Two days later you
discovered a critical issue in your new version and are forced to roll back to
the old app. Now, your 20150710 app opens the database and notices that it has
20150811-oh-we-did-something-to-the-db.sql migration applied. Can it use the
database or should it fail/downgrade the DB? How do you tell if all you have
is a timestamp and description?

IMO this is more important than solving branch merges because it helps me when
my production system is down and there's pressure to bring it back online
ASAP. Merging branches can be done offline in the cozy comfort of my
development environment.

PS. Of course, the above scenario might be more or less relevant depending on
your development and deployment workflows.

~~~
agopaul
Interesting point, though I've never thought about this situation because I've
never worked with software versioning. If I need to rollback, I have to check
which migrations are present in the SCM repository in a specific revision that
I need to rollback to

------
Aqueous
For our Scala web applications this is pretty much a solved problem, since
Play Framework keeps track of database versions in an 'evolutions' table, and
runs evolutions in order to bring the database up-to-date. We've also modelled
our test snapshots runner after the evolutions format - so each test fixture
contains queries for both set up and teardown, on top of the empty vanilla
database described the evolutions, of course.

The trouble we encounter is extending outside software to meet our needs,
specifically Wordpress. WordPress is a better-than-average CMS but like many
CMSes it is very developer-unfriendly. I need to be able to snapshot the
database from a single WordPress installation and version it along with an
installation of the several WordPress Plug-Ins I am developing for a specific
client. This way I can keep a repository reflecting the current state of a
single site (with a specific set of plug-ins activated, and specific
configurations for each), an exact copy of which I want to eventually deploy
in a production environment. I've seen plug-ins for this, but for at least in
one specific case (VersionPress) it seems like it is only available through an
overly expensive 'Early Access Program' \- and it's been in Early Access for
many months now, making it seem Vaporware-ish. Anyone know of a better,
automated solution, other than dumping the MySQL database every time I make a
change?

~~~
thoman23
I just really wish the Play Framework evolutions feature allowed you to give a
logical name to your scripts. It would nice to be able to tell at a glance
what the high level purpose of each script was.

~~~
muyfine
+1. This bugged me about the excellent Schema Evolution Manager. Ended up
porting it and extending it to take an optional name to add after the date.

------
austinhyde
Sure, I'll throw my hat in the ring here with a shameless self-plug.

I help maintain a database versioning/migration tool that my team has been
successfully using for years now:
[https://github.com/nkiraly/DBSteward](https://github.com/nkiraly/DBSteward)

The idea is that instead of managing a schema + several migrations, you just
store the current schema in XML, then generate a single migration script
between any two versions of the database (or just build the whole thing from
scratch). The ideal use-case during deployment is to checkout the existing
deployed schema into a different directory, then diff against the current and
apply the upgrade script.

I've always found most migration solutions to be wanting, and while this
approach has its downsides (things like renames can be hairy, need to
explicitly build in support for RDBMS features), I do like it a lot more than
the standard sum-of-changes approach.

The tool is still very much a work in progress, written in terrible 10-year-
old PHP, and desperately in need of a proper rewrite and modernization, but it
is definitely stable, safe, and production ready.

------
axelfontaine
Flyway does exactly this and enjoys widespread adoption
[http://flywaydb.org/](http://flywaydb.org/)

~~~
endymi0n
Another happy user of Flyway here which essentially forces all of the
described best practices onto your DB. If they only supported transactionless
migrations, we'd be totally happy campers :)

~~~
axelfontaine
Coming very soon :-)

------
pendext
We use [http://www.dbmaintain.org](http://www.dbmaintain.org) at the Java shop
where I work. It enforces all of the best practices in the article.

Something else that we do is to have the schema/mandatory data in a separate
git project. This has the added benefit of having a separate deployable
artifact from the application itself.

------
u02sgb
We're using Delphix (off Oracle) to do our "every developer should have a
database". For anyone not familiar with Delphix, it provides individual DB
instances that are "forked" from the source DB. I'm assuming it's some kind of
Copy On Write.

So far it's working ok, although we've not automated, snapshot/update/refresh
off the master yet. The master is planned to be periodically refreshed from
ProdCopy.

Would be interested if anyone else is using Delphix and can comment or link to
anything on it. I've been meaning to poke around the REST interface but
haven't had time yet.

~~~
shortimer
We're using Delphix too. Complements an approach like this, doesn't replace
it. We use Liquibase to support the schema-as-code concept; it's enforced by
no one outside Sr. DBAs being able to make such changes directly in anything
other than Dev (lower-than-QA) environments.

Delphix is terrific for having production-like data in your test environments
("like," because you'd better be masking sensitive information from prod
before it gets written in test.

The issue we've had with Delphix is performance. It absolutely pounds on the
storage system, and tends to suck up all available bandwidth and CPU that's
made available to it. If you resource it properly, it's amazing.

------
gghh
It's all nice and good but there is an additional problem the article doesn't
mention: how do you handle "exceptional" (unplanned) cases in your database
version control? Sometimes you discover things don't work as you expected, or
that there is a last-minute schema change you want to promote quickly to the
production db. I find it hard to encompass these situations by modeling
changes via forward/backward migrations (database schema diffs, like rails
does).

At my workplace we operate a large-ish web service in our own datacenters
(airline reservations). We're very happy with the way we version code: we've
one single branch where changes arrives all over the clock; every second day
this branch is deployed in a "pre-production" platform. Every week we take one
of those versions, fork a release branch out of it, and make all necessary
adjustments, i.e. a bunch of cherrypicks (changes arrived late but needed
quickly) and possibly backouts (we found a bug and have no time for a fix):

    
    
        a--b--c--d--e   <--- what's in pre-prod
               \
                b*--e'  <--- what I want in prod
        
        b* = backout, e' = cherripick
    

Now, I really really would like to do the same for our DDL (database schema
changes). But backouts and cherrypicks just don't work if you're versioning
migrations, because you have to deal with what's already in the prod db
(conversely, when you deploy software you just trash the old binary
alltogether and replace it with the new one):

backouts, like b* above: you would hand to your DBA the sequence of patches
a,b,c,b* ,e' which makes no sense because you're asking to apply migration b
and its reverse b* (you should say "just skip b", but version control doesn't
work like that)

cherripicks, like e' above: good luck remembering, in two weeks from now, that
you've already applied e' and don't need to apply e.

At the end of the day, we -do- version our changes with forward/backward
migrations, but we always end up managing backups and cherrypicks by sticking
notes on the wall (or on the team wiki), which smells bad. I admit I never
tried any of the popular solutions for database version control (liquibase,
flyway) and I don't know how they work. But I feel the ideal solution would be
to version the whole schema (not schema diffs), and have tools that at any
given moment can compare the schema I have in my VCS and what is actually on
the platform, and generate the necessary "migration" on the fly.

EDIT: formatting

------
rburhum
Django's South does this.
[http://south.readthedocs.org/en/latest/tutorial/part1.html](http://south.readthedocs.org/en/latest/tutorial/part1.html)
And now, the next version of that is built into Django core.
[https://docs.djangoproject.com/en/1.8/topics/migrations/](https://docs.djangoproject.com/en/1.8/topics/migrations/)

------
jmickey
What about versioning data? For most projects we have SQL tables consisting of
preconfigured data, that is essential for the project to work. (predefined
templates, initial configuration settings) How do you go about producing an
SQL difference script for your data, getting it under version control and then
applying it at the push of a button?

Currently we are using the Toad for SQL tool for producing and applying both
schema and data diffs between our development environments, but it's a Windows
only tool.

~~~
IanCal
Two things that may be relevant that I've done before:

1\. The simplest. Setup scripts for doing full backups and restores to/from
S3. This is helpful for many reasons, but should also help your case. Diffs
are an optimisation if your data is large enough, but I think many cases are
easily small enough to just push and pull as a whole. Until you're at a
significant number of gigs of data, I'd recommend trying this first.

2\. Record your data in an append-only format (this doesn't require a
different database, just a strict way of doing things). This means you can
always get a full history of any bit of your data. I do this for some scraping
work (recording lots of noisy/unreliable values, so not losing old versions is
important). If your data is stored like this, pushing around diffs should be
easy (grab everything with a created_at time after a certain point).

------
Joeri
We're using this approach on a large team and db but are struggling with
having multiple major version branches in play at the same time and
backporting of functionality and bugfixes breaking things by changing the
order in which changes get applied. At some point, the problem remains that
you're migrating the data in a live system, and that means the order of
changes matters.

------
feld
Archiveopteryx (aox.org) handles database versioning in a very clean way. You
can actually upgrade and downgrade your application and everything required to
change your schema and dataset is contained within Postgres.

[http://archiveopteryx.org/aox/upgradeschema](http://archiveopteryx.org/aox/upgradeschema)

------
jgrowl
I would like to add that it should be easy to blow away your db and recreate
the structure with enough sample data so that the application works in a
dev/test environment without having to pull in a db dump from some other
machine. Depending on dumps drives me crazy.

------
languagehacker
This touches on some very common issues related to migrations. It's very
interesting to see that after all this time, effectively handling database
schema change management can still be a challenge. I wrote about my experience
with this topic several years back. I think some of these observations might
still be useful to folks just digging into the problem:
[http://robertelwell.info/blog/future-proof-db-
changelog/](http://robertelwell.info/blog/future-proof-db-changelog/)

------
zkhalique
Wow, I came in there thinking what we will learn. Turns out we do all of those
points in our platform!

[http://platform.qbix.com/guide/database](http://platform.qbix.com/guide/database)

[http://platform.qbix.com/guide/scripts](http://platform.qbix.com/guide/scripts)

We use the actual database schema as the primary source of truth for the
schema, and we generate the code for the models from it. We have scripts that
upgrade the database schema for each plugin and app.

------
agopaul
I have a similar approach, but with timestamps instead of an incremental int.

I'd use a tool for this, but then I would need to change how I deploy the apps
given that usually, it's best to run the migrations and then update the
codebase to prevent exceptions.

In case there are new columns, the best solution would be: execute the
migrations (if there are new columns, make them NULL), deploy the code, fill
the new columns via SQL or code, if the new columns must be not-NULL, change
the table definition and make the columns not-NULL.

Given the complexity I find better to do it by hand.

~~~
sanderjd
Yeah, timestamps seem to work much better, especially with distributed version
control. Otherwise you have to coordinate who gets the next number in the
sequence. There's still a race condition between two branches changing the
schema in incompatible ways, but with a clean merge. But in my experience,
code review, tests, and just developers running migrations day-to-day
mitigates that issue.

------
stephenmm
I am sort of new to this stuff but I believe Django essentially does this
correctly as long as you have your site under version control. Although having
a DB instance for each developer seems a little bit tricky to me but that may
just be my inexperience.

------
k__
What do you think about versioning with an ORM?

Doesn't this imply a automatic versioning with the rest of the code?

For example, I use Sequelize and create my tables via model classes, which I
write like every other code in my app.

~~~
haukur
That works until a method you reference in the migration changes or doesn't
exist anymore. For instance, a model you're working with might not expose some
of the methods you're working with or a library has been modified, but it's
being used in the migration. Obviously this is something that people should
try to avoid, but I once found myself in the position of having to do checkout
to some specific commits in the history in order to be able to run migrations
at a past project. In the end I found myself repeating the process once too
often and created a gem for it:
[https://github.com/hph/farfugl](https://github.com/hph/farfugl). This is for
Rails migrations but the concept is the same in any library or language, you
just have to go to the commit where the migration originates from.

As a side note, I've been using Sequelize for a year now and it can often be
pretty annoying. Both the query interface and the migrations are subpar when
compared to Rails. For instance:

    
    
        Model.find({ id: 1}) // or maybe just Model.find(1)!
    

vs. (the Sequelize syntax)

    
    
        Model.find({ where: { id: 1 } })
    

Forgetting the "where" above doesn't even raise an error, it just silently
fails.

~~~
k__
For IDs you can use

    
    
        Model.find(1)

------
sinatra
"every SQL script file must be immutable after it is deployed to production or
staging environment."

git doesn't support locking files. Then, what is the best way to meet this
immutability requirement?

~~~
joshka
A baseball bat

------
muyfine
Schema Evolution Manager is excellent for managing Postgres databases in a
language-, framework-agnostic manner!

------
WorldWideWayne
SSDT is fantastic for this. I wish it worked for something other than SQL
Server though!

~~~
pc86
Well I mean it has "SQL Server" in the name, so..

