

Database Versioning: The ignored aspect of version control - jlemoine
http://www.nimkar.net/index.php/9-release-management/3-database-versioning-the-ignored-aspect-of-version-control

======
danbruc
While all theory sounds nice in theory in practice it just fails. If you are
only concerned with empty databases during development there is not much to be
gained by keeping deltas of different database schema versions - just having
one evolving database create script will do the job. Then just grab the
version belonging to your application version and you are good to go. Once you
have a non-empty database and have to provide data migration scripts
everything becomes a one-way street in all but the most trivial cases because
updates are destructive.

My favorite toy example are firstname and lastname. If you decide to merge
them into a single name column (without separator between firstname and
lastname) you can not automatically go back because there is ambiguity when
ever either firstname or lastname is missing or when at least one of firstname
or lastname consists of two or more words. If you do the opposite - split the
name into firstname and lastname - and somehow manage to come up with a
satisfying heuristic how to perform the split, it is easy to go back by just
merging firstname and lastname again. But then going forward again will not
return you to the initial state in all the cases where your heuristic fails.

Currently I think the way to go may be using a highly normalized database
schema - like 5th or 6th normal form - and perform only non-destructive schema
evolution. I am not yet sure how this will turn out in practice with respect
to limitations imposed on the possibilities to evolve the schema and how bad
it may become because of abandoned database schema artifacts and the contained
data. Anyway I think it looks promising and after doing some more research,
thinking and prototyping I will most likely give it a try in an upcoming
project.

~~~
fendale
I agree that rollbacks are hard, but if you think them through, there is
usually a way.

For your example, to merge first and last name, you would create a backup
first, ie, create a table that contains the current version of first and last
name, and if you need to go back, you restore. Now, if you have an application
that cannot have any downtime through DB upgrades you could be in trouble -
after your backup, more records could come in to the original table. Probably
what you need to do is modify the application to start inserting both
full_name and first_name, last_name, and then gradually move over all the old
data to the new format etc - ie a gradual migration. Every application and
example is different, but the problem is usually solvable, often with some
amount of effort.

The master schema create script is an idea I've seen in the past, and it works
ok-ish up to a point. The problem is you need to update the master script, and
also produce scripts to migrate production, so it doubles the effort. For
enterprise database developers, I've seen plenty of development approaches
that boil down to:

* Create all changes on Dev using GUI tools. * Used TOAD to diff against production and produce delta script. * Apply delta to prod.

Its gets the job done, mostly, so long as someone doesn't drop the dev
database which is the only place with the new code - I've saw it happen. Then
you have to go and beg the DBA to do a full restore just you get the release
back on track!

~~~
herge
I think you misunderstood what the OP was talking about. Have you ever looked
at migration systems like South for django or the migrations in ruby on rails?

He was saying how the reverse of a migration is night impossible, because what
happens if you add a new user after applying that first name + last name
migration, for which you don't have the split name.

~~~
couradical
That's why you apply the approach he was stating - you move slowly. Step 1.
insert name column and populate from fn/ln. Step 2. Change code to point to
name column as well as populate fn/ln Step 3. remove code pointing to fn/ln.
Step 4. remove fn/ln columns.

No loss of data, instant reversion at any point, because each step is non-
destructive. It's just longer, harder to do in reality, but definitely the
"right" way to do it if you are 100% concerned with data validity.

~~~
stonemetal
At step 3 or later reversion becomes impossible. A new record will be added to
the DB without a first name, last name making rollback impossible.

Also how is step 4 non destructive?

~~~
couradical
But at step 3, you've already validated the code change. You know that the
name only code works as intended, and you're just cleaning up old code at that
point. I do agree that there is a potential for failure at that point, but I'd
think it to be reasonably mitigated in comparison to making the change and
needing to then roll back.

Step 4 is non destructive from a data perspective, as those columns would be
"zombie" data. they'd exist, but nothing would be referencing them at that
point, and the data would simply exist in a different form.

If you're talking about representing the data differently, yes that would
happen, and in a case like the above, you'd be much better off by preserving
normalization and using a view to combine into a "name" entity. Th

------
pornel
Schema rollback doesn't really work, because once you have data in the
database DROP TABLE is not the inverse of CREATE TABLE (going back and forth
deletes the data).

I've never been in situation where losing production data is acceptable. I've
stopped bothering with rollback statements that nobody ever cares to test and
nobody will ever use in production. Schema changes are forward-only.

~~~
toxicFork
Assuming you have lots of storage space, instead of drop you could just copy
the table across to the archive. Similarly with dropped fields' previous
versions, and created fields' current versions in the case of a rollback. But
then you will also need to have the logic to create those fields for new rows.

It's just easier to have forwards-only changes.

------
flurdy
Have to admit I use simpler tools such as DbDeploy, Flyway or Play!'s
evolutions. They do the trick for smaller databases, but are difficult to
retroactively fit in as the OP mentions.

A great tip I also received once was to split migrations across releases when
they are destructive. ie. if a column is to be removed, first release code
that don't use that column, if all okay, rename to column in the next release
db migration, and if still ok, in a release later on actually delete the
column. If this was done in one release and didn't work your rollback would be
difficult.

Other defensive actions e.g. when deleting rows of data, is to copy the
affected data to a temporary table first before deleting it in the original
table(s), and only in a later release when its proven you did not need to
rollback should you delete the temporary table.

And if for some reason you don't have any automated database versioning, at
least create a versions table to manually keep track of changes and dates for
some sort of sane auditing.

------
fendale
This is an area I have experimented with in large enterprise environments,
usually with Oracle databases.

In one job I created a process where we used Rails style migrations with some
Ruby code I put together. As well as schema migrations, I had it handle
applying stored procedures to the database too.

Since then I created a little application called dbgeni (database generic
installer) and have been using it successfully to create Dev and Test
environments in another job. It changes the schema with migrations and applies
changed stored procedures too.

One difference with dbgeni and other tools, is that dbgeni uses plain SQL
files and applies them using the database CLI - ie sqlplus for Oracle, the
mysql command for MySql and isql for Sybase. This means that if you take the
tool away, the SQL scripts are still useful - that allows me to use the tool
for applying changes to dev and test environments, but since the production
DBAs don't want to use it, I can just bundle up the same scripts and pass them
over.

The project is on github
[https://github.com/sodonnel/dbgeni](https://github.com/sodonnel/dbgeni) and
it installs as a Ruby gem. I have supporting website
[http://dbgeni.appsintheopen.com](http://dbgeni.appsintheopen.com) with more
details.

------
tiquorsj
Everyone who does DB update should understand idempotence and how to undo
their changes. I like the approach the author provides, but calling it version
control is inaccurate. This is change management and it's a completely
different discipline with its own body of knowledge.

~~~
bsaul
I failed to understand how undo script would work for big schema modifications
( changing 1-n to n-n or adding intermediate tables). To me, the safest way is
definitely a backup / restore if things go wrong ( and you should definitely
have backups anyway), but i'm not an expert. Could you elaborate ?

Also, the posts mentions ORMs as a solution. Well, clearly i would never want
to let anything automaticaly generate and perform migration scripts apart for
simple "new nullable column" types of change.

~~~
zorlem
You do it by breaking up the potentially destructive changes so you could
rollback your deployment to the previous working version. Typically all
destructive modification of the schema is done in several passes: eg. in each
release/deploy:

# copy the data in a new table and add code to place it in both tables

# change the code to reference only the new data but still store it in both
tables

# rename the old table

# remove the old table

On top of that you have changes that are supposed to be applied before the new
code is deployed (eg. creating a table, renaming a table) and after the code
is deployed (eg. dropping unused columns).

As somebody else mentions in the comments this is more about Change Control
than Versioning, but the two things go together to insure that you have as
much idempotent and reversible changes as possible. Of course, if you have the
resources you can always adopt a "never delete data" policy, so you're always
copying instead of modifying the data, making sure you have a way to retract
the changes. Clean-up can be done on regular, distant intervals.

Using ORM doesn't prevent you from doing a review on the changes that are
about to be implemented. ActiveRecord for example, could provide you with a
SQL file that will be representing your schema after the ORM is done. Usually
the ORM schema changes are executed automatically by your CI upon deployment
on Dev, QA and possibly Staging environments (you do have at least two of
those, don't you? :) and need manual intervention (maybe to control
performance degradation of your live env - locks, increased disk I/O, etc).
This way you get plenty of testing of any DB changes and you should be able to
spot any problems as early as possible.

------
ibotty
every article about db version control should mention sqitch. it's a great
tool that works similar to and in conjunction with git.

[http://sqitch.org/](http://sqitch.org/)

~~~
chrisfarms
sqitch is the _only_ approach to version controlled migrations I've found that
actually works long term. I wish more people would invest time into learning
it so that it would become more commonplace.

------
emmelaich
Here you go. It's easy with a bit of Category Theory. Just implement something
along the lines of David Spivak's "On The Relational Foundations Of Functorial
Data Migration".

    
    
        http://arxiv.org/abs/1212.5303
    

Original motivation, from v1 of the paper:

    
    
       I asked whether the difficulty in changing data types to
       allow for longer integers was simply a storage issue or 
       something more fundamental. They said that it was more 
       fundamental, and in fact that my question showed how
       little I knew about the horrors of using databases in
       the real world!

------
yoodenvranx
> Every script must start with a integer which must be the next available
> integer.

Based on experience you should not do 01_script_a.py, 02_script_b.py, ... But
rather do 010_, 020_, 030_, ... . This way you can squeeze in some initially
unaccounted script in the correct location without renaming the others.

~~~
Negitivefrags
It's like BASIC all over again.

~~~
anon4
Ahaha! Glad I'm not the only one who thought that!

------
catmanjan
Microsoft's MVC database migration system has worked well for me.

The package manager handles making the "code behind" the database changes
(which is committed to the CVS), and you can revert and update as needs be.

~~~
qntmfred
for .NET work I prefer
[https://github.com/schambers/fluentmigrator/wiki](https://github.com/schambers/fluentmigrator/wiki)

------
sethreno
I prefer to think of db version control and db migrations as two separate
things.

Version control achieved by keeping the SQL scripts to create the database in
your VCS along with the code that depends on it.

Migration scripts are created later as a pre-deployment step. I prefer not to
store these scripts in VCS because in my experience they are typically run
once and never used again. There are several tools that can generate the
migration scripts by comparing two schemas e.g. dev-->prod.

I wrote an open source command line tool that can be used to create VCS
friendly create scripts from a SQL Server database. It's on github at
[https://github.com/sethreno/schemazen#schema-zen---script-
an...](https://github.com/sethreno/schemazen#schema-zen---script-and-create-
sql-server-objects-quickly)

Also, here are a few tools that I've used to generate migrations scripts: *
[http://opendbiff.codeplex.com/](http://opendbiff.codeplex.com/) *
[http://www.sqldbtools.com/Tools.aspx?ProductId=1](http://www.sqldbtools.com/Tools.aspx?ProductId=1)
* [http://www.red-gate.com/products/sql-development/sql-
compare...](http://www.red-gate.com/products/sql-development/sql-compare/)

------
binarymax
As others have said and the author mentioned with the 3Tb example - it is the
data that is the tricky part. Rather, the tricky part is _state_. Code (and
schema) versioning is fundamentally easy because it is internally stateless.
Once you add external state to the mix all bets are off. I really wish there
was a silver bullet but there is not. If you can solve the issue of versioned
persistence of large amounts of state then you will solve a lot of problems in
computer science.

~~~
aragot
Exact: Stateless!

Sad that we still can't dump a database to a code file, unique and canonical.

Maybe one way would be if the deployment script could say "We're missing
column X and Y" from comparing a description of the db to the actual one. If
you deploy an older version, you still get the message but choose not to apply
the change if it's destructive.

------
sandGorgon
Does anyone know what is the production-worthy mechanism of working with
database versioning in Django, Node and Go ?

It always feels strange to me (coming from Rails) that other frameworks treat
versioning as optional - I mean Django is getting migrations (not as South,
but as part of the default toolkit) only now ?

Does that mean there is another way to collaborate and work on database
changes ?

~~~
stdbrouw
South has existed and has been in wide use for a very long time, so whether or
not it's built into Django doesn't really matter that much.

As for Node and Go, I don't see how they could have One System To Rule Them
All – this feels like something you take care of at the ORM or framework
level.

Now, as to your "is there another way?" – during development of an app that
hasn't yet reached production, I still prefer extensively using fixtures and
just nuking the database on every change, then loading the fixtures back in.
(In Django this is a one-liner on the shell, I suppose it's probably similar
in other frameworks.) Faster and really helps with unit testing too.

~~~
ollysb
How do you do deployments to production?

------
andrewaylett
Very interested to see the link to Liquibase[0] in that article. Must check it
out. Does anyone have experience with that sort of tool that they'd like to
share, either good or bad?

[0]: [http://www.liquibase.org/](http://www.liquibase.org/)

~~~
gizmogwai
I do use it. Primarily because it more widespread in the Java world than other
competitors,and more easy, in the syntax, to start with. You might also be
interested in DB-Main [0]

But database migration has little to do with version control of code. With
code, when you switch from a version to another, either you have a bug^H^H
feature, either you don't. And you can repeat that as many time as you want.

Data is another kind of beast. You cannot simply "DROP" a table or even a
column back and forth and rely on your backups (if any...). You do no longer
want a table? Do not use it anymore, be let it there. You want to migrate your
data? make sure you do no loose information. And if you do, duplicate it
somewhere if you have to "rollback", or at least choose sensible defaults that
can be applied.

Data is there to stay, as complete and detailed as it was originally put in
your datastore.

[0]: [http://www.db-main.eu](http://www.db-main.eu)

~~~
collyw
Yes, a database is more like a living, constantly evolving thing. I don't
think I have ever rolled back a schema change, and I don't think it would be
useful to in most cases. Its far better to work out where you went wrong and
repair the data. Logging each statement is more likely to be useful than the
ability to roll back.

------
derekp7
I've been struggling with how to get an Oracle database contents under Git
control. Our developers use an older rapid development environment, in which
the front end tool stores its form layout, triggers, and source code all in
the database. It doesn't provide an option to develop out of regular text
files. So there are a couple approaches I could take:

1) Use triggers so that every time a GIT command is run, it either imports or
exports the given database table. But this takes simple, fast operations and
makes them expensive.

2) Write a Fuse plugin that can expose a database schema as a filesystem
object, and have a git checkout trigger that mount it in the appropriate
place.

Has anyone already done either of these two?

------
share_and_enjoy
I've used the manual migration script approaches before (Active Record,
DBDeploy), which are really reliable providing you use them religiously to
perform all changes. However it's a lot of work and it's easy to accidentally
make a change directly to the local dev DB and then either forget to commit a
change script or commit a change that isn't quite what you did.

At the other end the automated comparison/migration tools are much more
convenient. However they just can't work in 100% of cases, for example
thinking a rename is a drop and a create, or splitting data.

I've always ended up with a hybrid approach in the past

------
jdc0589
This is something EVERYONE struggles with at some point, and everyone has a
different solution that works.

It's still alpha (I'm introducing breaking changes soon), but I have been
working on Mite ([https://github.com/jdc0589/mite-
node](https://github.com/jdc0589/mite-node)) for a few weeks now; its a re-
implementation of an internal tool we used in the past. The documentation is
incomplete/out-of date though; it will be getting some love soon.

Mite's goal is kind of similar to sqitch, but it aims to be a little simpler;
very little configurations and migrations are plain old sql

------
dgrnbrg
Another interesting solution is to use a functional database, like Datomic.
Since the database looks like a giant timeseries, you have an implicit
versioning built into the database.

------
bni
Didn't like the article. Has your typical handwaving about "using an ORM" that
will resolve all your complexities. Object Relational Mapping is a problem
completely orthogonal to database migrations.

I like flywaydb. For "rolling back" you need to use your regular disaster
recovery, as migrations will modify data aswell as the schema. Use transaction
logging.

------
languagehacker
I wrote about this idea few years ago, from a non-ORM perspective:
[http://robertelwell.info/blog/future-proof-db-
changelog/](http://robertelwell.info/blog/future-proof-db-changelog/)

------
nercury
Solved by Doctrine2 project way ago. Using a much better timestamp versioning.
I actually miss its schema diff and migration tools when I go back to such
languages like C# or C++...

~~~
e1g
Agreed. Doctrine Migrations[1] is excellent for managing SQL schema
differences and migrations across all major RDBMs
(MySQL/PGSQL/MSSQL/Oracle/etc). Both upgrading and downgrading is supported,
as well as custom scripts and adjustments. Conceptually there is no reason why
it couldn't be integrated into non-PHP environments and be used only as an SQL
versioning tool.

[1]
[https://github.com/doctrine/migrations](https://github.com/doctrine/migrations)

------
pjmlp
If I remember correctly there were a few vendors that used to sell such
tooling in the 90's.

------
dsugarman
I would love to see Github come out with some decent postgres versioning

------
mukundmr
flywaydb works fine for Java.

~~~
gizmogwai
flywaydb is not even as good as pure SQL scripts. It is mainly designed to be
a forward-only tool. I honestly doubt anyone with some database background
would recommend it.

~~~
ecopoesis
Well, I have DB background. I've designed the schema for a moderately large
OTA (tingo.com) where I've used Flyway, and I would recommend and use Flyway.
Forward only migrations aren't bad, and work pretty well in the real world. As
some else pointed out, there are situations (like drop table) that cannot be
reversed.

As long as each release is backwards compatible with the previous one, you
won't ever need to roll back.

~~~
mercurial
We use it at work, and it works decently. For our development model, we don't
need something more complex, and the possibility to have code-based migrations
is important (don't know if you can get that with Liquidbase).

