Hacker News new | comments | show | ask | jobs | submit login
Database Versioning: The ignored aspect of version control (nimkar.net)
72 points by jlemoine on Mar 14, 2014 | hide | past | web | favorite | 59 comments

While all theory sounds nice in theory in practice it just fails. If you are only concerned with empty databases during development there is not much to be gained by keeping deltas of different database schema versions - just having one evolving database create script will do the job. Then just grab the version belonging to your application version and you are good to go. Once you have a non-empty database and have to provide data migration scripts everything becomes a one-way street in all but the most trivial cases because updates are destructive.

My favorite toy example are firstname and lastname. If you decide to merge them into a single name column (without separator between firstname and lastname) you can not automatically go back because there is ambiguity when ever either firstname or lastname is missing or when at least one of firstname or lastname consists of two or more words. If you do the opposite - split the name into firstname and lastname - and somehow manage to come up with a satisfying heuristic how to perform the split, it is easy to go back by just merging firstname and lastname again. But then going forward again will not return you to the initial state in all the cases where your heuristic fails.

Currently I think the way to go may be using a highly normalized database schema - like 5th or 6th normal form - and perform only non-destructive schema evolution. I am not yet sure how this will turn out in practice with respect to limitations imposed on the possibilities to evolve the schema and how bad it may become because of abandoned database schema artifacts and the contained data. Anyway I think it looks promising and after doing some more research, thinking and prototyping I will most likely give it a try in an upcoming project.

I agree that rollbacks are hard, but if you think them through, there is usually a way.

For your example, to merge first and last name, you would create a backup first, ie, create a table that contains the current version of first and last name, and if you need to go back, you restore. Now, if you have an application that cannot have any downtime through DB upgrades you could be in trouble - after your backup, more records could come in to the original table. Probably what you need to do is modify the application to start inserting both full_name and first_name, last_name, and then gradually move over all the old data to the new format etc - ie a gradual migration. Every application and example is different, but the problem is usually solvable, often with some amount of effort.

The master schema create script is an idea I've seen in the past, and it works ok-ish up to a point. The problem is you need to update the master script, and also produce scripts to migrate production, so it doubles the effort. For enterprise database developers, I've seen plenty of development approaches that boil down to:

* Create all changes on Dev using GUI tools. * Used TOAD to diff against production and produce delta script. * Apply delta to prod.

Its gets the job done, mostly, so long as someone doesn't drop the dev database which is the only place with the new code - I've saw it happen. Then you have to go and beg the DBA to do a full restore just you get the release back on track!

TL;DR It is fundamentally impossible to find a solution that allows one to freely migrate any database schema and the data contained in that database forward and backward without any constraints. In practice you can push those constraints a bit around and most of the time you can hide them in dark corners where nobody looks and nobody gets hurt, but there is no silver bullet and there will never be one. (The last paragraph might be interesting, too.)

I completely agree with you but I was a bit unclear about what I wanted to say. What I really wanted to say is that the problem is fundamentally unsolvable in the general case no matter how hard you try. The pathological case is that from some point in time on you capture less information than you did before - only the full name instead of first and last name.

This is admittedly a relatively rare case in practice but nonetheless possible. Now there is fundamentally no way to migrate data captured past that point back before that point because you just don't have the necessary information and can not make it up out of thin air. As you mentioned it is not a big problem to migrate data captured before that point forward and backward by just keeping the additional information around and therefore making the migration step non-destructive.

In practice it should be a relatively rare case that you have to migrate data forward and backward repeatedly so the relevance of all that is not to high. Most of the time you start capturing additional information or stop capturing no longer relevant information and these cases behave much nicer under schema evolution. But especially when ever you change the way you capture some information you will from time to time get into trouble.

It is instructive to think of data migration in terms of mathematical functions. The function to migrate the data forward is always defined otherwise you could not perform the migration in the first place. But not every function has an inverse function and so you may not be able to migrate the data backwards. You can always make up a function that returns a tuple of the function value together with all arguments and therefore make inverting it trivial - this is exactly what also keeping first and last name around when you actually want to merge them into full name means - but as mentioned above this fails for new information where you only capture the function value and no longer the function arguments.

I think you misunderstood what the OP was talking about. Have you ever looked at migration systems like South for django or the migrations in ruby on rails?

He was saying how the reverse of a migration is night impossible, because what happens if you add a new user after applying that first name + last name migration, for which you don't have the split name.

That's why you apply the approach he was stating - you move slowly. Step 1. insert name column and populate from fn/ln. Step 2. Change code to point to name column as well as populate fn/ln Step 3. remove code pointing to fn/ln. Step 4. remove fn/ln columns.

No loss of data, instant reversion at any point, because each step is non-destructive. It's just longer, harder to do in reality, but definitely the "right" way to do it if you are 100% concerned with data validity.

At step 3 or later reversion becomes impossible. A new record will be added to the DB without a first name, last name making rollback impossible.

Also how is step 4 non destructive?

But at step 3, you've already validated the code change. You know that the name only code works as intended, and you're just cleaning up old code at that point. I do agree that there is a potential for failure at that point, but I'd think it to be reasonably mitigated in comparison to making the change and needing to then roll back.

Step 4 is non destructive from a data perspective, as those columns would be "zombie" data. they'd exist, but nothing would be referencing them at that point, and the data would simply exist in a different form.

If you're talking about representing the data differently, yes that would happen, and in a case like the above, you'd be much better off by preserving normalization and using a view to combine into a "name" entity. Th

Or you could just use a schema-less database like MongoDB </irony>

And introduce yourself to a new set of problems? Like the possibility of some part of the program saving first name and last name as one field and other parts as one field.

hence the </irony> at the end of my comment.

woops :)

Schema rollback doesn't really work, because once you have data in the database DROP TABLE is not the inverse of CREATE TABLE (going back and forth deletes the data).

I've never been in situation where losing production data is acceptable. I've stopped bothering with rollback statements that nobody ever cares to test and nobody will ever use in production. Schema changes are forward-only.

Assuming you have lots of storage space, instead of drop you could just copy the table across to the archive. Similarly with dropped fields' previous versions, and created fields' current versions in the case of a rollback. But then you will also need to have the logic to create those fields for new rows.

It's just easier to have forwards-only changes.

Agree with the production use case. For me rollback scripts are more useful in development and branching than production rollbacks. Say I'm working on one feature in one branch and I've split a column, then I need to work on a hotfix on another branch which doesn't have that split column. It's really useful being able to switch to that branch and quickly get my database rolled back into a state to match

That's true, but I still test the rollback on all my liquibase change sets. This helps to ensure that I have a straightforward option to back out of my changes solely in the case that something goes badly wrong immediately during deployment into the live environment.

But I agree that any regression after that point requires independent consideration, for the reasons you give.

I don't do rollback statements for non reversible operations either, but for some things like adding/renaming columns or adding/dropping indices it usually works fine.

Have to admit I use simpler tools such as DbDeploy, Flyway or Play!'s evolutions. They do the trick for smaller databases, but are difficult to retroactively fit in as the OP mentions.

A great tip I also received once was to split migrations across releases when they are destructive. ie. if a column is to be removed, first release code that don't use that column, if all okay, rename to column in the next release db migration, and if still ok, in a release later on actually delete the column. If this was done in one release and didn't work your rollback would be difficult.

Other defensive actions e.g. when deleting rows of data, is to copy the affected data to a temporary table first before deleting it in the original table(s), and only in a later release when its proven you did not need to rollback should you delete the temporary table.

And if for some reason you don't have any automated database versioning, at least create a versions table to manually keep track of changes and dates for some sort of sane auditing.

This is an area I have experimented with in large enterprise environments, usually with Oracle databases.

In one job I created a process where we used Rails style migrations with some Ruby code I put together. As well as schema migrations, I had it handle applying stored procedures to the database too.

Since then I created a little application called dbgeni (database generic installer) and have been using it successfully to create Dev and Test environments in another job. It changes the schema with migrations and applies changed stored procedures too.

One difference with dbgeni and other tools, is that dbgeni uses plain SQL files and applies them using the database CLI - ie sqlplus for Oracle, the mysql command for MySql and isql for Sybase. This means that if you take the tool away, the SQL scripts are still useful - that allows me to use the tool for applying changes to dev and test environments, but since the production DBAs don't want to use it, I can just bundle up the same scripts and pass them over.

The project is on github https://github.com/sodonnel/dbgeni and it installs as a Ruby gem. I have supporting website http://dbgeni.appsintheopen.com with more details.

Everyone who does DB update should understand idempotence and how to undo their changes. I like the approach the author provides, but calling it version control is inaccurate. This is change management and it's a completely different discipline with its own body of knowledge.

I failed to understand how undo script would work for big schema modifications ( changing 1-n to n-n or adding intermediate tables). To me, the safest way is definitely a backup / restore if things go wrong ( and you should definitely have backups anyway), but i'm not an expert. Could you elaborate ?

Also, the posts mentions ORMs as a solution. Well, clearly i would never want to let anything automaticaly generate and perform migration scripts apart for simple "new nullable column" types of change.

You do it by breaking up the potentially destructive changes so you could rollback your deployment to the previous working version. Typically all destructive modification of the schema is done in several passes: eg. in each release/deploy:

# copy the data in a new table and add code to place it in both tables

# change the code to reference only the new data but still store it in both tables

# rename the old table

# remove the old table

On top of that you have changes that are supposed to be applied before the new code is deployed (eg. creating a table, renaming a table) and after the code is deployed (eg. dropping unused columns).

As somebody else mentions in the comments this is more about Change Control than Versioning, but the two things go together to insure that you have as much idempotent and reversible changes as possible. Of course, if you have the resources you can always adopt a "never delete data" policy, so you're always copying instead of modifying the data, making sure you have a way to retract the changes. Clean-up can be done on regular, distant intervals.

Using ORM doesn't prevent you from doing a review on the changes that are about to be implemented. ActiveRecord for example, could provide you with a SQL file that will be representing your schema after the ORM is done. Usually the ORM schema changes are executed automatically by your CI upon deployment on Dev, QA and possibly Staging environments (you do have at least two of those, don't you? :) and need manual intervention (maybe to control performance degradation of your live env - locks, increased disk I/O, etc). This way you get plenty of testing of any DB changes and you should be able to spot any problems as early as possible.

Some ORMs do a decent job generating a correct sequence of DB queries for quite complex schema changes, including relationship changes. However, the process is a bit more manual than you might think. For example, to change from 1-n to n-n you would first generate n-n for first migration, then a manual migration to move data, and then for the final migration you would remove 1-n. The first and last one can be auto-generated by making changes to entities, including the reverse migration.

Of course, manual review is absolutely necessary and we write queries ourselves if they involve moving the data.

every article about db version control should mention sqitch. it's a great tool that works similar to and in conjunction with git.


sqitch is the only approach to version controlled migrations I've found that actually works long term. I wish more people would invest time into learning it so that it would become more commonplace.

Oh wow, that's really neat (although their main website won't load for me, I'm checking out the github repo). It's exactly what I've been looking for, cheers!

This is why I checked these comments. Thank you.

Here you go. It's easy with a bit of Category Theory. Just implement something along the lines of David Spivak's "On The Relational Foundations Of Functorial Data Migration".

Original motivation, from v1 of the paper:

   I asked whether the difficulty in changing data types to
   allow for longer integers was simply a storage issue or 
   something more fundamental. They said that it was more 
   fundamental, and in fact that my question showed how
   little I knew about the horrors of using databases in
   the real world!

> Every script must start with a integer which must be the next available integer.

Based on experience you should not do 01_script_a.py, 02_script_b.py, ... But rather do 010_, 020_, 030_, ... . This way you can squeeze in some initially unaccounted script in the correct location without renaming the others.

It's like BASIC all over again.

Ahaha! Glad I'm not the only one who thought that!

Better yet, use timestamps. With a sequential identifier, you'll get problems in a distributed team real quick. Rails switched because of this.

That would seem like a good idea at first. However if your product is deployed to hundreds of customers each being at various versions of the DB, what happens? Ex: Customer A is at v10, customer B at v20 and customer C at v30. Then you decide for some reason that patch v11 needs to be deployed. What happens to Customers B and C? Do you create 2 new patch, i.e v11 and v40 which have similar changes?

Check out alembic's neat solution to this: http://alembic.readthedocs.org/en/latest/tutorial.html#worki...

Basically you can solve the problem by fixing the order manually after merging code. And it won't run until you resolve the ordering.

Microsoft's MVC database migration system has worked well for me.

The package manager handles making the "code behind" the database changes (which is committed to the CVS), and you can revert and update as needs be.

As others have said and the author mentioned with the 3Tb example - it is the data that is the tricky part. Rather, the tricky part is state. Code (and schema) versioning is fundamentally easy because it is internally stateless. Once you add external state to the mix all bets are off. I really wish there was a silver bullet but there is not. If you can solve the issue of versioned persistence of large amounts of state then you will solve a lot of problems in computer science.

Exact: Stateless!

Sad that we still can't dump a database to a code file, unique and canonical.

Maybe one way would be if the deployment script could say "We're missing column X and Y" from comparing a description of the db to the actual one. If you deploy an older version, you still get the message but choose not to apply the change if it's destructive.

I prefer to think of db version control and db migrations as two separate things.

Version control achieved by keeping the SQL scripts to create the database in your VCS along with the code that depends on it.

Migration scripts are created later as a pre-deployment step. I prefer not to store these scripts in VCS because in my experience they are typically run once and never used again. There are several tools that can generate the migration scripts by comparing two schemas e.g. dev-->prod.

I wrote an open source command line tool that can be used to create VCS friendly create scripts from a SQL Server database. It's on github at https://github.com/sethreno/schemazen#schema-zen---script-an...

Also, here are a few tools that I've used to generate migrations scripts: * http://opendbiff.codeplex.com/ * http://www.sqldbtools.com/Tools.aspx?ProductId=1 * http://www.red-gate.com/products/sql-development/sql-compare...

Does anyone know what is the production-worthy mechanism of working with database versioning in Django, Node and Go ?

It always feels strange to me (coming from Rails) that other frameworks treat versioning as optional - I mean Django is getting migrations (not as South, but as part of the default toolkit) only now ?

Does that mean there is another way to collaborate and work on database changes ?

South has existed and has been in wide use for a very long time, so whether or not it's built into Django doesn't really matter that much.

As for Node and Go, I don't see how they could have One System To Rule Them All – this feels like something you take care of at the ORM or framework level.

Now, as to your "is there another way?" – during development of an app that hasn't yet reached production, I still prefer extensively using fixtures and just nuking the database on every change, then loading the fixtures back in. (In Django this is a one-liner on the shell, I suppose it's probably similar in other frameworks.) Faster and really helps with unit testing too.

How do you do deployments to production?

Very interested to see the link to Liquibase[0] in that article. Must check it out. Does anyone have experience with that sort of tool that they'd like to share, either good or bad?

[0]: http://www.liquibase.org/

At this point I can barely imagine working with a relational database without using liquibase. It does take some getting used to, and there's a few ways you can get yourself a little bit stuck during development, like when you realize a change you've just applied is wrong, and you dive into the changelog or source files to fix it BEFORE you've asked liquibase to roll back the bad change. Then, when you finally remember to roll back the bad change, liquibase refuses because you've edited the change's definition, and it no longer corresponds to what was actually applied to the database. It's not a very big deal, and if you have more presence of mind than me you might learn not to trip yourself up in that particular way in the first place.

In my job I really struggle with getting meaningful test data, and so having the ability to clone the live database and let liquibase roll forward with new changes from a dev branch, try things out, roll back, tweak things and roll forward again is very handy.

Also if you're maintaining a large number of stored procedures, it's very handy to keep each of them in a separate file and reference them from your changelog something like: <sqlFile path="..." runOnChange="true"/> - the runOnChange property means that whenever the contents of the file is changed liquibase will automatically reload the procedure. (Though for example, postgres will complain if the change alters the type signature of a function.)

I'd recommend using the XML config format (though JSON and YAML are also supported), and use an editor that understands XML schemas. Probably any of the free Jetbrains ones will, certainly PyCharm does. Load the XML schema definitions into the project settings, and you'll get syntax autocompletion and validation for your liquibase changelog files.

I do use it. Primarily because it more widespread in the Java world than other competitors,and more easy, in the syntax, to start with. You might also be interested in DB-Main [0]

But database migration has little to do with version control of code. With code, when you switch from a version to another, either you have a bug^H^H feature, either you don't. And you can repeat that as many time as you want.

Data is another kind of beast. You cannot simply "DROP" a table or even a column back and forth and rely on your backups (if any...). You do no longer want a table? Do not use it anymore, be let it there. You want to migrate your data? make sure you do no loose information. And if you do, duplicate it somewhere if you have to "rollback", or at least choose sensible defaults that can be applied.

Data is there to stay, as complete and detailed as it was originally put in your datastore.

[0]: http://www.db-main.eu

Yes, a database is more like a living, constantly evolving thing. I don't think I have ever rolled back a schema change, and I don't think it would be useful to in most cases. Its far better to work out where you went wrong and repair the data. Logging each statement is more likely to be useful than the ability to roll back.

I've been struggling with how to get an Oracle database contents under Git control. Our developers use an older rapid development environment, in which the front end tool stores its form layout, triggers, and source code all in the database. It doesn't provide an option to develop out of regular text files. So there are a couple approaches I could take:

1) Use triggers so that every time a GIT command is run, it either imports or exports the given database table. But this takes simple, fast operations and makes them expensive.

2) Write a Fuse plugin that can expose a database schema as a filesystem object, and have a git checkout trigger that mount it in the appropriate place.

Has anyone already done either of these two?

I've used the manual migration script approaches before (Active Record, DBDeploy), which are really reliable providing you use them religiously to perform all changes. However it's a lot of work and it's easy to accidentally make a change directly to the local dev DB and then either forget to commit a change script or commit a change that isn't quite what you did.

At the other end the automated comparison/migration tools are much more convenient. However they just can't work in 100% of cases, for example thinking a rename is a drop and a create, or splitting data.

I've always ended up with a hybrid approach in the past

This is something EVERYONE struggles with at some point, and everyone has a different solution that works.

It's still alpha (I'm introducing breaking changes soon), but I have been working on Mite (https://github.com/jdc0589/mite-node) for a few weeks now; its a re-implementation of an internal tool we used in the past. The documentation is incomplete/out-of date though; it will be getting some love soon.

Mite's goal is kind of similar to sqitch, but it aims to be a little simpler; very little configurations and migrations are plain old sql

Another interesting solution is to use a functional database, like Datomic. Since the database looks like a giant timeseries, you have an implicit versioning built into the database.

Didn't like the article. Has your typical handwaving about "using an ORM" that will resolve all your complexities. Object Relational Mapping is a problem completely orthogonal to database migrations.

I like flywaydb. For "rolling back" you need to use your regular disaster recovery, as migrations will modify data aswell as the schema. Use transaction logging.

I wrote about this idea few years ago, from a non-ORM perspective: http://robertelwell.info/blog/future-proof-db-changelog/

Solved by Doctrine2 project way ago. Using a much better timestamp versioning. I actually miss its schema diff and migration tools when I go back to such languages like C# or C++...

Agreed. Doctrine Migrations[1] is excellent for managing SQL schema differences and migrations across all major RDBMs (MySQL/PGSQL/MSSQL/Oracle/etc). Both upgrading and downgrading is supported, as well as custom scripts and adjustments. Conceptually there is no reason why it couldn't be integrated into non-PHP environments and be used only as an SQL versioning tool.

[1] https://github.com/doctrine/migrations

If I remember correctly there were a few vendors that used to sell such tooling in the 90's.

I would love to see Github come out with some decent postgres versioning

flywaydb works fine for Java.

flywaydb is not even as good as pure SQL scripts. It is mainly designed to be a forward-only tool. I honestly doubt anyone with some database background would recommend it.

Well, I have DB background. I've designed the schema for a moderately large OTA (tingo.com) where I've used Flyway, and I would recommend and use Flyway. Forward only migrations aren't bad, and work pretty well in the real world. As some else pointed out, there are situations (like drop table) that cannot be reversed.

As long as each release is backwards compatible with the previous one, you won't ever need to roll back.

We use it at work, and it works decently. For our development model, we don't need something more complex, and the possibility to have code-based migrations is important (don't know if you can get that with Liquidbase).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact