My favorite toy example are firstname and lastname. If you decide to merge them into a single name column (without separator between firstname and lastname) you can not automatically go back because there is ambiguity when ever either firstname or lastname is missing or when at least one of firstname or lastname consists of two or more words. If you do the opposite - split the name into firstname and lastname - and somehow manage to come up with a satisfying heuristic how to perform the split, it is easy to go back by just merging firstname and lastname again. But then going forward again will not return you to the initial state in all the cases where your heuristic fails.
Currently I think the way to go may be using a highly normalized database schema - like 5th or 6th normal form - and perform only non-destructive schema evolution. I am not yet sure how this will turn out in practice with respect to limitations imposed on the possibilities to evolve the schema and how bad it may become because of abandoned database schema artifacts and the contained data. Anyway I think it looks promising and after doing some more research, thinking and prototyping I will most likely give it a try in an upcoming project.
For your example, to merge first and last name, you would create a backup first, ie, create a table that contains the current version of first and last name, and if you need to go back, you restore. Now, if you have an application that cannot have any downtime through DB upgrades you could be in trouble - after your backup, more records could come in to the original table. Probably what you need to do is modify the application to start inserting both full_name and first_name, last_name, and then gradually move over all the old data to the new format etc - ie a gradual migration. Every application and example is different, but the problem is usually solvable, often with some amount of effort.
The master schema create script is an idea I've seen in the past, and it works ok-ish up to a point. The problem is you need to update the master script, and also produce scripts to migrate production, so it doubles the effort. For enterprise database developers, I've seen plenty of development approaches that boil down to:
* Create all changes on Dev using GUI tools.
* Used TOAD to diff against production and produce delta script.
* Apply delta to prod.
Its gets the job done, mostly, so long as someone doesn't drop the dev database which is the only place with the new code - I've saw it happen. Then you have to go and beg the DBA to do a full restore just you get the release back on track!
I completely agree with you but I was a bit unclear about what I wanted to say. What I really wanted to say is that the problem is fundamentally unsolvable in the general case no matter how hard you try. The pathological case is that from some point in time on you capture less information than you did before - only the full name instead of first and last name.
This is admittedly a relatively rare case in practice but nonetheless possible. Now there is fundamentally no way to migrate data captured past that point back before that point because you just don't have the necessary information and can not make it up out of thin air. As you mentioned it is not a big problem to migrate data captured before that point forward and backward by just keeping the additional information around and therefore making the migration step non-destructive.
In practice it should be a relatively rare case that you have to migrate data forward and backward repeatedly so the relevance of all that is not to high. Most of the time you start capturing additional information or stop capturing no longer relevant information and these cases behave much nicer under schema evolution. But especially when ever you change the way you capture some information you will from time to time get into trouble.
It is instructive to think of data migration in terms of mathematical functions. The function to migrate the data forward is always defined otherwise you could not perform the migration in the first place. But not every function has an inverse function and so you may not be able to migrate the data backwards. You can always make up a function that returns a tuple of the function value together with all arguments and therefore make inverting it trivial - this is exactly what also keeping first and last name around when you actually want to merge them into full name means - but as mentioned above this fails for new information where you only capture the function value and no longer the function arguments.
He was saying how the reverse of a migration is night impossible, because what happens if you add a new user after applying that first name + last name migration, for which you don't have the split name.
No loss of data, instant reversion at any point, because each step is non-destructive. It's just longer, harder to do in reality, but definitely the "right" way to do it if you are 100% concerned with data validity.
Also how is step 4 non destructive?
Step 4 is non destructive from a data perspective, as those columns would be "zombie" data. they'd exist, but nothing would be referencing them at that point, and the data would simply exist in a different form.
If you're talking about representing the data differently, yes that would happen, and in a case like the above, you'd be much better off by preserving normalization and using a view to combine into a "name" entity. Th
I've never been in situation where losing production data is acceptable. I've stopped bothering with rollback statements that nobody ever cares to test and nobody will ever use in production. Schema changes are forward-only.
It's just easier to have forwards-only changes.
But I agree that any regression after that point requires independent consideration, for the reasons you give.
A great tip I also received once was to split migrations across releases when they are destructive. ie. if a column is to be removed, first release code that don't use that column, if all okay, rename to column in the next release db migration, and if still ok, in a release later on actually delete the column. If this was done in one release and didn't work your rollback would be difficult.
Other defensive actions e.g. when deleting rows of data, is to copy the affected data to a temporary table first before deleting it in the original table(s), and only in a later release when its proven you did not need to rollback should you delete the temporary table.
And if for some reason you don't have any automated database versioning, at least create a versions table to manually keep track of changes and dates for some sort of sane auditing.
In one job I created a process where we used Rails style migrations with some Ruby code I put together. As well as schema migrations, I had it handle applying stored procedures to the database too.
Since then I created a little application called dbgeni (database generic installer) and have been using it successfully to create Dev and Test environments in another job. It changes the schema with migrations and applies changed stored procedures too.
One difference with dbgeni and other tools, is that dbgeni uses plain SQL files and applies them using the database CLI - ie sqlplus for Oracle, the mysql command for MySql and isql for Sybase. This means that if you take the tool away, the SQL scripts are still useful - that allows me to use the tool for applying changes to dev and test environments, but since the production DBAs don't want to use it, I can just bundle up the same scripts and pass them over.
The project is on github https://github.com/sodonnel/dbgeni and it installs as a Ruby gem. I have supporting website http://dbgeni.appsintheopen.com with more details.
Also, the posts mentions ORMs as a solution. Well, clearly i would never want to let anything automaticaly generate and perform migration scripts apart for simple "new nullable column" types of change.
# copy the data in a new table and add code to place it in both tables
# change the code to reference only the new data but still store it in both tables
# rename the old table
# remove the old table
On top of that you have changes that are supposed to be applied before the new code is deployed (eg. creating a table, renaming a table) and after the code is deployed (eg. dropping unused columns).
As somebody else mentions in the comments this is more about Change Control than Versioning, but the two things go together to insure that you have as much idempotent and reversible changes as possible. Of course, if you have the resources you can always adopt a "never delete data" policy, so you're always copying instead of modifying the data, making sure you have a way to retract the changes. Clean-up can be done on regular, distant intervals.
Using ORM doesn't prevent you from doing a review on the changes that are about to be implemented. ActiveRecord for example, could provide you with a SQL file that will be representing your schema after the ORM is done. Usually the ORM schema changes are executed automatically by your CI upon deployment on Dev, QA and possibly Staging environments (you do have at least two of those, don't you? :) and need manual intervention (maybe to control performance degradation of your live env - locks, increased disk I/O, etc). This way you get plenty of testing of any DB changes and you should be able to spot any problems as early as possible.
Of course, manual review is absolutely necessary and we write queries ourselves if they involve moving the data.
I asked whether the difficulty in changing data types to
allow for longer integers was simply a storage issue or
something more fundamental. They said that it was more
fundamental, and in fact that my question showed how
little I knew about the horrors of using databases in
the real world!
Based on experience you should not do 01_script_a.py, 02_script_b.py, ... But rather do 010_, 020_, 030_, ... . This way you can squeeze in some initially unaccounted script in the correct location without renaming the others.
Basically you can solve the problem by fixing the order manually after merging code. And it won't run until you resolve the ordering.
The package manager handles making the "code behind" the database changes (which is committed to the CVS), and you can revert and update as needs be.
Sad that we still can't dump a database to a code file, unique and canonical.
Maybe one way would be if the deployment script could say "We're missing column X and Y" from comparing a description of the db to the actual one. If you deploy an older version, you still get the message but choose not to apply the change if it's destructive.
Version control achieved by keeping the SQL scripts to create the database in your VCS along with the code that depends on it.
Migration scripts are created later as a pre-deployment step. I prefer not to store these scripts in VCS because in my experience they are typically run once and never used again. There are several tools that can generate the migration scripts by comparing two schemas e.g. dev-->prod.
I wrote an open source command line tool that can be used to create VCS friendly create scripts from a SQL Server database. It's on github at https://github.com/sethreno/schemazen#schema-zen---script-an...
Also, here are a few tools that I've used to generate migrations scripts:
It always feels strange to me (coming from Rails) that other frameworks treat versioning as optional - I mean Django is getting migrations (not as South, but as part of the default toolkit) only now ?
Does that mean there is another way to collaborate and work on database changes ?
As for Node and Go, I don't see how they could have One System To Rule Them All – this feels like something you take care of at the ORM or framework level.
Now, as to your "is there another way?" – during development of an app that hasn't yet reached production, I still prefer extensively using fixtures and just nuking the database on every change, then loading the fixtures back in. (In Django this is a one-liner on the shell, I suppose it's probably similar in other frameworks.) Faster and really helps with unit testing too.
In my job I really struggle with getting meaningful test data, and so having the ability to clone the live database and let liquibase roll forward with new changes from a dev branch, try things out, roll back, tweak things and roll forward again is very handy.
Also if you're maintaining a large number of stored procedures, it's very handy to keep each of them in a separate file and reference them from your changelog something like: <sqlFile path="..." runOnChange="true"/> - the runOnChange property means that whenever the contents of the file is changed liquibase will automatically reload the procedure. (Though for example, postgres will complain if the change alters the type signature of a function.)
I'd recommend using the XML config format (though JSON and YAML are also supported), and use an editor that understands XML schemas. Probably any of the free Jetbrains ones will, certainly PyCharm does. Load the XML schema definitions into the project settings, and you'll get syntax autocompletion and validation for your liquibase changelog files.
But database migration has little to do with version control of code. With code, when you switch from a version to another, either you have a bug^H^H feature, either you don't. And you can repeat that as many time as you want.
Data is another kind of beast. You cannot simply "DROP" a table or even a column back and forth and rely on your backups (if any...). You do no longer want a table? Do not use it anymore, be let it there. You want to migrate your data? make sure you do no loose information. And if you do, duplicate it somewhere if you have to "rollback", or at least choose sensible defaults that can be applied.
Data is there to stay, as complete and detailed as it was originally put in your datastore.
1) Use triggers so that every time a GIT command is run, it either imports or exports the given database table. But this takes simple, fast operations and makes them expensive.
2) Write a Fuse plugin that can expose a database schema as a filesystem object, and have a git checkout trigger that mount it in the appropriate place.
Has anyone already done either of these two?
At the other end the automated comparison/migration tools are much more convenient. However they just can't work in 100% of cases, for example thinking a rename is a drop and a create, or splitting data.
I've always ended up with a hybrid approach in the past
It's still alpha (I'm introducing breaking changes soon), but I have been working on Mite (https://github.com/jdc0589/mite-node) for a few weeks now; its a re-implementation of an internal tool we used in the past. The documentation is incomplete/out-of date though; it will be getting some love soon.
Mite's goal is kind of similar to sqitch, but it aims to be a little simpler; very little configurations and migrations are plain old sql
I like flywaydb. For "rolling back" you need to use your regular disaster recovery, as migrations will modify data aswell as the schema. Use transaction logging.
As long as each release is backwards compatible with the previous one, you won't ever need to roll back.