I've been really happy with how my current company[0] has been doing migrations and I've seen a couple others do it but it seems like it should be more widespread.
Database Schema as Code
Instead of writing up and down migrations, you define what the end state should look like. Then the computer will figure out how to get here. This is just how the industry started managing server configurations (Puppet) and infrastructure (Terraform).
We use protocol buffers so it was pretty straight forward to have a definition of what our tables should look like. We have a script that figures out what the delta is between two states (either proto files or a db) and can calculate the schema migration SQL (e.g. CREATE TABLE, etc).
From there, we run it through a safety check. Any unsafe migration (either for data loss or performance issues e.g. DROP TABLE) requires an extra approval file.
There's no real difference between an up migration and a down migration (except that one tends to result in an unsafe migrations). It's calculable at CI time so we can give devs a chance to look at what it's going to do and approve any unsafe migrations. API compatability checks enforce that you need to deprecate before you can drop.
DML, that is data changes, are handled via standard check in a sql file and CI will run it before the code deploy and after the schema migration.
Alembic is the one other place I've seen this concept (a couple others mentioned this) so it's not new, but surprised I haven't seen it more places.
[0] Shameless plug: We're hiring if you're interested in changing how healthcare is paid for, delivered, and experienced. https://www.devoted.com/about/tech-jobs/
I call this declarative schema management, since the repo declares the desired state, and the tooling knows how to reach this state. This concept is finally catching on lately, although some huge companies have already been doing it this way for quite some time. Facebook is a key example; they've managed their schema changes in a pure-SQL declarative fashion, company-wide, for nearly a decade.
I'm developing a suite of tools [1] to provide declarative schema management and "schema change by pull request" functionality, initially targeting MySQL and MariaDB. A few large companies have built pipelines using one of my tools -- including Twilio SendGrid, who wrote about their process in-depth recently [2].
This is good to know. As someone who didn't do much with databases before, I was frankly worried given how it didn't seem like many others were taking this approach when it made so much sense (we did have the advantage of having a defined schema which I know isn't always available). Seems like I just didn't know what to search for.
Git would never have worked it required devs to write the up/down patches - why should we have to write the up/down migrations for my schema?
Excited to see more tooling around declarative schema!
I'm surprised this isn't more of a thing. It seems like the natural evolution of "[X] as code". I've always been a little turned off by migrations (though they were certainly an improvement over the previous situation, which was basically just indeterministic changes on the fly).
My thoughts exactly. But it's a major paradigm shift for those coming from the world of Rails/Django/etc migrations, and that unfamiliarity understandably leads to some initial resistance and skepticism.
fwiw, other declarative tools are starting to pop up -- besides my tool Skeema, some others I've seen recently are Migra [1] and sqldef [2]. And meanwhile a bunch of enterprise tools for MS SQL Server have operated in the declarative fashion for quite a long time, although usually with GUIs instead of being git / pull-request-driven. So I think/hope it's just a matter of time before this concept becomes more widely known.
It's definitely a thing, eg SQL Server Data Tools has this as a default - Schema Compare and Data Compare, and you can just use declarative approaches to defined your final state and let the tool take care of it.
That being said - if you want to do this the downside is usually that its slow as hell, and the non-migration approaches can cost you downtime.
Generic solutions to specific states often means copying all data somewhere else so you can modify the table and then put it back in a useful fashion - a migration often allows more piecemeal approaches.
Excellent question! The short answer is Skeema doesn't directly support renames yet. Renames are inherently more imperative than declarative, so they don't fit in well with the model. I've thought about handling them via tracking their history/state, but it would be hacky.
Two workarounds exist in Skeema currently:
* You can do the rename "out of band" (e.g. manually, rather than via `skeema push`), and then update your schema repo via `skeema pull`. This isn't ideal, but then again, table/col renames typically involve nasty code-deploy-order complexities to begin with (regardless of migration system): there's no way to rename something at the same exact instant that your new code goes live, and it's difficult/annoying to write code that can correctly interact with both names.
* For new empty tables, happily a rename is equivalent to drop-then-re-add. So this case is trivial, and Skeema can be configured to allow destructive changes only on empty tables.
If you use ms sql server ssdt you use refactor/rename and it finds all references and changes them and then when you go to deploy it generates a sp_rename - 100% killer feature right there :)
Data migrations? Denormalizing columns from one table to one or more child tables, possibly more than one relation away? Switching one set of fields in a single table to be in a different table via a relation, converting something from 1:1 to 1:n?
The concept appeals to me, but it only seems to work for trivial migrations.
I’ll openly admit that we don’t have everything figured out. You’re absolutely right that currently, we constrain ourselves on what we migrate to admittedly simple migrations.
I think there’s a whole set of problems to be solved in this space and frankly, I’m as surprised as anyone that given how SQL is declarative, we use procedural code to do migrations so part of my post was hoping people would tell me what tool I should be using or how this approach fails over time. So your examples are really helpful for me as I think through if it’s possible to do automatically, workaround, or get by without.
It seems to me that we just lack the ability to express these transitions mathematically that can help us do them. And of those, there’s probably only a subset which are possible to do without taking downtime.
In particular, the class of migrations that you outlines are a combination of DDL and DML changes and also have quite a bit of code complexity to do without downtime. It’s definitely a current weakness.
A totally valid point, but I'd argue those should be handled by a separate tool or process. Data migrations tend to be fully programmatic; tools and frameworks can help reduce the code required, but cannot handle every possible case. (having performed numerous multi-billion-row data migrations, I learned this painfully first-hand...)
For simpler cases, where it may make sense to run a data migration immediately after a schema change, a good generic middle-ground may be configurable hook scripts. A declarative schema management system can then pass relevant info to the hook (which tables were changed, for example) and then the script can run any arbitrary row data diff/apply/migrate type of logic.
I do understand your point though; for relatively straightforward data migrations, an imperative system can capture these much more cleanly by just coupling them with the corresponding schema migration code.
I honestly like the way Rails does it: both capturing the imperative deltas and dumping the final schema which gets checked in. Not a big fan of down migrations, usually a waste of time.
Otherwise I like Percona's OSC, particularly how it can tune down table rewrites when there's competing work, or replication is lagging too much. We're just at the point where we need to automate the OSC tool rather than using it as a point solution for migrating our bigger tenants.
I am guessing that you are probably not using Python/Django... but is this any different than what Django offers?
Django allows you to define your models (schema) and then run a command that will generate the migrations. If you don't like the migration that was generated, you can modify it. You can customize up and down operations.
There are also tools that will take an existing database and generate the Django models for you.
All of these operations can also export the exact SQL that will run on each supported RDBMS platform in case you want to be extra sure on what exactly will be executed.
Django migrations can be problematic because they're meant to be sequential and have interdependencies. I've had problems merging multiple feature branches because of this, even though there are no code conflicts.
A system like Saltstack or Puppet for databases would not have checked in migrations, these would be generated on the fly at deploy time.
So you could very well have multiple state changes in a single run, by comparing actual DB state and desired DB state, then creating the SQL code as needed for that deployment.
Honestly not having to fiddle with the migrations table on a live server seems pretty nice ;-)
This could very well turn out to be Django's next gen migration tool...
> Django migrations can be problematic because they're meant to be sequential and have interdependencies. I've had problems merging multiple feature branches because of this, even though there are no code conflicts.
They're actually a directed graph; this means a conflict wasn't handled on the branches that should have been, and would probably have been a problem regardless.
I've rarely encountered logical merge conflicts with migrations, but I could see it happening.
I used to be on the SQL Server team at Microsoft and had some exposure to the customer support teams. So data integrity and eliminating any potential for errors was huge.
So while I love the idea of migrations being generated on the fly from actual state in Production-System-5 to desired state of commit 27a73e, I'm skeptical of it working that well in practice. Certain cases come to mind where there might be intermediate migrations from [full name] -> ([first name] [last name]) -> ([first initial] [last name]). The system would have to be smart enough to know A -> C may require A -> B -> C or prompt the engineering team for a better DML migration script.
Also, you will want there to be documentation about what was performed whether that is a migrations table that points to a .py file... or a .json output... or a log file.
Yeah. I’d love to see the academic paper with formalizations that help me understand the true scope of this problem. Your example is a great one that prompts many questions. Is it possible to travel directly to the commit o(1) or will the code have to calculate the diff of each commit and apply them one at a time o(n) and how much definition and dependency mapping humans need to do to have it work correctly?
The closest I can think of is trying to define a set of CRDT-compatible operations that are expressive enough to describe your database schema, starting from an empty database. Then, the migration you need to perform is whatever the CRDT merge operator says you need to do.
That's great initially, but problems definitely crop up at scale:
* What happens when your company creates new systems that aren't Python/Django? You can either still shoehorn all migrations into Django models, or have multiple separate schema change processes/pipelines... both options are not good.
* If someone makes an out-of-band schema change manually (either by accident or to do a rapid hotfix), you're no longer on a known version. Not sure about Django, but when this happens, most traditional migration systems cannot operate without additional manual cleanups... whereas declarative tools can inherently transition any state to any other state.
* Depending on the DBMS, with large tables and certain types of ALTERs, using an external online schema change tool is greatly preferable to just running an ALTER directly.
* Does Django support sharding / the notion that a change to a sharded model must be made to multiple shards?
* I see your point on not standardizing on one framework. Generally when that has happened for me, it turns into a new service and it has its own database/tables/migration management. It does get quite annoying, for sure.
* I've seen enough things go wrong that on my teams I do not allow DDL to be executed outside of a controlled process that comes from code. But yeah, if that were to happen, it would annoying to figure out what was done and then try to re-model.
* With Django you can specify exact SQL to run. So you can break up operations into multiple smaller steps... canonical example is building a new column based on an old column. You first add the column with NULL. Then you populate in batches of ~10k records. Then you add on the constraints/indexes.
* I haven't used Django with sharding. It appears there are some posts about it, but it all appears to be community generated content and not part of the official docs.
All-in-all, I could see that at a large scale with very mature engineering organizations with lots of activity and complex operations that something like Django could fall short and a home-grown system like this may be beneficial, assuming it were reliable enough.
This sounds nice! One Question: You said that DML changes are handled via "standard check in sql file". Does this simply mean a new SQL file for each migration? And how are DML changes connected to DDL changes? For example, if some code is two versions behind and updated to the current schema, wouldn't this mean that the DDL is updated in one step to the current state, but the DML potentially in two steps, breaking the update?
That's correct. The DML changes as part of CI are somewhat new so we haven't ironed it all out yet.
Here's the scenario that I think you're laying out:
1. Commit A creates column foo
2. Commit B has DML that reference column foo
3. Commit C removes column foo
This works fine if our CI deployer does each commit individually. First roll out any schema changes, then run any DML SQL.
However, our deployer might pick up all those changes and since we roll out the schema migrations first (in this case a create + drop -> NOP) and then runs the DML (which will error), this is an issue because of the rollup.
In practice, we have yet to see this case (most of the time, the dev who write the DML is close enough to the code to know if it's going to be dropped soon and we don't drop that many columns - in part because we know that there be dragons) but truthfully, I haven't thought about it much and need to think through what the impact is beyond this example. Thanks for helping me refine my thinking and I'll have something to ponder on this weekend!
I’ll openly admit that we don’t have everything ironed out. In fact my next big project is to tackle derived columns (rename is a column where the transformation is the identity function).
It requires a bit more finesse and integration into our code base as it requires multiple deploys with code that knows how to handle both columns.
Not sure about the state of the world currently after living in BigCo filter bubble for the past few years, but do you even need custom tools to calculate the delta between the schema as checked into VCS vs the database's actual state?
Spanner (https://cloud.google.com/spanner/) I think can auto-compute the diff between its current state and a given schema, generate appropriate SQL statements to perform a migration and get user confirmation for destructive schema changes.
Database Schema as Code
Instead of writing up and down migrations, you define what the end state should look like. Then the computer will figure out how to get here. This is just how the industry started managing server configurations (Puppet) and infrastructure (Terraform).
We use protocol buffers so it was pretty straight forward to have a definition of what our tables should look like. We have a script that figures out what the delta is between two states (either proto files or a db) and can calculate the schema migration SQL (e.g. CREATE TABLE, etc).
From there, we run it through a safety check. Any unsafe migration (either for data loss or performance issues e.g. DROP TABLE) requires an extra approval file.
There's no real difference between an up migration and a down migration (except that one tends to result in an unsafe migrations). It's calculable at CI time so we can give devs a chance to look at what it's going to do and approve any unsafe migrations. API compatability checks enforce that you need to deprecate before you can drop.
DML, that is data changes, are handled via standard check in a sql file and CI will run it before the code deploy and after the schema migration.
Alembic is the one other place I've seen this concept (a couple others mentioned this) so it's not new, but surprised I haven't seen it more places.
[0] Shameless plug: We're hiring if you're interested in changing how healthcare is paid for, delivered, and experienced. https://www.devoted.com/about/tech-jobs/