Things to keep in mind include:
* If migrating data to a new column, keep the old data for a while
* After migrating the data (with new code writing old/new data), you may want to do an audit query to validate everything was migrated correctly
* If you need to mutate data inline in the column, add a new column and just copy the (pre-migrated) data to it as a backup (so worst case you have a pristine copy from before), then drop that column some time later
* Runtime config flag in your app that lets you read the new column/truth source as authoritative for a decision, or put the flag back to false to revert it. This could include a percentage so that you slowly creep load onto a new service and build trust in it functioning correctly.
* If the data migrated is quite complicated, help yourself to easily switch between old/new data sources for your own manual review in your apps code paths at runtime (eg, render this analytics from Source A or Source B via a runtime flag, query string parameter etc)
When the time came to do the cut-over in production, we had a war room, ran the program, and it was all over in seconds. Everything worked perfectly and I honestly felt a little dejected how boring the most major change we had ever made as a team went. I had hyped myself up for some stressful debugging, hair pulling, restoring things from backups... but it was all over and done with perfectly in a few seconds. I didn't really know what to do with the rest of the day. It worked! Major new feature! Back to work.
I try to make all future "major" migrations equally boring. The time you spend debugging things in development is time you don't spend with production being down. Worth it every time.
There is one thing though, and that is that a lot of these transformations end with more information, so you can usually recover the old state if you ever needed to (or write some compatibility shim for whatever reason)
I had a whole part about rollbacks (and an intuitive proof about why each step was needed and not being able to skip any) in my notes for this, but I decided to leave it out cuz I couldn’t do it justice just yet.
There are so many little details to cover in this space, and honestly the vocabulary and tooling are a bit poor so it’s hard to be succinct
I think we need to be a bit more explicit in this part.
Database columns are your system of record. If the code treats them as the source of truth, you need to break that first. If your code is factored properly you only need a toggle in the retrieval code. If it's not factored properly, now you understand why some of your peers are pushy about separating data retrieval from data use - it concentrates any mutation to a point in time before first use where you can easily find it when surprises happen and where there can be no concurrency issues. If that's too daunting you can settle for the lesser solution of replacing use with a function call.
Once everything is using the new source of truth, you can put a toggle there. And after the migration you can decide if you want to keep them separate or fold them back together.
Applications are developed continuously, different versions of the application need to run in test, dev and staging environments, each of which has different data and a different schema. This makes for a difficult set of challenges in the set up of each environment, and the reliable promotion of software from one environment to the next. Staging environments need copies of production data in order to test things properly. They also need all the new data and schema changes that are required to make the new version of the software work.
So now you’re moving data and schema changes back and forth, up and down the pipeline. But unlike with software source code, you do not have Git to clone, branch and merge your changes. As applications become more data-centric and designers create design systems with all kinds of amazing reactive components for devs to use in their applications, each of these components has fields and each of these fields have data, and this data needs to be stored in a database, and this database needs to updated every time new components are added, modified, or used in new ways.
We should manage the revision of data the way we manage the revision of our source code. We should merge our latest changes into the production database. We should simply clone our production database for staging, we should just push changes from a developer’s local machines to a test environment!
Creating tools to produce realistic test data and also using fuzzing tools can be just as good and sometimes even better than using actual production data.
Your production database may have medically-sensitive PII (or for something like SOC-2 compliance any PII at all) that cannot be shared any human (other than the original user) unless with prior approval.
Even for non-externally mandated reasons, companies may (and often do) wish to restrict access to production data by developers to minimize concerns around data exfiltration and snooping on user data by company employees.
I doubt this. This is for two reasons: the first is that the development database usually shares the same schema as the production database so that's not an issue.
The second is that fake data convincingly takes care of various issues surrounding de-anonymization of data using correlations among bits of data that ostensibly have had their PII-sensitive bits removed.
If protection of user data is a priority, there are far fewer headaches associated with creating entirely fake data to populate the same schema than trying to figure out post-hoc censoring of production data.
That's not to say there aren't valid use cases of the latter. You often will want to do post-hoc censoring/aggregation if you wish to track e.g. usage metrics. This is in fact often a component of ETLs. However, those are removed from everyday development tasks.
Personally, I think we rely on prod access as a crutch because it's easier to expect that than it is to build a sufficient infrastructure. Cloning a prod database or allowing ad-hoc r/w access is on my list of strictly forbidden operations.
Which I think is reasonable if you are not in a strictly controlled space.
Not everyone is able to spend weeks or months on instrumentation efforts.
I believe planetscale is trying to do this, introducing branches to the database
I think some new features of e.g. Postgres allow for something like this, but all SQL databases have the same basic design flaw: no inherent version control. SQL Databases are so big and complex that they haven't evolved much in 30 years. That's why there's so many "new" databases that are either a wrapper around Postgres, or trying to re-create parts of a relational database from scratch.
It would be great if the next iteration of the SQL standard included version control semantics.
And as long as you take care of backwards compatibility in your abstraction layer (i.e. sensible defaults for new input columns, hide output columns in the view), you can run both the old and the new app side-by-side.
It’s because we don’t like half of our application logic to be in random stored procedures that may or may not be there, and may or may not be the same across environments.
This way I can almost painlessly revert back, and restore data.
There are also versioned databases like TerminusDB and Dolt that try to do similar in the database
Production is production. Everywhere else, the database is temporary. It is be blown away and brought up to a known, source-code-defined state when the application is deployed. You never migrate any database except production. You always rebuild from scratch. Think of the state of the database as being more like the binary than the source code.
The state of a database is defined as the content of a specific base dump from production, plus some number of SQL scripts to run on top of it.
Base dumps need to be put somewhere accessible. When i worked with Oracle, they lived on the server. You run a command to create a dump on the server, then get a sysadmin to scp the dump file to your dev/test database server. With PostgreSQL, you can dump to a local file, so set up a file server or something where dumps can go. Could even be in source control, using large file support. Base dump files are named with a timestamp, GUID, etc.
In the app codebase, there is some config giving the name of the base dump, and those additional SQL scripts. There's some tool, script, build command, etc, which uses those to rebuild the database - delete (or rename!) database, restore base dump, run SQL.
You might not want to include everything in the dump. When i was doing e-commerce work, we would dump the schema and the catalogue data, but not user data (profiles, orders, etc), because of data protection. If your database is very big, obviously you will have to be very selective about data here.
On one project, the process of rebuilding a database was quite slow, because there was a lot of stuff to add on top of the dump (not just SQL, some of it used some weird data loader tool). To make that faster, we did caching: after a rebuild, we would dump the state of the database, then when doing a rebuild, check if there was already a suitable dump, and if so, just load that. The dump was named as a hash of everything that went into it. Easy to get that wrong! For example, the hash should include the version of the data loader tool.
When you finally release to production, then as long as the dump is similar to what is live, you just run the SQL scripts; it's like a dev deploy but without deleting the old database and loading a dump (deleting prod and loading a dump of prod peephole optimises to doing nothing!). 'Similar' means that the schema is the same and the data, if any, is reasonably fresh. If the dump is behind production somehow, you have some dev work to do where you update to a fresh dump, and amend the SQL scripts as needed.
Some time after a release to production, take a fresh dump, and propagate it into the source code: update the dump filename in config, and remove any SQL scripts which have now been released. You don't have to do this right away, because the fresh dump should really correspond to the old dump plus the scripts anyway.
If there is additional data or schema you need in some lower environments, that's fine. Include that as some extra SQL scripts which will never be run on production.
That gives you at least a base level of sanity checks that your migrations work. Of course, this is no guarantee that your production migration will also work - you may e.g. add a FK constraint but there are violating rows in prod -, but for that you can wrap your migration in a transaction. Upon failure, the migration is rolled back and, of course, you should then also rollback the deployment. Then you can start investigating why the migration went wrong.
Also, creating dumps from prod and storing them locally is something I don't recommend (and it's also disallowed at the company I work for) unless you take care to purge/mask all sensitive data (all PII and everything that includes trade secrets of some sort) before it reaches your local machine.
The process of loading a base dump and then running SQL scripts is precisely to mimic what will happen when you release to production. As you say, this is a good test bed!
What you shouldn't do is try to move between existing intermediate, non-production-like, states of the database by migrating. And really, what i object to are down migrations, because you won't ever use those in production. Migrating up is not so bad. But since running individual up migrations is something you will never do in production, why do it in dev?
Indeed, always running your whole queue of up migrations on top of a fresh base dump is a way of catching the kind of problem you hit in production, where existing data violates a new constraint.
The nuke-and-pave rebuild approach also avoids the problem of moving between branches and reverting pushed but unreleased schema changes. Those become trivial.
As you say, and as i said in my initial comment, there will be data in production that you can't or don't want to use in dev for some reason. But that's just another development task - you need to take your base dumps in such a way that that data is not included. You can exclude certain tables. You can only dump certain rows, corresponding to some kind of testing user etc (one bank i worked with had a "shadow branch" defined in their production systems, a made-up branch with made-up customers, that they could use for final testing in the actual production environment). You can dump into an intermediate database in production, anonymise and delete any data you need to, and then take the base dump. I'm sure there are loads of ways to do this.
I feel like this kind of thing isn't really taught. I mean, it was a long time ago that I went to uni, so I don't know for sure. But it seems to me like a lot of developers either learn from someone else because they're lucky enough to get a position somewhere where it's already known, or they have to invent it again for themselves. But it seems important enough that it should be an expected part of professional training.
Also, the fact the author felt it necessary to call it "pedantic" speaks volumes. Imagine an electrician calling it pedantic to check that the mains has been disconnected or a capacitor has been discharged. There's nothing pedantic about good, formal practice.
That chapter lists many techniques for live upgrades.
Oh the users can't take it, they'll go away, yadda yadda. That's usually BS. I've witnessed giant sites with massive traffic that go offline every night. Apple's own web store goes offline before updates every time (most valuable brand, looks like this doesn't affect them huh?). Every marketing exchange also goes offline every day and weekend (except the fancy cryptocoin ones I guess).
My overall point is, keep things simple. Planned downtime can vastly simplify your update deployment, and planned downtime can result in less unplanned downtime. So don't give it up for cargo cult reasons. You're not Google Search until proven otherwise.
Also you can have planned downtime piecemeal. If your data is sharded in some way (for example by user) you can update the user when they log out and you'll seemingly have no global downtime.
But downtime is inherent to well working systems. You sleep every night. If you don't things go haywire.
Agreed, and in a lot of cases the planned downtime is a matter of seconds which will only be noticeable if someone happens to be loading your page at the exact point in time where your web server is offline.
In most cases it doesn't really take that long to change a column in a table with 15,000 rows on it. For most apps I'd much rather go the route of making this a dead simple operation with 5 seconds of downtime vs a multi-migration 8 step approach.
It's 2021. We may not have flying cars, but we do have zero downtime deployments, sometimes. Let's not go back to the dark ages please.
The only thing you said back which I can trace back to an objective problem is that you don't want to stay late.
It doesn't inform you that the steps are "precarious" vs. "predictable" or that you have have or don't have quick way of recovery, or how small or big the update is.
You can just as easily screw up your application state without downtime.
You gotta be careful thinking purely by association. It's natural and intuitive, but often causes you to group unrelated characteristics together.
Maybe my update is as simple as adding a column to a table. If I have millions of rows, that won't be instant, it can take say half an hour. In InnoDB for example it'll do a full table copy to add a column. But it's not precarious or fragile, or dangerous. It's just what it says on the can. It'll either do the copy or it won't.
- engineers/CS staff tend to want to work during working hours
=> it is hard to plan for downtime during normal working hours.
If you want to deploy multiple times a day, and if you want to be able to merge in new code into your main branch, and you don't want to have to gate in merges etc for "deploy windows" with downtime, a great solution is to just do the extra work to avoid downtime on simple stuff.
I think the irony is that you really need to already be a certain size to do downtime-centric workflows reasonably without asking a bunch of people to do stuff at weird hours.
The irony you point out, isn't a huge irony. When your business is small, downtime is fine. As you grow and it becomes "not fine" you have people around the world that can handle it off-peak hours for the users during dev work hours.
By asking for changes to be done in a highly-available way, you reduce coordination efforts needed for release. "Our main branch is always deployable" is a great default! And while actualy DB schema changes aren't multiple times a day, even on a relatively small team simple DB changes come into the pipeline once a week? Most aren't backwards-incompatible, but the beauty of it is that we don't even have to consider that much, because of this strategy.
Maybe the "no downtime" stuff is unreasonable. In my case, we are a relatively small team servicing a good amount of customers (who do time-sensitive work), so the effort here is worth it. But if you can get away with it (with downtime windows being big enough to recover from issues), then it's definitely gonna be easier.
- Read from new representation
- If not found:
- read X from old representation
- migrate X to new representation (complex business logic here)
- write migrated datum to new representation
- return migrated datum
However, if you can get away with some downtime, you can instead notify your users of a planned outage, shutdown, update everything, and startup again. This eliminates several steps of the process, which has the added benefit of reducing the number of places where something can go wrong.
To put this in context: say you guarantee 99.9% uptime - that's still 43.2 min / month of downtime available to you! You can update quite a lot of machines / database rows in even 10-20 min, making this viable in many cases. (Obviously there is a scale threshold past which this is no longer viable, but if you're at that threshold you can probably also spare the engineering effort to go the zero-downtime route!)
You can decouple migrations from the release, and rest easy knowing your app is working without praying that a massive migration happens successfully and promptly.
Also since your UI or admin Interface likely lets you inspect your data in either representation, you can more naturally verify up/down migrations work in staging.
It actually also removes some advantages. The author fails to point that out, but the advantage of these small steps is that you can roll back to the previous version at any time without loosing writes. If you find a production bug that does not impact data consistency, you can go back one version without loosing the writes that happened while the new version was up.
> To put this in context: say you guarantee 99.9% uptime - that's still 43.2 min / month of downtime available to you!
On the other hand, there might be a fatal bug at some point and you will need 3 hours of leeway to get up & running again. Shutting down is surely the easier method, but the available downtime is usually not free real estate for updates.
Looking through the comments on this post, I see a lot of conflation between "not doing zero-downtime deployments" and "seat-of-your-pants engineering", as though fully orchestrated / containerized blue-green zero-downtime deployments were the One True Way for all teams / organizations / projects, and completely inseparable from other parts of good engineering practice.
If you have a reasonable deployment strategy, then the chance that someone will talk to your application in the exact moment that the app is in some invalid state, might be quite small.
One thing I liked was the chosen example is changing Booleans to enums. This almost always happens. Your field is almost never a Boolean. This goes for code too. We’ve all read code like:
foo(false, false, true, null);
Or, my favorite:
So yeah, my tip for the day is just make any Boolean an enumeration from the outset.
Here's the counter point : ordered argument list will break if the order is re arranged in the defined function. While for named arguments it doesn't.
Counter-counterpoint: Once you add support for named arguments, your cannot do either, as you don't know which convention the caller uses.
Day to day I use Hack (FB's PHP fork), which has a type system and shapes, which are essentially anonymous objects in the JS/TS sense but a little more awkward syntactically:
'a' => int,
'b' => int,
?'c' => int,
foo(shape('a' => 1, 'b' => 2));
Is it as simple as keeping the data around, but e.g. renaming columns to "columnname_deprecated_yyyy_MM_dd" and then in the unlikely event that something starts failing, you can at least rename it back to the old name while putting out that fire?
Another one might be to log queries and check for use of that old column.
2 caveats: (1) ensure this logging doesn't cause performance bottleneck and (2) this might not catch queries that use globs (ie. "select * from ..."). For (1) you can do this one only some of your instances of your DB cluster and (2), try to always use explicit column references.
In that case, you'd want to add a step between steps 2 and 3 where you double-read, either inline or within a shadow workload, and add telemetry to alert you if there are discrepancies between the two versions.
Only when you're confident that the two data models are producing acceptably equivalent results would you cut over the primary reads to the new model.
This pattern often comes to play when you break off a piece of state from a monolithic database into a dedicated service.
You might find some software frameworks that make this easier than others but expecting to find a technology that replaces human methodology is like reading a book on Scrum and saying “can’t we make a software system that does this for us?”
If you had a Django plugin that, like, let you say “oh the previous version of the app that was deployed had this model shape, so during this migration I need to do X/Y/Z” using the git info, you could do more.
But that would be a leaky abstraction and would run into a sort of coordination problem mid-deploy (what states are other machines in?). There might be some restrictions you could place to make this work though
But I think the article doesn't talk about full data migrations or integrations of two very distinct models, but rather about incremental, on the fly migrations on production databases, so the above doesn't apply well I think.
The SAP Cloud Application Programming Model (CAP) let's you derive your DB schema and UI from one semantic data model. The runtime framework interprets this and provides (deep) CRUD functionality out of the box. Business logic can then be inserted using different request handlers. Therefore you have (depending on additional business logic) usually only one place to define your model.
a) Reliably determine which step you are on - this includes both migration steps and application deployments
b) Fully automate your deployments from a list of staged code changes ("normal" CD will have issue with this)
c) Fully automate migrations _after_ all deployments are complete
Those are all totally possible steps, but they are all going to vary based on your setup, making a drop in solution for this process unlikely to be easy.
Additionally, these are only the steps if everything works! What happens if there's a bug halfway through? You probably want to stop the deployment process and either revert or quickly hotfix. A human-based system can handle this quite easily, but an automated system will need to be able to measure errors and reliably determine if an issue exists.
So its not that this can't be automated, in fact, I know that it has for several large organizations, but the hurdle to get to the full level of automation is rather high and fraught with secondary issues to consider.
They will churn until all the steps are complete and validated. If a step fails, the rest of the steps won't happen. You can also write rollback steps that depend on yet another value written if a step or test fails. And if you're paranoid, you can force the steps to wait for another value to be flipped which you would flip manually, so you can inspect everything before allowing the next step to fire.
All of this is complicated, so, try to only write backwards compatible code where the data model never changes, or only ever adds fields, or use feature flags, or some other strategy. There's a book on this stuff somewhere...