Hacker News new | past | comments | ask | show | jobs | submit login
The pedantic checklist for changing your data model in a web application (rtpg.co)
310 points by rtpg 7 months ago | hide | past | favorite | 98 comments

My advice whenever you have large, complex, or slow to perform (eg hours/days) migrations, that will take some time to process, don't put yourself into a position of "we can't go back", until the very last moment, after you have been assured as much as possible things are fine.

Things to keep in mind include:

* If migrating data to a new column, keep the old data for a while

* After migrating the data (with new code writing old/new data), you may want to do an audit query to validate everything was migrated correctly

* If you need to mutate data inline in the column, add a new column and just copy the (pre-migrated) data to it as a backup (so worst case you have a pristine copy from before), then drop that column some time later

* Runtime config flag in your app that lets you read the new column/truth source as authoritative for a decision, or put the flag back to false to revert it. This could include a percentage so that you slowly creep load onto a new service and build trust in it functioning correctly.

* If the data migrated is quite complicated, help yourself to easily switch between old/new data sources for your own manual review in your apps code paths at runtime (eg, render this analytics from Source A or Source B via a runtime flag, query string parameter etc)

I strongly agree with this. Relatively recently, I did a major migration to clean up some weird account states that we didn't want to support anymore, so we could launch a new feature. I spent some time writing all the SQL queries to do the migration, went into staging, ran them... and found that there were a lot of edge cases that weren't handled correctly. It was terrible. I decided to abort the SQL statement approach, and just wrote a somewhat-complicated program with extensive unit tests to handle the migration. The tests exercised the edge cases and ensured that they were accounted for correctly in the migration. (And we had time to come up with a real plan for each of them.)

When the time came to do the cut-over in production, we had a war room, ran the program, and it was all over in seconds. Everything worked perfectly and I honestly felt a little dejected how boring the most major change we had ever made as a team went. I had hyped myself up for some stressful debugging, hair pulling, restoring things from backups... but it was all over and done with perfectly in a few seconds. I didn't really know what to do with the rest of the day. It worked! Major new feature! Back to work.

I try to make all future "major" migrations equally boring. The time you spend debugging things in development is time you don't spend with production being down. Worth it every time.

Original author here. I do think it’s good to think about rollback strategies (especially given that your change might go out with something else that might need rollbacks).

There is one thing though, and that is that a lot of these transformations end with more information, so you can usually recover the old state if you ever needed to (or write some compatibility shim for whatever reason)

I had a whole part about rollbacks (and an intuitive proof about why each step was needed and not being able to skip any) in my notes for this, but I decided to leave it out cuz I couldn’t do it justice just yet.

There are so many little details to cover in this space, and honestly the vocabulary and tooling are a bit poor so it’s hard to be succinct

> new column/truth source

I think we need to be a bit more explicit in this part.

Database columns are your system of record. If the code treats them as the source of truth, you need to break that first. If your code is factored properly you only need a toggle in the retrieval code. If it's not factored properly, now you understand why some of your peers are pushy about separating data retrieval from data use - it concentrates any mutation to a point in time before first use where you can easily find it when surprises happen and where there can be no concurrency issues. If that's too daunting you can settle for the lesser solution of replacing use with a function call.

Once everything is using the new source of truth, you can put a toggle there. And after the migration you can decide if you want to keep them separate or fold them back together.

I am wondering how the long running migrations are done. Are they running as a scheduled job outside the app? I can imagine it can't be part of the app itself as that would block deploying a new version of the app (/docker image)

Managing Databases in a CI/CD pipeline is frankly difficult.

Applications are developed continuously, different versions of the application need to run in test, dev and staging environments, each of which has different data and a different schema. This makes for a difficult set of challenges in the set up of each environment, and the reliable promotion of software from one environment to the next. Staging environments need copies of production data in order to test things properly. They also need all the new data and schema changes that are required to make the new version of the software work.

So now you’re moving data and schema changes back and forth, up and down the pipeline. But unlike with software source code, you do not have Git to clone, branch and merge your changes. As applications become more data-centric and designers create design systems with all kinds of amazing reactive components for devs to use in their applications, each of these components has fields and each of these fields have data, and this data needs to be stored in a database, and this database needs to updated every time new components are added, modified, or used in new ways.

We should manage the revision of data the way we manage the revision of our source code. We should merge our latest changes into the production database. We should simply clone our production database for staging, we should just push changes from a developer’s local machines to a test environment!

Sometimes you can't use production data in your staging environment, like when prevented by privacy laws. Often times developers who would not have access to production data would have permissions to view data in a staging environment...

Creating tools to produce realistic test data and also using fuzzing tools can be just as good and sometimes even better than using actual production data.

I'm having a hard time visualizing this - are we talking outside devs (consultants) or folks in a different jurisdiction who don't have rights under law, or just a company administrative policy? Agree that fake data can work well, but if you can clone the production DB, that seems a preferable and easier approach.

There's a variety of reasons this may happen that stem from either legal, certification, or company reasons.

Your production database may have medically-sensitive PII (or for something like SOC-2 compliance any PII at all) that cannot be shared any human (other than the original user) unless with prior approval.

Even for non-externally mandated reasons, companies may (and often do) wish to restrict access to production data by developers to minimize concerns around data exfiltration and snooping on user data by company employees.

that makes sense - I suppose using my model you could mask some of the data in a versioned graph or a collection contained in the database that can be surfaced up to other users who can then clone the collection that excludes PII. You could run the main collection and the PII free collection in the same data product. This might be an easier approach than creating fake data & fake schema.

> This might be an easier approach than creating fake data & fake schema.

I doubt this. This is for two reasons: the first is that the development database usually shares the same schema as the production database so that's not an issue.

The second is that fake data convincingly takes care of various issues surrounding de-anonymization of data using correlations among bits of data that ostensibly have had their PII-sensitive bits removed.

If protection of user data is a priority, there are far fewer headaches associated with creating entirely fake data to populate the same schema than trying to figure out post-hoc censoring of production data.

That's not to say there aren't valid use cases of the latter. You often will want to do post-hoc censoring/aggregation if you wish to track e.g. usage metrics. This is in fact often a component of ETLs. However, those are removed from everyday development tasks.

I am struggling to think of any time it would be appropriate to clone a production database and use it for staging. That almost inevitably translates into any developer who ever deploys database-related changes having access to the contents of the production database, or at least every part of it that might ever be affected by code changes they are making. There are numerous legal, regulatory and ethical issues with granting such broad access, unless you're small enough that the few people you obviously do need to have the ability to grant full DB access happen to include anyone who is likely to change code that talks to the database. And even then it still seems like bad practice to actually grant anyone that access when it's not strictly necessary!

Legally. I've worked in healthcare and any direct access to production is strictly forbidden. If you can't figure it out through your logs and monitoring, and you can't reproduce in earlier environments, then the problem isn't that you don't have prod access, the problem is that your instrumentation has a gap.

Personally, I think we rely on prod access as a crutch because it's easier to expect that than it is to build a sufficient infrastructure. Cloning a prod database or allowing ad-hoc r/w access is on my list of strictly forbidden operations.

> Personally, I think we rely on prod access as a crutch because it's easier to expect that than it is to build a sufficient infrastructure.

Which I think is reasonable if you are not in a strictly controlled space.

Not everyone is able to spend weeks or months on instrumentation efforts.

For example, we, as a company have decided on some rules around PII (personally identifyable information) that means, amongst other things, that we don’t have any of it outside of production. We therefore cannot just copy data from production to staging/dev.

Alternatively, my company tends to rely on “testing in production” to avoid having different environments. If we need to try something new, we use feature gates/flags to only give select users access to experimental code paths. It’s worked well for us, but it does take some getting used to

> We should manage the revision of data the way we manage the revision of our source code

I believe planetscale is trying to do this, introducing branches to the database https://docs.planetscale.com/concepts/nonblocking-schema-cha...

Many frameworks and libraries (ActiveRecord, flyway, etc.) allow you to write migration scripts that get automatically applied during deployment. In that sense, you can version changes to your DB. Of course, it's true that you don't have a strict guarantee that your prod DB matches these scripts, but as long as you don't change the schema from outside such scripts and don't manually rollback/delete them etc., it should work out well in practice—at least, I've never had an issue with that specific part of migrations (of course, there are other issues you can run into with migrations).

This fails as soon as you have multiple instances of your app running (for load balancing or HA reason). Server one might still be running version A while server two is running the migrations done by version B - leading A to crash or corrupt data. If shutting down for the length of the update is not an option anymore, you will need to use the process the author pointed out.

That's not really related to the topic of DB versioning though, I feel. What you said is true, however it only matters if the DB operations are backwards-incompatible (something which you can avoid by using the approaches from the article).

> We should manage the revision of data the way we manage the revision of our source code.

I think some new features of e.g. Postgres allow for something like this, but all SQL databases have the same basic design flaw: no inherent version control. SQL Databases are so big and complex that they haven't evolved much in 30 years. That's why there's so many "new" databases that are either a wrapper around Postgres, or trying to re-create parts of a relational database from scratch.

It would be great if the next iteration of the SQL standard included version control semantics.

It's really not that difficult if you do the necessary legwork beforehand. That basically means, do not access the tables directly. Databases offer a read abstraction via views, and a write abstraction via stored procedures. Use those in your app, and suddenly you can atomically change your database schema without your app even having to go offline.

And as long as you take care of backwards compatibility in your abstraction layer (i.e. sensible defaults for new input columns, hide output columns in the view), you can run both the old and the new app side-by-side.

That requires the developers to be very familiar with databases, which they usually are not. And I've never heard a single developer in my entire career say "Hey let's talk to a DBA before we design and implement this feature".

> Hey let's talk to a DBA before we design and implement this feature

It’s because we don’t like half of our application logic to be in random stored procedures that may or may not be there, and may or may not be the same across environments.

I use .Net Core, and often make changes to data base through the application and the application updates the database in both staging/production. This creates a migration history and allows me to revert back to a previous database state. However, in some instances data is still lost, in that scenario I would just do a back up.

This way I can almost painlessly revert back, and restore data.

I've not looked into it a ton, but would something like Pachyderm address this?

Yes - pachyderm is good for managing and running distributed data pipelines on flat files

There are also versioned databases like TerminusDB and Dolt that try to do similar in the database

This stuff is not trivial, but it's also not rocket science.

Production is production. Everywhere else, the database is temporary. It is be blown away and brought up to a known, source-code-defined state when the application is deployed. You never migrate any database except production. You always rebuild from scratch. Think of the state of the database as being more like the binary than the source code.

The state of a database is defined as the content of a specific base dump from production, plus some number of SQL scripts to run on top of it.

Base dumps need to be put somewhere accessible. When i worked with Oracle, they lived on the server. You run a command to create a dump on the server, then get a sysadmin to scp the dump file to your dev/test database server. With PostgreSQL, you can dump to a local file, so set up a file server or something where dumps can go. Could even be in source control, using large file support. Base dump files are named with a timestamp, GUID, etc.

In the app codebase, there is some config giving the name of the base dump, and those additional SQL scripts. There's some tool, script, build command, etc, which uses those to rebuild the database - delete (or rename!) database, restore base dump, run SQL.

You might not want to include everything in the dump. When i was doing e-commerce work, we would dump the schema and the catalogue data, but not user data (profiles, orders, etc), because of data protection. If your database is very big, obviously you will have to be very selective about data here.

On one project, the process of rebuilding a database was quite slow, because there was a lot of stuff to add on top of the dump (not just SQL, some of it used some weird data loader tool). To make that faster, we did caching: after a rebuild, we would dump the state of the database, then when doing a rebuild, check if there was already a suitable dump, and if so, just load that. The dump was named as a hash of everything that went into it. Easy to get that wrong! For example, the hash should include the version of the data loader tool.

When you finally release to production, then as long as the dump is similar to what is live, you just run the SQL scripts; it's like a dev deploy but without deleting the old database and loading a dump (deleting prod and loading a dump of prod peephole optimises to doing nothing!). 'Similar' means that the schema is the same and the data, if any, is reasonably fresh. If the dump is behind production somehow, you have some dev work to do where you update to a fresh dump, and amend the SQL scripts as needed.

Some time after a release to production, take a fresh dump, and propagate it into the source code: update the dump filename in config, and remove any SQL scripts which have now been released. You don't have to do this right away, because the fresh dump should really correspond to the old dump plus the scripts anyway.

If there is additional data or schema you need in some lower environments, that's fine. Include that as some extra SQL scripts which will never be run on production.

I don't understand why you shouldn't migrate non-production databases. I agree that there should not be an expectation that they are stable (although it can still be useful to keep data around in test environments), but if you constantly apply migrations upon every deployment (as well as locally), you have a good test bed for the production migration.

That gives you at least a base level of sanity checks that your migrations work. Of course, this is no guarantee that your production migration will also work - you may e.g. add a FK constraint but there are violating rows in prod -, but for that you can wrap your migration in a transaction. Upon failure, the migration is rolled back and, of course, you should then also rollback the deployment. Then you can start investigating why the migration went wrong.

Also, creating dumps from prod and storing them locally is something I don't recommend (and it's also disallowed at the company I work for) unless you take care to purge/mask all sensitive data (all PII and everything that includes trade secrets of some sort) before it reaches your local machine.

I wasn't very clear in my statement about migrations, sorry. I'll put it another way: the only migrations you should run in dev are ones you intend to eventually run in production.

The process of loading a base dump and then running SQL scripts is precisely to mimic what will happen when you release to production. As you say, this is a good test bed!

What you shouldn't do is try to move between existing intermediate, non-production-like, states of the database by migrating. And really, what i object to are down migrations, because you won't ever use those in production. Migrating up is not so bad. But since running individual up migrations is something you will never do in production, why do it in dev?

Indeed, always running your whole queue of up migrations on top of a fresh base dump is a way of catching the kind of problem you hit in production, where existing data violates a new constraint.

The nuke-and-pave rebuild approach also avoids the problem of moving between branches and reverting pushed but unreleased schema changes. Those become trivial.

As you say, and as i said in my initial comment, there will be data in production that you can't or don't want to use in dev for some reason. But that's just another development task - you need to take your base dumps in such a way that that data is not included. You can exclude certain tables. You can only dump certain rows, corresponding to some kind of testing user etc (one bank i worked with had a "shadow branch" defined in their production systems, a made-up branch with made-up customers, that they could use for final testing in the actual production environment). You can dump into an intermediate database in production, anonymise and delete any data you need to, and then take the base dump. I'm sure there are loads of ways to do this.

Ah yes, I generally agree with what you said (although I think it's fine to occasionally do destructive and weird things to your local DB and roll them back if you're working iteratively; you can either recreate from scratch if you want to make sure you didn't break anything and otherwise you might still have some pre-production stage where you would notice it if the migrations didn't work).

I think this only really works as long as your dumps fit on one machine.

This is actually really useful to see written down. Perhaps a more complicated example would be useful, like changing the relationships between two or more models.

I feel like this kind of thing isn't really taught. I mean, it was a long time ago that I went to uni, so I don't know for sure. But it seems to me like a lot of developers either learn from someone else because they're lucky enough to get a position somewhere where it's already known, or they have to invent it again for themselves. But it seems important enough that it should be an expected part of professional training.

Also, the fact the author felt it necessary to call it "pedantic" speaks volumes. Imagine an electrician calling it pedantic to check that the mains has been disconnected or a capacitor has been discharged. There's nothing pedantic about good, formal practice.

Our whole field is both constantly changing and (AFAIK) has no formal trade organization like electricians/plumbers. We also aren't licensed like electricians and plumbers. It's just a bunch of academics, and a bunch of haggard people who've been in the trenches, and a lot of random books with great lessons that nobody reads.

I agree. On a similar vein I was trying to figure out how best to model something complex recently, what kind of data structures to use, etc, and Google searches don't really help. Where are the books on this kind of stuff for our field? In the end I just invented something decent, but it seems like a missed opportunity.

This is called The McHenry Technique in the book The Practice of Cloud System Administration by Limoncelli/Chalup/Hogan.

That chapter lists many techniques for live upgrades.

This comment is the top Google result for this term.

The fact the author was kidding by writing "you can't have (planned) downtime" in itself shows everything has downtime. So if your app can have downtime, why not have some planned downtime as well?

Oh the users can't take it, they'll go away, yadda yadda. That's usually BS. I've witnessed giant sites with massive traffic that go offline every night. Apple's own web store goes offline before updates every time (most valuable brand, looks like this doesn't affect them huh?). Every marketing exchange also goes offline every day and weekend (except the fancy cryptocoin ones I guess).

My overall point is, keep things simple. Planned downtime can vastly simplify your update deployment, and planned downtime can result in less unplanned downtime. So don't give it up for cargo cult reasons. You're not Google Search until proven otherwise.

Also you can have planned downtime piecemeal. If your data is sharded in some way (for example by user) you can update the user when they log out and you'll seemingly have no global downtime.

But downtime is inherent to well working systems. You sleep every night. If you don't things go haywire.

> So don't give it up for cargo cult reasons. You're not Google Search until proven otherwise.

Agreed, and in a lot of cases the planned downtime is a matter of seconds which will only be noticeable if someone happens to be loading your page at the exact point in time where your web server is offline.

In most cases it doesn't really take that long to change a column in a table with 15,000 rows on it. For most apps I'd much rather go the route of making this a dead simple operation with 5 seconds of downtime vs a multi-migration 8 step approach.

I don't want to have to stay up late to do it. I don't want to have to deal with a rollback. I, and the developer, want to be able to make changes when we feel like it, not based on a schedule. The developer does their work at their pace, I go about my work without thinking about their deployment. Apps with planned downtime are often poorly designed and have other problems, usually due to nobody taking the time to make them work better or fix tech debt. Downtime deployments can lead to huge changes which, if not completed successfully, can lead to much longer downtimes in the event of a rollback. And, yeah, I don't want my customer to see any downtime or errors. Call me fussy, but I want our product to work better than everyone else's.

It's 2021. We may not have flying cars, but we do have zero downtime deployments, sometimes. Let's not go back to the dark ages please.

Unfortunately "it's CURRENT_YEAR" and "downtime means you're poorly designed" are very poor, circular arguments.

The only thing you said back which I can trace back to an objective problem is that you don't want to stay late.

Would you rather a change 1) depend on two people doing a series of precarious steps and recovery which could possibly lead to a large amount of downtime, or 2) depend on one person doing a series of predictable small steps through automation which won't lead to any downtime? I prefer #2. Also I don't want to stay late.

Planned downtime means only one thing: you have an atomic update, that needs some downtime.

It doesn't inform you that the steps are "precarious" vs. "predictable" or that you have have or don't have quick way of recovery, or how small or big the update is.

You can just as easily screw up your application state without downtime.

You gotta be careful thinking purely by association. It's natural and intuitive, but often causes you to group unrelated characteristics together.

Maybe my update is as simple as adding a column to a table. If I have millions of rows, that won't be instant, it can take say half an hour. In InnoDB for example it'll do a full table copy to add a column. But it's not precarious or fragile, or dangerous. It's just what it says on the can. It'll either do the copy or it won't.

- people tend to want to use many systems during working hours

- engineers/CS staff tend to want to work during working hours

=> it is hard to plan for downtime during normal working hours.

If you want to deploy multiple times a day, and if you want to be able to merge in new code into your main branch, and you don't want to have to gate in merges etc for "deploy windows" with downtime, a great solution is to just do the extra work to avoid downtime on simple stuff.

I think the irony is that you really need to already be a certain size to do downtime-centric workflows reasonably without asking a bunch of people to do stuff at weird hours.

I have to point out, it seems, that you don't change your database schema "multiple times a day". Nor is every database schema linked to downtime (especially if you follow service/module orientation with distinct aggregates, and don't throw everything in one shared database like it's the 90s). Most updates don't need downtime. Few rare ones, do, unless you want major complications.

The irony you point out, isn't a huge irony. When your business is small, downtime is fine. As you grow and it becomes "not fine" you have people around the world that can handle it off-peak hours for the users during dev work hours.

Sorry, I guess the rationale here is more that DB schema changes (at least for most "basic" web applications) go through the same codebase as normal application changes, so it's in the same pipeline of changes as other changes.

By asking for changes to be done in a highly-available way, you reduce coordination efforts needed for release. "Our main branch is always deployable" is a great default! And while actualy DB schema changes aren't multiple times a day, even on a relatively small team simple DB changes come into the pipeline once a week? Most aren't backwards-incompatible, but the beauty of it is that we don't even have to consider that much, because of this strategy.

Maybe the "no downtime" stuff is unreasonable. In my case, we are a relatively small team servicing a good amount of customers (who do time-sensitive work), so the effort here is worth it. But if you can get away with it (with downtime windows being big enough to recover from issues), then it's definitely gonna be easier.

With complex migrations, like spanning multiple tables or depending on external APIs, I sometimes use a variant of this. Where instead of migrating the underlying database in "Migration 2", my version 3 would migrate on read:

- Read from new representation

- If not found:

  - read X from old representation

  - migrate X to new representation (complex business logic here)

  - write migrated datum to new representation

  - return migrated datum
- Either wait long enough until all data has been read at least once or systematically call your API for all records you want migrate (depending on the logic, you might want to throttle this)

As stated in the post: these steps make complete sense if you're aiming for a nice, smooth zero-downtime transition from the old data model to the new.

However, if you can get away with some downtime, you can instead notify your users of a planned outage, shutdown, update everything, and startup again. This eliminates several steps of the process, which has the added benefit of reducing the number of places where something can go wrong.

To put this in context: say you guarantee 99.9% uptime - that's still 43.2 min / month of downtime available to you! You can update quite a lot of machines / database rows in even 10-20 min, making this viable in many cases. (Obviously there is a scale threshold past which this is no longer viable, but if you're at that threshold you can probably also spare the engineering effort to go the zero-downtime route!)

One upside to writing an intermediate version that supports both the old and new representation: your data migrations are not an urgent race against the clock, or mission critical to the release.

You can decouple migrations from the release, and rest easy knowing your app is working without praying that a massive migration happens successfully and promptly.

Also since your UI or admin Interface likely lets you inspect your data in either representation, you can more naturally verify up/down migrations work in staging.

> This eliminates several steps of the process, which has the added benefit of reducing the number of places where something can go wrong.

It actually also removes some advantages. The author fails to point that out, but the advantage of these small steps is that you can roll back to the previous version at any time without loosing writes. If you find a production bug that does not impact data consistency, you can go back one version without loosing the writes that happened while the new version was up.

> To put this in context: say you guarantee 99.9% uptime - that's still 43.2 min / month of downtime available to you!

On the other hand, there might be a fatal bug at some point and you will need 3 hours of leeway to get up & running again. Shutting down is surely the easier method, but the available downtime is usually not free real estate for updates.

To be clear: data model migrations should still be tested and scripted to the extent possible; there should still be backups in place; there should still be a way to roll back application code and data schema. Just because a team / organization decides not to do zero-downtime updates - which I'm arguing is a completely reasonable choice in a lot of cases - doesn't mean they should abandon other software engineering practices.

Looking through the comments on this post, I see a lot of conflation between "not doing zero-downtime deployments" and "seat-of-your-pants engineering", as though fully orchestrated / containerized blue-green zero-downtime deployments were the One True Way for all teams / organizations / projects, and completely inseparable from other parts of good engineering practice.

There's also some DB updates that, even if strictly speaking non-backwards compatible, run in just a matter of seconds. Additionally, in postgres you can set statement_timeout, which should abort any query that takes too long.

If you have a reasonable deployment strategy, then the chance that someone will talk to your application in the exact moment that the app is in some invalid state, might be quite small.

One thing I’d add to this is that your database schema and your UI data model and/or API are decoupled so you can update your front end to the new representation before you do any database changes.

One thing I liked was the chosen example is changing Booleans to enums. This almost always happens. Your field is almost never a Boolean. This goes for code too. We’ve all read code like:

    foo(false, false, true, null);
This leads to coding standards that all such parameters must be commented and breakages when the parameter list is changed. It also often results in hacks to squeeze a third value into your Boolean (eg Optional<Boolean>).

Or, my favorite:

    @nullable Optional<Boolean>
As I like to say: for when three values for your Boolean just aren’t enough.

So yeah, my tip for the day is just make any Boolean an enumeration from the outset.

    enum Bool 

Don’t ever make True be the first item in an enumeration if there are any languages in your stack that initialize them by default. Make the default item be InvalidIForgotToInitializeThis.

One of my favourites features in Python are named parameters. I even wish Java could be more verbose in that manner.

You'd be surprised how many people are against this feature. I thought it'd be a no-brainer when it was proposed in the D community, but there was a lot of pushback against the feature (it's not implemented yet but it was accepted for the language). The main argument was that named arguments will cause your code to break when the arguments are renamed.

> The main argument was that named arguments will cause your code to break when the arguments are renamed.

Here's the counter point : ordered argument list will break if the order is re arranged in the defined function. While for named arguments it doesn't.

> Here's the counter point : ordered argument list will break if the order is re arranged in the defined function. While for named arguments it doesn't.

Counter-counterpoint: Once you add support for named arguments, your cannot do either, as you don't know which convention the caller uses.

in Python you can (and should) define your keyword arguments as keyword-only, for that very reason.

Seems acceptable since names are now part if your interface, and you could object that now you can relax ordering. Whats more interesting is that python has position only and name only arguments, meaning that there are uses for both.

I go back and forth on Python in this regard, specifically the named vs positional arguments in that an argument can be one or the other.

Part of me prefers just using an anonymous object in Javascript. Typescript is also cleaner than that.

Day to day I use Hack (FB's PHP fork), which has a type system and shapes, which are essentially anonymous objects in the JS/TS sense but a little more awkward syntactically:

    function foo(shape(
      'a' => int,
      'b' => int,
      ?'c' => int,
    )): void;

    foo(shape('a' => 1, 'b' => 2));
But with Python you sort of have this awkward mix of args and kwargs.

You can mark args as keyword only or positional only in the recent versions of Python (starting 3.7, maybe?)

If the project permits it, Kotlin fills that gap in Java well. Named parameters + defaults + nullability is a nice combination to have when it comes to evolving functions or types with backwards compatibility in mind.

This is good advice since some DBs (like MySQL) store booleans as 8 bit ints anyways.

Before dropping the old data, what is the simplest way of making sure it's not used? As the article says "this is hard to roll back".

Is it as simple as keeping the data around, but e.g. renaming columns to "columnname_deprecated_yyyy_MM_dd" and then in the unlikely event that something starts failing, you can at least rename it back to the old name while putting out that fire?

That's a good precaution.

Another one might be to log queries and check for use of that old column.

2 caveats: (1) ensure this logging doesn't cause performance bottleneck and (2) this might not catch queries that use globs (ie. "select * from ..."). For (1) you can do this one only some of your instances of your DB cluster and (2), try to always use explicit column references.

Are there any databases that support per-column security so you can fail reads (including globs)?

This works for API field deprecation, too.

For whatever it's worth, a pretty good book that covers this in a lot of detail is https://databaserefactoring.com/. One of the authors had several notes on his website about this prior to putting them into a book: http://www.ambysoft.com/books/agileDatabaseTechniques.html

I believe the checklist is missing adding any optional constraints on the new field. This should be done after deploying version 2 (writing to new field) and backfilling existing records.

This is a very useful guide. I'd add one additional step that can sometimes be useful if you're doing a more complex change than adding a field to an object. For example, sometimes you may want to read from an entirely new schema or data source, while ensuring that the user-facing behavior stays the same.

In that case, you'd want to add a step between steps 2 and 3 where you double-read, either inline or within a shadow workload, and add telemetry to alert you if there are discrepancies between the two versions.

Only when you're confident that the two data models are producing acceptably equivalent results would you cut over the primary reads to the new model.

This pattern often comes to play when you break off a piece of state from a monolithic database into a dedicated service.

Instead of an error prone human checklist, has anyone made frameworks with these sequenced structural migrations in mind? I’ve worked with both Rails migrations and Alembic, but they seem a layer of abstraction too low to only allow safe migrations.

This is what the humans need to do. It works with any framework or database.

You might find some software frameworks that make this easier than others but expecting to find a technology that replaces human methodology is like reading a book on Scrum and saying “can’t we make a software system that does this for us?”

No. Then we shouldn’t be using programming languages. I have learned to trust humans to not get migrations right. The system can have guardrails for certain structures of migrations.

Par of the problem is that tools don’t really have a notion of “behavior of system 2 releases ago”.

If you had a Django plugin that, like, let you say “oh the previous version of the app that was deployed had this model shape, so during this migration I need to do X/Y/Z” using the git info, you could do more.

But that would be a leaky abstraction and would run into a sort of coordination problem mid-deploy (what states are other machines in?). There might be some restrictions you could place to make this work though

Not a framework, but schemaless key-value stores allow you to play fast and loose and avoid a lot of the reason for migrations, with obvious downsides. Monorepos also allow you to change and deploy all of the code at once without having to coordinate versions of both dependencies and deployments.

Schema-less data does not make it unstructured data, it's just that it's now unknown/hidden and that you're perpetually in the middle of a migration with no obvious way to know you've finished migrating.

Even a monorepo won't help if you run multiple servers with your application code - there will still be a few seconds (or even a few minutes) during your deploys when some of those servers are running an old version of the software and some are running the latest release.

Also using a key value store or some other no-sql non-graph storage can sometimes be put between two structured migration endpoints. The advantage is that you have a clear way to first pull things apart before you put them together again and an intermediate step that is durable and easy to analyze.

But I think the article doesn't talk about full data migrations or integrations of two very distinct models, but rather about incremental, on the fly migrations on production databases, so the above doesn't apply well I think.

The lack of such tooling is one reason why devops is so hard and so hard to hire for.

It’s a very good topic. I currently use Ansible for deployment. I typically run the equivalent of a db migration as a custom application based upon a proven core, that, after extensive testing on copies of the production db, eventually runs against the production dB, as part of an orchestrated deployment. Verifying individual app nodes behaviors prior to moving onto the next is important, but also knowing the difference between data migrations that break old code vs ones that are agnostic, ak just make new data structures and slowly prune old ones, slowly meaning one or two deployments in the future, but before forgotten

(Disclaimer: I work here)

The SAP Cloud Application Programming Model (CAP) let's you derive your DB schema and UI from one semantic data model. The runtime framework interprets this and provides (deep) CRUD functionality out of the box. Business logic can then be inserted using different request handlers. Therefore you have (depending on additional business logic) usually only one place to define your model.


You're clearly not a spammer so I want to go easy on you, but nobody on HN is looking for a SAP sales pitch. I can promise you that.

I'm sorry, it wasn't meant to be a sales pitch (it's free to use anyway), I just wanted to show another way to deal with the "you change your database model, now you have to change many other places too" problem.

I didn't think you were shilling, but I was a little disappointed not to get a deep link to the docs for that specific capability :)

It's all built around that feature. You define your model using the language CDS (core data services), after compilation, you'll get a JSON representing the model (used by the generic runtime, either Node.js or Java), table/view SQL statements for the databases SQLite/HANA and OData metadata used by generic UIs. That means you semantically define your model and get all the artifacts needed to run the application. With generic support for most "enterprise qualities" like authorization, localization, draft support, extensibility etc. https://cap.cloud.sap/docs/get-started/in-a-nutshell gives a first glimpse. But again, I don't want to advertise. Like everything else, this approach has pros and cons, just wanted to give some insights how we dealt with that problem.

Does it provide tooling for migrating from one version of the data model to another? How does it handle the issues described in the article?

There is some migration tooling (HANA), but it won't be able to solve all issues (no framework can), as it strongly depends on how your migration looks like. However, it's purely declarative (for the generic CRUD operations), which helps you "experimenting" with different approaches more easily. It supports projections and allows CRUD operations on them (if the projections are not too complicated). For some migrations this is sufficient, but of course not if your additional business logic depends on it. The real power is more regarding the server/UI layer as you will be able to change the server code without needing to change the UI (except if you have custom UI code), as the UI is dynamically constructed based on the model metadata coming from the server. In your migration, this is not a big problem as you can pin newer versions of your UI to newer versions of your server, but here you just don't have to, which is neat.

Wow such spam, and SAP wow... What trash

Can't this be automated? It seems like such a basic transformation, that a computer should be able to do it.

Sure, but automating this requires a lot more tooling. In the case described with multiple, widely spread instances, you'll need to be able to

a) Reliably determine which step you are on - this includes both migration steps and application deployments b) Fully automate your deployments from a list of staged code changes ("normal" CD will have issue with this) c) Fully automate migrations _after_ all deployments are complete

Those are all totally possible steps, but they are all going to vary based on your setup, making a drop in solution for this process unlikely to be easy.

Additionally, these are only the steps if everything works! What happens if there's a bug halfway through? You probably want to stop the deployment process and either revert or quickly hotfix. A human-based system can handle this quite easily, but an automated system will need to be able to measure errors and reliably determine if an issue exists.

So its not that this can't be automated, in fact, I know that it has for several large organizations, but the hurdle to get to the full level of automation is rather high and fraught with secondary issues to consider.

Yes. For each step, write code that makes the change, and at the end of the change, updates a record (somewhere; different table, let's say) to note it is compete. Write a test that check if the record is updated, validates the change, and upon validation, updates the record a second time. Make each step depend on the validated value of the previous step. Put them all into cron jobs.

They will churn until all the steps are complete and validated. If a step fails, the rest of the steps won't happen. You can also write rollback steps that depend on yet another value written if a step or test fails. And if you're paranoid, you can force the steps to wait for another value to be flipped which you would flip manually, so you can inspect everything before allowing the next step to fire.

All of this is complicated, so, try to only write backwards compatible code where the data model never changes, or only ever adds fields, or use feature flags, or some other strategy. There's a book on this stuff somewhere...

Systematic database migrations do feel complicated, but how would you automate the process in the general case? You need to coordinate changes to two systems: your application code and your database. You can be doing anything with either. You can't update all relevant instances/records instantaneously, so you need to allow for intermediate states where both old and new code or data are in use while the update process itself is running. I think the article here does a pretty good job of describing this and a systematic way of dealing with it. At least you only need that many steps if you're making a breaking change to your existing data model rather than just extending it, which still requires care but is quite a bit simpler IME.

I hope the next blog article would discuss how to support multiple versions of APIs like Stripe does :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact