Hacker News new | past | comments | ask | show | jobs | submit login
Database schema changes are hard (2017) (djrobstep.com)
149 points by djrobstep 18 days ago | hide | past | web | favorite | 89 comments

So much of what is bad about tooling comes from this one assumption: Database schema changes are hard.

If changing the schema is hard, then you come up with silly rules about when the schema can change and who can do it. You make migration tools in other languages to avoid writing the line of SQL that would change the schema. You use 3rd party tools to compare two databases and spit out change scripts automatically (and keep two databases versions up to date just for that purpose). You adopt entire schemaless databases so that you never need to change the schema.

But that's silly. Because Database Schema Changes are Not Hard.

You do the thing the author is scared of: SQL Change Scripts. Insert column, massage the data, flip on the null constraint, add relationships. It's all really basic stuff, and if you don't know the SQL syntax for it you can just ask the db tool you're using to make the change. It'll have a little "Script This Out" button next to the Save button.

If you do that, then you get to live in a world where Database Schema Changes are Easy. You get to have a build that just runs new change scripts in order rather than involving Ruby or some wacky 3rd party tool.

And you can move as fast as you like.

That's assuming changing the schema is just about changing the layout of the database, which is definitly not true.

You have your code depending on it, documentation, constrainsts, replication, caching, db load, deployement systems, different envs and versions...

The SQL is the easy part and barely registers as an issue.

I see the same problem with SQL lovers rejecting ORMs as a dumb way to avoid using the right tool for the job.

But using an ORM is not about avoiding to write SQL. It's about all the rest: code introspection, data validation, documententation, having a common api, tooling, etc

ORM are frameworks that exist mainly as a way to avoid writing and maintaining thousands of lines of boilerplate.

Like most frameworks (i.e. packages that force you to write your code to conform to their patterns) a really good one confers a massive boost in productivity while a really bad one is horrifying to work - it fails in bizarre unexplainable ways and straitjackets your development.

IME, passionate rejection of ORMs predominantly comes from people who have been scarred by experience with really bad ORMs.

Blaming it all in “really bad ORMs” without naming any sounds suspect to me. Can you name some good ones?

For example JPOX was an ORM that almost put me off the entire idea. It was horrendous.

Recently I've used Django ORM and SQLAlchemy - both are warty but generally decent and better than not using any ORM.

This is somewhat language dependent - certain language qualities inhibit the creation of decent ORMs (e.g. it's not really possible to create a decent ORM in golang due to its design).

Django's ORM has always served me well, it handles migrations and I've had almost zero SQL-hacks that needed to be done.

I've had bad luck with ORMs in general, mainly because they fall apart whenever I need to do something complicated. The abstraction is just too leaky.

However, I've had really good luck with Doobie, a purely functional SQL library in Scala. The big difference is that queries can be manipulated as ordinary pure values, which really lets you do some cool stuff. I actually used these features to significantly speed up a schema change in a large database I was working on.


Given your experience (which is replicated all over the world), why not just use ORMs for what they're good at and skip the rest?

That's something I don't understand as well. There is no need to go 100% SQL or 100% ORM. You can use raw SQL with the ORM, and the good ones have actually features to make it easy and productive. You don't even need to use the OOP nature of ORM for querying, good ones will have a functional layer for those as well.

Indeed, I often use all 3 in the same file, let alone the same app.

Even if you do know SQL syntax using the tool to make the change script for you can save quite a bit of time, especially if you've made a bunch of changes which can happen if you've added new functionality.

DISCLAIMER: I worked at Redgate, who make a tool called SQL Compare that works with SQL Server, for nearly a decade. Since this article is about PostgreSQL you should check out Postgres Compare, built by my friend Neil: https://www.postgrescompare.com/. He also used to work at Redgate.

Thanks Bart!

I know for PostgresCompare a lot of its utility is in finding the differences. Providing clarity into what has changed while one was building a feature.

A lot of users still write the SQL themselves. They let the tool figure out the easy stuff, get the dependencies correct, and they make a few edits here or there.

I don't really understand what is it about that makes RDBMS schema changes particularly harder in people's minds. All schema changes to your data can be hard or easy depending on what you do, regardless of whether that schema is implicit in the application or enforced by a database.

If the schema change is additive, it's usually easy. On the other hand, if you drop data items that older versions of applications expect, or change the semantics of existing data, you will need to proceed carefully. Whether you're using a relational database or something else doesn't really matter.

The tool you use to execute those changes doesn't really matter either if your data model changes at the whim of the developers without actual planning behind it to deal with the impact on upgrades it will have.

In an RDMS, a change in schema is hard because changing one table can affect many others. In a NOSQL database, there are far fewer connections between tables (or "objects"). There is less to wrap your head around.

A nosql database schema change would force you to update every document requiring the change.

I would say rdms schemas are straight forward. Remove constrants, alter tables/move data, add constrants.

Changing a nosql schema going forward is simple. Adding new columns is equally simple. Deleting columns: rdms is much easier. Updating rdms is simplier.

With RDBMS the integrity constraints can make gradually changes to large data sets more difficult. It often means a lot of downtime or the code must tolerate old and new structures for a while. With NoSQL the constraints are probably already in the application layer.

I usually remove db constraints and implement constrains at the application level for both types.

When you remove constraints from an RDBMS you slow it down, sometimes catastrophically.

The query planner/compiler in all modern RDBMS use foreign key constraints, unique constraints, and check constraints to optimize their execution code.

For instance, having a foreign key constraint allows the query planner to omit any code that checks if a value is in the child table but not the parent during a join. It can sometimes avoid accessing one table or the other entirely (so-called “join elimination”).

That's just up to how you model your data. there's nothing that prevents you from using a single table with a JSON blob in an RDBMS either, but it's throwing away most of the benefits.

A NoSQL database might seem to make migrations easier, but I'm not convinced that it actually does.

> That's just up to how you model your data.

Well, a NoSQL database forces you to reduce or eliminate relations, whereas an RDMS allows and encourages them. Sure, theoretically, you can make an RDMS database without relations, but it goes against the standard design methodology. You can be sure that if you implement an RDMS in a large organization, relationships between tables will seep into your schema. With a NoSQL database, that's pretty much impossible.

> But that's silly. Because Database Schema Changes are Not Hard.

Live database schema changes are hard. Usually people just let server down a few seconds if schema changes and data are not huge.

Why is it hard? Most of changes are atomical and either in background or instant. What change require downtime? I would argue that software change without downtime is harder and basically not solved problem for most environments unless you're using cluster of machines.

Live changes are hard because they requires precise planning, and must often be done in several steps.

Imagine you're extracting some data to a new table in order to support 1:N instead of 1:1 of something. For example, introducing an address table, instead of storing the address directly in the customer table.

If you can have downtime, and have a small database, this can be done with one change and one deploy - that's not very hard.

If you have a big database (100s of GB), and can't have downtime, you can't grab a full table lock to rewrite all the records in one go. You'll have to do something like

  1) create the new table (easy)
  2) create a marker column on the old table saying 'data moved to new  table', set to false on all records
  3) deploy version of application that can use both schemas
  4) migrate all data to the new table, flipping the marker as you go.
  5) nce all records are migrated, deploy new version of application that only uses new schema
  6) drop marker and old columns
(there are multiple ways to do this, of course - this is one approach)

Now imagine that you have 4 or 5 applications using this database. (Some reporting application, some integration with an accounting system, some integration with a marketing tool, etc). And you must do a live migration.

That is _hard_.

I've been wanting a 12factor-type methodological doctrine for data for a long time. Ideally, one that can fully account for how, when, and why to start introducing different kinds of databases and when you need to start exploring those options.

But I'd be happy with a lexicon and fundamental strategy for managing data that allows you do relax and rely on the fundamental primitives, like 12 factor does of overall application architecture.

Queued jobs which can auto-retry will help, of they can detect incompatible changes in the data structure. Of course that'll produce some false positives in the error logs. But it's a small price to pay IMO

The author's point is not that schema changes are hard - but rather that versioning schema changes are hard.

I'm not sure you read the talk, which is literally about creating SQL change scripts?

I did. Its main piece of advice is to stop writing sequential change scripts, and to instead only keep "empty", "dev", "live" schema scripts or similar, and to rely on a tool to sort out changes between them.

... which introduces all the problems that everybody here is talking about. Thus my comment above advocating not doing that.

I have no idea what document you just read, but the linked talk is literally about preparing sequential change scripts.

So how do you deal with it when you actually need to "migrate" data?

Getting a correct schema is only part of the migration process. Many times a refactoring requires a new table or field to be populated with data from the old table or field.

For example, a single table has a one to one relationship that we must select from to insert data into a newly created table so we can have a one to many relationship.

If only there was some way to query the data from the old tables and insert them into the new ones...

On, dev, local, staging, live, and where else?

If you have to manually access all of the databases and run queries on them, then that defeats the purpose of a ci/cd migration tool.

Of course, the migrations that this method is purporting to replace handles that.

I'm confused. Wouldn't this type of statement cover those cases?

INSERT INTO tbl_temp2 (fld_id) SELECT tbl_temp1.fld_order_id FROM tbl_temp1 WHERE tbl_temp1.fld_order_id > 100;

Yes, but on how many different systems and dev setups do you have to run that?

All of them. It would be part of your migration process.

Let's say you want users to have multiple emails, your scripts would be as follows:


  CREATE TABLE user_emails (
     user_id INT NOT NULL REFERENCES (users.id),
  INSERT INTO user_emails SELECT id, email FROM users;


  INSERT INTO users SELECT id, email FROM users;
  DROP TABLE user_emails;

Program: code, dockerfile, Kubernetees config

Persistent data: databases tables, disk snapshots

program != persistent data

But a particular version of a program expects a certain schema in the persistent data store. So when the program gets upgraded, the persistent data store must be migrated, unless you're able to lose all your data. In that case, you don't need to care about migrations, and it's much easier.


The reason you keep a migration chain is so you have a clear path from anything in the wild to the latest. Unless I'm misunderstanding, throwing this away means you need to migrate some unknown permutation to latest.

The hardest part of the migration is migrating the data and as far as I can tell that is glossed over.

Why is throwing away the iterative migration version useful exactly?

Moreover, Django already has facilities for compacting migration files together: squashing.


I agree but I also have migrations that don't correspond to any release of the product. You need to run a migration to update your database so you can test your changes. A single product release might contain hundreds of migration files; maybe some that simply undo the affect of some previous migration. The pain of that is real but your point is valid.

I like this product but I agree that it doesn't solve all migration issues. I do think we need better migration tools.

In theory yes, but I've barely looked at those old migration files even the latest one. Barely rollback, just patch with a new one. Some people say a rollback is just another migration "up"

In many (most?) cases people have a single production environment.

If you have multiple versions in the wild to support, there's nothing stopping you from supporting multiple upgrade paths, or even continuing to chain migration files, if you want to.

Moving data about is always going to be a manual process. This approach just helps you test it better.

You just got done telling me that migrations are bad, but also the solution to some non-trivial deployment scenario is to just continue using migrations?

> Moving data about is always going to be a manual process.

No, it's not always.

Did you even read it? No idea where you got "migrations are bad" from.

If moving data around within a migration isn't always a manual process, please show me the tool which automatically generates the correct statements.

I don't believe a tool can always generate the correct statements. But I do believe those same statements can be used on all deployments, which I consider automation.

I don't buy this at all. It may be true for projects with only a single developer, but consider environments with a larger team.

First, every developer has a development environment. If someone goes on holiday for a few weeks they will come back to a personal environment that's likely several steps behind everyone else.

Smart engineering orgs have one or (hopefully) more QA environments which differ from production by design - they're where you preview and test upcoming features that may involve schema changes.

At a certain scale individual teams may have their own dedicated QA environments, for testing in-development features without disrupting the work of other teams.

Then there are environments for running integration tests, hooking up to CI systems, load testing etc.

There may be only one production environment but there could be dozens or even hundreds of other environments that need to be able to reliably apply migrations up to a specific point.

The Django approach to migrations handles this really well.

? None of those are production environments.

They're "production" in as much as they need to work. They still need to have migrations applied to them in an automated and repeatable way.

"An environment that needs to work" is not what "production environment" means.

I would say this is a pretty limited view. If you're working on a large team with many non production environments, this style is far more cumbersome than the upgrade chain method.

If we're talking about deployed sqlite dbs running on client machines, expecting an upgrade to be a manual process is simply not acceptable.

What makes it more cumbersome?

If I'm understanding correctly, the suggestion is to run the migration by hand and handle issues as they arise. Not nearly as nice as an automated migration.

The talk is more about creation migration scripts than deploying them, but definitely not suggesting doing anything by hand. The idea is to ensure correctness so you can automate the deployments with confidence.

What about apps running on users’ devices, with their own internal databases for local storage?

When a user on random old version finally updates their app to latest, that latest code better be able to handle a migration correctly.

> In many (most?) cases people have a single production environment.

No idea why this was downvoted, because this is a key truth.

You need to be able to migrate the production database to the latest version without losing any data.

You need to be able to replace any non-production database with the same schema as production. Ideally, with a (sanitised, subsetted, etc) copy of its data.

You don't need to be able to do anything else.

You're only thinking about SaaSy things there.

What about on-premises or just open-source software, where you any number of any version of your software may be installed around the universe, and those users need to upgrade to a newer version?

See my comment here about organizations with many engineers and teams: https://news.ycombinator.com/item?id=19289778


The problem with database migrations are the rows, not the columns. I think I've written similar "column" tool like this for just about every DB project I've started, or write one for projects I've inherited/stepped into...

The rows (i.e. the actual data) is where the problem comes in. Just a few examples:

* Splitting overloaded fields ("5qt" => 5, "qt")

* Datatype changes ("ID" => ID)

* PK size/type change (NUMBER(10) => NUMBER(15))

* "Just" add a column (PK,Col1 => PK, Col1, Col2)... easy until the identifying column is no longer enough!

* Alter a PK (PK => PK1, PK2) - now cascade that through the Referential Integrity chain(s)

* Change the "Type" column for every row based on the output of some external API call

* "Rollback" part of the schema, with examples of the above applied

* Implement a 'vertical split' on a very wide table, and associated RI cascades/updates

* The infamous "was_migrated" flag, or "row_version" solutions that seemed like a good idea at the time

* etc.

And the next level of complexity comes with large-scale systems; doing these types of changes while the DB is online is even more complicated, and sometimes not possible or not worth the development effort.

These are not intractable problems, and a tool to help with these kinds of issues would be quite valuable; that's what I was hoping to see here.

[Edited; formatting]

Interesting tool, but anecdotally the difficult part about database migrations has very little to do with the actual schema changes. Rather, it has to do with:

1. Ensuring the app works properly with the old schema as well as the new one -- not hard by itself, but requires rigor.

2. Minimizing locking during the migration itself. That right there actually is the much trickier part of migrations, because sometimes you really end up needing to rewrite an enormous table and you want to minimize downtime while doing so.

^^^^ This. A thousand times, this. ^^^^^

As a supporting example of the arguments, I recently had to replace a subtly corrupted table in production with one rebuilt from trusted sources. The table was about 160k rows and core to the operation of the production system I work on.

The initial "easy" approach to replacing this table took many hours to execute, and locked the table for much of that time. It took a day or two of effort, but the final series of SQL operations pushed the total run time down to less than four minutes, was recoverable from any intermediate state, and pushed the locking time down to three seconds.

The complicated part was not the final state of the database, but finding a path to reach that state that fit within the operational constraints.

In the MySQL family there are several tools which help: pt-online-schema-change, gh-ost, etc. PostgreSQL and others have their own techniques and tools.

Most of my migrations with Django are column renames, column adds, deletions, etc., which seem to fit nicely with this tool, but there are some cases where I want to do some more complex transformation which require running a function using Django's `RunPython()`[0]. For example, combining `firstName` and `lastName` columns into a single `name` column.

How would these changes be accomplished?

[0]: https://docs.djangoproject.com/en/2.1/ref/migration-operatio...

Just create a migration script that mixes (autogenerated) sql and the python you need.

I thought that was the point of migrations in Ruby for Rails, PHP for Laravel, etc. Because raw SQL often lacks the tools to express and coordinate the changes.

I did an Ask HN[0] on a related topic almost a year ago. I think the OP is right that thinking clearly about data relationships helps in the design of an app, but things do change. Which is why people give up and use Mongo.

[0]: https://news.ycombinator.com/item?id=16871789

So how would you deal with continuous deployment to test but less frequent deployment to production? In test you might make multiple sequential schema changes before deploying to production. Don't you just end up in the same place as with migrations?

How about multiple engineers making schema changes simultaneously?

How is a diff really much different from a migration?

Author here. Where I've used this in a CI environment, the migration step goes from:

- check schema version number, if newer version exists, apply migration files


- check schema on this branch matches production schema exactly, if it doesn't, look for (one or more) pending migration files that would bring production to target state, apply those files

So you can still have multiple migration files as needed. The differences:

- no version numbers

- no long migration chain (you can clean up old pending files once they've hit prod)

- migrations are tested directly, and only get applied if a correct outcome will result

How would you handle a situation where the underlying data needs to be transformed? For example, converting a text column to an int column?

In that particular example, migra will detect that change and generate the right `alter column` statement with a `using` clause.

It's impossible to handle every case, of course, such as when renames happen or data needs to be moved/inserted.

Migration scripts always need reviewing. Tools can get you most of the way there automatically, and help you test, but not all of the way.

The reason files with numbers with different changes are kept is because when there are more than a handful of devs working on the same schema, who has the right magic 1 step "Dev schema"?

Your migration tool is bad and you should feel bad.

Schema changes aren't hard. It's the mismatch of devs knowing how big a table is in production + the downtime caused by a data migration that is the hard problem. Data retention, partitioning, and other data archiving things are rarely the top of mind for apps until it's usually too late and "hard".

> who has the right magic 1 step "Dev schema"?

Hint: It's right there in the name. The developers.

> Schema changes aren't hard.

I'm glad you enjoy tedious manual work. I'd rather automate it.

How to do data migrations this way? Let's say you have a field for statuses (int) and want to add a new option in the middle of array [(1,a),(2,b),(3,d)] => [(1,a),(2,b),(3,c),(4,d)]. Where do you do Model.objects(status=3).update(status=4)?

Also, at least for Django, I can't see real harm with schema versioning via files with ids. You can pretty easily merge everything to one file once you have 100 or whatever migrations in any app.

You can take the multiple migrations approach and for a short time keep two status fields while you change over, eventually dropping the first and renaming the second.

Or, as most people would do it, within a transaction add the new value to your enum and run an update. Of course, this can't be reversed but if it's a simple `set status = 3 where status = 4` I don't see the problem.

> Fundamentally, a schema is a guarantee that your data conforms to the structure you intend.

My opinion :

The reality is that you never have enough information about how the data will be queried and connected together in the long run. A table structure is a model you built from a snapshot of your knowledge of the reality, but your knowledge changes over time and so should the model. As new information goes, some assumptions made before become obsolete, while new assumptions need to be made.

Even if a SQL relational database works OK for most of the business cases, they are not good at frequently evolving over time with new information or needs. They are also not that good if you have different models of the data that must exist in parallel.

I could see database schema as a specialized form of organizing knowledge, but it shouldn't be considered the best and only one.

Fundamentally, if you think you can fit the complexity of the reality in a bunch of relational tables, you are starting with the wrong assumptions already

This is interesting, though it really shouldn't be specific to one RDBMS (and I like PostgreSQL).

To be honest, I don't find the status quo particularly difficult to deal with. A directory with lots of little files doesn't take up much storage, and nobody cares about the space anyway. That said, improvements are always welcome.

I like how it looks at the diff and figures out simple schema changes. However, in my experience, the simple changes are no big deal. The more interesting challenge in migrations is when semantics change and you need to generate data from existing data. Simple diff cannot determine that, that needs to be provided by some set of queries, which easily fits in a migration file.

Is there a way to integrate more complex changes and dataset manipulations when needed?

A very common problem I have seen in distributed applications is that the database is treated as an internal part of the application. As soon as you have a separate server running a database (basically most modern setups that don't use SQLite) you have a distributed system, and the database is an external dependency -- just like, say, Facebook/Google API you use for SSO. It becomes even more clear if your system has several independent services.

Once you accept this, the database schema becomes just yet another API contract to accept and respect, with all the same mechanisms to ensure compatibility -- versioning, integration tests etc.

Thanks for the post. I agree that schema migrations could be improved, and am also working on a tool to address this space. Why is your tool specific to PostGres? Does it perform migrations or just generate the scripts? Am I understanding correctly that you compare 2 live databases together? I.e. prod compared to dev, and a diff script will be created with a create table because my dev database has additional table "foo" in it that doesn't exist in prod?

Alembic is an amazing db migration tool imo. I only use it in small-scale (80(?) tables, 2 developers), but it's so amazingly hackable by design.


I generally like Django migrations. A few rough spots. This seems like it could help some of them, but would take s lot of work to wire it in.

Because this works directly at the database level, it shouldn't need any "wiring in" to django itself, to use it with django.

I don't use django, but all that is required is for django to be able to initialize an empty database based on its model definition.

Then you can just use that initialized database independently as a target from which to generate migration scripts.

When starting to use mongo it felt like a silver bullet. Reading this reminds me that it still is. I kind of get the point of schemas etc. but if your product can work with mongo, you just might avoid all these pains.

Does Mongo support ACID nowadays? And if data is mostly relational it still seems like overkill to me.

Thanks for saying mongo sucks. I'll be quoting that.

Brilliant tool. I totally get it.

Huh. This is essentially what I have been doing, though not so succenitly. I thought I was just being lazy to be honest as this method seemed easier than anything else. Good to know I'm not the only one!

Bullshit, DDD is a way, database should be just plugin.

OT: Wondering if DB admins are the highest paid contracters among tech professinals. Discussions around DB tech are always the most emotional and heated up ones, like if they lost something significant if their world view on DBs is wrong.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact