If changing the schema is hard, then you come up with silly rules about when the schema can change and who can do it. You make migration tools in other languages to avoid writing the line of SQL that would change the schema. You use 3rd party tools to compare two databases and spit out change scripts automatically (and keep two databases versions up to date just for that purpose). You adopt entire schemaless databases so that you never need to change the schema.
But that's silly. Because Database Schema Changes are Not Hard.
You do the thing the author is scared of: SQL Change Scripts. Insert column, massage the data, flip on the null constraint, add relationships. It's all really basic stuff, and if you don't know the SQL syntax for it you can just ask the db tool you're using to make the change. It'll have a little "Script This Out" button next to the Save button.
If you do that, then you get to live in a world where Database Schema Changes are Easy. You get to have a build that just runs new change scripts in order rather than involving Ruby or some wacky 3rd party tool.
And you can move as fast as you like.
You have your code depending on it, documentation, constrainsts, replication, caching, db load, deployement systems, different envs and versions...
The SQL is the easy part and barely registers as an issue.
I see the same problem with SQL lovers rejecting ORMs as a dumb way to avoid using the right tool for the job.
But using an ORM is not about avoiding to write SQL. It's about all the rest: code introspection, data validation, documententation, having a common api, tooling, etc
Like most frameworks (i.e. packages that force you to write your code to conform to their patterns) a really good one confers a massive boost in productivity while a really bad one is horrifying to work - it fails in bizarre unexplainable ways and straitjackets your development.
IME, passionate rejection of ORMs predominantly comes from people who have been scarred by experience with really bad ORMs.
Recently I've used Django ORM and SQLAlchemy - both are warty but generally decent and better than not using any ORM.
This is somewhat language dependent - certain language qualities inhibit the creation of decent ORMs (e.g. it's not really possible to create a decent ORM in golang due to its design).
However, I've had really good luck with Doobie, a purely functional SQL library in Scala. The big difference is that queries can be manipulated as ordinary pure values, which really lets you do some cool stuff. I actually used these features to significantly speed up a schema change in a large database I was working on.
DISCLAIMER: I worked at Redgate, who make a tool called SQL Compare that works with SQL Server, for nearly a decade. Since this article is about PostgreSQL you should check out Postgres Compare, built by my friend Neil: https://www.postgrescompare.com/. He also used to work at Redgate.
I know for PostgresCompare a lot of its utility is in finding the differences. Providing clarity into what has changed while one was building a feature.
A lot of users still write the SQL themselves. They let the tool figure out the easy stuff, get the dependencies correct, and they make a few edits here or there.
If the schema change is additive, it's usually easy. On the other hand, if you drop data items that older versions of applications expect, or change the semantics of existing data, you will need to proceed carefully. Whether you're using a relational database or something else doesn't really matter.
The tool you use to execute those changes doesn't really matter either if your data model changes at the whim of the developers without actual planning behind it to deal with the impact on upgrades it will have.
I would say rdms schemas are straight forward. Remove constrants, alter tables/move data, add constrants.
Changing a nosql schema going forward is simple. Adding new columns is equally simple. Deleting columns: rdms is much easier. Updating rdms is simplier.
The query planner/compiler in all modern RDBMS use foreign key constraints, unique constraints, and check constraints to optimize their execution code.
For instance, having a foreign key constraint allows the query planner to omit any code that checks if a value is in the child table but not the parent during a join. It can sometimes avoid accessing one table or the other entirely (so-called “join elimination”).
A NoSQL database might seem to make migrations easier, but I'm not convinced that it actually does.
Well, a NoSQL database forces you to reduce or eliminate relations, whereas an RDMS allows and encourages them. Sure, theoretically, you can make an RDMS database without relations, but it goes against the standard design methodology. You can be sure that if you implement an RDMS in a large organization, relationships between tables will seep into your schema. With a NoSQL database, that's pretty much impossible.
Live database schema changes are hard. Usually people just let server down a few seconds if schema changes and data are not huge.
Imagine you're extracting some data to a new table in order to support 1:N instead of 1:1 of something. For example, introducing an address table, instead of storing the address directly in the customer table.
If you can have downtime, and have a small database, this can be done with one change and one deploy - that's not very hard.
If you have a big database (100s of GB), and can't have downtime, you can't grab a full table lock to rewrite all the records in one go. You'll have to do something like
1) create the new table (easy)
2) create a marker column on the old table saying 'data moved to new table', set to false on all records
3) deploy version of application that can use both schemas
4) migrate all data to the new table, flipping the marker as you go.
5) nce all records are migrated, deploy new version of application that only uses new schema
6) drop marker and old columns
Now imagine that you have 4 or 5 applications using this database. (Some reporting application, some integration with an accounting system, some integration with a marketing tool, etc).
And you must do a live migration.
That is _hard_.
But I'd be happy with a lexicon and fundamental strategy for managing data that allows you do relax and rely on the fundamental primitives, like 12 factor does of overall application architecture.
... which introduces all the problems that everybody here is talking about. Thus my comment above advocating not doing that.
Getting a correct schema is only part of the migration process. Many times a refactoring requires a new table or field to be populated with data from the old table or field.
For example, a single table has a one to one relationship that we must select from to insert data into a newly created table so we can have a one to many relationship.
If you have to manually access all of the databases and run queries on them, then that defeats the purpose of a ci/cd migration tool.
Of course, the migrations that this method is purporting to replace handles that.
INSERT INTO tbl_temp2 (fld_id)
FROM tbl_temp1 WHERE tbl_temp1.fld_order_id > 100;
Let's say you want users to have multiple emails, your scripts would be as follows:
CREATE TABLE user_emails (
user_id INT NOT NULL REFERENCES (users.id),
email VARCHAR NOT NULL
INSERT INTO user_emails SELECT id, email FROM users;
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users ADD COLUMN email VARCHAR;
INSERT INTO users SELECT id, email FROM users;
DROP TABLE user_emails;
Persistent data: databases tables, disk snapshots
program != persistent data
The hardest part of the migration is migrating the data and as far as I can tell that is glossed over.
Why is throwing away the iterative migration version useful exactly?
I like this product but I agree that it doesn't solve all migration issues. I do think we need better migration tools.
If you have multiple versions in the wild to support, there's nothing stopping you from supporting multiple upgrade paths, or even continuing to chain migration files, if you want to.
Moving data about is always going to be a manual process. This approach just helps you test it better.
> Moving data about is always going to be a manual process.
No, it's not always.
If moving data around within a migration isn't always a manual process, please show me the tool which automatically generates the correct statements.
First, every developer has a development environment. If someone goes on holiday for a few weeks they will come back to a personal environment that's likely several steps behind everyone else.
Smart engineering orgs have one or (hopefully) more QA environments which differ from production by design - they're where you preview and test upcoming features that may involve schema changes.
At a certain scale individual teams may have their own dedicated QA environments, for testing in-development features without disrupting the work of other teams.
Then there are environments for running integration tests, hooking up to CI systems, load testing etc.
There may be only one production environment but there could be dozens or even hundreds of other environments that need to be able to reliably apply migrations up to a specific point.
The Django approach to migrations handles this really well.
If we're talking about deployed sqlite dbs running on client machines, expecting an upgrade to be a manual process is simply not acceptable.
When a user on random old version finally updates their app to latest, that latest code better be able to handle a migration correctly.
No idea why this was downvoted, because this is a key truth.
You need to be able to migrate the production database to the latest version without losing any data.
You need to be able to replace any non-production database with the same schema as production. Ideally, with a (sanitised, subsetted, etc) copy of its data.
You don't need to be able to do anything else.
What about on-premises or just open-source software, where you any number of any version of your software may be installed around the universe, and those users need to upgrade to a newer version?
The rows (i.e. the actual data) is where the problem comes in. Just a few examples:
* Splitting overloaded fields ("5qt" => 5, "qt")
* Datatype changes ("ID" => ID)
* PK size/type change (NUMBER(10) => NUMBER(15))
* "Just" add a column (PK,Col1 => PK, Col1, Col2)... easy until the identifying column is no longer enough!
* Alter a PK (PK => PK1, PK2) - now cascade that through the Referential Integrity chain(s)
* Change the "Type" column for every row based on the output of some external API call
* "Rollback" part of the schema, with examples of the above applied
* Implement a 'vertical split' on a very wide table, and associated RI cascades/updates
* The infamous "was_migrated" flag, or "row_version" solutions that seemed like a good idea at the time
And the next level of complexity comes with large-scale systems; doing these types of changes while the DB is online is even more complicated, and sometimes not possible or not worth the development effort.
These are not intractable problems, and a tool to help with these kinds of issues would be quite valuable; that's what I was hoping to see here.
1. Ensuring the app works properly with the old schema as well as the new one -- not hard by itself, but requires rigor.
2. Minimizing locking during the migration itself. That right there actually is the much trickier part of migrations, because sometimes you really end up needing to rewrite an enormous table and you want to minimize downtime while doing so.
As a supporting example of the arguments, I recently had to replace a subtly corrupted table in production with one rebuilt from trusted sources. The table was about 160k rows and core to the operation of the production system I work on.
The initial "easy" approach to replacing this table took many hours to execute, and locked the table for much of that time. It took a day or two of effort, but the final series of SQL operations pushed the total run time down to less than four minutes, was recoverable from any intermediate state, and pushed the locking time down to three seconds.
The complicated part was not the final state of the database, but finding a path to reach that state that fit within the operational constraints.
How would these changes be accomplished?
How about multiple engineers making schema changes simultaneously?
How is a diff really much different from a migration?
- check schema version number, if newer version exists, apply migration files
- check schema on this branch matches production schema exactly, if it doesn't, look for (one or more) pending migration files that would bring production to target state, apply those files
So you can still have multiple migration files as needed. The differences:
- no version numbers
- no long migration chain (you can clean up old pending files once they've hit prod)
- migrations are tested directly, and only get applied if a correct outcome will result
It's impossible to handle every case, of course, such as when renames happen or data needs to be moved/inserted.
Migration scripts always need reviewing. Tools can get you most of the way there automatically, and help you test, but not all of the way.
Your migration tool is bad and you should feel bad.
Schema changes aren't hard. It's the mismatch of devs knowing how big a table is in production + the downtime caused by a data migration that is the hard problem. Data retention, partitioning, and other data archiving things are rarely the top of mind for apps until it's usually too late and "hard".
Hint: It's right there in the name. The developers.
> Schema changes aren't hard.
I'm glad you enjoy tedious manual work. I'd rather automate it.
Also, at least for Django, I can't see real harm with schema versioning via files with ids. You can pretty easily merge everything to one file once you have 100 or whatever migrations in any app.
Or, as most people would do it, within a transaction add the new value to your enum and run an update. Of course, this can't be reversed but if it's a simple `set status = 3 where status = 4` I don't see the problem.
My opinion :
The reality is that you never have enough information about how the data will be queried and connected together in the long run. A table structure is a model you built from a snapshot of your knowledge of the reality, but your knowledge changes over time and so should the model. As new information goes, some assumptions made before become obsolete, while new assumptions need to be made.
Even if a SQL relational database works OK for most of the business cases, they are not good at frequently evolving over time with new information or needs. They are also not that good if you have different models of the data that must exist in parallel.
I could see database schema as a specialized form of organizing knowledge, but it shouldn't be considered the best and only one.
Fundamentally, if you think you can fit the complexity of the reality in a bunch of relational tables, you are starting with the wrong assumptions already
To be honest, I don't find the status quo particularly difficult to deal with. A directory with lots of little files doesn't take up much storage, and nobody cares about the space anyway. That said, improvements are always welcome.
I like how it looks at the diff and figures out simple schema changes. However, in my experience, the simple changes are no big deal. The more interesting challenge in migrations is when semantics change and you need to generate data from existing data. Simple diff cannot determine that, that needs to be provided by some set of queries, which easily fits in a migration file.
Is there a way to integrate more complex changes and dataset manipulations when needed?
Once you accept this, the database schema becomes just yet another API contract to accept and respect, with all the same mechanisms to ensure compatibility -- versioning, integration tests etc.
I generally like Django migrations. A few rough spots. This seems like it could help some of them, but would take s lot of work to wire it in.
I don't use django, but all that is required is for django to be able to initialize an empty database based on its model definition.
Then you can just use that initialized database independently as a target from which to generate migration scripts.
Brilliant tool. I totally get it.