
Database schema changes are hard (2017) - djrobstep
https://djrobstep.com/talks/your-migrations-are-bad-and-you-should-feel-bad
======
jasonkester
So much of what is bad about tooling comes from this one assumption: Database
schema changes are hard.

If changing the schema is hard, then you come up with silly rules about when
the schema can change and who can do it. You make migration tools in other
languages to avoid writing the line of SQL that would change the schema. You
use 3rd party tools to compare two databases and spit out change scripts
automatically (and keep two databases versions up to date just for that
purpose). You adopt entire schemaless databases so that you never need to
change the schema.

But that's silly. Because Database Schema Changes are _Not Hard._

You do the thing the author is scared of: SQL Change Scripts. Insert column,
massage the data, flip on the null constraint, add relationships. It's all
really basic stuff, and if you don't know the SQL syntax for it you can just
ask the db tool you're using to make the change. It'll have a little "Script
This Out" button next to the Save button.

If you do that, then you get to live in a world where Database Schema Changes
are Easy. You get to have a build that just runs new change scripts in order
rather than involving Ruby or some wacky 3rd party tool.

And you can move as fast as you like.

~~~
sametmax
That's assuming changing the schema is just about changing the layout of the
database, which is definitly not true.

You have your code depending on it, documentation, constrainsts, replication,
caching, db load, deployement systems, different envs and versions...

The SQL is the easy part and barely registers as an issue.

I see the same problem with SQL lovers rejecting ORMs as a dumb way to avoid
using the right tool for the job.

But using an ORM is not about avoiding to write SQL. It's about all the rest:
code introspection, data validation, documententation, having a common api,
tooling, etc

~~~
pytester
ORM are frameworks that exist mainly as a way to avoid writing and maintaining
thousands of lines of boilerplate.

Like most frameworks (i.e. packages that force you to write your code to
conform to their patterns) a really good one confers a _massive_ boost in
productivity while a really bad one is horrifying to work - it fails in
bizarre unexplainable ways and straitjackets your development.

IME, passionate rejection of ORMs predominantly comes from people who have
been scarred by experience with really bad ORMs.

~~~
ken
Blaming it all in “really bad ORMs” without naming any sounds suspect to me.
Can you name some good ones?

~~~
pytester
For example JPOX was an ORM that almost put me off the entire idea. It was
horrendous.

Recently I've used Django ORM and SQLAlchemy - both are warty but generally
decent and better than not using any ORM.

This is somewhat language dependent - certain language qualities inhibit the
creation of decent ORMs (e.g. it's not really possible to create a decent ORM
in golang due to its design).

------
jjeaff
So how do you deal with it when you actually need to "migrate" data?

Getting a correct schema is only part of the migration process. Many times a
refactoring requires a new table or field to be populated with data from the
old table or field.

For example, a single table has a one to one relationship that we must select
from to insert data into a newly created table so we can have a one to many
relationship.

~~~
zepolen
If only there was some way to query the data from the old tables and insert
them into the new ones...

~~~
jjeaff
On, dev, local, staging, live, and where else?

If you have to manually access all of the databases and run queries on them,
then that defeats the purpose of a ci/cd migration tool.

Of course, the migrations that this method is purporting to replace handles
that.

------
jayd16
The reason you keep a migration chain is so you have a clear path from
anything in the wild to the latest. Unless I'm misunderstanding, throwing this
away means you need to migrate some unknown permutation to latest.

The hardest part of the migration is migrating the data and as far as I can
tell that is glossed over.

Why is throwing away the iterative migration version useful exactly?

~~~
djrobstep
In many (most?) cases people have a single production environment.

If you have multiple versions in the wild to support, there's nothing stopping
you from supporting multiple upgrade paths, or even continuing to chain
migration files, if you want to.

Moving data about is always going to be a manual process. This approach just
helps you test it better.

~~~
twic
> In many (most?) cases people have a single production environment.

No idea why this was downvoted, because this is a key truth.

You need to be able to migrate the production database to the latest version
without losing any data.

You need to be able to replace any non-production database with the same
schema as production. Ideally, with a (sanitised, subsetted, etc) copy of its
data.

You don't need to be able to do anything else.

~~~
akx
You're only thinking about SaaSy things there.

What about on-premises or just open-source software, where you any number of
any version of your software may be installed around the universe, and those
users need to upgrade to a newer version?

------
Jupe
The problem with database migrations are the rows, not the columns. I think
I've written similar "column" tool like this for just about every DB project
I've started, or write one for projects I've inherited/stepped into...

The rows (i.e. the actual data) is where the problem comes in. Just a few
examples:

* Splitting overloaded fields ("5qt" => 5, "qt")

* Datatype changes ("ID" => ID)

* PK size/type change (NUMBER(10) => NUMBER(15))

* "Just" add a column (PK,Col1 => PK, Col1, Col2)... easy until the identifying column is no longer enough!

* Alter a PK (PK => PK1, PK2) - now cascade that through the Referential Integrity chain(s)

* Change the "Type" column for every row based on the output of some external API call

* "Rollback" part of the schema, with examples of the above applied

* Implement a 'vertical split' on a very wide table, and associated RI cascades/updates

* The infamous "was_migrated" flag, or "row_version" solutions that seemed like a good idea at the time

* etc.

And the next level of complexity comes with large-scale systems; doing these
types of changes while the DB is online is even more complicated, and
sometimes not possible or not worth the development effort.

These are not intractable problems, and a tool to help with these kinds of
issues would be quite valuable; that's what I was hoping to see here.

[Edited; formatting]

------
ddebernardy
Interesting tool, but anecdotally the difficult part about database migrations
has very little to do with the actual schema changes. Rather, it has to do
with:

1\. Ensuring the app works properly with the old schema as well as the new one
-- not hard by itself, but requires rigor.

2\. Minimizing locking during the migration itself. That right there actually
is the _much_ trickier part of migrations, because sometimes you really end up
needing to rewrite an enormous table and you want to minimize downtime while
doing so.

~~~
jyounker
^^^^ This. A thousand times, this. ^^^^^

As a supporting example of the arguments, I recently had to replace a subtly
corrupted table in production with one rebuilt from trusted sources. The table
was about 160k rows and core to the operation of the production system I work
on.

The initial "easy" approach to replacing this table took many hours to
execute, and locked the table for much of that time. It took a day or two of
effort, but the final series of SQL operations pushed the total run time down
to less than four minutes, was recoverable from any intermediate state, and
pushed the locking time down to three seconds.

The complicated part was not the final state of the database, but finding a
path to reach that state that fit within the operational constraints.

------
emptysea
Most of my migrations with Django are column renames, column adds, deletions,
etc., which seem to fit nicely with this tool, but there are some cases where
I want to do some more complex transformation which require running a function
using Django's `RunPython()`[0]. For example, combining `firstName` and
`lastName` columns into a single `name` column.

How would these changes be accomplished?

[0]: [https://docs.djangoproject.com/en/2.1/ref/migration-
operatio...](https://docs.djangoproject.com/en/2.1/ref/migration-
operations/#django.db.migrations.operations.RunPython)

~~~
djrobstep
Just create a migration script that mixes (autogenerated) sql and the python
you need.

~~~
paulryanrogers
I thought that was the point of migrations in Ruby for Rails, PHP for Laravel,
etc. Because raw SQL often lacks the tools to express and coordinate the
changes.

------
bagels
So how would you deal with continuous deployment to test but less frequent
deployment to production? In test you might make multiple sequential schema
changes before deploying to production. Don't you just end up in the same
place as with migrations?

How about multiple engineers making schema changes simultaneously?

How is a diff really much different from a migration?

~~~
djrobstep
Author here. Where I've used this in a CI environment, the migration step goes
from:

\- check schema version number, if newer version exists, apply migration files

to:

\- check schema on this branch matches production schema exactly, if it
doesn't, look for (one or more) pending migration files that would bring
production to target state, apply those files

So you can still have multiple migration files as needed. The differences:

\- no version numbers

\- no long migration chain (you can clean up old pending files once they've
hit prod)

\- migrations are tested directly, and only get applied if a correct outcome
will result

~~~
fbonetti
How would you handle a situation where the underlying data needs to be
transformed? For example, converting a text column to an int column?

~~~
djrobstep
In that particular example, migra will detect that change and generate the
right `alter column` statement with a `using` clause.

It's impossible to handle every case, of course, such as when renames happen
or data needs to be moved/inserted.

Migration scripts always need reviewing. Tools can get you most of the way
there automatically, and help you test, but not all of the way.

------
turdnagel
I did an Ask HN[0] on a related topic almost a year ago. I think the OP is
right that thinking clearly about data relationships helps in the design of an
app, but things do change. Which is why people give up and use Mongo.

[0]:
[https://news.ycombinator.com/item?id=16871789](https://news.ycombinator.com/item?id=16871789)

------
dwheeler
This is interesting, though it really shouldn't be specific to one RDBMS (and
I like PostgreSQL).

To be honest, I don't find the status quo particularly difficult to deal with.
A directory with lots of little files doesn't take up much storage, and nobody
cares about the space anyway. That said, improvements are always welcome.

I like how it looks at the diff and figures out simple schema changes.
However, in my experience, the simple changes are no big deal. The more
interesting challenge in migrations is when semantics change and you need to
generate data from existing data. Simple diff cannot determine that, that
needs to be provided by some set of queries, which easily fits in a migration
file.

Is there a way to integrate more complex changes and dataset manipulations
when needed?

------
bfung
The reason files with numbers with different changes are kept is because when
there are more than a handful of devs working on the same schema, who has the
right magic 1 step "Dev schema"?

Your migration tool is bad and you should feel bad.

Schema changes aren't hard. It's the mismatch of devs knowing how big a table
is in production + the downtime caused by a data migration that is the hard
problem. Data retention, partitioning, and other data archiving things are
rarely the top of mind for apps until it's usually too late and "hard".

~~~
djrobstep
> who has the right magic 1 step "Dev schema"?

Hint: It's right there in the name. The developers.

> Schema changes aren't hard.

I'm glad you enjoy tedious manual work. I'd rather automate it.

------
robben1234
How to do data migrations this way? Let's say you have a field for statuses
(int) and want to add a new option in the middle of array [(1,a),(2,b),(3,d)]
=> [(1,a),(2,b),(3,c),(4,d)]. Where do you do
Model.objects(status=3).update(status=4)?

Also, at least for Django, I can't see real harm with schema versioning via
files with ids. You can pretty easily merge everything to one file once you
have 100 or whatever migrations in any app.

~~~
petepete
You can take the multiple migrations approach and for a short time keep two
status fields while you change over, eventually dropping the first and
renaming the second.

Or, as most people would do it, within a transaction add the new value to your
enum and run an update. Of course, this can't be reversed but if it's a simple
`set status = 3 where status = 4` I don't see the problem.

------
antpls
> Fundamentally, a schema is a guarantee that your data conforms to the
> structure you intend.

My opinion :

The reality is that you never have enough information about how the data will
be queried and connected together in the long run. A table structure is a
model you built from a snapshot of your knowledge of the reality, but your
knowledge changes over time and so should the model. As new information goes,
some assumptions made before become obsolete, while new assumptions need to be
made.

Even if a SQL relational database works OK for most of the business cases,
they are not good at frequently evolving over time with new information or
needs. They are also not that good if you have different models of the data
that must exist in parallel.

I could see database schema as a specialized form of organizing knowledge, but
it shouldn't be considered the best and only one.

Fundamentally, if you think you can fit the complexity of the reality in a
bunch of relational tables, you are starting with the wrong assumptions
already

------
BerislavLopac
A very common problem I have seen in distributed applications is that the
database is treated as an internal part of the application. As soon as you
have a separate server running a database (basically most modern setups that
don't use SQLite) you have a distributed system, and the database is an
external dependency -- just like, say, Facebook/Google API you use for SSO. It
becomes even more clear if your system has several independent services.

Once you accept this, the database schema becomes just yet another API
contract to accept and respect, with all the same mechanisms to ensure
compatibility -- versioning, integration tests etc.

------
lukevp
Thanks for the post. I agree that schema migrations could be improved, and am
also working on a tool to address this space. Why is your tool specific to
PostGres? Does it perform migrations or just generate the scripts? Am I
understanding correctly that you compare 2 live databases together? I.e. prod
compared to dev, and a diff script will be created with a create table because
my dev database has additional table "foo" in it that doesn't exist in prod?

------
michael_j_ward
Link to the talk:
[https://www.youtube.com/watch?v=xr498W8oMRo](https://www.youtube.com/watch?v=xr498W8oMRo)

------
zimablue
Alembic is an amazing db migration tool imo. I only use it in small-scale
(80(?) tables, 2 developers), but it's so amazingly hackable by design.

------
brianwawok
Neat.

I generally like Django migrations. A few rough spots. This seems like it
could help some of them, but would take s lot of work to wire it in.

~~~
djrobstep
Because this works directly at the database level, it shouldn't need any
"wiring in" to django itself, to use it with django.

I don't use django, but all that is required is for django to be able to
initialize an empty database based on its model definition.

Then you can just use that initialized database independently as a target from
which to generate migration scripts.

------
tommilukkarinen
When starting to use mongo it felt like a silver bullet. Reading this reminds
me that it still is. I kind of get the point of schemas etc. but if your
product can work with mongo, you just might avoid all these pains.

~~~
paulryanrogers
Does Mongo support ACID nowadays? And if data is mostly relational it still
seems like overkill to me.

------
xrd
Thanks for saying mongo sucks. I'll be quoting that.

Brilliant tool. I totally get it.

------
cjjp
Huh. This is essentially what I have been doing, though not so succenitly. I
thought I was just being lazy to be honest as this method seemed easier than
anything else. Good to know I'm not the only one!

------
marfusios
Bullshit, DDD is a way, database should be just plugin.

------
fromalex
OT: Wondering if DB admins are the highest paid contracters among tech
professinals. Discussions around DB tech are always the most emotional and
heated up ones, like if they lost something significant if their world view on
DBs is wrong.

