
One year of automatic DB migrations from Git - awinter-py
https://abe-winter.github.io/2020/08/03/yr-of-git.html
======
kanobo
The reason the author flat out didn't consider SQLAlchemy’s alembic seems
incorrect? Don't most people simply use a local testing db, generate
migrations on that test db then just perform the upgrade on prod if it looks
good instead of generating the migration directly on prod?

~~~
ThePhysicist
Yes that's how you'd do it. That said, while I really love SQLAlchemy and
worked with Alembic for a long time I switched to manually generated SQL-based
migrations in all of my projects, as there are simply too many corner-cases
that are quite hard to capture with tools like Alembic. Writing plain SQL
files gives you much more expressive power and also makes it quite easy to
build complex migrations that e.g. involve first some schema changes, then a
data migration, and then some more schema changes, which is hard to do with
most migration tools.

I also found that the use-case which most of these migration tools optimize
for (being somewhat independent of specific SQL dialects) rarely works in
practice, except maybe if you stick to the smallest common denominator of all
SQL dialects. And that again is wasteful IMO as not using a lot of the
functionality that's present e.g. in Postgres can make your life a lot harder.

~~~
GordonS
I'm working with dotnet rather than Python, and worked with Entity Framework's
automated migrations for years. I say "worked with", but all too often it was
"battled with" \- just as you mention, there were just too many corner-cases
for anything other than the simplest of databases.

A few years back I switched to plain SQL based migrations (using a lightweight
library called DbUp to execute them) - I couldn't be happier with it!

What better tool to express database changes with than SQL? I also like the
transparency it brings you - the files are right there for you to read. And
I'm a stickler for formatting and convention, so it's great being able to
write SQL files that include comments and use my preferred formatting.

------
qxmat
This is what SSDT dacpacs offer (it's supposed to be an open standard too).
Unfortunately they suffer similar issues as you describe (most data migrations
are destructive).

I settled on flyway after migrating to postgres a few years ago and haven't
looked back. The "this is great" moment came after we automated snapshot and
restore of per-dev environments in the release pipelines - this allowed us to
roll out the last released version of the product with either a clean database
(fast) or a snapshot (slow) to our own individual database in only a few
minutes. As a result we could test flyway migrations without the grief caused
by breaking (or reverting) a shared 'dev' database.

Since moving on from that team/company, and run head first into dacpacs again,
I've found it increasingly difficult to sell my past approach. Few people are
genuine data experts, few still DBAs and almost no one is responsible for
cross cutting concerns etc. I can't help but think empowering devs to automate
migrations, even if it's a hackity SQL parser, is part of the solution and I'd
love to see this take off.

------
morgo
I really like the workflow GitHub uses for this:
[https://github.blog/2020-02-14-automating-mysql-schema-
migra...](https://github.blog/2020-02-14-automating-mysql-schema-migrations-
with-github-actions-and-more/)

(It makes use of skeema, which allows you to track your schema in a
declarative way vs. ALTER TABLE statements.)

~~~
awinter-py
skeema is similar -- if I understand it correctly, it keeps 'truth' as a SQL
file and diffs it against a live mysql DB

automig diffs the SQL files directly without looking at a live DB

pros & cons to both approaches

~~~
evanelias
Skeema author here -- you're correct that a live DB is involved, but the
notion of "truth" in Skeema actually depends on what operation you're running
:)

In Skeema, you have directories of *.sql files, where each directory defines
the desired state of a single "logical schema". A config file in each dir
allows you to map that logical schema to one or more live databases, possibly
in different environments (prod, stage, etc) and/or possibly different shards
in the same environment.

Most commonly, users will want to "push" the filesystem state to the
databases, e.g. generate and run DDL that brings the database up to the
desired state expressed by the filesystem. But users can also "pull" from a
live database, which does the reverse: modifies the filesystem to look like
the current state of a live database, essentially doing a schema dump. So the
source of truth depends on the direction of the operation.

Skeema also uses the live database as a SQL parser. Instead of needing to
parse every type of CREATE statement accurately across many different
versions/dialects of MySQL, Percona Server, MariaDB, (and maybe someday Aurora
etc), Skeema runs the statements in a temporary location and then introspects
the result from information_schema. This avoids an entire possible class of
bugs around parser inaccuracies.

That all said, I do wholeheartedly agree that generating DDL from git alone is
both useful and really cool! I actually built a "database CI" product around
this concept as a GitHub app last June:
[https://www.skeema.io/ci/](https://www.skeema.io/ci/)

~~~
awinter-py
whoa neat

and agree re parsing -- turned out to be a huge pain. I think someone bundled
the postgres parser for python; I wish this were true for every dialect but
even so, there are versions to consider.

------
djrobstep
This is definitely better than the traditional rails/django approach to
migrations, but the post seems to imply that generating changes against a live
DB is a pointless annoyance - I disagree.

If you care about testability - the best way to test your migrations is by
testing them directly against the schema they'll actually be applied to.

What if your tool is not the only party making schema changes to the database?
What if there was a bug in a previous migration and the database is in a
different state than you're assuming? The only way to check for problems like
this is to check your actual production state.

------
luhn
> and because in general there’s no guarantee that the migrations produce
> something equivalent to your ‘schema.sql’ or ORM definition.

I think the solution to this is to not have a schema.sql. Build your
development/testing database by running the same migrations you have/will run
in production.

There's still the issue of ORM parity, which admittedly is a bit awkward. I
try to limit my ORM definitions to the minimal necessary for the ORM to run.
If I want a true picture of what the table looks like, I look to the database
directly.

~~~
djrobstep
To me it seems a bizarre approach to have no actual goal state and have your
source of truth be a chain of scripts.

Why not:

\- Have an ORM definition/schema.sql explicitly defined

\- Continually test that production_schema+pending_changes=goal_schema

This way, no chain is required: The only scripts you need to keep hanging
around are any pending changes yet to be applied. And you can mostly
autogenerate any changes required, so the need to cobble together scripts
manually is much reduced.

~~~
ThePhysicist
You can have both. I usually have numbered SQL scripts that represent up &
down migrations (example:
[https://github.com/algoneer/algonaut/tree/master/algonaut/mo...](https://github.com/algoneer/algonaut/tree/master/algonaut/models/migrations)).
The first migration simply contains the initial data schema, the following
migrations contain amendments made during development. As you said it can be
confusing to have to look at a series of files to figure out how the schema
would look in practice. To alleviate this problem we regularly "compact" our
changes by generating a new initial migration representing the current state
of the database (this is trivial with most database systems e.g. using the
"pgdump" command for Postgres), replacing all migration files and resetting
the schema version in the production DB to 1.

In practice it's hard to avoid having incremental migrations, as you need a
way to evolve your production schema. Your approach would be awesome of course
but I don't know any tool that would be able to robustly come up with a
migration by comparing a database schema and a schema specification. Often
it's not even possible to derive the necessary changes using such an approach:
If you e.g. change the name of a column in your schema.sql, how can the tool
know if you want to remove the original column and add a new one instead or
simply rename and possibly type-cast it? If you have multiple changes like
that it only gets worse. So I'm not very optimistic about that approach.

~~~
djrobstep
I wrote a tool that does exactly this (migra, mentioned in the post). You
obviously have to review the output to catch things like renames, but most of
the time, for typical operations like adding a column/updating a
view/whatever, it's going to give you what you want.

------
ishcheklein
Looks interesting - GitOps for databases. Can I rollback to a specific
version?

I think Rails has a combination - migration files + schema file that serves a
source of truth
[https://edgeguides.rubyonrails.org/active_record_migrations....](https://edgeguides.rubyonrails.org/active_record_migrations.html#what-
are-schema-files-for-questionmark). It solves a problem of a cold start, but
I'm not sure it can generate a migration from the changes to it.

~~~
bdcravens
Rails does some basic tracking in the schema_migrations table, which it uses
to enable rollbacks via its CLI.

------
evanelias
Minor nit, but GitHub's gh-ost tool is for performing schema changes, not data
migrations. You may be thinking of Shopify's Ghostferry?

Anyway though, your broader point is absolutely correct, re: data migrations
typically involve complex custom/company-specific tooling at scale. And it's
also difficult to map data migrations to a declarative tool in a way that
feels intuitive.

------
fanf2
I wonder how automig compares to loading the schema definitions from git into
scratch databases and feeding them to migra
[https://djrobstep.com/docs/migra](https://djrobstep.com/docs/migra)

------
o1lab
>> I haven’t written a migration in 1+ years

This.

Above should be the goal of every database backed application framework. A
really well balanced post (author is super modest).

>> Dialect support is no picnic.

Are you doing SQL statement parsing here for diff of pg ? This can lead to lot
of work if you want to add other databases.

@awinter-py : have a look at our approach to schema migrations - we hand a GUI
DB application to devs and it generates schema migrations as you change
database. The goal is still the same - backend devs should never write schema
migrations.

------
OJFord
I really want schema definition to be declarative; then ask me for a data
migration if applying (to a given db) is going to result in data loss (e.g.
declaration doesn't include a column that currently exists in db), optional
otherwise.

Even when state-aware migrations are needed they could be better, e.g. 'on
creating this column, value = other column * 2' rather than in an ordered
sequence.

------
jayd16
Do people often find they're doing schema migrations without having the need
to run data migrations as well?

I find writing the alter table commands fairly trivial compared to the data
migration that comes along with it.

------
smoe
Sorry if that comes off as looking down on people, it is not meant that way,
but I'm honestly a bit baffled how I meet many people to this day that still
discover schema migration tools as something new/special instead of just
learning it as one of the basic tools of a developer.

I haven't written schema migrations by hand on a regular basis in like 12+
years, the very start of my career, so they were probably not even new back
then. Am I just living in a bubble where those tools, while certainly not
perfect, are heavily relied upon, whereas others prefer to do it
(semi-)manually to have full control? How is the state of these tools in
different stacks / environments?

~~~
chousuke
Perhaps it's just that in some cases, the tool is overkill? In one case, I
wrote a super simple shell script to track applied migrations to a database
that had until then been managed entirely manually; I didn't feel like it was
worth the trouble to try and learn something more complicated just for that
use case, since it took me about 15 minutes to write a script that worked well
enough: it applied the migration scripts transactionally (transactional DDL
rules), in order and did so idempotently.

Granted, I'm not a software developer by trade; just a sysadmin who
occasionally uses programming to solve problems. I imagine if I had to deal
with actual software development I'd end up learning one of those tools
eventually.

------
techntoke
We need a file system API for databases, much like /proc.

~~~
rswail
There is, it's called INFORMATION_SCHEMA and defined by the SQL standard.

[https://www.postgresql.org/docs/current/information-
schema.h...](https://www.postgresql.org/docs/current/information-schema.html)

Postgres also has its own schema that is basically manipulated directly by the
C runtime:

[https://www.postgresql.org/docs/current/catalogs.html](https://www.postgresql.org/docs/current/catalogs.html)

~~~
techntoke
So I can cat the file system API directly from Linux?

