Hacker News new | past | comments | ask | show | jobs | submit login
One year of automatic DB migrations from Git (abe-winter.github.io)
50 points by awinter-py on Aug 3, 2020 | hide | past | favorite | 39 comments



The reason the author flat out didn't consider SQLAlchemy’s alembic seems incorrect? Don't most people simply use a local testing db, generate migrations on that test db then just perform the upgrade on prod if it looks good instead of generating the migration directly on prod?


Yes that's how you'd do it. That said, while I really love SQLAlchemy and worked with Alembic for a long time I switched to manually generated SQL-based migrations in all of my projects, as there are simply too many corner-cases that are quite hard to capture with tools like Alembic. Writing plain SQL files gives you much more expressive power and also makes it quite easy to build complex migrations that e.g. involve first some schema changes, then a data migration, and then some more schema changes, which is hard to do with most migration tools.

I also found that the use-case which most of these migration tools optimize for (being somewhat independent of specific SQL dialects) rarely works in practice, except maybe if you stick to the smallest common denominator of all SQL dialects. And that again is wasteful IMO as not using a lot of the functionality that's present e.g. in Postgres can make your life a lot harder.


I'm working with dotnet rather than Python, and worked with Entity Framework's automated migrations for years. I say "worked with", but all too often it was "battled with" - just as you mention, there were just too many corner-cases for anything other than the simplest of databases.

A few years back I switched to plain SQL based migrations (using a lightweight library called DbUp to execute them) - I couldn't be happier with it!

What better tool to express database changes with than SQL? I also like the transparency it brings you - the files are right there for you to read. And I'm a stickler for formatting and convention, so it's great being able to write SQL files that include comments and use my preferred formatting.


Yes, that is exactly how you use sqla alembic: You connect it to your development db, generate the migration, then run the migration locally first then on staging/prod.

This has both the pro and con of your app being quite tightly coupled with the database.


> your app being quite tightly coupled with the database

I mean, this is the case any time your app talks directly to a database


Think I was using the wrong terminology. What I meant is that your ORM code in the app needs to reflect your database pretty much exactly, otherwise you can get unwanted migrations.

Which is in most cases desirable, but for example if you want to use database features that SQLAlchemy/Alembic don't (yet) support it can get a bit messy and/or require some manual plumbing to keep getting the migrations you want.


Only if your DB is purely a way of keeping your objects warm, otherwise you can design your DB with the business in mind, the app with the apps domain in mind, and an adapter between the two. Layer of abstraction but might help BAs work against the database and not be stumbled by application specific things that affected the design. Depends on the system of course!


Then your "app" is the adaptor, and the tying still applies.


why would you ever add a layer between you and the db?


Some DBs have inherent limitations and you need to design something to get around them. Sometimes you have many apps and you don't want to reinvent the wheel to support what are otherwise standard accesses of a database. Sometimes it's a "special" database which makes this more convoluted, or perhaps you need to implement things like a gradual back-off retry to prevent swamping the backend. Sometimes you don't want to have to patch every app that uses the database if a database change is breaking. Sometimes you want to prevent developers from abusing your database, provide more fine-grained access control, prevent poorly researched queries which can kill performance, etc. Sometimes you're talking to more than one db.

Personally, I would start writing the abstraction as soon as I plan to have more than two apps.


Yeah, that seems off to me too. I don't know of anyone who just goes straight to prod in that way, and I wouldn't take them seriously if they did.


I think you're right -- I've never managed an SQLAlchemy on a production site. I'm going to edit this paragraph.


This is what SSDT dacpacs offer (it's supposed to be an open standard too). Unfortunately they suffer similar issues as you describe (most data migrations are destructive).

I settled on flyway after migrating to postgres a few years ago and haven't looked back. The "this is great" moment came after we automated snapshot and restore of per-dev environments in the release pipelines - this allowed us to roll out the last released version of the product with either a clean database (fast) or a snapshot (slow) to our own individual database in only a few minutes. As a result we could test flyway migrations without the grief caused by breaking (or reverting) a shared 'dev' database.

Since moving on from that team/company, and run head first into dacpacs again, I've found it increasingly difficult to sell my past approach. Few people are genuine data experts, few still DBAs and almost no one is responsible for cross cutting concerns etc. I can't help but think empowering devs to automate migrations, even if it's a hackity SQL parser, is part of the solution and I'd love to see this take off.


I really like the workflow GitHub uses for this: https://github.blog/2020-02-14-automating-mysql-schema-migra...

(It makes use of skeema, which allows you to track your schema in a declarative way vs. ALTER TABLE statements.)


skeema is similar -- if I understand it correctly, it keeps 'truth' as a SQL file and diffs it against a live mysql DB

automig diffs the SQL files directly without looking at a live DB

pros & cons to both approaches


Skeema author here -- you're correct that a live DB is involved, but the notion of "truth" in Skeema actually depends on what operation you're running :)

In Skeema, you have directories of *.sql files, where each directory defines the desired state of a single "logical schema". A config file in each dir allows you to map that logical schema to one or more live databases, possibly in different environments (prod, stage, etc) and/or possibly different shards in the same environment.

Most commonly, users will want to "push" the filesystem state to the databases, e.g. generate and run DDL that brings the database up to the desired state expressed by the filesystem. But users can also "pull" from a live database, which does the reverse: modifies the filesystem to look like the current state of a live database, essentially doing a schema dump. So the source of truth depends on the direction of the operation.

Skeema also uses the live database as a SQL parser. Instead of needing to parse every type of CREATE statement accurately across many different versions/dialects of MySQL, Percona Server, MariaDB, (and maybe someday Aurora etc), Skeema runs the statements in a temporary location and then introspects the result from information_schema. This avoids an entire possible class of bugs around parser inaccuracies.

That all said, I do wholeheartedly agree that generating DDL from git alone is both useful and really cool! I actually built a "database CI" product around this concept as a GitHub app last June: https://www.skeema.io/ci/


whoa neat

and agree re parsing -- turned out to be a huge pain. I think someone bundled the postgres parser for python; I wish this were true for every dialect but even so, there are versions to consider.


This is definitely better than the traditional rails/django approach to migrations, but the post seems to imply that generating changes against a live DB is a pointless annoyance - I disagree.

If you care about testability - the best way to test your migrations is by testing them directly against the schema they'll actually be applied to.

What if your tool is not the only party making schema changes to the database? What if there was a bug in a previous migration and the database is in a different state than you're assuming? The only way to check for problems like this is to check your actual production state.


> and because in general there’s no guarantee that the migrations produce something equivalent to your ‘schema.sql’ or ORM definition.

I think the solution to this is to not have a schema.sql. Build your development/testing database by running the same migrations you have/will run in production.

There's still the issue of ORM parity, which admittedly is a bit awkward. I try to limit my ORM definitions to the minimal necessary for the ORM to run. If I want a true picture of what the table looks like, I look to the database directly.


To me it seems a bizarre approach to have no actual goal state and have your source of truth be a chain of scripts.

Why not:

- Have an ORM definition/schema.sql explicitly defined

- Continually test that production_schema+pending_changes=goal_schema

This way, no chain is required: The only scripts you need to keep hanging around are any pending changes yet to be applied. And you can mostly autogenerate any changes required, so the need to cobble together scripts manually is much reduced.


You can have both. I usually have numbered SQL scripts that represent up & down migrations (example: https://github.com/algoneer/algonaut/tree/master/algonaut/mo...). The first migration simply contains the initial data schema, the following migrations contain amendments made during development. As you said it can be confusing to have to look at a series of files to figure out how the schema would look in practice. To alleviate this problem we regularly "compact" our changes by generating a new initial migration representing the current state of the database (this is trivial with most database systems e.g. using the "pgdump" command for Postgres), replacing all migration files and resetting the schema version in the production DB to 1.

In practice it's hard to avoid having incremental migrations, as you need a way to evolve your production schema. Your approach would be awesome of course but I don't know any tool that would be able to robustly come up with a migration by comparing a database schema and a schema specification. Often it's not even possible to derive the necessary changes using such an approach: If you e.g. change the name of a column in your schema.sql, how can the tool know if you want to remove the original column and add a new one instead or simply rename and possibly type-cast it? If you have multiple changes like that it only gets worse. So I'm not very optimistic about that approach.


I wrote a tool that does exactly this (migra, mentioned in the post). You obviously have to review the output to catch things like renames, but most of the time, for typical operations like adding a column/updating a view/whatever, it's going to give you what you want.


You usually keep the full chain because you don't track the state of every environment (like dev local environments).


Your dev environment is invariably just set to a goal schema. Why is a chain required for that?


Ah I see what you mean. It doesn't work for data migrations though.


Looks interesting - GitOps for databases. Can I rollback to a specific version?

I think Rails has a combination - migration files + schema file that serves a source of truth https://edgeguides.rubyonrails.org/active_record_migrations..... It solves a problem of a cold start, but I'm not sure it can generate a migration from the changes to it.


Rails does some basic tracking in the schema_migrations table, which it uses to enable rollbacks via its CLI.


in theory rollback works -- the tool will try to generate a migration between any two git refs

in practice the rollback use case isn't tested super well and, for data integrity reasons, it doesn't drop tables -- you'll run into trouble if you try to rollback across an added table


Minor nit, but GitHub's gh-ost tool is for performing schema changes, not data migrations. You may be thinking of Shopify's Ghostferry?

Anyway though, your broader point is absolutely correct, re: data migrations typically involve complex custom/company-specific tooling at scale. And it's also difficult to map data migrations to a declarative tool in a way that feels intuitive.


I wonder how automig compares to loading the schema definitions from git into scratch databases and feeding them to migra https://djrobstep.com/docs/migra


Sorry if that comes off as looking down on people, it is not meant that way, but I'm honestly a bit baffled how I meet many people to this day that still discover schema migration tools as something new/special instead of just learning it as one of the basic tools of a developer.

I haven't written schema migrations by hand on a regular basis in like 12+ years, the very start of my career, so they were probably not even new back then. Am I just living in a bubble where those tools, while certainly not perfect, are heavily relied upon, whereas others prefer to do it (semi-)manually to have full control? How is the state of these tools in different stacks / environments?


Perhaps it's just that in some cases, the tool is overkill? In one case, I wrote a super simple shell script to track applied migrations to a database that had until then been managed entirely manually; I didn't feel like it was worth the trouble to try and learn something more complicated just for that use case, since it took me about 15 minutes to write a script that worked well enough: it applied the migration scripts transactionally (transactional DDL rules), in order and did so idempotently.

Granted, I'm not a software developer by trade; just a sysadmin who occasionally uses programming to solve problems. I imagine if I had to deal with actual software development I'd end up learning one of those tools eventually.


I started working with Django professionally in 2013. At that time, migrations weren’t built in to Django, but there was a widely used plugin (South), which was subsequently integrated into Django core. I can’t really imagine managing a professional web app without a migration tool. Sound like a complete nightmare.


>> I haven’t written a migration in 1+ years

This.

Above should be the goal of every database backed application framework. A really well balanced post (author is super modest).

>> Dialect support is no picnic.

Are you doing SQL statement parsing here for diff of pg ? This can lead to lot of work if you want to add other databases.

@awinter-py : have a look at our approach to schema migrations - we hand a GUI DB application to devs and it generates schema migrations as you change database. The goal is still the same - backend devs should never write schema migrations.


I really want schema definition to be declarative; then ask me for a data migration if applying (to a given db) is going to result in data loss (e.g. declaration doesn't include a column that currently exists in db), optional otherwise.

Even when state-aware migrations are needed they could be better, e.g. 'on creating this column, value = other column * 2' rather than in an ordered sequence.


Do people often find they're doing schema migrations without having the need to run data migrations as well?

I find writing the alter table commands fairly trivial compared to the data migration that comes along with it.


We need a file system API for databases, much like /proc.


There is, it's called INFORMATION_SCHEMA and defined by the SQL standard.

https://www.postgresql.org/docs/current/information-schema.h...

Postgres also has its own schema that is basically manipulated directly by the C runtime:

https://www.postgresql.org/docs/current/catalogs.html


So I can cat the file system API directly from Linux?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: