Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you organize and manage database migrations?
81 points by anonfunction 5 months ago | hide | past | web | favorite | 60 comments
When building services that rely on relational databases, in my case postgres, what are some best practices and tools to help manage schema changes?

We've been using migrations and doing it all manually but it's become a bottleneck and a little bit of a nightmare with multiple consumers of the database needing to make schema changes.

Another concern is multiple environments, from local development to staging and production. We're using docker-compose for local development which runs the entire "full" schema and then I'm manually applying the migration files to staging and production before we deploy.

I've looked at some projects like flywaydb[1] and liquibase[2] but both are not completely free and seem proprietary. Does anyone know of another open source system that could help manage database schema versioning and migrations?

Thanks so much HN, this is something that I have been struggling with.

1: https://flywaydb.org/

2: https://www.liquibase.org/




Honestly? I don't know of any tool that handles database migrations as well or as easily as Ruby on Rails and ActiveRecord.

I have literally spun up a barebones Rails app just to manage my schema and migrations for a completely separate Python/PostgreSQL project before.

https://guides.rubyonrails.org/v5.2/active_record_migrations...


Django's migration handling is excellent as well [0].

[0] https://docs.djangoproject.com/en/2.2/topics/migrations/


I second this - migrations in Rails are fantastic. Even if you aren't using Rails you can get the styles and source code here https://edgeguides.rubyonrails.org/active_record_migrations....

Large open source projects such as GitLab source code are also treasures to discover migration rules.

If you have a millions of users with high traffic tables you may want to check some migration guides for downtime, data migrations and batch processing here https://docs.gitlab.com/ee/development/migration_style_guide...

If you want to have some concurrent ways of DDL operations you may benefit from https://gitlab.com/gitlab-org/gitlab/blob/master/lib%2Fgitla...


That gitlab link was a good read, thanks!

One of the points was that migrations should always be reversible. I’ve struggled with this in the past when it comes to migrations that change or drop data. How do you write a downgrade script that recalls the previous data?

I’ve given up with downgrades now. I make sure to take a dB backup or snapshot before running an upgrade. I’ve never had to test this solution in a tense situation though...


One way is to keep a polymorphic (postgres unloggged) `backups` table with bson/json field to store any table's row. use this before and after migrations and backup and truncate this frequently.


I wholeheartedly agree.

Why do you think no other toolchain has come close to this for other languages or frameworks? It’s a very common problems that needs a solution no matter what you are building and what you are building it on.


Ever tried EntityFramework in .Net ? It's amazingly easy and there are 2 ways:

- Database-First

- Code-First


+1 for code-first migrations. It's a bit of a learning curve, but the effort is worth it imo.


I prefer dB first.

Create an SQL project that defines the schema. It automatically compares the schema with your current dB schema and auto generates a migration script for you.

Then update entities from the dB.

Few steps involved but avoids writing and maintaing migrations yourself.


I would like to suggest code-first, at least try it once.

I don't want to go back to any sort of migrations outside of this.


Code-first does this too, but based on a schema you define in your code and not in your DB.


We use this for migrations on the daily..

https://pypi.org/project/yoyo-migrations/


I'm the author of an open-source schema management tool, Skeema [1], which uses a declarative infrastructure-as-code approach: you track your table definitions (and procs, funcs, etc) in a Git repo of *.sql files, typically one CREATE statement per file.

Instead of writing migrations, developers simply add / remove / modify these CREATE statements, going through the same pull request and code review process as for code. The tool knows how to diff the desired state (expressed in the repo) vs the actual state of any database environment (prod / stage / dev, etc) to generate the appropriate DDL.

It's a bit of a paradigm shift relative to traditional mysql and postgres migration tools, but it's an approach that has been used successfully by Facebook internally for nearly a decade. It's also a common approach in the SQL Server world. I've written a blog post [2] describing some of the advantages of the declarative approach.

Skeema currently only supports MySQL and MariaDB, but some declarative tools for Postgres include sqldef [3] and migra [4].

[1] https://www.skeema.io

[2] https://www.skeema.io/blog/2019/01/18/declarative/

[3] https://github.com/k0kubun/sqldef/

[4] https://github.com/djrobstep/migra


I was going to ask after rename support, but lack of it is documented over here along with a couple other things. https://github.com/skeema/skeema/blob/master/doc/requirement...

Interesting ideas though. Microsoft does something similar with a DACPAC where they record the schema and diff, but they also record a list of rename actions etc. and keep a table to track those “migrations” in addition to the diff process. https://docs.microsoft.com/en-us/sql/relational-databases/da... and/or https://docs.microsoft.com/en-us/sql/ssdt/extract-publish-an... — for rename and other SSDT details, see https://docs.google.com/presentation/d/1DvC2gzCucjHFbGiBLa0R... (it’s a “RefactorLog” if searching)

Having seen all that, the Rails ActiveRecord approach strikes a decent balance between specify migrations as imperative actions and keeping a declarative schema checked in. A comparison between the two approaches is at https://blog.raph.ws/2019/01/migrations-vs-dacpacs/ but I find DACPAC to be useful but over-complicated for developers and not well-enough supported on non-windows hosts yet. (It basically requires VS for Windows or SSDT right now...)

Also, it’s likely a migration system needs some kind of code review checks and automation around which actions you’d allow in your migrations if you don’t have that already...

The next question after “how do you migrate?” is probably “how do you backup and how long would it take to restore after a bad migration and/or bad rollback?” These days your answer is probably either “no time at all” for smaller apps with SSDs or “it’s all outsourced to the cloud,” for the new cloud distributed data stores or for the smart DIY ones, “we only deploy one server at a time and can either afford the downtime or have copies in multiple AZs, perhaps eventual consistency...”


We built it ourselves.

* We have migrations written in SQL files named like 45_add_column_x_to_view_y.sql

* We track the migrations in git and we do merge requests and reviews

* We include a comment in each migration containing the original DDL of the object (1. it makes it easier for the reviewer to see what's changing, 2. it gives us a way to rollback, though it never happened so far)

* We track the applied migrations in a table migrations looking like "2019-10-30T10:49:00" ! "45_add_column_x_to_view_y.sql"

* We have a CLI in the same repo that allow us to apply the migrations. It's 50 LOC of python that basically 1. gets the DB host/user/pwd from an .env, 2. checks the last applied migration in the table migrations, 3. finds the new migrations to apply, 4. applies them following the order of the ids e.g. 45 before 46, 5. updates the table migrations

* We have 3 DBs (dev / staging / prod) in AWS RDS and I have an .env for each one of them

* We have a CI pipeline on the repo to automatically apply the merged migrations in staging and check if nothing goes wrong

* When the staging or dev is trashed, I delete it and recreate it from a snapshot of the prod (it takes 5 min) https://stackoverflow.com/a/49878477/652669


This is almost exactly what i built at my company. It felt like a dirty hack at the time, and i assumed we'd transition to something "real" over time, but it's been working flawlessly for quite a few years now and i've yet to see any reason to use something different.


We have something very similar, written in Java. Our extensions specify whether the migration is run pre-release or post-release, in order to facilitate no-downtime releases. The pre-release migration is responsible to not break the running code. The released code is responsible for detecting (if necessary), the pre- or post-release state and behaving appropriately.

Sequence is to run pre-release migrations, deploy the release, then run post-release migrations. Sometimes days apart, in case things need to be stabilized. Testing environment runs only pre-release migrations, until some time before the release, then runs post.

We also have a handful of keywords to run operations in ways that don't lock large, frequently-used tables, or do other environment- or application-specific operations in standard ways.


Which existing technologies did you look at and why didn't they fit your needs?


We looked at Alembic and our main issues were that

1. we don't have the entire database modeled into SQLAlchemy and we don't plan to

2. we have A LOT of views, triggers and functions and they are not very well supported by SQLAlchemy/Alembic (? I may be wrong)


Point 1 is not an issue IMO, you definitely don’t need to model anything schema-wise into SQLAlchemy ORM to make Alembic useful, since Alembic operations translate pretty much directly to SQL. I’ve just built a project last month that uses Alembic but otherwise connects to the database and issues SQL directly.

Point 2 is indeed a big factor though. Alembic doesn’t have much first-class operation support beyond data storage constructs, and if most you need is op.execute() (which runs plain SQL) one might just as well build their own tool like you did.


How the system handles rolling back a migration?


Flyway is free open source software (licensed under the Apache License 2.0).

There are pro and enterprise licenses in case you have additional requirements. These licenses as I understand it are proprietary and hence non-free but the the community edition absolutely is free software.

I’ve been using the community edition in multiple projects for several years now and it works great, particularly in the context of Spring Boot applications.

It might not be just as polished and well-integrated as ActiveRecord is for Ruby on Rails but keep in mind Flyway is supposed to be framework-independent so it can’t be too tightly integrated.

That said, its integration with Spring Boot is really good. Just add it as a dependency and configuration and migrations during application startup will be taken care of automatically.


Been using Flyway for years with no problems and definitely open source, having contributed documentation changes in the past.

Repeatable migrations are great especially when combined with upserts in Postgres, if you keep config in your database as well.


With flyway you need to pay for a license to get down migrations. You can always write the down migrations anyway and apply manually when needed.


I've been using Dbmate for a few years now across our entire stack. It does the job perfectly.

https://github.com/amacneil/dbmate


I can second Dbmate.

The process of a sql migration management is pretty simple and like other comments it's not hard to write something yourself which works well I just prefer to use something a little more battle tested in the deployment process.

Also it's a go app so you don't need to pull down a stupid amount of dependencies for it to work just pull from it's github release and mark it as executable.


Our stack heavily uses Django. We are therefor using Django's migration system. I worked with alembic before and some other Python based migration systems, but nothing comes close to the Django's simplicity.

We're heavy Postgres users and take full advantage of some of its features which are not very well supported by Django. Quite a number of our migrations involve `migrations.RunSQL` operations. That's fine. We've also developed a package [1] that extends Django with a lot of Postgres specific features. For example, recently we added support for partitioned tables. We can now set those up as normal Django models.

In order to reduce the number of f'ups, we also built a Github bot that inspects PR's for potentially disastrous migrations. For example, the Github bot will yell at you if you try to rename a field or blindly drop a table. The Github bot also makes sure your migration doesn't conflict with another migration that was just merged.

[1] https://github.com/SectorLabs/django-postgres-extra


Honestly, build it yourself. A good library of small shell scripts along with some well-commented .sql files will be easier to maintain and understand than any off the shelf migration utility.


We have built projects where we used python Alembic or Ruby's Activerecord migrations to manage the db.

The actual application was in Go or nodejs.

I daresay that Activerecord and Alembic outclass Liquibase in a lot of things.


Another vote for Alembic, especially if you use SQLAlchemy in your app. I can't say enough positive things about the 1-2 punch of SQLAlchemy and Alembic if you are dealing with relational databases.

Some technical benefits to Alembic:

- It will give you the structure for upgrades and downgrades.

- Has a clean interface for DDL operations

- Supports every (?) database that SA does

- You can use it in "Offline" mode if you don't want to have Python and all the dependencies on the server or have to hand the migration off to someone else that has access.

- The branch feature is really nifty if you are in advanced situations.

Some non-technical benefits with Alembic

- It is open source

- zzzeek, the author, is pretty active on here and has built both SQLAlchemy and Alembic so there is a lot of cohesion in styles.

- The issue tracker is active and responsive

- The code is stable (something you want in a migration tool) and is unlikely to go anywhere.

Highly recommend.

Edit: Formatting


I've previously used some ORM-based migration tools, like Django ORM migrations (and when it was a separate tool, South), looked at others like Alembic. What I found is that they automate simplest cases, in turn making more complex cases even more complex.

So when I need to do some complex migration on a big database, I usually open postgres shell, open a transaction and I develop a migration like a code, in a REPL. And in case of these automatic tools, after I did that I have to go read their documentation, and port SQL to their syntax.

If I have junior members on the team who don't know SQL yet on a good level, easier tasks are automated by the tool, and for more complex task they completely lack skills developing a migration and have a much steeper wall to climb.

Another thing is that many tools have downgrade migrations. Downgrade migrations are a lie! How could I revert a migration that drops a NOT NULL column with some data? In case I really need to revert a migration I will write another forward migration. Which I did exactly zero times in more than ten years. So, writing downgrade migrations is a waste of time.

Another minor point is that sometimes migrations are running for a loooooong time. Not a second or two - it can be many hours. So I don't want this migration to start automatically during my deploy process. But I don't want to have it as a separate script, so on local dev installations these migrations just take the same `make migrate` route. It's much more convenient to take a part of a commented SQL and run it separately, than to take a part of a migration script written in Python and run it separately.

I'm using a Nomad tool https://pypi.org/project/nomad/ which is written in Python. I think any tool that supports plain SQL migrations, has dependencies between migrations and doesn't require to write downgrades would be acceptable.


We have three applications that access the same database (ugly, we're working on changing that), so we have a fourth component that's solely responsible for the schema.

We use CI/CD for all components, including for the schema. The deployment playbook for the schema basically does a `flyway migrate` after installing the schema / migration files.

We have (again, historical reasons) a home-built schema diff tool for our subset of mysql that we use. For Postgres I'd look into apgdiff. We use the diff tool to generate initial versions of the migrations, potentially modify them by hand, and then `git add` them in the schema project.

If you don't like flyway, you could check out https://sqitch.org/ which is fully Open Source.


I have used sqitch http://sqitch.org/ in the past and found it good for postgres. It supports revert and verify too.


+1 for Sqitch. If you include a full deploy -> verify -> revert -> deploy flow in your automated testing suite you can pretty safely introduce changes that can be rolled back without issue. This has served my team well on several complicated databases over the past two years.



Entity Framework in .Net is very easy.

---

The logic for migrations is suprisingly simple:

- A new migration will normally create a file with an Up or Down method and a "timestamp" + "description of the purpose of the migration" ( eg. 20191031_080603532_add_tags_to_contacts.cs )

- Every migration, when run on the database will create a new entry in eg. the _migrations table . Some will store the PreviousContext which is a hash of the CurrentScheme before the migrations.

If the application has a live check on startup. It will check the hash of the current db schema, compare it to an entry in the database.

Check if that compares to the latest.

The latest is the latest "migration file with up or down". It will execute all newer once according to the timestamp. Untill the latest one is executed.

A tool also contains a "TargetMigration" for development purpose or for reverting a database back to a previous state.

Every Up or Down migration is run inside a transaction. If it fails, nothing is stored.

There is also a way normally to do a Seed of the database, after the migration is run. But this could be added in the Up or Down method.

PS. All other methods untill now were bad.


I use good old fashioned numbered SQL scripts. 01_foo.sql 02_bar.sql 03_baz.sql

In my local dev environment, I have a DB backup representing the current state of production. This is the starting point SQL scripts must be applied to. A deployment script restores this DB, then applies the SQL scripts against it. The entire deployment process takes about 2 seconds even with many gigs of data in the DB backup.

Sometimes I'll include an extra "dummy data" script applied at the end.

My local environment, Test, and Production are deployed in almost exactly the same way. A few extra actions are done for Production of course. And Production is not restored as that would loose data. Local dev is optimized for a speedy restore/deploy.

Developers always want to treat the DB as if it's code. But it's not. It's a blob of valuable state that must be protected from corruption. You apply patches to that blob of state. The order patches occur matters. There's no "re-build" like in pure code.


Django is my favorite backend because its database migration tool is easy to use and can be customized to fit my needs. That being said, if you want database-first schema migration, the only way to do that (that I aware of) is by attempting to generate models for your existing schema using Django's inspectdb command, then run a fake migration.

I like how Wordpress does simple database-first migration: just call `dbDelta()` function with your new table schema in sql format (CREATE TABLE ...) and it'll figure out a way to update your table to fit your new schema. I wonder if there is any similar db migration tool in python that work like this, seem perfect for small projects that don't want to use an ORM.

https://codex.wordpress.org/Creating_Tables_with_Plugins#Cre...


We're trying out a few things right now.

1) Use a migration tool (flyway, dbmigate, etc.)

2) We try and keep scripts idempotent to minimize on migration script explosion. Also helps to keep changes to similar assets located within the same file: easier to see the progress of a given asset as it changes throughout the lifetime of your app.

3) We have a dedicated repository for each database/schema. A CI/CD process triggers on a push to a given branch (dev, qa, stage, prod, etc.). The CI/CD process runs the migration script (in this case AWS Code Pipeline). Having the schema in its own repository decouples our databases from the service(s) that use them.

4) We try our best to separate schema changes from feature changes: a) push database changes first maintaining backwards compatibility, b) then push feature changes, c) then remove backwards compatibility. So, try to minimize on rollback script.

Local development is same as yours: docker-compose.


Database migrations for RDBMS like mysql or postgres, we use knex or laravel's database migrations tools.

For bigquery, we wrote this nifty tool https://github.com/SplitmediaLabsLimited/supermigration


I've developed a simple migration tool called mgrt [1] to handle database migrations. It takes plain SQL files, and runs them in order, keeping a hash of the contents of the file in the database to ensure that it cannot be modified once run for the first time. I've posted about it here once before [2]. Check it out if you like.

Right now it only supports SQLite, MySQL, and PostgreSQL.

[1] - https://github.com/andrewpillar/mgrt

[2] - https://news.ycombinator.com/item?id=19517001


> but both are not completely free and seem proprietary

I don't mean to be "that guy", but unless your whole business's "thing" is no-proprietary-software, please try to pay for commercial licenses of good software. That money pays for bug fixes, security fixes, general maintenance, feature development, testing, support, and a host of other crap you probably don't want to have to do by yourself. If they provide a free tier, you're also supporting the free users.

Flyway is kind of the gold standard, I recommend it.


We use flyway for the biggest monolith database in our company. We also use alembic for some microservices.

Alembic is extra convenient if you're already invested in sqlalchemy.

I've tried yoyo-migrations[1] once, a while ago.

With yoyo, like flyway, migrations are written as SQL queries. However, they're expressed as SQL steps in python files. Might worth a look if you're using python.

1. https://pypi.org/project/yoyo-migrations/


Like half the people commenting here, I wrote my own (https://bitbucket.org/koalephant/mallard/) in shell, after using another similar tool that’s written in python (and contributing some fixes to it).

Same general concept as most others use, apply sql files by date, track what’s applied in the db itself, deploy the sql files with your code.


I've found MyBatis Migrations [0], to be very useful. I like the fact that you can do a DB dump of your current database state and set that as the 0th level of migration and then continue from there even if you've never done migrations before.

It's also extremely straightforward and simple as it doesn't try to be part of any ORM.

Highly recommended.

[0]: http://mybatis.org/migrations/


I don't have a say in what we do at work, but in all my side projects (which I write in Go), I've used darwin[1], run in main before the server starts up.

1: https://github.com/GuiaBolso/darwin


Having dealt with this before, if I could go back I would integrate an ORM to do the schema changes. Something like SQLAlchemy/Alembic where you can store your schema changes like git commits and roll forward/back as needed.


FWIW, a product I know uses

https://github.com/vmware/upgrade-framework

a little better than just sequential numbering (which breaks down on teams larger than 1).


I'm currently very much into http://knexjs.org/ since it is just a query builder rather than an ORM. It has just enough abstraction to allow it to be DB agnostic.


I've used Alembic extensively for Python projects and sequelize somewhat less so for Node projects. It's not a huge sample space, but Alembic's design is much better IMO because it can detect and manage conflicts across merges.


We use native rails DB migration.


I know it's payware, but shout out to Redgate SQL Compare (SQL Server for sure). I managed schema comparison from dev, staging, deployment for multiple clients - I would have gone nuts without it.



In javascript land I use Knex, which has built in migrations and CLI. However it only has basic features compared to ActiveRecord.


I have been using flyway for over half a dozen projects and more than a few years. Works very well.


If you are using Ruby but don’t like ActiveRecord, you can use Sequel, which is really great


Also check out MyBatis migrations tool. Similar to raw-sql approach, very simple.


Liquibase is open source.

You only get support and templates by paying for it.


We use sequelizer on postgres,nodejs


I wrote a tool for that in our company (in Java, but could be written in any language) like 10 years ago or so, which is still in heavy use as of today in multiple projects, migrating tens of thousands of (SQLServer and PostgreSQL) databases each year. Its key feature is probably the usage of a schema hash, which seems to be pretty unique among DB migration tools (at least the common open source equivalents to what I wrote don't seem to do this). It works like this:

- The tool can hash the current schema found in a live database that it's connected to

- It also knows hashes for all schemas ever created for the application (these are versioned and usually originate largely from an ORM generator, but can also mix generated parts with manually-added ones)

- In addition to knowing these "full schemas" with their respective versions, it knows forward (and optionally backward) migrations between the versions (these are entirely created by developers). Migrations usually perform DDL stuff, but may optionally also convert data either with pure SQL or procedurally (as migrations are effectively stored as Java code, they can do practically anything)

- When run against a database, it can either initialize it to any version of the software using one of the full schemas, or migrate it from any known schema version to any other version to which it can find a migration path (pathfinding is a cheap Dijkstra implementation on the migration graph). It automatically detects at which version a schema is by hashing it and comparing the hash with all known schema versions and their hashes (with the ability to fall back to metadata of past migrations also stored in the database in case two versions result in a hash collision). Before and after every migration step, the hash is built from the current state of the database and checked against the stored values for before and after the migration step in question, to ensure the step finds everything as expected and resulted in the correct final state.

- If a backward migration path can be built, migrating "down" is also possible (but I've never seen this actually used in practice, hence most teams today just skip writing backwards migration steps).

- The hashes are also used to prevent accidents that might happen if someone runs a migration against a database on which some person manually modified anything schema-relevant. In such case, the hash won't match any version, and the migration will not even start. Whatever manual modification has been done must be undone first in order to guarantee that the migration steps can find the exact state that they were written for and thus yield the expected results (this actually saved our asses multiple times).

- Finally, a variant of the tool was integrated in the CI process, which means that all migration steps currently checked in are tested against an actual database with each build and can fail the build if not working. Since the current schema can be generated from an object model, the tests also generate this automatically and check its hash against whatever the currently highest versioned migration would produce, failing the test if these are not equal. Effectively this results in developers being immediately notified if someone did change the database schema (usually indirectly via an object model change) but did not check in the necessary migrations to get old versions migrated upwards.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: