Hacker News new | comments | show | ask | jobs | submit login
Better Database Migrations in Postgres (craigkerstiens.com)
316 points by jbreihan 93 days ago | hide | past | web | favorite | 85 comments



I really just want a deterministic (or sync-based) migration tool. The only two I'm aware of are Innovartis DBGhost and RedGate SQL Compare (though RedGate requires a license everywhere it runs, whereas DBGhost only requires a license to compile the package).

This stems from my work years ago with databases in on-premise products. Customers would modify the database schema, causing migration "up" scripts to fail, and it would be very difficult (and manually intensive) to baseline everything again and get the database back to a working state. Even though modifying the schema was clearly laid out as "not supported", it would still be something we'd have to fix, because ultimately they (and their account reps, etc) still need the product to work.

We used DBGhost, and then had custom pre- and post-deployment scripts to do certain types of changes, such as renaming a column (`if exists old_column { add new column; copy old to new; drop old_column; }`), or adding expensive indexes. One of the best parts is all it stores in source control is all the CREATE scripts for database objects (and the custom pre/post deployment scripts). Pull requests would let you trivially see a new column or index being added.

Compared to the pain of creating and maintaining up/down scripts and the long-term problem where deploying a new instance takes thousands of migration steps (or risking the inital CREATE scripts not matching what happens after all migrations), doing a schema sync was significantly simpler in nearly every respect.

I've been looking for something similar for Postgres and MySQL/MariaDB without any luck, and it really surprises me there's not more interest in doing migrations this way.


At Facebook we discovered this is especially important for large sharded environments. As you add more shards, the chances that a migration will fail on at least one of them (simply due to hardware failure) increases. A declarative approach, where you have a repo of CREATE statements that your tools can diff/sync automatically, is much easier to manage. You can have automation just retry in a loop until the shard converges on the desired final state.

> I've been looking for something similar for Postgres and MySQL/MariaDB without any luck

For MySQL, I wrote an open source tool to do this: http://github.com/skeema/skeema -- it gives you a repo-of-CREATE-statements approach to schema management, like you describe.

Skeema's CLI is inspired by Git (in terms of paradigm for subcommands) and the MySQL client (in terms of option-handling), so if you know how to use those tools, it's a cinch to learn. Basically you `skeema init` to initially populate your CREATE statements (one file per table) from a db instance. You can then change those files locally and run `skeema diff` to view auto-generated DDL, or `skeema push` to actually execute it. Or if you make "out of band" changes directly to the db -- such as renames, or just someone doing something manually for whatever reason -- then you can `skeema pull` to update the files accordingly.

Skeema's configuration supports online schema change tools, service discovery, sharding, various safety options, etc. I also hope to add integration with GitHub API at some point. That can provide a really nice "self service" model for schema management at scale.


(We should coin a term for this. I propose: "idempotent database updates".)

I'm also a strong proponent of idempotent database updates, and prefer those over classic migrations wherever possible.

Some experience from PostgreSQL (with several years of experience in various applications):

While this approach works pretty well for idempotent changes such as "add column if not exists", it is more tricky when data content is changed by a migration. Although seldom, this alone justified classic migrations, which I always had to use in addition to idempotent upgrades. But I try to keep that part as small as possible.

However, the latter issue might be solved by disciplined usage of names. That is, never reuse or "clean up" column names, table names, index names, view names, function names.

A nice fit into idempotent upgrades is "create or replace function" for database functions. However, there is a caveat that you can't replace it if you change the return type. (Changing the argument types is mostly safe, because then it is a different function for PostgreSQL.) You might be tempted to solve this via "drop function if exists" followed by "create function", but then you need "drop ... cascade", which destroys all views (and perhaps indexes!) that depend on it. Again, the correct solution here is to create a new function with a different name. (And drop old one only at the very end, when everything else is switched to the new one.)

One final note: Always put each migration into a database transaction. And for idempotent updates, put the whole thing into a huge transaction. So when anything goes wrong, nothing happened. You can fix your script and simply try again, without having to cleanup any intermediate mess. This is obviously important on production systems, but also very, very handy during development. For the same reason, while writing a classic migration, always put a "ROLLBACK" at the end. Remove it only when you are fully satisfied with the results.

PostgreSQL is especially strong here, because all DDL actions (alter table, etc.) are transaction safe and can easily be rolled back.


One caveat regarding "Always put each migration into a database transaction": I've found that for very large database tables on a live system, it becomes impractical to e.g. create a new index inside a transaction, because the entire table would need to be locked for the duration of the operation.


That's because this entire approach is broken: if one cannot afford to bring database down to do alter table migration, one should always go a lazy migration either via application stack using write-to-new only, read from new on miss, read from old, followed by a backfill or using triggers.


Is this not what PostgreSQL's MVCC was created to help mitigate?

https://www.postgresql.org/docs/current/static/mvcc-intro.ht...


I've found Liquibase [1] to be really good for this, although the docs don't give much of a hint about how to structure your change sets.

If you get it right, it's really easy to manage the DB schema structure and commits to your source repository are very legible (because there's one file per entity being modified), much better than interleaving all changes in a single stream IMO.

Particular highlights:

- Their annotated SQL format allows you to create a file per table that contains a first "changeset" to create the table and subsequent changesets for each alteration

- The "runOnChange" option allows idempotent, non-data-destroying parts of your schema such as view and function definitions to be kept as individual files and modified in place (Liquibase uses a hash of the definition to decide whether to rerun it when migrating)

- The "includeAll" tag lets you collect together the scripts in directories ("/views", "/tables", "/functions", etc.) and have them automatically picked up and run from a single "schema.xml" file in the root.

[1] http://www.liquibase.org


> I've been looking for something similar for Postgres and MySQL/MariaDB without any luck, and it really surprises me there's not more interest in doing migrations this way.

skeema (suggested above by evanelias) and square/shift are two management tools for MySQL/MariaDB.

On the lower level, I'm authoring GitHub's gh-ost, an online schema migration tool, and we have Ruby on Rails + scripts automation around that. Described in https://www.percona.com/live/17/sessions/automating-schema-c... .

I suspect no matter what solution you'd use, you will always need to break migrations into parts that are not fully automated. e.g. when dropping a column you'd first deploy code that ignores said column, only then run the migration to actually drop the column.


Strongly agree that the sync-based approach to schema migrations is much better.

I presented on this exact topic last week at PostgresOpen 2017, and have written a schema diff tool for Postgres that supports this approach (https://github.com/djrobstep/migra).

I'd be curious to know if this is what you are getting at, and if it would solve your problem?


Mysql Workbench has had a GUI ability to do data modeling, forward and reverse engineer it with a live database, and sync your model with your database (showing you deltas and letting you make changes to the model or the database).

I also have used (with mysql) liquibase with preconditions to script db migrations with conditional logic so that deployment can deal with variations in target db environments, conditionally apply ddl, fail midscript while allowing restart and rollback gracefully.


We used thisbat a previous job. I loved it. https://github.com/dbsteward/dbsteward

You define you're schema and it figures out what tonsonbased onbyour current schema.


Can it handle stored procs / UDFs? Was unable to find this in the docs...


Yes. I know I've done it for triggers, let me see if I can find that project.


https://github.com/dbsteward/dbsteward/blob/master/xml/somea... is an example with a function and trigger.


Sql Server with Management Studio can do something like it with the dacpak format. It calculates differences and what needs to be done.


We have a project at work using this dacpak stuff and tbh it's a nightmare to work with since the dacpaks themselves have to manually generated and distributed.

It also appears to make the projects (in visual studio) unbelievably slow.


The way I did was adding a MSBuild script that re-generates the dacpacs for each supported MSSQL version when you compile using a special configuration in Visual Studio (so the dropdown is Debug / Release / RebuildDacpacs).

Then the dacpacs get shipped with the application, along with a zipped SqlPackage.exe (and accompanying libraries) to apply them.

It's been a lot of work and we've had to handle a lot of corner cases through pre-deployment scripts and SqlPackage CLI options, but I haven't seen a 'why does this customer have an [Address] column that's the wrong length and set to nullable???' or 'this report is super slow -> index is missing' ticket since.


I'm with you that it makes the solution slower, but you can just selectively not load it, or use the new light loading feature.

You most certainly do not have to make them manually, that can be delegated to a build server. You can also use the dacpac to automatically run integration test and it can run some rudimentary upgrade tests vs previous versions of the dacpac.


I think Navicat will do that as well


The very quick exclusive lock that postgres needs for things that are normally thought of as "free" like `ALTER TABLE ADD COLUMN` without a default can still cause problems on tables that are under extremely heavy contention. If you have hundreds of clients running long transactions against a table, even this fast metadata-only lock can cause an outage due to head-of-line blocking. Our team has had good luck with this lock-polling approach for migrations against very busy tables [1]. The idea is that you combine `lock_timeout` and attempt to `LOCK TABLE IN ACCESS EXCLUSIVE MODE` in a loop prior to running the DDL. Note that this is only safe for migrations that are actually instantaneous once the lock is acquired.

[1] http://www.databasesoup.com/2015/08/lock-polling-script-for-...


> Gradually backfill the default value

Might I add a little warning from experience.

The "gradually" part is important to get right, since in postgres updates actually create new tuples and leave dead tuples behind, if you update too much at once two things might happen: you increase disk usage fast (worth considering how much room you have to work and adjust before if needed) and increase the amount of dead tuples fast.

I have experienced that having too much dead tuples on a table might impact the query planner and it may stop using some index completely before autovacuum does its job. Depending on the circumstance this may severely impact performance if an important query suddenly starts Seq Scanning a huge table.

This blog post has great practical details to consider before running such updates: https://www.citusdata.com/blog/2016/11/04/autovacuum-not-the... In particular at the time we had to adjust the "autovacuum_vacuum_scale_factor" for the table in order for the autovacuum process to kick in much sooner. Also there is a valuable query there to inspect how much dead tuples you have.


Actually, if you don't change any keys I believe it can update in place under certain circumstances. I'll have to find a link.



Yes, I was just about to post https://github.com/postgres/postgres/blob/master/src/backend... I misremembered it a little in my original post.

(I'm not able to open your link for some reason :-\)


In Rails, for concurrent index creation, the add_index method supports the "algorithm" option.

    add_index(table, fields_to_index, algorithm: :concurrently)


OT: Postgres is an incredible piece of open source software, but the official admin UI pgAdmin is in a state of complete chaos. Any suggestions for an OSS replacement?


Although i stick to bash tools as pg_ctl and co, one of my colleagues uses Postico [https://eggerapps.at/postico/] and it seems to do the job pretty well. And yes they made such a mess with this UI.. looks like they tried to rebuild an Os inside themselves, with that painful windows management, notification popups etc.. crazy shit happens


I have taken to using DBeaver. The overall experience is really nice.

I don't like that you have to toggle your "Active database" rather than just opening a new window with a new connection to a different database. If that annoys me much more then I plan to try SquirrelSQL and then retry Postage.

I never understood the hate that pgAdmin3 received, I liked it a lot. V4 though is a mess.


DBeaver is really excellent for so many things.

I use it to organize my scripts which I need regularly. I can also put them on a network/shared drive so that others can access it.

I don't know if pgAdmin allows it, but in DBeaver, I can switch between Grid and Text view to copy paste data into email in a nicely tabulated format, besides of course, being able to export a result set out into CSV etc.

https://dbeaver.jkiss.org/

https://dbeaver.jkiss.org/


I'm one of those who disliked pgAdmin3, at least on Mac. It crashed reliably if the database didn't disconnect cleanly and you tried to continue using the program. It crashed randomly when using SSH tunneling. It crashed randomly when resuming from sleep. It crashed reliably when you clicked on objects that no longer exist (ie something outside of pgAdmin deleted a table, then you click on that table in pgAdmin). The window would be lost in the nether if I moved it to a secondary monitor and then unplugged said monitor. The query editor was pretty mediocre. That's just the ones I can remember.


Yeah, I liked pgadmin3 but don't like pgadmin4. I like a GUI for traversing and understanding table structure, without having to do \d table. I use cli for everything else.

Will give dbeaver a try.


I've used Aqua Data Studio for ages. Very solid, many features. But DBeaver looks interesting (especially considering the cost of ADS).


There is some movement in that space fortunately. Depending on whether you rather want to lean towards analytical or DB management functionality I'd check out PopSQL (https://popsql.io/), Tableplus (https://tableplus.io/) which are not open source but freemium or OmniDB (https://omnidb.org/index.php/en/) which is OSS.


Pgadmin is awful, especially when you try using an ssh tunnel. If it is going to crash the process, why put it there?

A decent alternative I use is SQL workbench/j (not related to MySQL) which leverages jdbc to connect to any database and has a long feature list. In my experience, it is on-par with dbeaver.


Not OSS nor free, but I've always been happy with Navicat: https://www.navicat.com/en/products/navicat-for-postgresql


Postage – A fast replacement for pgAdmin | https://news.ycombinator.com/item?id=14884183 (Jul 2017)

> skrause: BigSQL maintains an LTS release that stays compatible with the most recent Postgres versions: https://www.openscg.com/bigsql/pgadmin3/

The discussion mentions a number of additional alternatives. Also: Postage itself is looking for a new maintainer.


All of the GUI tools I've tried such as Postico, pgadmin etc have very poor flaky crash prone query editors. So I use Sublime Text, it has Postgres specific SQL syntax highlighter, powerful search and replace including Regex. Blazingly fast, Rock solid, Runs any query, can even handle result sets of millions of rows that would make most editors crash. Postico is pretty, but last time I tried it, it didn't even show the line number of the code causing an error, so was completely useless to me.


Not sure if this fits your needs, but I've really liked pgweb. https://github.com/sosedoff/pgweb GitHub - sosedoff/pgweb: Cross-platform client for PostgreSQL ...


Not an UI but I love pgcli (https://github.com/dbcli/pgcli), it's like psql but 10 times better. It works pretty well


Not OSS, but JetBrain's DataGrip is pretty great. Easily worth the money.


Soon(hopefully) there will be an excellent answer.


On this topic, here's a great reference for some common operations on high volume SQL databases:

https://www.braintreepayments.com/blog/safe-operations-for-h...

Has anyone been collating large-scale SQL best practices into a book or similar? I've found high-level overviews such as this https://github.com/donnemartin/system-design-primer, but lacks specifics to scaling SQL...I've only seen disparate blog posts like the one above.


Sqitch is a pretty useful migration tool:

http://sqitch.org


I like the whole sqitch approach much better than that provided by most migrations (if I'm in migration land, I use https://flywaydb.org/). From the version control perspective you get the sprawl of changes across files like you do with migrations.

I find that, for tables, if I write anonymous "DO" blocks in a single script to manage both the initial creation of the table and any later deltas, I get a very satisfactory version control history. This is a bit different than the recommended approach (if I'm not mistaken).

I just wish it wasn't written in perl with 10,000 cpan dependencies which may or may not build in a given circumstance.


Another options is liquibase[0]. I use the diff[1] tool to create a changeset and then convert it to sql before applying it to my db. liquibase keeps a log and lock table in your db so you can always review the changeset you applied to the db at a later time.

[0] http://www.liquibase.org/

[1] http://www.liquibase.org/documentation/diff.html


Liquibase is definitifely a good recommendation. We use it to update Oracle, postgreSQL, MariaDB as well as exporting the current schema as a hsqldb-script. Some migrations are only executed for some of our customers.


Also a big fan of Liquibase here, on SQL Server. I've used Redgate in the past, which is even graphical etc, so simpler to use for slightly less technical DBAs.


strong_migrations looks like a really useful tool. Are there similar ones for languages/frameworks besides Ruby/Rails?


Alembic is a standalone tool written in Python. Having spent years on rails migrations and then using Alembic, I can say that alembic is really brilliant. It behaves like git - it has a branching and merge model for Migrations (in case multiple people work on it simultaneously).


but it seems like the USP of strong_migrations is to recognise 'dangerous' operations in a migration and warn you about them so that you can rewrite it as a safer migration

Does alembic have anything similar? this would be useful for Python projects


Alembic is amazing, but AFAIK only works with its author's (world-beating) ORM, SQLAlchemy. I didn't think it was a general-purpose database change tool.


You could potentially use SQLA's ability to reflect your db to create the models either to get you started, or maybe in an automated fashion. Im sure there are loads of corner cases where it wouldn't work.

We're happy users of alembic. In general we use it to do the initial grunt work and then manually edit the migrations / split into phases ourselves. For some stuff that Alembic doesn't handle very well (like check constraints) we've extended Alembic to handle those automatically in the way we like.


We use it standalone for our nodejs code. It does pull-in SQL alchemy, but who cares. I think what you are really asking is this - We don't autogenerate migrations from models, we write them explicitly.

There is nothing even close enough in the nodejs world and we wanted to manage our dB properly.

Incidentally, three other nodejs startups that I told this to, also started using Alembic for their migrations.


I don't know Alembic well enough to know what features of SQLAlchemy it requires, but I do know that SQLAlchemy consists of quite a bit more than just the ORM features. Perhaps Alembic is using it at a lower level, and can be used with any (or no) ORM layer?


For JVM languages, Flyway is quite popular (flywaydb.org) Django (python) has a builtin migration tool


Original author here, thanks for sharing it. Will update the post to add it to the list of good tools.


Another x-plat CLI option is Piggy - https://github.com/datalust/piggy - a side-project still early days of development, but may be interesting for the lack of (e.g. JVM) dependencies. It's a self-contained binary for Win/Mac/Linux.

DbUp, upon which Piggy is based, is a popular option in the .NET space: https://dbup.github.io/.

(Neither attempt the protections that I think strong_migrations is offering - TIL!)


It works well as a command line tool, so not even restricted to JVM languages.

One strong point of Flyway, besides being "just SQL", is the repeatable migrations, you can update your stored procedures, views and table documentation in the same version controlled source files without making a new copy in a new migration file for each change.


Oooh I use Flyway but did not know about this feature! It solves a big problem for me, thank you for mentioning it! :)



Sqitch[0] is probably one of the best migration tools I've used. It expresses dependencies explicitly and doesn't rely on ordering. It will run change verification scripts for you and roll everything back if it doesn't check out. Plus you can tag your history and bundle everything up into a release for you. It feels a lot like using Git for managing database state.

[0] http://sqitch.org


Dbmate - http://dbmate.readthedocs.io/en/latest/ is a great standalone dependency-free migration tool. It is quite robust and flexible.


For Postgres or any SQL database, I've been relying on MyBatis Migrations for years.

http://www.mybatis.org/migrations/


For GitLab we have a somewhat special approach to allow us to perform zero downtime migrations. Basically we have two types of migrations:

1. Regular migrations (located in db/migrate)

2. Post-deployment migrations (located in db/post_migrate)

Post-deployment migrations are mostly used for removing things (e.g. columns) and correcting data (e.g. bad data caused by a bug for which we first need to deploy the code fix). The code for this is also super simple:

    # Just dump this in config/initializers
    unless ENV['SKIP_POST_DEPLOYMENT_MIGRATIONS']
      path = Rails.root.join('db', 'post_migrate').to_s

      Rails.application.config.paths['db/migrate'] << path
      ActiveRecord::Migrator.migrations_paths << path
    end
By default Rails will include both directories so running `rake db:migrate` will result in all migrations being executed as usual. By setting an environment flag you can opt-out of the post-deployment migrations. This allows us to deploy GitLab.com essentially as follows:

1. Deploy code to a deployment host (not used for requests and such)

2. Run migrations excluding the post-deployment ones

3. Deploy code everywhere

4. Re-run migrations, this time including the post-deployment migrations

We don't use anything like strong migrations, instead we just documented what requires downtime or not, how to work around that (and what methods to use), etc. Some more info on this can be found here:

* https://docs.gitlab.com/ee/development/what_requires_downtim...

* https://docs.gitlab.com/ee/update/README.html#upgrading-with...

* https://docs.gitlab.com/ee/development/post_deployment_migra...


DB design and implementation has always fascinated me. How can they build a concurrent index when ten of thousands of records are in and perhaps thousands are throwing out (although in seriousness deleting a row is always a bad idea).



Please correct me if I'm wrong, but aren't migrations a solved problem in databases that support triggers?

* create a new schema in a table.new

* install update triggers on table.old to write the same content into table.new

* backfill table.new from table.old

* swap table.new and table.old


No. https://githubengineering.com/gh-ost-github-s-online-migrati... has an excellent section on why triggers fail (at least in the MySQL world. I imagine the locking story is similar in postgresql).

I don't work for GitHub but we had the same problems using pt-online-schema-change independently of their issues (lots of locking contention affecting the app negatively). We're finally moving to gh-ost for our large/risky migrations and so far it's amazing.


and now say you have a 1TB table still want to go this route ?


Yes, the bigger the table the safer this route is for online migrations.


That would highly depend on the nature of migration :)


What's everyone favorite library for doing Postgres migrations using node? I'm using knex.js and still doing migrations mostly by hand.


I use the same tool across whatever language I'm in. It requires a directory of `TIMESTAMP-LABEL.DIRECTION.sql` files and then sorts them to know the order. It compares that to a database table of completed migrations to know which to run. It is extremely stupid, and throws its hands up if an error occurs (so the user can step in to fix it). It doesn't wrap things in transactions for you (not all DDL is transactional in Postgres) so you have to manually add BEGIN; and COMMIT; if you want those things.

It is glorious. I write a nice DSL called SQL for my database and there are never any surprises. If I want to write brittle migrations, idempotent migrations, transactional migrations, or data migrations, those are all on me. Anything more than this leads to needless debugging the tools and has never resulted in more clear or concise code.


Why not use knex migrations then? I've used it on several projects and it seems to work just fine


We have written our own migration workflow on top of pg-promise because all the alternatives didn't work out for us. We don't use an ORM because of several reasons and we favor SQL over SQL-abstractions. We use Bookshelf/knex in a quite large project and we had some trouble with the latter so that we decided to go with "raw" pg-promise.

So what we basically have is a script that runs migration SQL files in order within a transaction and then uses COPY to seed the database from a folder of CSV files. Triggered by a shell script which itself can be run via yarn/npm.


I use Sequelize. I write migrations in Javascript using Promise chains. It does take some trial and error to figure out how to add join columns or create the join table so Sequelize knows how to work with it.


We use umzug[1] a migrations framework which uses sequelize[2]. We use umzug/sequelize with TypeScript and its quite pleasant to work with.

[1] https://github.com/sequelize/umzug

[2] https://github.com/sequelize/sequelize


I use TypeORM -- supports migrations, though not to the same power as whats discussed in the OP.

https://typeorm.github.io


I use db-migrate but I have corresponding ...-{up,down}.sql files for each migration and write the migrations in SQL.


We use RedGate for MSSQL. It's not the best.


I was hoping this was about updating your version of Postgres. Whenever I do that my database doesn't work - this works with MySql.


What's the case where adding a JSON column causes downtime with Postgres?


What do you mean?


I think they're referring to the last item in the list at https://github.com/ankane/strong_migrations#dangerous-operat... which was linked from the article.


It looks like that can affect `SELECT DISTINCT` queries. From https://github.com/ankane/strong_migrations#adding-a-json-co...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: