
Better Database Migrations in Postgres - jbreihan
http://www.craigkerstiens.com/2017/09/10/better-postgres-migrations/
======
gregmac
I really just want a deterministic (or sync-based) migration tool. The only
two I'm aware of are Innovartis DBGhost and RedGate SQL Compare (though
RedGate requires a license everywhere it runs, whereas DBGhost only requires a
license to compile the package).

This stems from my work years ago with databases in on-premise products.
Customers would modify the database schema, causing migration "up" scripts to
fail, and it would be very difficult (and manually intensive) to baseline
everything again and get the database back to a working state. Even though
modifying the schema was clearly laid out as "not supported", it would still
be something we'd have to fix, because ultimately they (and their account
reps, etc) still need the product to work.

We used DBGhost, and then had custom pre- and post-deployment scripts to do
certain types of changes, such as renaming a column (`if exists old_column {
add new column; copy old to new; drop old_column; }`), or adding expensive
indexes. One of the best parts is all it stores in source control is all the
CREATE scripts for database objects (and the custom pre/post deployment
scripts). Pull requests would let you trivially see a new column or index
being added.

Compared to the pain of creating and maintaining up/down scripts and the long-
term problem where deploying a new instance takes thousands of migration steps
(or risking the inital CREATE scripts not matching what happens after all
migrations), doing a schema sync was significantly simpler in nearly every
respect.

I've been looking for something similar for Postgres and MySQL/MariaDB without
any luck, and it really surprises me there's not more interest in doing
migrations this way.

~~~
vog
(We should coin a term for this. I propose: "idempotent database updates".)

I'm also a strong proponent of idempotent database updates, and prefer those
over classic migrations wherever possible.

Some experience from PostgreSQL (with several years of experience in various
applications):

While this approach works pretty well for idempotent changes such as "add
column if not exists", it is more tricky when data content is changed by a
migration. Although seldom, this alone justified classic migrations, which I
always had to use _in addition_ to idempotent upgrades. But I try to keep that
part as small as possible.

However, the latter issue might be solved by disciplined usage of names. That
is, never reuse or "clean up" column names, table names, index names, view
names, function names.

A nice fit into idempotent upgrades is "create or replace function" for
database functions. However, there is a caveat that you can't replace it if
you change the return type. (Changing the argument types is mostly safe,
because then it is a different function for PostgreSQL.) You might be tempted
to solve this via "drop function if exists" followed by "create function", but
then you need "drop ... cascade", which destroys all views (and perhaps
indexes!) that depend on it. Again, the correct solution here is to create a
new function with a different name. (And drop old one only at the very end,
when everything else is switched to the new one.)

One final note: Always put each migration into a database transaction. And for
idempotent updates, put the whole thing into a huge transaction. So when
anything goes wrong, nothing happened. You can fix your script and simply try
again, without having to cleanup any intermediate mess. This is obviously
important on production systems, but also very, very handy during development.
For the same reason, while writing a classic migration, always put a
"ROLLBACK" at the end. Remove it only when you are fully satisfied with the
results.

PostgreSQL is especially strong here, because all DDL actions (alter table,
etc.) are transaction safe and can easily be rolled back.

~~~
keithlfrost
One caveat regarding "Always put each migration into a database transaction":
I've found that for very large database tables on a live system, it becomes
impractical to e.g. create a new index inside a transaction, because the
entire table would need to be locked for the duration of the operation.

~~~
notyourday
That's because this entire approach is broken: if one cannot afford to bring
database down to do alter table migration, one should always go a lazy
migration either via application stack using write-to-new only, read from new
on miss, read from old, followed by a backfill or using triggers.

------
semiquaver
The very quick exclusive lock that postgres needs for things that are normally
thought of as "free" like `ALTER TABLE ADD COLUMN` without a default can still
cause problems on tables that are under extremely heavy contention. If you
have hundreds of clients running long transactions against a table, even this
fast metadata-only lock can cause an outage due to head-of-line blocking. Our
team has had good luck with this lock-polling approach for migrations against
very busy tables [1]. The idea is that you combine `lock_timeout` and attempt
to `LOCK TABLE IN ACCESS EXCLUSIVE MODE` in a loop prior to running the DDL.
Note that this is only safe for migrations that are actually instantaneous
once the lock is acquired.

[1] [http://www.databasesoup.com/2015/08/lock-polling-script-
for-...](http://www.databasesoup.com/2015/08/lock-polling-script-for-alter-
table.html)

------
gldalmaso
> _Gradually backfill the default value_

Might I add a little warning from experience.

The "gradually" part is important to get right, since in postgres updates
actually create new tuples and leave dead tuples behind, if you update too
much at once two things might happen: you increase disk usage fast (worth
considering how much room you have to work and adjust before if needed) and
increase the amount of dead tuples fast.

I have experienced that having too much dead tuples on a table might impact
the query planner and it may stop using some index completely before
autovacuum does its job. Depending on the circumstance this may severely
impact performance if an important query suddenly starts Seq Scanning a huge
table.

This blog post has great practical details to consider before running such
updates: [https://www.citusdata.com/blog/2016/11/04/autovacuum-not-
the...](https://www.citusdata.com/blog/2016/11/04/autovacuum-not-the-enemy/)
In particular at the time we had to adjust the
"autovacuum_vacuum_scale_factor" for the table in order for the autovacuum
process to kick in much sooner. Also there is a valuable query there to
inspect how much dead tuples you have.

~~~
jimktrains2
Actually, if you don't change any keys I believe it can update in place under
certain circumstances. I'll have to find a link.

~~~
ddorian43
You mean heap-only-tuples:
[https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f...](https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/README.HOT;hb=HEAD)

~~~
jimktrains2
Yes, I was just about to post
[https://github.com/postgres/postgres/blob/master/src/backend...](https://github.com/postgres/postgres/blob/master/src/backend/access/heap/README.HOT)
I misremembered it a little in my original post.

(I'm not able to open your link for some reason :-\\)

------
SingAlong
In Rails, for concurrent index creation, the add_index method supports the
"algorithm" option.

    
    
        add_index(table, fields_to_index, algorithm: :concurrently)

------
danmaz74
OT: Postgres is an incredible piece of open source software, but the official
admin UI pgAdmin is in a state of complete chaos. Any suggestions for an OSS
replacement?

~~~
dt_rsp
I have taken to using DBeaver. The overall experience is really nice.

I don't like that you have to toggle your "Active database" rather than just
opening a new window with a new connection to a different database. If that
annoys me much more then I plan to try SquirrelSQL and then retry Postage.

I never understood the hate that pgAdmin3 received, I liked it a lot. V4
though is a mess.

~~~
thunderbong
DBeaver is really excellent for so many things.

I use it to organize my scripts which I need regularly. I can also put them on
a network/shared drive so that others can access it.

I don't know if pgAdmin allows it, but in DBeaver, I can switch between Grid
and Text view to copy paste data into email in a nicely tabulated format,
besides of course, being able to export a result set out into CSV etc.

[https://dbeaver.jkiss.org/](https://dbeaver.jkiss.org/)

[https://dbeaver.jkiss.org/](https://dbeaver.jkiss.org/)

------
wyc
On this topic, here's a great reference for some common operations on high
volume SQL databases:

[https://www.braintreepayments.com/blog/safe-operations-
for-h...](https://www.braintreepayments.com/blog/safe-operations-for-high-
volume-postgresql/)

Has anyone been collating large-scale SQL best practices into a book or
similar? I've found high-level overviews such as this
[https://github.com/donnemartin/system-design-
primer](https://github.com/donnemartin/system-design-primer), but lacks
specifics to scaling SQL...I've only seen disparate blog posts like the one
above.

------
Cieplak
Sqitch is a pretty useful migration tool:

[http://sqitch.org](http://sqitch.org)

~~~
sbuttgereit
I like the whole sqitch approach much better than that provided by most
migrations (if I'm in migration land, I use
[https://flywaydb.org/](https://flywaydb.org/)). From the version control
perspective you get the sprawl of changes across files like you do with
migrations.

I find that, for tables, if I write anonymous "DO" blocks in a single script
to manage both the initial creation of the table and any later deltas, I get a
very satisfactory version control history. This is a bit different than the
recommended approach (if I'm not mistaken).

I just wish it wasn't written in perl with 10,000 cpan dependencies which may
or may not build in a given circumstance.

~~~
oneweekwonder
Another options is liquibase[0]. I use the diff[1] tool to create a changeset
and then convert it to sql before applying it to my db. liquibase keeps a log
and lock table in your db so you can always review the changeset you applied
to the db at a later time.

[0] [http://www.liquibase.org/](http://www.liquibase.org/)

[1]
[http://www.liquibase.org/documentation/diff.html](http://www.liquibase.org/documentation/diff.html)

~~~
trampi
Liquibase is definitifely a good recommendation. We use it to update Oracle,
postgreSQL, MariaDB as well as exporting the current schema as a hsqldb-
script. Some migrations are only executed for some of our customers.

~~~
robertlagrant
Also a big fan of Liquibase here, on SQL Server. I've used Redgate in the
past, which is even graphical etc, so simpler to use for slightly less
technical DBAs.

------
misframer
strong_migrations looks like a really useful tool. Are there similar ones for
languages/frameworks besides Ruby/Rails?

~~~
sandGorgon
Alembic is a standalone tool written in Python. Having spent years on rails
migrations and then using Alembic, I can say that alembic is really brilliant.
It behaves like git - it has a branching and merge model for Migrations (in
case multiple people work on it simultaneously).

~~~
robertlagrant
Alembic is amazing, but AFAIK only works with its author's (world-beating)
ORM, SQLAlchemy. I didn't think it was a general-purpose database change tool.

~~~
aidos
You could potentially use SQLA's ability to reflect your db to create the
models either to get you started, or maybe in an automated fashion. Im sure
there are loads of corner cases where it wouldn't work.

We're happy users of alembic. In general we use it to do the initial grunt
work and then manually edit the migrations / split into phases ourselves. For
some stuff that Alembic doesn't handle very well (like check constraints)
we've extended Alembic to handle those automatically in the way we like.

------
YorickPeterse
For GitLab we have a somewhat special approach to allow us to perform zero
downtime migrations. Basically we have two types of migrations:

1\. Regular migrations (located in db/migrate)

2\. Post-deployment migrations (located in db/post_migrate)

Post-deployment migrations are mostly used for removing things (e.g. columns)
and correcting data (e.g. bad data caused by a bug for which we first need to
deploy the code fix). The code for this is also super simple:

    
    
        # Just dump this in config/initializers
        unless ENV['SKIP_POST_DEPLOYMENT_MIGRATIONS']
          path = Rails.root.join('db', 'post_migrate').to_s
    
          Rails.application.config.paths['db/migrate'] << path
          ActiveRecord::Migrator.migrations_paths << path
        end

By default Rails will include both directories so running `rake db:migrate`
will result in all migrations being executed as usual. By setting an
environment flag you can opt-out of the post-deployment migrations. This
allows us to deploy GitLab.com essentially as follows:

1\. Deploy code to a deployment host (not used for requests and such)

2\. Run migrations excluding the post-deployment ones

3\. Deploy code everywhere

4\. Re-run migrations, this time including the post-deployment migrations

We don't use anything like strong migrations, instead we just documented what
requires downtime or not, how to work around that (and what methods to use),
etc. Some more info on this can be found here:

* [https://docs.gitlab.com/ee/development/what_requires_downtim...](https://docs.gitlab.com/ee/development/what_requires_downtime.html#what-requires-downtime)

* [https://docs.gitlab.com/ee/update/README.html#upgrading-with...](https://docs.gitlab.com/ee/update/README.html#upgrading-without-downtime)

* [https://docs.gitlab.com/ee/development/post_deployment_migra...](https://docs.gitlab.com/ee/development/post_deployment_migrations.html#post-deployment-migrations)

------
agentultra
Sqitch[0] is probably one of the best migration tools I've used. It expresses
dependencies explicitly and doesn't rely on ordering. It will run change
verification scripts for you and roll everything back if it doesn't check out.
Plus you can tag your history and bundle everything up into a release for you.
It feels a lot like using Git for managing database state.

[0] [http://sqitch.org](http://sqitch.org)

------
ruby2elixir
Dbmate -
[http://dbmate.readthedocs.io/en/latest/](http://dbmate.readthedocs.io/en/latest/)
is a great standalone dependency-free migration tool. It is quite robust and
flexible.

------
thunderbong
For Postgres or any SQL database, I've been relying on MyBatis Migrations for
years.

[http://www.mybatis.org/migrations/](http://www.mybatis.org/migrations/)

------
yeukhon
DB design and implementation has always fascinated me. How can they build a
concurrent index when ten of thousands of records are in and perhaps thousands
are throwing out (although in seriousness deleting a row is always a bad
idea).

~~~
ddorian43
You can read the explanation: [https://www.postgresql.org/docs/9.5/static/sql-
createindex.h...](https://www.postgresql.org/docs/9.5/static/sql-
createindex.html#SQL-CREATEINDEX-CONCURRENTLY)

------
notyourday
Please correct me if I'm wrong, but aren't migrations a solved problem in
databases that support triggers?

* create a new schema in a table.new

* install update triggers on table.old to write the same content into table.new

* backfill table.new from table.old

* swap table.new and table.old

~~~
qaq
and now say you have a 1TB table still want to go this route ?

~~~
notyourday
Yes, the bigger the table the safer this route is for online migrations.

~~~
qaq
That would highly depend on the nature of migration :)

------
rawrmaan
What's everyone favorite library for doing Postgres migrations using node? I'm
using knex.js and still doing migrations mostly by hand.

~~~
emidln
I use the same tool across whatever language I'm in. It requires a directory
of `TIMESTAMP-LABEL.DIRECTION.sql` files and then sorts them to know the
order. It compares that to a database table of completed migrations to know
which to run. It is extremely stupid, and throws its hands up if an error
occurs (so the user can step in to fix it). It doesn't wrap things in
transactions for you (not all DDL is transactional in Postgres) so you have to
manually add BEGIN; and COMMIT; if you want those things.

It is glorious. I write a nice DSL called SQL for my database and there are
never any surprises. If I want to write brittle migrations, idempotent
migrations, transactional migrations, or data migrations, those are all on me.
Anything more than this leads to needless debugging the tools and has never
resulted in more clear or concise code.

------
l8nite
We use RedGate for MSSQL. It's not the best.

------
ape4
I was hoping this was about updating your version of Postgres. Whenever I do
that my database doesn't work - this works with MySql.

------
erichurkman
What's the case where adding a JSON column causes downtime with Postgres?

~~~
icebraining
What do you mean?

~~~
tclancy
I think they're referring to the last item in the list at
[https://github.com/ankane/strong_migrations#dangerous-
operat...](https://github.com/ankane/strong_migrations#dangerous-operations)
which was linked from the article.

~~~
benmanns
It looks like that can affect `SELECT DISTINCT` queries. From
[https://github.com/ankane/strong_migrations#adding-a-json-
co...](https://github.com/ankane/strong_migrations#adding-a-json-column-
postgres)

