
PostgreSQL Data Migration Tips - throughnothing
http://engineering.tilt.com/7-postgresql-data-migration-hacks/
======
xtrumanx
It was bugging me that the target of the "rosser" link was not pointing to the
comment he shared his bulk update technique so I dug it up for anyone else
interested:

[https://news.ycombinator.com/item?id=9018756](https://news.ycombinator.com/item?id=9018756)

~~~
aanari
Author here. Great point, I just updated this link to point directly to his
comment rather than his HN profile.

------
bhahn

      FOR UPDATE NOWAIT immediately locks the rows being retrieved in the first step (as if they were to be updated)
    

Technically "FOR UPDATE" only makes an attempt to lock rows, and the "NOWAIT"
instructs postgres, in the case that another transaction already has a lock on
the row, to raise an error immediately instead of the default behavior of
waiting for the lock to become available.

[http://www.postgresql.org/docs/9.4/static/sql-
select.html](http://www.postgresql.org/docs/9.4/static/sql-select.html)

~~~
aanari
That's a good clarification. I think it's better for the data migration to
error out when acquiring the lock on rows and then retry, rather than waiting
indefinitely, but YMMV (your mileage may vary).

------
bremac
I'm not sure I understand the purpose of the loop in the last example. AFAIK
top-level plpgsql statements (including DO blocks run in psql) execute in a
single transaction, so it seems like you end up slowly locking the entire
table, as the transaction won't commit until the loop completes. (I learned
this the hard way by trying to "batch"-update a table with tens of millions of
rows in production.)

The normal way to handle batch updates is to perform the loop outside of
postgresql, so that each batch is in its own transaction.

~~~
bhahn

      AFAIK top-level plpgsql statements (including DO blocks run in psql) execute in a single transaction
    

This is true according to the docs. Anonymous code blocks are "transient
anonymous functions", and functions are executed within a transaction.

    
    
      it seems like you end up slowly locking the entire table
    

The selected rows would be locked for update, delete, and select for updates,
but not for regular reads. Perhaps his users table is used primarily for
reads, which made this command run with negligible consequences?

[http://www.postgresql.org/docs/9.4/static/sql-
do.html](http://www.postgresql.org/docs/9.4/static/sql-do.html)

[http://www.postgresql.org/docs/9.4/static/plpgsql-
structure....](http://www.postgresql.org/docs/9.4/static/plpgsql-
structure.html)

~~~
aanari
Author here. That's a great point bhahn, I just updated my gist to properly
handle the case that you just outlined:

[https://gist.github.com/aanari/349c7d97ed50c6f69930#file-
bat...](https://gist.github.com/aanari/349c7d97ed50c6f69930#file-
batch_at_will_commander-sql)

By creating a separate function for the locking and updating of rows, we
ensure that the `BEGIN/END` transaction is handled per iteration rather than
at the very end, so we only lock rows while they are being processed. Since
Postgres does not support nested transaction blocks, calling a defined
function from within an anonymous function block seemed to be the easiest and
clearest path to achieve this.

~~~
bhahn
[https://gist.github.com/aanari/349c7d97ed50c6f69930#file-
bat...](https://gist.github.com/aanari/349c7d97ed50c6f69930#file-
batch_at_will_commander-sql-L22-L28)

The anonymous block is implicitly called in a transaction, so all the calls to
batch_at_will will be executed in the parent transaction; there's no way
around this (except for using db_link but that's pretty smelly imo).

The only way to batch update while only locking rows in the batch is to run
the loop outside of postgres like OP suggested.

------
NDizzle
I like how SQL is becoming cool again.

~~~
ngoel36
Did it ever really go out of style?

~~~
gdulli
Not for me. I've used ORMs for the simplest of CRUD operations from web apps
but couldn't imagine ever working without SQL for everything else.

------
steventhedev
Bone to pick with the CTE that archives users. I'd rather see it insert them
into the archive table first, then delete them. Not much of a difference, but
my instinct is to cover all possible failures, and be especially careful
around deleting rows.

It also means that god forbid it did die halfway through, and PG isn't smart
enough to pick up where it left off safely, you won't lose any data, and at
worst would end up with a duplicate archived row (easy enough to catch with
some maintenance scripts and origin ids)

~~~
jpitz
Should be a transaction. I bet if you could a way to corrupt data on safe
hardware with this query and a power-plug test, the pg developers would treat
it as a high-priority bug.

~~~
aanari
That's correct, as jpitz mentioned if the code is run inside a transaction
block, then we don't have to worry about the failing DELETE causing the
INSERTs to fail.

~~~
jpitz
The transaction is implict - in fact, I don't know of a way to do this outside
the scope of a transaction. Is there?

~~~
steventhedev
My instinct here is to write the query in such a way that even if pg changes
in the future, or if I migrate to an almost compatible alternative tomorrow,
it should work the same way. Perhaps more importantly, it sends the message to
future devs (including me) that rows should only be deleted after they've
already been archived. It also covers that 0.000000001% chance that everything
else went wrong and the intern did it manually and managed to screw it up.

In the end, queries are code, and code is our way to communicate or intent to
the next developer, so it's better to do the delete after the insert

~~~
dragonwriter
> My instinct here is to write the query in such a way that even if pg changes
> in the future, or if I migrate to an almost compatible alternative tomorrow,
> it should work the same way.

Transaction semantics are a fundamental feature of pg, of SQL, and of ACID
databases generally. Anything that doesn't preserve them is not an "almost
compatible alternative".

> Perhaps more importantly, it sends the message to future devs (including me)
> that rows should only be deleted after they've already been archived.

Using an _explicit_ transaction, even if it is not necessary, communicates to
future devs the actual intent, which isn't "rows deleted after they are
archived" but "deletion and archive should occur in a single atomic step, and
either both happen or neither happen." The ordered approach is just a way to
express the least harm alternative in an environment that doesn't provide
atomicity guarantees, but that's inappropriate when working in an environment
that actually provides atomic transactions.

One of the biggest problems that programmers with a stronger background
outside of databases often have in writing SQL DB code is that they keep using
ingrained workarounds for the fact that many programming environments lack
convenient support for grouping operations into atomic units in SQL, which
supports such grouping.

------
mizerable
nice ! thank you for this

~~~
aanari
No problem mizerable! Glad you enjoyed it.

