
Postgres Job Queues and Failure by MVCC - craigkerstiens
https://brandur.org/postgres-queues
======
rosser
Disclaimer: IAAPDBA (I am a PostgreSQL DBA) I have been employed thusly for a
decade now, and have worked in environments ranging from tens of concurrent
users to hundreds of thousands. Previously, I was a developer.

A few jobs ago, I encountered a similar situation. That stack was Python-
based. The engineers had apparently not realized that the DBAPI's default is
to silently wrap all database interactions in an explicit transaction. (That
is, before it submits your query to the db, it transmits a "BEGIN", and
expects you to "COMMIT" or "ROLLBACK" as appropriate. I, personally, think
that's a _terrible_ choice, but whatever.)

Everywhere else in the stack, this wasn't an issue — or at least not a
particularly troublesome one. In the job queue, it was. The reason is simple:
the job workers were long-lived, holding those (implicitly opened) explicit
transactions for the lifetime of the connection. In the rest of the stack, the
app code would connect, interact, and disconnect cleanly, including COMMIT-
ing. How folks realized the need to do that in one place, but not the other,
no-one could adequately explain to me.

When I started, they were uncomplainingly habituated to shutting down their
entire jobs processing infrastructure (which had customer-visible impact) for
hours, every week, to do a "VACUUM FULL"/"REINDEX" cycle on the queue table.
That table's disk footprint followed a perfect "sawtooth" pattern, growing
linearly "up and to the right", well into the gibibytes (for, on average, a
few hundred to a few thousand relatively narrow, "live" rows), until the
maintenance event, where it would drop off vertically. Lather, rinse, repeat.
They had been running this way for _years_.

It took a _one-line_ code change to disable that behavior, and hours of
downtime per week were eliminated.

IMO, this even less PostgreSQL's "fault" than the "write amplification" issue
that Uber flounced over last year. They Did It Wrong, and it bit them.

TL;DR, it's the poor craftsman who blames his tools, and when you use a tool
incorrectly, that tends to be consequential.

EDIT: phrasing.

~~~
k__
Would you say the DBA job is easy mostly because non-DBAs simply don't bother
with getting to know their DB?

~~~
rosser
No, I would say that makes it harder.

People have all kinds of crazy, naïve assumptions about how databases work. I
often don't see the result of those assumptions until late in the process,
when correcting them is significantly more difficult.

EDIT: It does create a lot of teachable moments, however. Over time, many of
those assumptions are mitigated.

------
smilliken
Bloat due to long running transactions and batch writes is a significant
operational pain in PostgreSQL. For extreme cases where vacuum isn't doing its
job, check out pg_repack.

There is another approach to queuing in postgresql that doesn't suffer this
problem though:

* Give each of your workers an id from 0..n.

* Open a connection to read from the queue, ordered by your priority index, and partition for your worker id (eg primary key % n = worker id). Make it a cursor.

* Read from the cursor in batches. You'll never see the same dead row twice because the cursor is stateful and is doing a single scan. You generally won't see dead rows anyway since this worker is the only one assigned to update these queue items.

* Perform your writes in a separate postgresql connection to avoid invalidating the cursor. Think about it like how a Unix process has a stdin and stdout pipe, except your pipes are postgresql connections.

This approach is great for throughout because you request queued items in
batch, and there's no contention between workers. It's admittedly not as good
for latency because a new queue item won't be dequeued until the worker it's
assigned to finishes its run and restarts with a new cursor. For low latency
queues, use FOR UPDATE SKIP LOCKED.

------
evadne
I noticed that this article was written in 2015 and I wonder whether SKIP
LOCKED in Postgres 9.5 could have changed the entire situation,

[https://blog.2ndquadrant.com/what-is-select-skip-locked-
for-...](https://blog.2ndquadrant.com/what-is-select-skip-locked-for-in-
postgresql-9-5)

[http://www.cybertec.at/skip-locked-one-of-my-
favorite-9-5-fe...](http://www.cybertec.at/skip-locked-one-of-my-
favorite-9-5-features)

Also, duplicate of:
[https://news.ycombinator.com/item?id=9576864](https://news.ycombinator.com/item?id=9576864)

~~~
rosser
Pre-9.5, you could use "SELECT ... FOR UPDATE NOWAIT" to achieve the same end.

~~~
malisper
Uh... NOWAIT will error if the query encounters a locked row and rollback the
transaction. SKIP LOCKED will make locked rows invisible to the current
transaction, but will not cause the transaction to be rolled back. SKIP LOCKED
effectively provides equivalent behavior to the recursive CTE given in the
post

~~~
rosser
> _Pre-9.5..._

~~~
malisper
I don't see how NOWAIT gives you what you want here. NOWAIT and SKIPPED LOCKED
are completely different.

~~~
rosser
The clause "SKIP LOCKED" did not exist _until_ 9.5.

Before that version, the behavior of silently skipping rows that had
outstanding locks against them was specified by using "NOWAIT".

The decision of which specific clause to use is driven by which version of
PostgreSQL you're using.

How much more plainly can I put that?

~~~
malisper
AFAICT the behavior of NOWAIT has always been the same. The earliest mention
of NOWAIT I could find was in Postgres 8.1[0]:

> To prevent the operation from waiting for other transactions to commit, use
> the NOWAIT option. SELECT FOR UPDATE NOWAIT reports an error, rather than
> waiting, if a selected row cannot be locked immediately.

[0] [https://www.postgresql.org/docs/8.1/static/sql-
select.html](https://www.postgresql.org/docs/8.1/static/sql-select.html)

~~~
rosser
You're right. With a super-contrived test case, I got the described behavior.

Somehow, in all the the hundreds of millions (guesstimate) of rows I've used
SELECT FOR UPDATE NOWAIT to process without lock conflicts, I've never
actually gotten the error. At a guess, because I've generally always also used
a LIMIT clause, and there have always been enough _unlocked_ rows available to
satisfy that.

Apologies for the noise.

------
random_comment
I misunderstood what the heck the article was about at first. I hope this
quirky analogy may help:

"I am implementing a database for a bank.

I open a transaction in a special active_customers table when the customer
first joins the bank, and commit the transaction when they close their account
or die. In this way, I use postgres transactions to keep track of which
customers are active customers.

I am having problems with long-living customers forcing us to wait up to 120
years between autovacuums of the DB."

Clearly, that's not going to be a good way to do things.

Rossers comment
([https://news.ycombinator.com/reply?id=14731214&goto=threads%...](https://news.ycombinator.com/reply?id=14731214&goto=threads%3Fid%3Drosser%2314731214))
is very helpful.

------
random_comment
> As dead tuples continue to accumulate in the index, Postgres enters a hot
> loop as it searches the B-tree, comes up with an invisible tuple, and
> repeats the process again and again, surfacing empty-handed every time. By
> the end of the experiment illustrated in the charts above, every worker
> trying to lock a job would cycle through this loop 100,000 times.

1\. I am curious if the author tried setting the autovacuum more appropriately
for a table that is in continuous use at 50 new rows/second.

This article ([http://okigiveup.net/postgresql-vacuuming-an-introduction-
fo...](http://okigiveup.net/postgresql-vacuuming-an-introduction-for-busy-
devs/)) suggests:

ALTER TABLE person SET (autovacuum_vacuum_scale_factor = 0.0); ALTER TABLE
person SET (autovacuum_vacuum_threshold = 4000);

This should result in entries being removed from the index and table
regularly.

2\. If I understand correctly, it seems the other part of the problem is that
there are lots of rows to be individually tested to see if they can be locked,
based on the way the author has implemented their queue. While lock attempts
are being made on a set of rows, they aren't being deleted, since they're 'in
use'. This non-deletion causes things to get out of control.

As far as I know, 9.5's SKIP LOCKED is meant to help address that
([https://blog.2ndquadrant.com/what-is-select-skip-locked-
for-...](https://blog.2ndquadrant.com/what-is-select-skip-locked-for-in-
postgresql-9-5/)). Though it appears to be doing something kinda similar under
the hood.

If the author of the original article (@ brandur.org) is following this
discussion I would be grateful if they could evaluate these two options in the
context of their test implementation.

~~~
rosser
The problem with the first point is that autovacuum can't mark "dead" tuples
as "dead" if there are open transactions that are "older" (in MVCC terms) than
those tuples, no matter how aggressively it's tuned.

"SKIP LOCKED" (or, pre-9.5, "NOWAIT") will definitely help with the locking
issue, but not with the bloat.

~~~
random_comment
Thank you for a good summary.

The real problem in my understanding was that I wasn't seeing that the article
author REALLY IS using long-running PG transactions to as a way to enforce
external queue items running to completion (or not). I know they say it in the
article, indeed it's the point of the article, but it seems so strange.

Hence he's getting these long-running transactions in PG in the first place.
The phrase 'sledgehammer to crack a marshmallow' comes to mind.

At 50 items/second I guess I'm totally baffled over why you wouldn't simply
use e.g. an exclusive table lock each time you connect to the DB to
add/take/remove a task from a task table, with a timestamp to allow aborting
and rescheduling of tasks.

I just ran a test to check the performance of exclusive table locking for this
purpose, and despite the slowness of having everything dumping to the console
while running, and the slowness of setting up a completely new psql session
for each connection (i.e. no pgpool etc), using a BASH script, and running on
a crappy 6-year-old mac, I got over 100TPS.

------
kevinburke
This is a really great summary of the problems with queues in Postgres.

I ended up working around this by moving jobs to an "archived" table when they
were complete, so the "hot" table never gets too large, and by avoiding long
running transactions - we update jobs to in-progress, and then the worker is
responsible for marking them as "complete" or "failed". If the worker crashes,
a "stuck jobs" background worker will mark jobs as failed after a timeout.

[https://github.com/Shyp/rickover](https://github.com/Shyp/rickover)

~~~
codesnik
In one of my projects I was going to use something like that: "update ... skip
locked" was used in isolated short transaction, which would just update that
job with "taken" flag and timestamp and commit that. "stuck jobs" worker was
meant to fire as rare as possible for cases of unhandled failures, but was
still there.

------
lobster_johnson
Whenever this comes up, I argue for moving the queue orchestration out of
Postgres. Everyone will be happier that way.

By this I mean that you keep your jobs in Postgres, but instead move the
ordering -- the physical queue data structure -- out into something like
RabbitMQ or Redis.

To schedule a job, create a job row, then push its ID to Rabbit or whatever.
To process the next job, grab the next ID, look up the job row, then do the
processing.

That gives you the best of both worlds, without ever running into issues with
locking and dead tuples.

(There is of course a race condition here where you may create a job but not
publish it (due to Rabbit temporarily failing, for example). But there are
ways around that.)

~~~
andrewvc
I would argue that is not a pragmatic approach. The article makes a good point
about the huge benefits of this approach for relatively small workloads.
Transactional guarantees are hugely useful! Being able to rollback something
like a user create _including_ a queued job to send out a welcome email is
incredibly nice.

IMHO HN conversations tend too much toward solving problems at google scale.
We often forget here that there are tons of high-value business critical
systems that generate mere 10s, hundreds, or a few thousand background jobs a
day. Given the numbers quoted in the article, DB backed queue systems scale
quite far. I've successfully used them to handle tens of thousands of events
per day with no problem over a timespan of years. It's clear that this is a
great approach for _some_ situations, and that one size fits all doesn't make
sense here.

~~~
lozenge
> Transactional guarantees are hugely useful! Being able to rollback something
> like a user create including a queued job to send out a welcome email is
> incredibly nice.

That would work in this case - the job is added to the postgres table in a
transaction, which eventually rolls back. The ID of the job was put into a
separate queue. When the queue consumer tries to get the job from the database
by ID, it doesn't see it because the transaction never committed. It can
discard that job ID from its queue.

~~~
andrewvc
Yes but then you need to support a whole additional piece of infrastructure
and you have to write and support more code. For a lot of use cases that's a
waste of time and money. This is a case of YAGNI for the use cases I
mentioned.

------
memracom
When will people stop trying to build queuing systems on top of a database and
use a queuing service. Amazon's SQS or RabbitMQ or one of dozens of other
tools that are specialized for message queuing. A job queue is just a message
queue in which the messages hold a job.

ZeroMQ is the thing to use to roll your own queuing service, not a database
server whether SQL or NoSQL. Databases are not queueus and database servers
are not designed with queues as any kind of important priority when features
are implemented.

~~~
hvidgaard
Until you hit the roadblock that is operations. Some times you cannot get to
install new tools willy nilly, and using an existing tool is prefered. There
is nothing wrong with using a database for this, and under the hood, that is
more or less what a message queue with persistence does anyway.

------
ris
This turns into a very interesting dissection of postgres' internal workings,
but it's hard to get over the beginning's massive caveat of using a RDBMS as a
queue.

------
CodeWriter23
> storing jobs in a database allows a program to take advantage of its
> transactional consistency; when an operation fails and rolls back, an
> injected job rolls back with it

This is a false premise for using a database as a queue. The same goal can be
simply achieved with a different queuing system, say SQS, by only enqueuing
jobs AFTER successful transaction commit.

~~~
ruslan_talpa
How do you queue that job only after the transaction committed?

~~~
sorkin2
The same way the new built-in logical replication works: logical decoding.

~~~
ruslan_talpa
this means it does not work on RDS

~~~
sorkin2
Tough.

~~~
rosser
This is the kind of comment that makes me wish I could burn some of my own
karma to downvote more than once.

