
Postgres Job Queues and Failure by MVCC - craigkerstiens
https://brandur.org/postgres-queues
======
chanks
(I'm the author of Que, the job queue discussed in the post most extensively)

This isn't terribly surprising to me, since I have an appreciation for what
long-running transactions will do to a system, and I try to design systems to
use transactions that are as short-lived as possible on OLTP systems. I
realize that this should be explicitly mentioned in the docs, though, I'll fix
that.

I'll also note that since the beginning Que has gone out of its way to use
session-level locks, not transaction-level ones, to ensure that you can
execute long-running jobs without the need to hold open a transaction while
they work. So I don't see this so much as a flaw inherent in the library as
something that people should keep in mind when they use it.

(It's also something that I expect will be much less of an issue in version
1.0, which is set up to use LISTEN/NOTIFY rather than a polling query to
distribute most jobs. That said, 1.0 has been a relatively low priority for
much of the last year, due to a lack of free time on my part and since I've
never had any complaints with the locking performance before. I hope I'll be
able to get it out in the next few months.)

~~~
brandur
> I'll also note that since the beginning Que has gone out of its way to use
> session-level locks, not transaction-level ones, to ensure that you can
> execute long-running jobs without the need to hold open a transaction while
> they work. So I don't see this so much as a flaw inherent in the library as
> something that people should keep in mind when they use it.

+1! I tried to clarify in the "Lessons Learnt" section that this isn't so much
a problem with Que, but something that should be kept in mind for any kind of
"hot" Postgres table (where "hot" means lots of deletions and lots of index
lookups). (Although many queues are more vulnerable due to the nature of their
locking mechanisms.)

But anyway, thanks for all the hard work on Que. The performance boost upon
moving over from QC was nice, but I'd say that the major win was that I could
eliminate 90% of the code where I was reaching into QC internal APIs to add
metrics, logging, and other missing features.

~~~
chanks
Thank you!

------
LukaAl
Thanks for the article, very interesting. Just one point on the solution. To
my understanding, you prefer a queue implemented on a DB because transactions
guarantee that a task could not start unless the task before as succeeded and
committed the required data on the DB. In our environment we run tasks on
RabbitMQ/Celery. One of the nice feature that I believe exists also in Sidekiq
is that it allows you to create chain (or more complex structure) of task and
the worker itself will take care of synchronizing them removing your problem
(when a task finishes successfully it commits on the db and triggers the next
step). The only problem we had where on entry points were we create the task
chains and we fire them before committing the transaction. One solution was
manually committing before firing the task. But that was somewhat difficult.
What we have implemented was a Python decorator to have the tasks actually
fired after the function completion (thus after the transaction commit). In Go
we achieve the same result in a more simple way using the defer statement. In
my experience, all these solutions are local to the process that fires the
task, so there is less risk of interaction with other process, easy to
implement and more robust compared to other solution.

~~~
brandur
> Thanks for the article, very interesting.

Thanks!

> What we have implemented was a Python decorator to have the tasks actually
> fired after the function completion (thus after the transaction commit). In
> Go we achieve the same result in a more simple way using the defer
> statement. In my experience, all these solutions are local to the process
> that fires the task, so there is less risk of interaction with other
> process, easy to implement and more robust compared to other solution.

Oh yes, totally. I've seen this same pattern in Ruby before whereby a job
enqueue is put on something like an ActiveRecord `after_commit` hook.

One (overly) pedantic observation is that this still leaves you with the
possibility of having your transaction and data safely committed in your
database, but with your job not enqueued if something happens to your process
between the time of commit and time of enqueue. Admittedly though, this
probably doesn't happen all that often in real life.

Probably the best answer I have is that we take this approach for the sheer
convenience. We can do things like this:

    
    
      def create_user(email, password)
        User.transaction do
          user = User.new email: email
          user.set_password_hash(password)
    
          check_for_abuse!
    
          # make an account ID queue job to create record in billing system
          user.billing_account_id = uuid()
          async_create_account_in_billing_system(user)
    
          # queue job to send an e-mail invite
          async_send_invite_email(user)
    
          create_auditing_record
    
          ...    
        end
      end
    

You can build out very explicit chains of actions that may enqueue jobs or
call into other chains that may enqueue their own jobs and all the while never
have to worry about any kind of ordering problem while working anything. There
are no hidden callbacks anywhere, but you still get to keep perfect
correctness: if anything fails at any point, the whole system rolls back like
it never happened.

~~~
LukaAl
The observation on failure modes is correct. When you process thousands of
tasks every day even problems with low probability happens. What bugs me with
the transaction approach is that you loose all the information (except
probably for logs). In your example, if I signup on your service and for any
reason the process fails (maybe after I received the confirmation that
everything is ok) I will end up having no record stored for my account. This
is problematic for post mortem diagnosis, for customer support (although its
easy to ask the customer redo the signup) and so for and so on. Imagine you
are handling payment with an external provider (e.g: Paypal). I could end up
being billed without a trace on your system that I've paid. I'm not saying
that my approach is correct. I'm saying that just assuming that transactions
will save all your problems discount the fact that your system, to be useful,
has side effects (not using the term in a strict way) on your customer and
possibly on systems outside your organization. I prefer to plan for actively
manage the possible failure modes and to detect and correct quickly anomaly.
For this reason inconsistent data is sometime better than no data. Another
point on the database approach that I haven't thought before: In the past we
designed a system that stored and updated user timelines on MySQL. It has
always been a nightmare. SQL databases are not designed with an high ratio of
write operation in mind. The indexes get quickly fragmented, even deleting
entries doesn't immediately deflate the disk consumption etc. I don't see this
being a problem immediately applicable to a queue use case. But with your
service growing in size there's a risk of hitting scalability problem.
Obviously you could react like we have done, use a bigger database, use better
disks, create cleanup jobs. But it is bad software engineering and you are
just buying more time.

------
atombender
Good article!

I am currently implementing a project in which we use Postgres to track job
state (eg., run status, failures, timings, resource usage, related log
entries), but Kafka as the actual queueing mechanism -- thus bypassing the
challenges mentioned in the article but still getting the best of Postgres.

This way we have complete, introspectable, measurable history about every
queue item. It greatly simplifies the Postgres part of it (state updates are
always appends, no locking) and thanks to Kafka, increases performance and
scalability.

It also adds a measure of safety: We can detect "lost" jobs that disappear
because of data failure, bugs in Kafka, failing clients etc. We know that if a
job was never logged as "complete", it probably died.

The job log also functions as an audit log, and we also intend to use it for
certain jobs that benefit from being incremental and from able to continue
from when it last left (for example, feed processing).

~~~
brandur
Interesting approach!

I'd be curious to hear about the mechanic that you came up with for division
of labor among workers — since every client is essentially reading the same
stream, I guess you'd have to distribute jobs based on job_id modulo
worker_number or something like that?

~~~
atombender
I suppose one could use a round-robin sharding approach like you mention, but
it goes against Kafka's design, and it's not necessary.

Kafka divides a queue into partitions. Each partition is a completely
independent silo. When you publish messages, Kafka distributes them across
partititons. When you read messages, you always read from a partition.

This means partitions are also the unit of parallelism: You don't want
multiple workers on a single partition (because of the labour division problem
you mention). Rather, Kafka expects you to have one partition per worker.

This is more elegant than it sounds if you're coming from something like
RabbitMQ. Partitions (ie., queues) in Kafka are append-only and strictly
linear; unlike RabbitMQ, you can never "nack" a message in a way that results
in the message ending up at the back of the queue and thus violating the
original message order. Rather, Kafka expects each consumer to maintain its
"read position" in the queue. Failure handling, then, is simply a matter of
winding back the read position. And unlike RabbitMQ, there's less need for
complicated routing, dead-letter exchanges and so on, because rather than move
messages around, you're just moving a cursor.

Of course, message order is only preserved within a single partition; if you
publish messages A, B and C and you have 3 partitions and 3 workers, then in a
real world, messages may be processed in the order C, B, A. That sounds bad,
but then other queue solutions such as Que or RabbitMQ suffer from the exact
same problem: If you run 3 workers against one queue, your queue may supply
each worker with messages in the right order, but there's no guarantee that
they will be _processed_ in that order. The only way to guarantee ordering is
to have just one worker per queue, using some kind of locking (RabbitMQ does
support "exclusive" consumers). But then you don't get any parallelism at all.
So I think Kafka's solution is quite sane, even if it's more low-level and
less developer-friendly than AMQP.

------
brandur
I'd be curious to hear what the general community thinks of putting a job
queue in a database and if there are a lot of other QC/Que users out there.

FWIW, the transactional properties of a Postgres-backed queue were so
convenient that we took advantage of them for a long time (and still do)
despite the fact that they have a few caveats (e.g. poor degraded performance
as outlined in the post), but more recently there's been a bit of a shift
towards Sidekiq (probably because it's generally very problem-free and has
some pretty nice monitoring tools).

(Disclaimer: I authored this article.)

~~~
apinstein
We are using Postgres as our queue backing store. I tried switching to Sidekiq
but ran into issues (read here
[https://github.com/mperham/sidekiq/pull/624](https://github.com/mperham/sidekiq/pull/624)).
Fortunately our job throughput is small enough to not hit any scaling issues
with Postgres, so I stuck with that because of my confidence and experience
w/Postgres over the years. The issues I ran into on Sidekiq just made me
skeptical of their architecture/code maturity, though that was several years
ago and it may be much improved by now.

We use JQJobs (which we authored) to manage queueing and it's architected such
that it could be ported to Redis or some other better backing store, or
potentially even to QC/Que, which I wasn't aware of until your article (so
thanks for that!).

~~~
brandur
Ah, nice, thank-you!

> Fortunately our job throughput is small enough to not hit any scaling issues
> with Postgres, so I stuck with that because of my confidence and experience
> w/Postgres over the years.

I think we're in a pretty similar situation. For what it's worth, I think that
a queue in PG can scale up about as well as Postgres can as long as you keep
an eye on the whole system (watch out for long-lived transaction and the
like).

------
wpeterson
This was a well written article with an interesting investigation.

However, storing small, ephemeral messages like jobs in a queue within
Postgres is a bad idea and the pain far outweighs the benefits of
transactional rollback for jobs.

Instead, a much simpler solution is to plan for jobs to run at least once, use
a more appropriate datastore like Redis or RabbitMQ, and build in idempotency
and error handling at the job layer.

Postgres used as a system of record shouldn't be used for ephemeral message
queues.

~~~
anarazel
Well, it's not necessarily that simple. It can be very interesting to be able
to directly enter jobs into a queue in a transactional manner, with very low
latency. Say from a trigger.

Edit: typo

~~~
wpeterson
For the sake of error handling on rollback, you're usually better delayed job
enqueueing events to after commit hooks if you're concerned about failures
resulting in a rollback.

~~~
adamtj
Doing anything after committing misses the point. If you can do that, you
don't need postgres.

For example, suppose you mark an account closed, commit, and then enqueue an
event to issue a refund from another system. It's possible that your process
may crash or be killed at just the wrong time leaving you with a closed
account but no refund.

So what if you enqueue the event before you commit? In that case, you might
crash before committing which will automatically rollback. Now you've done a
refund on a non-closed account.

Transactions make it trivial to guarantee that either both happen or neither
do. There are other ways to get that guarantee, but they require more work and
are more error prone.

------
haavikko
Were you using SERIALIZABLE isolation level in your application, as one code
example in your article seems to show? Would using READ COMMITTED level have
made a difference?

------
bmm6o
The long-lived transaction that keeps the deleted jobs from being reclaimed -
is that process working with the job queue table, or is it working with other
tables in the database?

~~~
nierman
Even if the long-running transaction hasn't accessed the job queue table yet
it might do so before it completes. Postgres needs to keep the "dead" tuples
accessible as long as there are active transactions that started before the
deletion.

~~~
bmm6o
Right, that's what I thought. Is there value in giving the job queue its own
database then, or are they taking advantage of the fact that the jobs
modifications can be included in transactions that involve other tables?

