
Turning PostgreSQL into a queue serving 10k jobs per second (2013) - faizshah
https://gist.github.com/chanks/7585810
======
Diggsey
I have also found the lack of transactional guarantees in typical job queues
to be very problematic.

One problem with using PostgreSQL in this way (using either advisory locks or
LOCK FOR UPDATE) is that it requires you to keep an open connection to the
database whilst the job is being worked on.

For a MySQL database, this would be just fine, but PostgreSQL uses a process-
per-connection model which caps the number of active connections to the
database to a relatively low number (on the order of 1000x fewer connections
than a similarly sized MySQL instance) and tools like PgBouncer do nothing to
help with this.

As a result, if your jobs take more than a few milliseconds to execute (let's
say you make external HTTP requests as part of your job) this is not a good
approach to take.

I use a similar approach which avoids this problem, but it only works because
I have relatively low throughput requirements. I essentially implement in-
database advisory locks using a separate table - before taking a job, workers
create a row in the table, and the primary key of this table is used as a
worker ID. Jobs are "taken" by assigning them a worker ID. Each row in the
worker table has an expiry date, so if workers die, the corresponding row will
be deleted and any linked jobs released back into the queue.

As well as transactional guarantees, using a database as a job queue gives you
a lot of power over how jobs are executed: for example, our service for
delivering webhooks has a separate queue per customer, and we can ensure that
within a single queue jobs are processed strictly in order. Meanwhile, our
service for search indexing supports different priority levels, so that newly
created records are indexed with a higher priority.

~~~
bgentry
I elaborated a bit on this elsewhere
([https://news.ycombinator.com/item?id=21537414](https://news.ycombinator.com/item?id=21537414))
but Que’s design has changed significantly and no longer holds open a
connection or transaction for the duration of working a job. It holds one
database connection per worker process. Each worker process handles all job
locking and assignment to the individual worker threads in that process, each
of which only use connections that they themselves decide to open.

Otherwise, I completely agree on the benefits of transactional job enqueueing.
This lets you push off so much complexity until you actually need it (when
you’ve scaled such that a single database doesn’t handle your needs well). At
that point, you get to solve the same challenges you will have been solving
all along to deal with jobs that might run which depend on transactions that
may not have committed (yet or ever). I believe this model is almost always
the right starting point for a web application, barring some unusual job
requirements or massive initial scale.

~~~
ianai
Is this what you’re talking about?

[https://github.com/que-rb/que](https://github.com/que-rb/que)

~~~
s0l1dsnak3123
Yep!

------
bgentry
The author of this post, Chris Hanks, created the Que queueing library for
Ruby: [https://github.com/que-rb/que](https://github.com/que-rb/que)

It’s changed significantly since this post as the 1.x betas use a very
different structure which should actually be more efficient, use fewer
Postgres connections, cause less lock contention, and cause less table bloat.

Not sure if the benchmarks have been run recently or not but I’m definitely
curious how things stack up to this post from 6 years ago :)

~~~
aldoushuxley001
Wish I could use that in a Django app. Doesn't seem to be a viable python
queueing library that allows using postgresql

~~~
subleq
I wrote django-postgres-queue for this purpose. It uses postgres transactions
to keep queue and application state in sync. It also uses SKIP LOCKED to avoid
some of the typical issues with using a database as a queue.

[https://github.com/gavinwahl/django-postgres-
queue](https://github.com/gavinwahl/django-postgres-queue)

~~~
elnygren
Thanks for making this, I've been using it to help me send emails, push
notifications etc. to 20-30k users in a Django project :)

------
bloody-crow
One of the interesting unforeseen downsides of RDBMS-based queues is explored
here: [https://brandur.org/postgres-queues](https://brandur.org/postgres-
queues)

TDLR the lock time grows exponentially depending on the number of dead tuples
in the table, which naturally grows as you use long running transactions.

~~~
Exuma
Is that fixed by the VACUUM process?

~~~
bloody-crow
No, cause VACUUM can't kill those dead tuples while transaction is still
running. Think of it this way...

When you open a transaction, you need to have a guarantee that you can touch
rows that existed at the moment when transaction has started. You job queue is
chugging along and processes let's say a 1000 jobs per minute. Processing a
job involves deleting the row from the queue, but since you have a transaction
running, Postgres only marks the row as deleted, and keeps it around in case a
transaction would want to access it at some point. Each time you need to
process a job, Postgres needs to lock the row. The way this mechanism works
involves iterating over the rows until you find one you can lock on. If your
transaction is running 30 minutes, each job would have to iterate through 30k
dead rows (deleted, but still around for the sake of the transaction). Slowing
down lock time leads to overal degreaded performance of the job queue, which
leads to jobs being added faster than they're being processed, which further
exacerbates the problem

~~~
biggestdecision
I wonder if this could be solved by moving in-progress jobs to a separate
table...

Guess you lose job atomicity that way.

------
ukd1
This post is pretty out of date; Que and QueueClassic moved to SKIP LOCKED at
some point, over using advisory locks / lock head methods. It's much faster,
but only supported in Postgres >= 9.5.

~~~
ukd1
[https://www.2ndquadrant.com/en/blog/what-is-select-skip-
lock...](https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-
in-postgresql-9-5/) \- is actually a pretty good post explaining the before
(aka this gist) and after (aka skip locked). Another way of doing it was (now
pointless) using [http://www.cs.tau.ac.il/~shanir/nir-pubs-
web/Papers/Lock_Fre...](http://www.cs.tau.ac.il/~shanir/nir-pubs-
web/Papers/Lock_Free.pdf) (you can see an implementation here -
[https://github.com/QueueClassic/queue_classic/blob/v3.2.0.RC...](https://github.com/QueueClassic/queue_classic/blob/v3.2.0.RC1/sql/ddl.sql#L6))

------
oftenwrong
Why use a RDBMS as a queue? Because you already have it, and it works well.
One day you may need a purpose-built component, but today YAGNI.

[http://boringtechnology.club/](http://boringtechnology.club/)

~~~
pas
Breaking up the monolith is hard. Yes YAGNI is terrible, over-engineering
makes everything brittle and hard to test and so on, but at the same time I
burnt myself with spaghetti obelisks more than with the too many docker
containers way.

That said, using a common persistence store initially makes sense, but trying
to compartmentalize the jobqueue/batch-processing stuff never hurts.

------
bloody-crow
This is an article from 2013. I suggest reflecting this in HN title, cause
it's not immediately obvious.

------
ukd1
There is also QueueClassic
([https://github.com/QueueClassic/queue_classic);](https://github.com/QueueClassic/queue_classic\);)
another Ruby based project. Recently I benchmarked some improvements we made
to it vs Que (before and after) - it's now faster which is kinda cool
-[https://github.com/QueueClassic/queue_classic/pull/303#issue...](https://github.com/QueueClassic/queue_classic/pull/303#issuecomment-512970655)

------
kraih
The Minion job queue is another implementation of this idea. But using the
even more efficient and safe FOR UPDATE SKIP LOCKED.
[https://github.com/mojolicious/minion](https://github.com/mojolicious/minion)

------
voldacar
I'm not very knowledgable about db internals so sorry if this comes off as
ignorant, but in an era where cpus execute billions of instructions per second
_per core_ , is 10000 jobs per second supposed to be impressive? Is this kind
of problem bottlenecked by memory?

~~~
qaq
by IOPS you fsync transaction to Write Ahead Log (WAL) on commit.

~~~
Noumenon72
Is this supposed to explain why 10000 per second is impressive? I don't
follow.

~~~
qaq
This is supposed to explain that it's orthogonal to CPU performance.

------
golergka
Seems related to this one from a week ago: [https://layerci.com/blog/postgres-
is-the-answer/](https://layerci.com/blog/postgres-is-the-answer/)

~~~
faizshah
Yup, the reason I posted this was the great discussion from last week:
[https://news.ycombinator.com/item?id=21484215](https://news.ycombinator.com/item?id=21484215)

------
cwalv
Another potential benefit to use a RDBMS system as a queue is that it can make
it much simpler to express priorities in cases where it's not a simple FIFO,
i.e., if the next job a consumer should take depends on more than just the
time the job was added to the queue.

One place this has come up for me is when the next job that's picked depends
on currently running jobs, e.g., each job is associated with a user, and if a
single user already has N tasks running you may want to prioritize another
user's tasks for the N+1 slot.

------
bufferoverflow
Now compare to OpenAMQ, ZeroMQ, RabbitMQ, NSQ, Kafka.

I have seen benchmarks reaching millions of messages per second.

~~~
madhadron
The article is discussing job queues, not messaging. The MQ systems plus Kafka
that you are referring to are message transport systems.

~~~
bufferoverflow
What's the actual difference between a job queue and a message queue filled
with job IDs?

~~~
mlyle
In a decent job queue you know when jobs get completed, whether the worker
died, and whether they're otherwise stuck.

------
rhacker
About 22 years back I got in trouble for recommending we drop Tuxedo Queue
with just a database table since we can get all the transactional options for
free, just the same, without all the headaches an operational issues involved
in connecting all our code with the C++ middleware, especially since Java and
PHP were in the mix and Tuxedo was kinda of in it's own world and was pretty
much the only non-free software we were using and using in a horrible idiotic
way.

------
sbov
Depends on the project. We have some that use PostgreSQL backed queues. Some
using Redis. As you get more traffic RDBMS resources are precious - wasting
them on queues starts to become a poor tradeoff. Redis is so generally useful
we tend to use it on most projects anyways, so it doesn't add another moving
part. And we tend to be really conservative about adding new things to our
production environment, and despite that Redis has found a home in our stack.

~~~
mikeklaas
You're allowed to use more than one PostgreSQL instance for your queue if RDMS
resources are precious <g>

------
dpedu
How does this compare to Redis? Seems like Redis would handily beat it.

~~~
dickeytk
The author addresses Redis:

> So, many developers have started going straight to Redis-backed queues
> (Resque, Sidekiq) or dedicated queues (beanstalkd, ZeroMQ...), but I see
> these as suboptimal solutions - they're each another moving part that can
> fail, and the jobs that you queue with them aren't protected by the same
> transactions and atomic backups that are keeping your precious relational
> data safe and consistent. Inevitably, a machine is going to fail, and you or
> someone else is going to be manually picking through your data, trying to
> figure out what it should look like.

I disagree though. BLPOP is far easier to grok than any Postgres solution and
Redis is rock-solid. Either using the new Redis streams or a good queueing
library is going to guarantee you don’t miss any jobs and have a need for
transactions across both systems either.

I would also be very hesitant to add work to my database. That’s often a
sensitive part of systems.

~~~
shandor
I was also wondering about the "safe and consistent" part. Doesn't Redis
Streams with persistence solve those pretty well?

Edit: oh, someone mentioned this being from 2013. No Redis Streams back then.

~~~
ukd1
Ya, it doesn't mean streams - even if it's consistent in isolation, you're
using it with Postgres - how do you make actions in one consistent with
actions in the other? (TLDR: you won't/don't)

------
mianos
This reminds me of pgqueue
[https://github.com/markokr/skytools](https://github.com/markokr/skytools) as
used at skype. It works and it works well but needs a plugin so maybe not so
good for RDS.

------
maxpert
May be I am being ignorant/arrogant, or have seen similar toy systems crash
and burn in prod environments. And if you go through
[https://www.2ndquadrant.com/en/blog/what-is-select-skip-
lock...](https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-
in-postgresql-9-5/) it clearly states:

> A queue implemented in the RDBMS will never match the performance of a fast
> dedicated queueing system, even one that makes the same atomicity and
> durability guarantees as PostgreSQL. Using SKIP LOCKED is better than
> existing in-database approaches, but you’ll still go faster using a
> dedicated and highly optimised external queueing engine.

People only realize these warnings when they have built a toy system that is
rolled out to a production/product that takes off. By that time you will
realize "ohhh I don't need all the ACID guarantees for __every job __in my
system ", and I want to run my workers as lambdas/elastic scaling workers (you
need more connections). That is where companies will be spending efforts from
their best engineers to move away from Postgres as queue. Which comes down to
question; why do it in first place? What makes it cheaper (other than local
dev) to deploy a system that is bound to fail in future? Don't get me wrong I
love Postgres, I just don't believe that it's the right tool for doing this
job.

------
ptrwis
I would love to see now PostgreSQL going more into improving developer's
expierience. Job queues, job scheduler, packages.

------
tuldia
PostgreSQL never ceases to amaze me.

The great thing is that it is a mature tool and you don't need to bring
another beast to the zoo.

------
Exuma
This is awesome, I love Que

------
29athrowaway
What is the problem with Kafka?

------
danmg
I always thought that using a database server for RPC was considered an anti-
pattern.

------
Thaxll
Bad idea to use that because if the worker crash the event is lost, don't use
PG for that.

~~~
bloody-crow
Wait, what do you mean here? The job is a row in a database. You can crash as
much as you want, the row is not going anywhere until you explicitly mark it
as "worked" and delete it.

There are different performance constraints to this approach, but data
integrity and robustness of it are unmatched, really.

