
System design hack: Postgres is a great pub/sub and job server - colinchartier
https://layerci.com/blog/postgres-is-the-answer/
======
nostrademons
Another neat hack is to use Postgres as a quick & dirty replacement for
Hadoop/MapReduce if you have a job that has big (100T+) input data but small
(~1G) output data. A lot of common tasks fall into this category: generating
aggregate statistics from large log files, searching Common Crawl for relevant
webpages, identifying abusive users or transactions, etc.

The architecture is to stick a list of your input shards in a Postgres table,
have a state flag that goes PENDING->WORKING->FINISHED->(ERROR?), and then
spin up a bunch of worker processes as EC2 spot instances that check for the
next PENDING task, mark it as WORKING, pull it, process it, mark it as
FINISHED, and repeat. They write their output back to the DB in a transaction;
there's an assumption that aggregation can happen in-process and then get
merged in a relatively cheap transaction. If the worker fails or gets pre-
empted, it retries (or marks as ERROR) any shards it was previously working
on.

Postgres basically functions as the MapReduce Master & Reducer, the worker
functions as the Mapper and Combiner, and there's no need for a shuffle phase
because output <<< input. Almost all the actual complexity in MapReduce/Hadoop
is in the shuffle, so if you don't need that, the remaining stuff takes < 1
hour to implement and can be done without any frameworks.

~~~
ones_and_zeros
If a worker fails or gets pre-empted, how does it retry anything? It's gone at
that point, no?

Sounds like you'd end up with a bunch of dangling shards orphaned in WORKING
state. And now you need timeouts and health checks and something to coordinate
all that.

~~~
ris
I used to do something a little like this with postgres advisory locks. Worker
takes out a lock that signifies it's working on a task, if the worker fails or
loses its connection in any way the lock gets dropped along with the
transaction being rolled back.

Example:
[https://gist.github.com/risicle/f4807bd706c9862f69aa](https://gist.github.com/risicle/f4807bd706c9862f69aa)

~~~
rem7
Yeah, in MySQL you can also do this with a named lock. Really like this
workflow.

------
davidw
> It's rarely a mistake to start with Postgres and then switch out the most
> performance critical parts of your system when the time comes.

This is pretty good advice in general.

~~~
Thaxll
Adding postgres there and there where it's clearly the wrong tool is a bad
advice.

~~~
xmprt
I probably would have gone with something like Kafka in this case, but it
seems like Postgres serves them pretty well even if it's the "wrong" tool. If
it works, fits your current needs, leads to faster development time, and isn't
needlessly slow, then I say go for it.

~~~
rumanator
Picking Postgres for alternative applications such as pub/sub services also
simplifies both deployment and development. You'll already be using Postgres
for persistence, so why waste time deploying another set of services and
develop another set of clients just to get the same functionalities?

------
pjungwir
I've done this before with good results.

I was pleased to see they are using `SELECT FOR UPDATE SKIP LOCKED`. That is
what this 2nd Quadrant article recommends, which I think is required reading
if you want to implement this yourself:

[https://www.2ndquadrant.com/en/blog/what-is-select-skip-
lock...](https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-
in-postgresql-9-5/)

It goes into more detail about wrong ways to implement a queue and what the
downsides are for its preferred approach.

------
jordic
We use a lot this kind of tooling.. say, you need to check 20k URLs and you
want to rate limit them.. add them to a Pg table (with state and result
fields). A single thread worker that just takes a row (marks it as pending)
and later updates it. With select for update and skip tricks you can
horitzontal scale it to the number of workers you need.

I had seen it also for soft that sends massmail (our case around 100k/day)..
it's state is a postgres queue.

We also use Pg for transactional mail. We insert it on a table. (There is a
process that sends the row mails).. the so nice part is that the mail is
joining the dB transaction for free.. (all or nothing)

------
evv
Excellent design hack. If anybody in the Node/TypeScript ecosystem is looking
for this capability in a neat and supported library, it looks like the
graphile folks have you covered:

[https://github.com/graphile/worker](https://github.com/graphile/worker)

~~~
moltar
There’s also pg-boss that’s really full featured

------
zitterbewegung
Postgres is an acceptable relational database / nosql database / pub/ sub/ job
server / blockchain.

~~~
brian_herman
You forgot lisp.

~~~
leolas1995
and emacs too

------
zrail
If you're working with Ruby I have had good experiences with Que[1], which
implements a pattern similar to the OP using advisory locks.

[1]: [https://github.com/que-rb/que](https://github.com/que-rb/que)

~~~
theptip
Is there a similar library in Python?

I really like the look of this approach, but don't want to build it myself if
I can avoid it.

~~~
zrail
[https://pypi.org/project/pq/](https://pypi.org/project/pq/) looks similar,
although it looks like they use `SKIP LOCKED` instead of advisory locks. I'm
not sure what the tradeoff is.

~~~
fjp
SKIP LOCKED is much simpler and doesn't really have any gotchas.

We had an internal library that used advisory locks which had all sorts of
strange behavior we couldn't figure out until we just moved to SKIP LOCKED

------
soumyadeb
Geat post!! At Rudder (open-source segment), we used Postgres as our streaming
layer in a similar fashion. It has been super stable in production, very easy
to setup and we easily get to > 10K events/sec insertion performance.

The code is open-sourced here in case anyone wants to re-use

[https://github.com/rudderlabs/rudder-
server/blob/master/jobs...](https://github.com/rudderlabs/rudder-
server/blob/master/jobsdb/jobsdb.go)

We had to built additional logic to clean-up old jobs (similar to Level merges
in similar queing systems)

------
fizwhiz
Isn't hijacking a DB as a "distributed" message queue a pretty well trodden
path? Enterprises have been doing this for decades.

~~~
colinchartier
This pattern falls down if you need to poll the database, because if you have
3 queues and 100 workers you're making 300 queries per poll interval.

The feature of postgres that makes this viable in comparison to most other
databases is the "channel"

~~~
MuffinFlavored
What if your listeners crashed/were down at the time of the `PUB` message?
Does this mean the message falls into oblivion (since it will never receive a
reply/ACK/get worked on)?

~~~
colinchartier
The pattern I follow is that workers immediately poll once when they start -
this means that when your workers restart after crashing, they'll pick up any
missing jobs.

~~~
pm90
I'm not seeing how this recovers from the failure scenario where a worker
grabbed an event and then immediately crashed.

~~~
Rapzid
The comment you are replying to wasn't addressing that scenario.

Typically you would implement visibility timeouts and other such stuff.
Depending on the use case you could make specific optimizations or keep it
generic and have SQS like semantics or something.

~~~
disiplus
at this point why not use something like rabbitmq and not reinvent it in
postgres.

~~~
stickfigure
Because it's trivially easy. The dequeue operation updates the last_lease
time; the query makes sure you don't dequeue something that was recently
leased. It's far, far easier than setting up a separate job queue, especially
if you are already using an RDBMS.

------
silasdavis
The postgres module for node is quite unreliable when holding open a
connection to listen on a channel. This helped:
[https://github.com/andywer/pg-
listen/blob/master/README.md](https://github.com/andywer/pg-
listen/blob/master/README.md)

But we still see issues.

------
cle
At the scale I operate at, I wouldn't consider this a viable option. What's
the backpressure like on NOTIFY/LISTEN? (Docs mention a maximum backlog of 8GB
on the message queue, is that configurable? Monitorable?) Tons of constant
churn on a table means we have to worry about vacuuming right? Now I have to
monitor that too to make sure it's keeping up. Not to mention all the usual
operational issues with running relational databases.

No thanks, I'll stick with GCP PubSub or AWS SQS, which are explicitly
designed for this use case, and for which I have to setup _no_ infrastructure.

~~~
colinchartier
\- postgres doesn't maintain a queue for notify/listen, it's purely pubsub.
I'm not quite sure how the backpressure could make it to the database

\- A lot of use cases involve not dropping messages after they are processed
(like CI jobs, in this example), so you don't have to vacuum the rows

\- If you're comfortable with SQS there's no really big reason to switch, but
it makes it so that your project can _only_ run on amazon cloud servers, which
is annoying.

~~~
cle
[https://www.postgresql.org/docs/current/sql-
notify.html](https://www.postgresql.org/docs/current/sql-notify.html)

> There is a queue that holds notifications that have been sent but not yet
> processed by all listening sessions. If this queue becomes full,
> transactions calling NOTIFY will fail at commit. The queue is quite large
> (8GB in a standard installation) and should be sufficiently sized for almost
> every use case.

My understanding of MVCC (correct me if I'm wrong), is every time you do an
update, dead tuples are left behind that need to eventually be vacuumed. If
vacuum isn't running or can't keep up, you'll run out of space or have a tx id
wrap around.

Work queues are a simple paradigm and swapping out one queue for another is a
trivial exercise that won't change the architecture of a system, and SQS or
PubSub could be easily used, regardless of where the rest of the project is
hosted. I'd rather pay that one-time cost than the ongoing cost of maintaining
my own job queue service and infrastructure.

~~~
dfox
The LISTEN/NOTIFY queue has nothing to do with MVCC. Only reason why it exists
is that postgresql synchronizes the notification events with transaction
boundaries (ie. you get notifications only when you don’t have active
transaction). Given this the only case when you would care about the depth of
the notification queue is when your application is exceptionally misbehaved
and in that case you will run into more critical problems (stale locks, MVCC
bloat...) well before that will begin to be an issue.

~~~
cle
I mentioned MVCC with respect to vacuuming, not the queue.

------
stephenr
I'm not sure I quite follow this statement:

> In the list above, I skipped things similar to pub/sub servers called "job
> queues" \- they only let one "subscriber" watch for new "events" at a time,
> and keep a queue of unprocessed events:

If your job queue only allows one single worker (even per named queue), I'd
argue it's a shit job queue.

~~~
ibejoeb
The article mentions Celery and Gearman.

I'm not here to defend celery generally, but it has no such limitation.

~~~
stephenr
Right, I spent some time earlier this year researching a number of job queue
systems for a client and never came across one that had this limitation, which
kind of makes me wonder what exactly the article is trying to claim.

------
fyp
Can someone share their experience with scaling pg's NOTIFY and LISTEN?

The use case I have in mind has a lot of logically separate queues. Is it
better for each queue to have its own channel (so subscribers can listen to
only the queue they need) or have all queues notify a global channel (and have
subscribers filter for messages relevant to them). I am mainly confused about
whether I need a dedicated db connection per LISTEN query and also how many
channels is too much.

~~~
osigurdson
Based on what I have read and some experimentation, I concluded that it will
not scale very well beyond ~500 listener processes because there needs to be
one persistent connection per listener. Furthermore, my understanding is using
pgbouncer will not help since it needs to use session pooling in order to
support LISTEN.

------
londons_explore
Hacks like this work at first, but long running transactions and postgres
don't do well together.

After a few weeks of running on a multi-TB table, you'll find the dead tuples
massively outnumber the live tuples, and database performance is dropping
faster than the Vacuum can keep up. Vacuum is inherently single-threaded,
while your processes making dead tuples as part of queries are multi-threaded,
so it's obviously the vacuum that fails first if most queries are long running
and touch most rows. Your statistics will get old because they're also
generated by the vacuum process, making everything yet slower.

Even if you can live with gradually dropping performance, eventually the whole
thing will fail when your txids wrap around and the whole database goes read-
only.

~~~
ansible
Long running transactions? I didn't think the article implied doing something
like that. The job state changes shouldn't take long to execute.

------
directionless
Postgres generally has a fairly low maximum connections. If you're running
your own servers, you can adjust this, but in the cloud you may not be able
to. For example, Google CloudSQL maxes at 1000, Heroku at 500.

At that point, people usually start looking at the connection pooling tools.
Depending on how much work you need from the DB, connections pools can be a
win. Anyone know how connection pooling works with listeners?

A

~~~
1996
Can't say about listeners. But connection poolers introduced more problems
than I cared to fix. So now I routinely use 3000 connections, and have 9000
set up for peaks. It eats a little more ram but it is more stable.

~~~
takeda
connections aren't free[1] after reducing number of connections from 200 to 60
my commit throughput doubled.

[1] [https://brandur.org/postgres-connections](https://brandur.org/postgres-
connections)

------
z3t4
With today's hardware I argue you most likely do not need a pub/sub service.
Considering the extra work needed in a distributed system, you could save a
lot of time by keeping it simple.

------
luord
Postgres is now pretty much the ultimate nearly-all purpose backend, it seems.
At this point, I won't need to use anything else.

And I'm more than perfectly fine with that.

------
zyngaro
Keep in mind that a notify performed when nobody is listening is lost. The
workers need then to catch up.

~~~
gunnarmorling
This is where the Debezium connector for Postgres [1] comes in: it will
retrieve change events from the TX log and push it to brokers such as Apache
Kafka or Pulsar, or directly to your application. When not listening, any
consumer will continue to read from where it left off before (applying "at
least once" semantics).

Disclaimer: I work on Debezium

[1]
[https://debezium.io/documentation/reference/0.10/connectors/...](https://debezium.io/documentation/reference/0.10/connectors/postgresql.html)

~~~
makkesk8
I fail to see how Debezium would solve this as it's just as likely Debezium
wouldn't be watching the WAL as if you had a minimal number of workers always
listening and those went down for some reason.

Could you please elaborate?

~~~
gunnarmorling
Debezium stores the WAL position it has read ("offset"). So while it's down,
of course it won't be able to read any events. But once it's restarted, it
will continue to read the WAL from the last offset. So no events will be
missed also when the connector isn't running for some time (WAL segments are
retained by the DB until acknowledged by a logical decoding client such as
Debezium).

------
anonu
How do you performance tune PostgreSQL on AWS and still keep it running under
a reasonable cost?

~~~
dkoston
In my experience, you don’t. With tons of IOPS you need EBS or crazy expensive
instances.

Instead you use a cloud like google cloud where you can add NVMe SSDs to
whatever instance type you need and configure custom RAM and CPU instead of
picking from the super expensive AWS instances with no configurable options
and almost always the wrong resource allocations for your workload.

Source: testing my infrastructure that requires 60,000 iops on both google
cloud and AWS and it being 1/4 the cost and higher performance on Google. Of
note: this was a very high throughput streaming data application. YMMV for
other applications.

------
Iv
I'll soon have to do a pub/sub for an application that's close to a
multiplayer video game.

Most advices I have seen say that I'll probably want to code it myself, but I
was wondering about the latency of that solution? I'll likely have a SQL store
and that would be a good argument to use postgres...

~~~
codesushi42
Use Redis pubsub instead.

~~~
Iv
Looking into it, it seems to have something like 2ms of latency which is good.
Do you think it would be a good idea to use for exchanging, say, realtime
position of players in a virtual world?

~~~
codesushi42
Not with pubsub. You could use it as shared memory store for player states
though, I guess.

------
2bitencryption
I'm curious if the same holds true if you drop in Sqlite/MS Sql Server/Mysql.

I.e. is this good advice because Postgres in particular is a great
implementation of sql, or because sql in general is good enough to solve this
problem, or a mix of the two?

~~~
colinchartier
Postgres in particular:

\- has strong performance vs, say, sqlite

\- has "channel" and "trigger" support so you can avoid polling (which either
slows down your jobs or limits your number of workers)

\- is actually OSS (versus, say, mysql)

~~~
Twirrim
How is MySQL not OSS?

~~~
cogman10
It is somewhat split. Not all MySQL features are open source (enterprise
edition feature). Postgres is completely open source.

~~~
jabl
Well, yeah, but then there's stuff like enterprisedb which is postgresql +
some proprietary stuff on top, and the company employs some of the postgresql
core developers. So in some sense it's "PostgreSQL Enterprise Edition" in all
but name.

I believe Citus DB(?) is something similar.

~~~
e12e
> I believe Citus DB(?) is something similar.

"Citus Unforks From PostgreSQL, Goes Open Source"

[https://www.citusdata.com/blog/2016/03/24/citus-unforks-
goes...](https://www.citusdata.com/blog/2016/03/24/citus-unforks-goes-open-
source/)

[https://news.ycombinator.com/item?id=11353322](https://news.ycombinator.com/item?id=11353322)

As for mysql - I don't know why anyone wouldn't prefer mariadb - but I suppose
it's conceivable mysql has features that makes it worth dealing with oracle
licensing. But I doubt it.

------
namanyayg
I've simply been using MySQL/Maria everywhere and am meaning to switch, but
I'm not sure what makes Postgres this much better.

Can this hack not be achieved by a mariadb table too?

~~~
z92
The key features used here for PG seems SKIP LOCKED, and the PG notification
system, which informs all clients that are directly connected to the database
that "this table has changed" so that they can restart their work.

The notification system comes with a cost. You can't scale up the number of
connection to a PG database instance.

Therefore you need to go pooling.

And for SKIP LOCKED, I just checked and mysql/maria do have it.

------
aarbor989
I did something very similar with MySQL since that was the DB we already had
setup for data (not my choice). Basically any database that has atomic
operations can do this. It definitely is way more convenient and cheaper to do
use your existing infra for pub/sub and then only scale out to other services
once performance becomes subpar. Although if you already have a messaging
service up and running, it’s probably better to use that

------
klagermkii
Thanks, seeing that atomic fetch in action is very useful.

------
webscalist
if you use Postgres on AWS, you'll easily exhaust IOPS with this and hell
breaks

------
DevKoala
I agree. I have prototyped this in the past, but our current pub/sub was not
painful enough for us to go full steam ahead with PG.

However, my design was more bare bones. I was picking jobs by chaining CTE's
that did the status update as they returned the first element of the queue.

------
tapirl
Should the "ci_job_status" in the line be "ci_jobs" instead?

    
    
         INSERT INTO ci_job_status(repositor ...
    

BTW, it looks Go becomes so popular that many tutorials are using Go for
examples. ;D

------
rco8786
Interesting enough, but I’m not sure why this is called a hack. It’s a fully
supported feature of Pg

------
xmly
Kinesis is based on DynamoDB...

Key-value stores could do a lot of things theoretically.

~~~
solidasparagus
Kinesis is based on DynamoDB? Can you share where you saw that?

------
rantwasp
[http://mikehadlow.blogspot.com/2012/04/database-as-queue-
ant...](http://mikehadlow.blogspot.com/2012/04/database-as-queue-anti-
pattern.html)

~~~
mplewis
The concerns in this post have already been answered in OP's post.

------
fouc
Another neat hack would be to use Postgres as a graph db

~~~
retox
Check out AgensGraph, it's graph DB engine built on postgres. I haven't used
it for much more than toy projects at this point though.

[https://github.com/bitnine-oss/agensgraph](https://github.com/bitnine-
oss/agensgraph)

------
foou
How do you manage the bloat?

~~~
existencebox
Typically (at least when I've implemented variations of this pattern) a batch
job run on a schedule performing something like "expire stale/completed jobs
past a threshold."

If you needed the existence of completed jobs to compute future jobs, use a
supplementary datastructure that stores spans or other heuristics, updated in
a transaction with job completion

------
ubu7737
Used the "FOR UPDATE SKIP LOCKED LIMIT 1" trick to implement a job server in
PG a few years ago for the first time.

It's a great solution.

~~~
alberth
Why the “LIMIT 1”?

~~~
ubu7737
It makes the job-fetching more atomic, as the article points out properly. If
you take >1, you have to sort the situation where 1 or more jobs fail, which
adds an unreasonable amount of complexity.

On my first go I was taking batches, on the reasoning that it optimized for
query performance. It was a complete disaster and I quickly realized that I
was engaging in premature optimization.

Batching is out, stream-processing is in. If you design your job table
correctly, PG will perform very well as an advanced stream processor.

