
What is ‘skip locked’ for in PostgreSQL 9.5? (2016) - fanf2
https://blog.2ndquadrant.com/what-is-select-skip-locked-for-in-postgresql-9-5/
======
brianwawok
Why use a database as a queue? It is known this doesn't scale well nor work
particularly well. If you need big scale, you can get something like RabbitMQ
or Kafka. If you want to avoid server setup, use SQS or Cloud Pub/Sub. If you
want it to be lighter weight, use Redis.

This is kind of like an article talking about how most people use a hammer to
put in screws wrong. Which is cool, and good for learning why using a hammer
to put in screws is a bad idea. But the outcome of this all should be "Use a
proper tool for the job", not "try to come up with a neat trick to make this
work in the wrong system".

~~~
dom0
> Why use a database as a queue?

Already have a central, configured and monitored server and need "just a small
queue" for something. This is not per se a bad decision. For the same reason
it doesn't have to be a bad idea to cache things in the main database, instead
of using a dedicated cache like Redis.

~~~
BurningFrog
In my experience, the cost of adding and maintaining another storage subsystem
to a project is often hugely underestimated. It's easy to see the benefits and
ignore the costs.

If I can solve a problem reasonably well by adding a table to my Postgres DB,
that will always beat out adding the specialized ScrewdriverDB that does it
perfectly.

~~~
mbell
Depends on what you're adding. Running Redis is dead simple and easy to have
visibility into. RabbitMQ / Kafka are much larger undertakings.

~~~
derefr
If you already had a need for a _durable_ database, and so you properly
implemented Postgres streaming-archiving + disaster recovery, at much personal
effort for your tiny startup... and you now need a _durable_ queue as well...
then "just installing Redis" won't get you there. You'll need to that whole
ops setup all over again for Redis. Whereas, if your queue is _in_ Postgres,
the ops problem is already solved.

If you have a full-time ops staff, throwing another marginal piece of
infrastructure on the pile isn't much of an issue. If you're your own "ops
staff of one", each moving part of your infrastructure—that you need to
_ensure_ works, to guarantee you won't lose client data—is another thing
slowing down your iteration speed.

~~~
mbell
You're making a lot of assumptions that aren't terribly valid for most use
cases. Not to say that strict consistency requirements don't exist, they
certainly do, but they are the exception rather than the norm. A one person
startup doesn't usually care too much about losing a bit of client data.

> If you have a full-time ops staff, throwing another marginal piece of
> infrastructure on the pile isn't much of an issue.

That's a rather insulting description of Redis, care to elaborate?

~~~
cwyers
It's nothing like insulting. Think of it like absolute and marginal costs. You
have absolute infrastructure -- the sum total of all your infrastructure --
and you have marginal infrastructure -- bits and pieces that can be added and
removed. If you have no Redis and then you have Redis,
NewAbsoluteInfrastructure = OldAbsoluteInfrastructure + Redis, and Redis is
your marginal infrastructure.

~~~
mbell
I mis-interpreted your use of 'marginal', sorry about that.

------
misframer
MySQL also has SKIP LOCKED from 8.0.1 [0].

[0] [http://mysqlserverteam.com/mysql-8-0-1-using-skip-locked-
and...](http://mysqlserverteam.com/mysql-8-0-1-using-skip-locked-and-nowait-
to-handle-hot-rows/)

------
asah
(possibly dumb q, pls be gentle)

Hmmm... What happens if the application crashes immediately after removing the
work item but before it can do anything else? Doesn't this break exactly-once
semantics... ?

i.e. wouldn't a complete implementation include a second table with "who's
working on what" and a transaction that moves the records from one table to
the either... and come to think of it, why not store both tables in the same
place? I.e. don't delete records, just mark them as taken and include a
timestamp so we can timeout and give the work to another worker?

UPDATE queue_table SET taken_by = :myId, time_taken = now() WHERE id = (
SELECT id FROM queue_table WHERE taken_by IS NULL LIMIT 1 FOR UPDATE)

Note: completion can be signalled by setting taken_by to NULL, or but adding
another column e.g. completion time, which then enables computing stats on
completion times.

For high volume systems, we eventually want to garbage collect but that's easy
since we have timestamps, i.e. put a partial index on timestamp (WHERE
time_taken IS NOT NULL) and scan the oldest ones...

~~~
mbell
> What happens if the application crashes immediately after removing the work
> item but before it can do anything else? Doesn't this break exactly-once
> semantics... ?

If the connection is broken the transaction would be aborted and the lock
released. If the worker hit an infinite loop or something like that you'd use
something like `idle_in_transaction_session_timeout` to set a transaction
timeout and/or have a worker monitoring system in place to kill long running
jobs.

The important thing to note is that a single DB instance and a client still
comprises a distributed system. You still have almost all the same problems
with 'exactly-once' semantics you would with a distributed queue. You should
make all processing jobs idempotent and support retry regardless of the tech
backing the queue if you want a system that provides effectively exactly-once
semantics.

~~~
asah
ah! you're assuming the work is performed inside the same transaction as the
dequeue operation, and locks held for the duration ?

If so...

While I suppose row level locking technically solves contention, it still
feels like we're "asking for trouble" in holding databases locks while clients
perform arbitrarily long work operations. There's also practical issues when
the work itself is distributed and the original client can't itself keep state
around, i.e. it has to end the low level transaction.

Hence my poor-man's question/proposal using worker IDs and timeouts...

~~~
mbell
> ah! you're assuming the work is performed inside the same transaction as the
> dequeue operation, and locks held for the duration ?

Yes that is the model the linked post is proposing, see the Example.

> While I suppose row level locking technically solves contention, it still
> feels like we're "asking for trouble" in holding databases locks while
> clients perform arbitrarily long work operations. There's also practical
> issues when the work itself is distributed and the original client can't
> itself keep state around, i.e. it has to end the low level transaction.

Not that I recommend using PG as a queue, but you have most/all those problems
with any queuing backend. A problem you may have that is PG specific is that
the # of open connections/transaction could become quite large with a lot of
workers and PG doesn't play well with a lot of connections, it uses a process-
per-connection model.

------
spotman
database rule #1, do not use your database as a work queue.

this article does a great job discussing why not, and finishes with a sane
implementation that would work, but would fall over and shutter to a halt with
a moderate amount of load. ( which is mentioned in the article too )

doing queue workloads in rdbms is a recipe for index contention, bugs, or
both.

~~~
ngriffith
I've found that co-locating my queue with my application data enables a very
powerful pattern around persisting actions and queuing up callbacks --
especially useful for smaller applications where scale isn't as much of a
factor. It means I can persist both my data and my background jobs in a single
transaction, and if something goes wrong during that transaction they both
rollback (before the jobs are picked up by any workers). No need to coordinate
two different systems (which can be full of pitfalls).

~~~
koolba
Yes for infrequent yet high value tasks where consistency is paramount, DB
queues are king. The alternatives using an externalized queue is either non-
transactional without failure handling ( _yet everybody pretends it is_ ),
transactional with 2PC where recovery was never thought through ( _or tested_
), or devolves to having the equivalent of a DB queue anyway to track work
with re-insertion of work into the primary queue.

------
w23j
Locking the task row and doing the work in the same transaction has the nice
property, that the task will be available again, if the worker fails and the
transaction is rolled back.

Unfortunately this may cause problems when the tasks take considerable time to
complete and long running transactions are the result: "Postgres Job Queues &
Failure By MVCC" [https://brandur.org/postgres-
queues](https://brandur.org/postgres-queues)

------
based2
Why not using scheduled timestamped checkpoint synchronization, and then you
can get Loose coupling.

------
zzzcpan
I've noticed long ago that RDBMS users don't actually care about consistency,
so it's always kind of wrong at some level and when things break it's not that
big of a deal for them. It's just annoying to see people claiming how
transactions are easy, while they are only easy to do incorrectly.

