// One problem with using PostgreSQL in this way (using either advisory locks or...

jimktrains2 · on Nov 14, 2019

What happens when the worker processing the job dies and never updates the status?

shrimpx · on Nov 15, 2019

An easy way to implement this is to have a "heartbeat" column that the worker updates every N seconds in a thread. A periodic cron-like job reaps jobs who missed M heartbeats. It's possible a worker was alive but unable to update the heartbeat due to, e.g., a temporary network partition. It's also possible the worker's job execution thread crashed while the heartbeat thread is still chugging, causing the job to remain unfinished indefinitely. You can minimize the probability of these failures with various client side logic though. If the worker can't update the heartbeat due to network errors it cancels itself, and you can structure workers such that you can detect stuck/crashed job threads and put the job in "failed" state or just exit and allow the job to be reaped.

But come to think of it, these two classes of problems also occur with systems that hold a db lock for the duration of the job. If the worker loses its connection it needs to somehow cancel itself unless the work is idempotent and computation waste doesn't matter. And if the job crashes you need to make sure you release the lock.

Btw to add another related point, databases do have a lock timeout that you have to worry about if you hold a lock for the duration of the job. Your job execution time cannot exceed the lock timeout.

bgentry · on Nov 14, 2019

It helps to reference Que's schema [1] and source code to explain this further. But I'm also going from memory so it's possible I will miss some details :)

* If the entire worker process dies, then it will lose its Postgres connection which is holding an advisory lock on the jobs being worked. This releases those jobs to be worked by another worker. I don't recall how the built-in retry & back-off mechanisms work in this scenario. This advisory lock is indeed held for the entire time the jobs are being worked on, but only from a single supervisor connection (rather than one connection per job).

* If the job thread crashes, the worker supervisor catches this and the job is marked for a retry at a later time.

[1]: https://github.com/que-rb/que/blob/master/lib/que/migrations...

drabinowitz · on Nov 14, 2019

You can include a locked_at field and have your update query be for not_started rows and started rows where locked_at is older than the job timeout

shrimpx · on Nov 15, 2019

A global job timeout might be unreasonable with high variance in workload. Eg some jobs taking 0.5 seconds and others 30 seconds. You might set a global timeout of say 60s but it sucks to wait 59.5s to reap that short job whose worker crashed. A better system is to make workers update a timestamp on an interval and you reap any jobs that haven't been updated in N seconds.

pas · on Nov 15, 2019

It's a trade off between updates per sec and latency.

Maybe simply using a timeout per job type is a better way. (That of course trades off simplicity.)

shrimpx · on Nov 16, 2019

I agree. Frequency of updates also becomes more of an issue as you add workers. Say you have 1000 workers each updating every 2 seconds. That's ~500 timestamp update statements per second which is not trivial in terms of added load on the DB.

ukd1 · on Nov 14, 2019

Some of these systems also check if the worker running the job is still alive, then if it's not kill the transaction. The job then restarts. Also, you can do this by time - when did it lock it, how long does it have to complete before you nuke. Of course there are edge cases, but you have to make a choice of how to deal with this; just leaving it means doing it manually.