Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

// One problem with using PostgreSQL in this way (using either advisory locks or LOCK FOR UPDATE) is that it requires you to keep an open connection to the database whilst the job is being worked on. //

Not necessarily. You lock the row for an instant update to a field, for example called "status" into "running" and then disconnect from the database within milliseconds. Finish your job taking as much time as you want. And then connect again to change the row's status to "finished".

This is how it's always designed, as I have seen. The locking problem is for querying rows where status="waiting" and then instantly change it.

It's not to "keep the record locked, and DB connection up, until I finish my batch job". That would be a bad design.



What happens when the worker processing the job dies and never updates the status?


An easy way to implement this is to have a "heartbeat" column that the worker updates every N seconds in a thread. A periodic cron-like job reaps jobs who missed M heartbeats. It's possible a worker was alive but unable to update the heartbeat due to, e.g., a temporary network partition. It's also possible the worker's job execution thread crashed while the heartbeat thread is still chugging, causing the job to remain unfinished indefinitely. You can minimize the probability of these failures with various client side logic though. If the worker can't update the heartbeat due to network errors it cancels itself, and you can structure workers such that you can detect stuck/crashed job threads and put the job in "failed" state or just exit and allow the job to be reaped.

But come to think of it, these two classes of problems also occur with systems that hold a db lock for the duration of the job. If the worker loses its connection it needs to somehow cancel itself unless the work is idempotent and computation waste doesn't matter. And if the job crashes you need to make sure you release the lock.

Btw to add another related point, databases do have a lock timeout that you have to worry about if you hold a lock for the duration of the job. Your job execution time cannot exceed the lock timeout.


It helps to reference Que's schema [1] and source code to explain this further. But I'm also going from memory so it's possible I will miss some details :)

* If the entire worker process dies, then it will lose its Postgres connection which is holding an advisory lock on the jobs being worked. This releases those jobs to be worked by another worker. I don't recall how the built-in retry & back-off mechanisms work in this scenario. This advisory lock is indeed held for the entire time the jobs are being worked on, but only from a single supervisor connection (rather than one connection per job).

* If the job thread crashes, the worker supervisor catches this and the job is marked for a retry at a later time.

[1]: https://github.com/que-rb/que/blob/master/lib/que/migrations...


You can include a locked_at field and have your update query be for not_started rows and started rows where locked_at is older than the job timeout


A global job timeout might be unreasonable with high variance in workload. Eg some jobs taking 0.5 seconds and others 30 seconds. You might set a global timeout of say 60s but it sucks to wait 59.5s to reap that short job whose worker crashed. A better system is to make workers update a timestamp on an interval and you reap any jobs that haven't been updated in N seconds.


It's a trade off between updates per sec and latency.

Maybe simply using a timeout per job type is a better way. (That of course trades off simplicity.)


I agree. Frequency of updates also becomes more of an issue as you add workers. Say you have 1000 workers each updating every 2 seconds. That's ~500 timestamp update statements per second which is not trivial in terms of added load on the DB.


Some of these systems also check if the worker running the job is still alive, then if it's not kill the transaction. The job then restarts. Also, you can do this by time - when did it lock it, how long does it have to complete before you nuke. Of course there are edge cases, but you have to make a choice of how to deal with this; just leaving it means doing it manually.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: