Hacker News new | past | comments | ask | show | jobs | submit login
Postgres Job Queues and Failure by MVCC (brandur.org)
62 points by craigkerstiens on May 20, 2015 | hide | past | favorite | 27 comments



(I'm the author of Que, the job queue discussed in the post most extensively)

This isn't terribly surprising to me, since I have an appreciation for what long-running transactions will do to a system, and I try to design systems to use transactions that are as short-lived as possible on OLTP systems. I realize that this should be explicitly mentioned in the docs, though, I'll fix that.

I'll also note that since the beginning Que has gone out of its way to use session-level locks, not transaction-level ones, to ensure that you can execute long-running jobs without the need to hold open a transaction while they work. So I don't see this so much as a flaw inherent in the library as something that people should keep in mind when they use it.

(It's also something that I expect will be much less of an issue in version 1.0, which is set up to use LISTEN/NOTIFY rather than a polling query to distribute most jobs. That said, 1.0 has been a relatively low priority for much of the last year, due to a lack of free time on my part and since I've never had any complaints with the locking performance before. I hope I'll be able to get it out in the next few months.)


> I'll also note that since the beginning Que has gone out of its way to use session-level locks, not transaction-level ones, to ensure that you can execute long-running jobs without the need to hold open a transaction while they work. So I don't see this so much as a flaw inherent in the library as something that people should keep in mind when they use it.

+1! I tried to clarify in the "Lessons Learnt" section that this isn't so much a problem with Que, but something that should be kept in mind for any kind of "hot" Postgres table (where "hot" means lots of deletions and lots of index lookups). (Although many queues are more vulnerable due to the nature of their locking mechanisms.)

But anyway, thanks for all the hard work on Que. The performance boost upon moving over from QC was nice, but I'd say that the major win was that I could eliminate 90% of the code where I was reaching into QC internal APIs to add metrics, logging, and other missing features.


Thank you!


Isn't LISTEN/NOTIFY basically useless for queues, since NOTIFY will wake up every single consumer, causing a polling stampede?


Thanks for the article, very interesting. Just one point on the solution. To my understanding, you prefer a queue implemented on a DB because transactions guarantee that a task could not start unless the task before as succeeded and committed the required data on the DB. In our environment we run tasks on RabbitMQ/Celery. One of the nice feature that I believe exists also in Sidekiq is that it allows you to create chain (or more complex structure) of task and the worker itself will take care of synchronizing them removing your problem (when a task finishes successfully it commits on the db and triggers the next step). The only problem we had where on entry points were we create the task chains and we fire them before committing the transaction. One solution was manually committing before firing the task. But that was somewhat difficult. What we have implemented was a Python decorator to have the tasks actually fired after the function completion (thus after the transaction commit). In Go we achieve the same result in a more simple way using the defer statement. In my experience, all these solutions are local to the process that fires the task, so there is less risk of interaction with other process, easy to implement and more robust compared to other solution.


> Thanks for the article, very interesting.

Thanks!

> What we have implemented was a Python decorator to have the tasks actually fired after the function completion (thus after the transaction commit). In Go we achieve the same result in a more simple way using the defer statement. In my experience, all these solutions are local to the process that fires the task, so there is less risk of interaction with other process, easy to implement and more robust compared to other solution.

Oh yes, totally. I've seen this same pattern in Ruby before whereby a job enqueue is put on something like an ActiveRecord `after_commit` hook.

One (overly) pedantic observation is that this still leaves you with the possibility of having your transaction and data safely committed in your database, but with your job not enqueued if something happens to your process between the time of commit and time of enqueue. Admittedly though, this probably doesn't happen all that often in real life.

Probably the best answer I have is that we take this approach for the sheer convenience. We can do things like this:

  def create_user(email, password)
    User.transaction do
      user = User.new email: email
      user.set_password_hash(password)

      check_for_abuse!

      # make an account ID queue job to create record in billing system
      user.billing_account_id = uuid()
      async_create_account_in_billing_system(user)

      # queue job to send an e-mail invite
      async_send_invite_email(user)

      create_auditing_record

      ...    
    end
  end
You can build out very explicit chains of actions that may enqueue jobs or call into other chains that may enqueue their own jobs and all the while never have to worry about any kind of ordering problem while working anything. There are no hidden callbacks anywhere, but you still get to keep perfect correctness: if anything fails at any point, the whole system rolls back like it never happened.


The observation on failure modes is correct. When you process thousands of tasks every day even problems with low probability happens. What bugs me with the transaction approach is that you loose all the information (except probably for logs). In your example, if I signup on your service and for any reason the process fails (maybe after I received the confirmation that everything is ok) I will end up having no record stored for my account. This is problematic for post mortem diagnosis, for customer support (although its easy to ask the customer redo the signup) and so for and so on. Imagine you are handling payment with an external provider (e.g: Paypal). I could end up being billed without a trace on your system that I've paid. I'm not saying that my approach is correct. I'm saying that just assuming that transactions will save all your problems discount the fact that your system, to be useful, has side effects (not using the term in a strict way) on your customer and possibly on systems outside your organization. I prefer to plan for actively manage the possible failure modes and to detect and correct quickly anomaly. For this reason inconsistent data is sometime better than no data. Another point on the database approach that I haven't thought before: In the past we designed a system that stored and updated user timelines on MySQL. It has always been a nightmare. SQL databases are not designed with an high ratio of write operation in mind. The indexes get quickly fragmented, even deleting entries doesn't immediately deflate the disk consumption etc. I don't see this being a problem immediately applicable to a queue use case. But with your service growing in size there's a risk of hitting scalability problem. Obviously you could react like we have done, use a bigger database, use better disks, create cleanup jobs. But it is bad software engineering and you are just buying more time.


I actually wrote about this problem recently [1]. (Note that in that comment I'm not describing a job system, but rather a system for keeping one or more secondary stores in sync with a master database.)

If you're using an external queue such as RabbitMQ, there's basically no way to guarantee a queue message being fired without a "two-phase" transactional approach.

Let's say your use case is creating a user and sending an email verification email. You need to:

1. Start transaction.

2. Create the user.

3. Insert a tuple into an "events to fire" table.

4. Commit.

5. Fire the event.

6. Delete the tuple.

This ensures that if the transaction fails, nothing is fired; if the transaction succeeds, we guarantee that the user now exists, and that the command to fire the event also exists. Thus, if step 5 fails, we still have the "events to fire" tuple, and we can have a cron job running every minute that picks up stragglers. (In practice, you'll need to run steps 5-6 in a transaction that exclusively locks the tuple so that a concurrent cron job doesn't do it.)

This way, you're risking duplicate events, but you'll never lose the event, unless RabbitMQ falls over. Duplicate events are of course unfortunate and undesirable, but missing events are worse.

Of course, with the above solution, it could be argued that your "events to fire" table is a queue itself. The good news is that RabbitMQ is much better at scaling a queue than Postgres, so the duplication of work isn't actually that much of a problem. The "events to fire" table can be optimized in various ways; Postgres supports "unlogged" tables, for example.

Another approach which will be available soon is Bottled Water [2], which gives you pretty much the above, except without needing to maintain the "events to fire" table. But it's definitely more complicated.

[1] https://news.ycombinator.com/item?id=9490583

[2] http://blog.confluent.io/2015/04/23/bottled-water-real-time-...


Unlogged tables aren't durable and might lose events even after a successful commit. If you're willing to make that trade off, then you don't need Postgres in the first place.


Sure, which is why I said it would be an optimization. But if your PostgreSQL server dies in the middle of that transaction, I would be more worried about RabbitMQ.


Good article!

I am currently implementing a project in which we use Postgres to track job state (eg., run status, failures, timings, resource usage, related log entries), but Kafka as the actual queueing mechanism -- thus bypassing the challenges mentioned in the article but still getting the best of Postgres.

This way we have complete, introspectable, measurable history about every queue item. It greatly simplifies the Postgres part of it (state updates are always appends, no locking) and thanks to Kafka, increases performance and scalability.

It also adds a measure of safety: We can detect "lost" jobs that disappear because of data failure, bugs in Kafka, failing clients etc. We know that if a job was never logged as "complete", it probably died.

The job log also functions as an audit log, and we also intend to use it for certain jobs that benefit from being incremental and from able to continue from when it last left (for example, feed processing).


Interesting approach!

I'd be curious to hear about the mechanic that you came up with for division of labor among workers — since every client is essentially reading the same stream, I guess you'd have to distribute jobs based on job_id modulo worker_number or something like that?


I suppose one could use a round-robin sharding approach like you mention, but it goes against Kafka's design, and it's not necessary.

Kafka divides a queue into partitions. Each partition is a completely independent silo. When you publish messages, Kafka distributes them across partititons. When you read messages, you always read from a partition.

This means partitions are also the unit of parallelism: You don't want multiple workers on a single partition (because of the labour division problem you mention). Rather, Kafka expects you to have one partition per worker.

This is more elegant than it sounds if you're coming from something like RabbitMQ. Partitions (ie., queues) in Kafka are append-only and strictly linear; unlike RabbitMQ, you can never "nack" a message in a way that results in the message ending up at the back of the queue and thus violating the original message order. Rather, Kafka expects each consumer to maintain its "read position" in the queue. Failure handling, then, is simply a matter of winding back the read position. And unlike RabbitMQ, there's less need for complicated routing, dead-letter exchanges and so on, because rather than move messages around, you're just moving a cursor.

Of course, message order is only preserved within a single partition; if you publish messages A, B and C and you have 3 partitions and 3 workers, then in a real world, messages may be processed in the order C, B, A. That sounds bad, but then other queue solutions such as Que or RabbitMQ suffer from the exact same problem: If you run 3 workers against one queue, your queue may supply each worker with messages in the right order, but there's no guarantee that they will be processed in that order. The only way to guarantee ordering is to have just one worker per queue, using some kind of locking (RabbitMQ does support "exclusive" consumers). But then you don't get any parallelism at all. So I think Kafka's solution is quite sane, even if it's more low-level and less developer-friendly than AMQP.


I'd be curious to hear what the general community thinks of putting a job queue in a database and if there are a lot of other QC/Que users out there.

FWIW, the transactional properties of a Postgres-backed queue were so convenient that we took advantage of them for a long time (and still do) despite the fact that they have a few caveats (e.g. poor degraded performance as outlined in the post), but more recently there's been a bit of a shift towards Sidekiq (probably because it's generally very problem-free and has some pretty nice monitoring tools).

(Disclaimer: I authored this article.)


Great article. We've just moved a large portion of our async jobs over from Resque to Que because of the huge benefits you get from transactional enqueuing and processing. Performance seems good so far, and if it really becomes an issue, running two queuing systems side by side (one transactional, one high-throughput) seems viable.

We're super cautious about long-running transactions anyway, as they cause a load of other issues (e.g. http://www.databasesoup.com/2013/11/alter-table-and-downtime... - full blog post coming soon!)


We are using Postgres as our queue backing store. I tried switching to Sidekiq but ran into issues (read here https://github.com/mperham/sidekiq/pull/624). Fortunately our job throughput is small enough to not hit any scaling issues with Postgres, so I stuck with that because of my confidence and experience w/Postgres over the years. The issues I ran into on Sidekiq just made me skeptical of their architecture/code maturity, though that was several years ago and it may be much improved by now.

We use JQJobs (which we authored) to manage queueing and it's architected such that it could be ported to Redis or some other better backing store, or potentially even to QC/Que, which I wasn't aware of until your article (so thanks for that!).


Ah, nice, thank-you!

> Fortunately our job throughput is small enough to not hit any scaling issues with Postgres, so I stuck with that because of my confidence and experience w/Postgres over the years.

I think we're in a pretty similar situation. For what it's worth, I think that a queue in PG can scale up about as well as Postgres can as long as you keep an eye on the whole system (watch out for long-lived transaction and the like).


I use RQ these days. It's a really simple Python / redis solution. We had been using celery / rabbitmq but it's really bad with small numbers of long running jobs (each worker will take a task and reserve another even though it can't strt on it yet). For us that was a killer since we had jobs that could take 10 minutes to complete.

RQ has been good to us so far. There's a simple dashboard for it that works well enough. After messing around trying to find my data in rabbitmq it was a real relief to be able to query a simple set of redis keys.


You may be interested in my comment here: https://news.ycombinator.com/item?id=9578787.


This was a well written article with an interesting investigation.

However, storing small, ephemeral messages like jobs in a queue within Postgres is a bad idea and the pain far outweighs the benefits of transactional rollback for jobs.

Instead, a much simpler solution is to plan for jobs to run at least once, use a more appropriate datastore like Redis or RabbitMQ, and build in idempotency and error handling at the job layer.

Postgres used as a system of record shouldn't be used for ephemeral message queues.


Well, it's not necessarily that simple. It can be very interesting to be able to directly enter jobs into a queue in a transactional manner, with very low latency. Say from a trigger.

Edit: typo


For the sake of error handling on rollback, you're usually better delayed job enqueueing events to after commit hooks if you're concerned about failures resulting in a rollback.


Doing anything after committing misses the point. If you can do that, you don't need postgres.

For example, suppose you mark an account closed, commit, and then enqueue an event to issue a refund from another system. It's possible that your process may crash or be killed at just the wrong time leaving you with a closed account but no refund.

So what if you enqueue the event before you commit? In that case, you might crash before committing which will automatically rollback. Now you've done a refund on a non-closed account.

Transactions make it trivial to guarantee that either both happen or neither do. There are other ways to get that guarantee, but they require more work and are more error prone.


Were you using SERIALIZABLE isolation level in your application, as one code example in your article seems to show? Would using READ COMMITTED level have made a difference?


The long-lived transaction that keeps the deleted jobs from being reclaimed - is that process working with the job queue table, or is it working with other tables in the database?


Even if the long-running transaction hasn't accessed the job queue table yet it might do so before it completes. Postgres needs to keep the "dead" tuples accessible as long as there are active transactions that started before the deletion.


Right, that's what I thought. Is there value in giving the job queue its own database then, or are they taking advantage of the fact that the jobs modifications can be included in transactions that involve other tables?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: