Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Asynchronous Processing in Web Apps, Part 1: A Database Is Not a Queue (gomiso.com)
59 points by nesquena on Nov 15, 2012 | hide | past | favorite | 27 comments


With a traditional database this typically means a service that is constantly querying for new processing tasks or messages.

Traditionally, but not necessarily; PostgreSQL supports the LISTEN and NOTIFY commands for asynchronous notifications, without polling.


I agree and if polling was the only consideration then this would mitigate the issue. Admittedly I didn't cover this as much in my article but there's also a lot of other flexibility afforded by a good message queue around delivery strategies, consumption strategies, error handling, etc that you would have to do much more manually and painfully trying to shoehorn it into a database (even a good one). I am not saying PostgreSQL notify commands are not useful for certain tasks, just encouraging people to understand the tools available and pick the right one for the job.


Also, Axiom, my favorite little Python object layer over SQLite, has native scheduler support. Whenever loading a store (well, starting the store's scheduler service), all things-to-happen are scheduled to run using Twisted's event loop. And, well, event loops are very good at efficiently waiting for things to be done in the future :)


queue_classic is a postgresql job queue that can use listen/notify. http://www.postgresql.org/docs/9.1/static/sql-notify.html

Although queue_classic doesn't use listen/notify by default, and I'm not sure why. I think it should, it seems to perform better.


1) The article claims that queues is a replacement for a database thus it solves all the problems with DB solution. In reality, if you have to have reliability and data integrity guarantees, then a messaging queue system will be using a DB under the hood (in the best case, it will be off-the-shelf DB, in the worst, something homegrown). All the same problems and considerations apply. You run into the same DB problems with the need to manually optimize queue tables, etc. because message queues rarely do a good job there. Of course, all the commit issues discussed above apply to.

2) The issue with polling is easily solved by using modern DBs (e.g. Postress with listen/notify) or just plain old triggers/stored procedures. Yes, it will be a custom solution but it will be simpler from operational perspective than a "generic" do-it-all solution.

3) Lastly, it all depends on the task. At WePay we use gearmand backed by a DB. We have seen it working really well and helping us handle 1000x spikes from normal load w/o a single problem. However, we did quite a few modifications to get there from the "stock" code to customize it to our use-case and have 200% data integrity and reliability guarantees.


I understood the point as "Don't use a database as a replacement for a queue". Use the right tool for the job.

The fact that gearmand is backed by a DB is not at all the same as using a DB for a queue directly. Gearmand just uses a database as a backup that it can reload tasks when it get's restarted.


Yes. However, you have same issues with DB tables receiving tons of inserts/deletes/selects. This is the worst possible DB load :)


DB's are designed for this purpose.


I dunno. I've seen companies go through 4 or 5 different message queue systems and find that none of them work quite right.

12 years or so ago I developed a few systems that used qmail as a message queue and I was pretty happy with that.

Asynchronous processing is a necessary evil, but I think a lot of people underestimate the difficulty. It's one thing to compress a video in the background, but if you have one asynchronous task that spawns a bunch of asynchronous tasks and they spawn asychronous tasks and someday they all come together... Well maybe they come together someday. There's a definite "complexity barrier" you hit when asynchronous applications rapidly become harder to maintain.

There are ways around this, but I've frequently seen MQ-based systems that never get "done".


I believe this is the problem that Storm was supposed to solve ( https://github.com/nathanmarz/storm ). Or Amazon's SWF ( http://aws.amazon.com/swf/ ). They work in the case where you can define the flow between all of the tasks.


I think the article is good, however, I feel compelled to weigh in from the countervailing general direction:

I am a message queue skeptic. Not that they should never be used, but rather a general feeling that complex, dedicated message queue software is often used for engineering problems between two or three orders of magnitudes too small before they deliver value. And, for most projects, queue replacement is not too difficult, especially if one's use of a database-backed queue is relatively naive.

As such, I will -- with an open mind -- suggest that general purpose database management software masquerading as queues is not an outright antipattern, and the times to use them in this way is probably more commonly seen than the opposite, where a dedicated queue delivers clear value.

Here are my main reasons for thinking that, which almost entirely have something to do with being able to address one's queue and one's other data in the same transaction:

* Performance: a pedantically unsound but basically reasonable rule of thumb (as experienced by me) would suggest that one needs to be processing somewhere between hundreds or thousands of messages per second with at least fifty (and maybe up to a hundred or two) parallel executors processing or emitting messages before there are performance issues where the lower constants of a dedicated queuing package become attractive. Below this kind of throughput, one starts to experience some more pain than gain.

* Correctness: Most database + queue integrations do a lousy job of what is effectively two-phase commit between different data storage instances (so that would include two+ RDBMSes, even if they are of the same kind, e.g. 2xPostgreSQL), and frequently the queue has to be counted among these (exception: when the queue contents can be lost/can be rebuilt/is idempotent at all times). Systems do a lousy job of making this work because it's pretty finicky to do a good job in many situations, i.e. expensive and time consuming.

* Constants, when dealing with other systems: When one does do a good job and has interesting requirements in the 'correctness' case, it often means doing forms of two-phase commit, whether explicitly supported by the system (e.g. PREPARE TRANSACTION) or a spiritual equivalent via carefully thought out state machines. In principle these could be relatively cheap, but typically to avoid complexity more expensive approaches are employed, such as a couple extra UPDATE requests to poke at some home-grown state machine.

Also, my experience indicates that as systems evolve, there will be inevitable bugs in these state machines that, by nature, span systems. Be vigilant and make sure you get more value than pain, and try to avoid having too many of them.

* HA is still hard: clustering is generally in principle possible, but make sure you read the fine print. For example, many people use Redis as a queue, but it is not really unlike any other monolithic database most generally -- its main draw as a vanilla queue is good execution-time constants. The same could be said of Apache ActiveMQ in its least byzantine configuration. One might think that one would get a lot of leverage 'for free' given the simpler semantics of queues vs the diversity of access methods in most general purpose databases, but so far I have not seen that to be the case, for the very good reason that a lot of people expect a lot of reliability out of their queues (no less than the transactional nature of some databases), and doing that is either most natural in a monolithic system or slow, or complicated, or both in a multi-master distributed system.

All in all, if you think you need dedicated queuing software to send a few dozen emails a second (that's a lot of email for most people!), think twice: it might still be a good idea, but brace yourself for these pitfalls or convince yourself that they probably mostly don't apply to you.


Initially this might seem different from the topic at-hand, but bear with me please:

Several years ago (probably first or second year of high school) I wrote and distributed amongst my friends a small desktop game. The program e-mailed me their new high-scores and internal game diagnostics daily. I found that this was too often so I rewrote the game to send these data daily. This way of retrieving data sucked. Often, I would get empty or unimportant messages.

Eventually I had the realisation that time-oriented polling was the wrong philosophy. Instead, there should be a "this event has happened now" (e.g. 500kb of diag. generated) algorithm that calls the polling routine/invokes the 'stuff to be done traverser'. If I understand the problem correctly, the mass e-mail example, you could make a simple counter as part of the HTTP request. When this counter hits a certain you-determined threshold value (e.g. there are 500 jobs to do), then call the traverse/processing code:

if jobs_to_do > thresh then invoke some async processing

The takeaway from this post is that, in my opinion, time-based (cron job-bish) polling algorithms are inefficient but can be replaced with a rudimentary event-driven ones.

And just BTW, 'countervailing' - I'm deifying this word. Holy s*. I'm going to use it all the time.


Thanks for your feedback, I appreciate the thoughtful response. I actually agree that generalized message queues can often be complex and perhaps even unnecessary when dealing with asynchronous processing at a small scale depending on your needs.

I think the important thing is to understand your requirements, the volume of jobs, etc. In my series, I also plan to introduce much simpler lighter work queues that are a perfect medium between a 'heavy duty' generalized message queue and trying to wedge a queue into a database.

But as with everything, people should evaluate the available options for themselves. My goal is just to provide people with a framework for understanding the tradeoffs.


Interestingly, that's why Microsoft created SQL Server Service Broker. They had te infrastructure for reliable message queuing and transactional support, so they created SQL Server Service Broker!

Not sure I ever took off though...


Unfortunately it didn't really took off, but it's a shame because it's a good, polished and complete implementation that allows for some advanced scale-out topologies (it can also be used as a foundation for data dependent routing and map/reduce scenarios).

<rant>I guess it's not much used because people like to reinvent the wheel every time (by manually implementing queues using tables with all the traditional concurrency problems) instead of learning something a bit more complex.</rant>

Anyway, it's not going to be thrown away anytime soon as it's used in other parts of the engine (e.g. SMTP mail integration, Server Events and Query Notifications).


Nice article. One thing you should talk about in my opinion are systems like Redis. Redis can be used as a generalized key value store, but it can also be used as a messaging platform. In fact many systems like Storm (which would be another great topic) have easy integration with redis pub subs. While Redis and other solutions like it probably are not a good fit for all your data, they are great for mixed supporting data, caching and messaging. There are also really nice integrations if you like Java, these days you can make Spring message driven beans to consume messages from a redis pub sub very easily with minimal configuration. ZeroMQ is probably another technology that is worth discussing, either using it with a layer like Storm on top or by itself. Not every messaging system has to be heavyweight and cumbersome like the good ole' days.


Definitely, thanks for your feedback. Will be bringing up redis in the next post. Redis has some messaging functionality baked right in and can be a good solution as seen with https://github.com/defunkt/resque in Ruby as well. ZeroMQ is also an important technology to understand in the context of this discussion.


Look forward to reading the future articles. Right now I'm working on a side project that has a need for some asynchronous tasks. I was planning on using beanstalkd but the one thing that concerns me is that if the queue goes down the outstanding jobs are not persisted. Any recommendations on the best way around this?


Yes beanstalkd has solid persistence support now in later versions. You can use the “-b” option, and beanstalkd will write all jobs to a binlog. If the power goes out, you can restart beanstalkd with the same option and it will recover the contents of the log.


beanstalkd -b should work. I haven't tried this.


Nice clear introduction.

You can write a simple Async DB system without running into deadlocks. Deadlocks tend to occur when you have 2 processes trying to lock 2 different resources in differing orders. Not a case you'll run into here.

Also, you could do something like (correct me if I'm wrong, but I think it should be safe):

  update queue set owner = '1_time_use_rand_key' where owner is null order by id limit 1;
Not that I'm advocating this, you should definitely use a proper system to manage these types of tasks! :)


If you would consider using NoSQL, then MongoDB might be a good fit. It has a messaging queue. It's called capped collections with tailable cursors (http://www.mongodb.org/display/DOCS/Tailable+Cursors ). It's persistent, you don't need polling, and you don't need to remove processed messages.


disclaimer: co-author of NSQ [1] here

Agreed. Message queues play an important role for us (bitly) in being a layer of fault tolerance, buffering, and a means to perform various operational tasks.

They're so important to us that we decided to build something that worked exactly like we wanted.

NSQ is a realtime distributed message processing system where we've taken the approach of focussing on making it ops friendly and easy to get started.

IMO, solutions that make it easy to develop on and administrate are most important... because things break. NSQ is straightforward to deploy (limit dependencies), simple to configure (runtime discovery), and client libraries provide a lot of functionality important for handling failures (like backoff, deferred messages, etc.) for a variety of use cases.

We've written an in-depth introductory blog post [2] about NSQ that has more details.

[1]: https://github.com/bitly/nsq [2]: http://word.bitly.com/post/33232969144/nsq


Interesting, thanks for sharing the links. I think building your own message queue is sometimes the only way to get things working exactly as you might want. For people looking for a highly configurable framework for building a tailored MQ, be sure to check out http://www.zeromq.org/ as a basis.


An in-(or out of) process (async) driver (exposing an API supporting future semantics) can provide DB specific queueing to some extent. MQs and MOMs are very useful but not always necessary and the additional latency due to node hops is sometimes not acceptable.


great post! this is what I wish I had as a starting point when I was learning how to get web apps to do work smarter rather than diving straight into celery documentation cause someone told me that would be the way to go!


Thanks! That's exactly why I am writing this series. I plan to take people slowly through everything about asynchronous processing, message queues, handling job processing, etc. Message queues are underused in modern web apps, and I think also not understood as well as they should be. Planning to include plenty of diagrams along the way to help with that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: