
Ask HN: What's a good, open source, distributed, worker queue? - lzw
I'm working on a big data application, and need to schedule and distribute jobs around a cluster.  I'm looking for an open source solution to avoid writing my own.<p>This will be running on a small cluster of machines. These machines will already be running CouchDB, and so I have map/reduce setup to process the data as it comes into the database.  This queue is to schedule other tasks, such as crawling the web, going to third party APIs and pulling down some data, etc.<p>I'd like the queue to be transactional so that if a worker dies before finishing the task, the task remains on the queue and another worker will pick it up.  It is ok with me if I have to be responsible to make sure the tasks are idempotent.<p>I'd like the queue to be replicated in the cluster so that if one node happens to go down, we don't lose a chunk of the tasks from the queue.   I'd rather the occasional task be done twice than any task ever be forgotten.<p>Each node and each process will be both adding things to the queue and taking things off to process.<p>If these things come setup to assume a specific language for the worker processes, then javascript or python are the preferred ones.<p>Would prefer it to be relatively lightweight, and require nearly zero administration.<p>Part of the reason I'm asking is that I think I might be using the wrong terminology to try and find this.  I found DISCO (http://discoproject.org/).   I could make DISCO do what I want probably, but I have Map and Reduce covered in couchDB already.  I need "go ftp this zip file, uncompress it, and then run it thru this python script" kinds of tasks to be scheduled.<p>Thanks in advance for advice!
======
bjpirt
I'm a big fan of RabbitMQ ( <http://www.rabbitmq.com> ). It's an AMQP message
queue and has a bunch of very nice features like:

\- Persistent queues

\- Clustering \- Topic based queues for an extremely flexible syndication
model

\- Transactional (acknowledged) queues

\- Lots of adapters in different languages

\- Fast, handles thousands of messages/sec

It's written in erlang so if you're already running CouchDB then you can
maximise any language specific knowledge.

Replicated queues are tricky to set up with any queuing system, but if you're
looking to a) not lose any messages and b) have high availability then the
simplest thing is just to run 2 independent but identically configured queues
in parallel which are configured to be persistent and configure the workers to
pull jobs of each queue in turn. If one of them goes down you can restart it
and recover the lost jobs whilst the other queue continues.

Alternatively you can write to both queues and deduplicate the messages when
you pull them off. Beetle ( <http://xing.github.com/beetle/> ) takes this
approach and uses Redis to manage the deduplication.

~~~
ifesdjeen
Agreed! When using RabbitMQ / 0MQ or even Redis for queue replacement it's
possible to build a great infrastructure around them, you're not sticking to
the worker library launcher set. You simply have a queue, you may distribute
it and get it working the way you want to.

------
jasonkester
You might want to check out Amazon SQS for this. It has all the advantages
you're looking for (transactional, guaranteed not to lose messages, zero
administration, crazy-simple API with solid client libraries in every
conceivable language), with the only possible downside being that it's hosted
externally.

It's priced so cheaply as to be essentially free, though you'll need to give
it a credit card so that it can bill you eleven cents a month or whatever.

<http://aws.amazon.com/sqs/>

I've been using it for a few years now with good success.

~~~
cloudkj
+1 for SQS. It pretty much matches all your criteria to a tee. The replication
guarantees messages won't be lost, and will occasionally result in duplicate
messages.

I think another potential downside is that the messages get purged after 14
days.

------
huwshimi
You could have a look at <http://celeryproject.org/>

I've not used it so I'm not entirely sure it'll cover your needs, but I keep
hearing good things about it.

~~~
matclayton
We use rabbitmq and celery, both awesome projects. And both teams are always
on irc if you need help.

------
chrismsnz
I've worked mostly with Gearman (made by Danga who also created memcached,
mogilefs and other awesomesauce).

It's stable, fast and has API libraries for most languages which can create
clients and workers. You may have to do some configuration/customisation to
get it to do exactly what you're after but it's a great place to start.

<http://gearman.org/>

~~~
lzw
Maybe I'm missing something, but it looks like each Job Server is a separate
queue? Not sure what happens if one of the job servers dies then.

It does handle load balancing in a nice way.

~~~
mtai
If you've been "backgrounding" jobs (for async tasks for example where the
client wants to fire and forget) Gearman has the option to have a persistent
queue. There're a few options for persistence...

1) You can use a local SQLite file. Fast and fine for jobs you don't mind
losing once if your entire box goes down. (Cache busting comes to mind)

2) You can use MySQL. If a job server dies, you can restart it and point it at
the same MySQL instance. If a job server dies and the entire MACHINE is down,
you can spin up another gearmand instance on another machine and point it at
the right place.

If you are submitting "foregrounded" tasks, meaning your client requires a
response, Gearman's way of handling failure is pretty simple. When gearmand
(the server) dies, the client will see you lost a socket connection. It is
then up to the client to determine what to do in that failure scenario. It
sounds like in your case, you just want to resubmit it. This should be pretty
easy to do.

As an FYI, I'm currently the maintainer of the python-gearman 2.x series API.
We (derwiki and I) have been using Gearman in production for the past few
months now and it's worked out pretty well for us. Implementation's a snap and
running the daemon's pretty trivial.

------
deepu_256
I haven't tested a lot of queues but one solution that i played with recently
and liked is zookeeper.

It is distributed, battle tested and has a small simple API. You can build
some good distributed data structures(like queues for example) on top of it
with minimal work.

For python examples - [http://www.cloudera.com/blog/2009/05/building-a-
distributed-...](http://www.cloudera.com/blog/2009/05/building-a-distributed-
concurrent-queue-with-apache-zookeeper/)

and <http://github.com/twitter/twitcher>

------
hsuresh
I am currently using Resque(<http://github.com/defunkt/resque>) for a similar
requirement. You might want to check it out.

edit: This is in Ruby though.

~~~
rantfoil
In addition, Delayed Job is also a battle-hardened/tested queue system too,
with better retry / error rescuing (ironically enough). It is slower due to
MySQL locking though -- Resque is built on top of Redis and is lightning fast.

We use DJ for jobs that MUST succeed (e.g. autopost) whereas we use Resque for
more frequent but less essential jobs.

~~~
mceachen
We use Delayed Job at AdGrok for both for async jobs kicked off from user
activity as well as nightly maintenance jobs.

If you need support for "job prerequisites", (one job to not start before
other jobs have finished successfully), we'll be adding that to our fork from
collectiveidea soon. The DJ codebase is pretty nice to look at, too. Note also
that there are capistrano dj deployment recipes in the git source.

------
eugenejen
MySQL has a storage engine Q4M, which is transactional and persistent. The
performance depends on the server's performance, but I have not problem to
achieve 10k/sec queue throughput on Core2 Duo 2GHz cpu. You can use the queue
with standard mysql client.

Check out the <http://q4m.31tools.com/>,
[http://www.slideshare.net/kazuho/q4m-a-highperformance-
messa...](http://www.slideshare.net/kazuho/q4m-a-highperformance-message-
queue-for-mysql)

------
jokull
I've used beanstalkd before. It has transaction, timeout and sleep features.
What I don't like about it is that it adds complexity to server setup for such
a simple thing. I've seen queue libraries built around redis which is a win
since I'm already using that for other things (check out hotqueue
<http://richardhenry.github.com/hotqueue/>).

------
gabbott
I'm kind of a fan of using something like this (
[http://lethain.com/entry/2010/sep/05/python-
datastructures-b...](http://lethain.com/entry/2010/sep/05/python-
datastructures-backed-by-redis/) ) and rolling your own.

------
Vargas
If you are running Java or compatible, I recommend JPPF:
<http://www.jppf.org/>

------
mman
You can try condor <http://www.cs.wisc.edu/condor/>

------
krisneuharth
We are using this: <http://github.com/robey/kestrel>

It is written in scala for use at Twitter. It is super simple to getting
running and is so far robust. We are using it to pass JSON messages around our
system for background processing tasks.

