

Celery: A Distributed Task Queue for Django - Jasber
http://ask.github.com/celery/introduction.html

======
anuraggoel
Looks interesting. But shouldn't a library like celery work outside the
context of a web framework? I don't see a reason to call this a distributed
task queue 'for Django' specifically, except for the dependencies on Django's
ORM and settings definitions. Swapping out Django's ORM with SQLAlchemy (or
DB-API) would make this project much more useful.

See pp (<http://www.parallelpython.com/>) for something similar, without the
django dependency. More parallel processing goodies at
<http://wiki.python.org/moin/ParallelProcessing>.

~~~
gaborcselle
The way I understand this, Celery _is_ that binding glue between the Django a
message queue (RabbitMQ), and cronjobs.

I.e. you could swap out Django for CGI, and the ORM with SQLAlchemy, but then
you might as well start from scratch.

~~~
anuraggoel
I might not want to swap out Django with CGI, but I might want to use
RabbitMQ-based distributed tasks in Pylons. Seems like a waste to write all
that code from scratch. For Django users (and I am one), Celery is clearly
useful. But a distributed task processing library, even if it is a thin
wrapper over RabbitMQ, should not depend on a web-framework.

~~~
ubernostrum
It's actually not terribly hard. See this article for some pointers on how
easy it is to work with AMQP from Python:

[http://playgroundblues.com/posts/2009/may/20/working-
django-...](http://playgroundblues.com/posts/2009/may/20/working-django-and-
rabbitmq/)

Meanwhile, I think there's nothing wrong with people distributing tools which
integrate queuing solutions into specific libraries/frameworks; such things
can often be quite useful and end up offering more natural interfaces for the
task at hand.

~~~
asksol
It's not about communicating with AMQP. It uses carrot+pyamqplib for that.

~~~
ubernostrum
I was simply replying to the seeming assertion that it's wrong to develop a
queueing solution which integrates with Django. Writing queueing solutions in
Python is easy, and integrating with popular tools should be fine.

------
pie
Having just hacked together an ugly threaded task queue for scraping and
multi-stage data processing in Django, this looks like a breath of fresh air.
I need to work my way out of the self-inflicted mess I've created.

Does anyone have experience with this library or anything similar?

~~~
conesus
Yeah, I also have experience in hacking together a multi-threaded task queue
with ugly results. Try getting messaging in multiple daemon threads to
communicate with the web client (who is spawned off early in the process, only
to be reunited later), and you'll see how much of a bear this is.

It's not too hard to perform only one of the use cases that celery handles,
but to get all of them? I'm installing celery this weekend and will see how it
goes. Maybe if it goes well, I can give a before-and-after blog post and
submit it to HN.

------
tdavis
beanstalkd (<http://xph.us/software/beanstalkd/>) also has similarities to
this, and for non-Django / simpler needs, it may be better. It's basically
memcached repurposed into a queue server.

A "task" would be equivalent to a script which only looks for jobs in a
certain bucket (or "tube" as they're called). You can run as many clients on
as many machines as you like. Obviously, since it is memory-based, you'll lose
the queue in the event of a system crash.

That being said, as a rabid Django user, this is definitely going into my
bookmarks!

~~~
henriklied
I would suggest looking at redis (<http://code.google.com/p/redis/>). In my
own experience, redis is really fast, and you have persistent storage of the
tasks.

~~~
asksol
Please note that RabbitMQ does support persistence. See
<http://www.rabbitmq.com>

------
diN0bot
I've been reading through the documentation on the celery github page. I
haven't been able to figure out the appropriate task breakdown. That is, I'm
trying to do some crawling and ingestion, and I'm wondering if I should be
pushing a dozen small tasks onto the queue every second, or push larger tasks
(possibly with subtasks broken out like it suggests) every minute or hour.

This sounds like a dumb question to my own ears, but I just don't have to
familiarity to know the proper use case. I essentially want continuous
crawling and ingestion with the potential to spread the load across multiple
servers one day.

(presumably the ingestors would be populating local databases, with a query
getting farmed out to each server+database, but I haven't figured that part
out, either....ummm, sounds like a task I could put into the queue, as well.
Are these things really nails?)

I'd be grateful if anyone can point me to some examples or provide a bit of
context.

~~~
asksol
There isn't a single good answer to this, all depending on what storage you
use, the work you need to do etc. But in general I'd think you want the task
to be as granular as sensible, so you can spread the work to many servers. The
best thing you can do is to try out, stress-test and benchmark the different
ways to do this. As of your description, I'm not even sure if celery is the
right tool for the job, but you could join irc.freenode.net #celery to get
more information.

------
amix
I have often wondered why not use a MySQL table as a "queue" (or more tables
if needed). Basically, you get great performance (MySQL is really fast), you
get great language support (a LOT of languages can add tasks via simple SQL)
and you get such things like easy backups and replication.

~~~
asksol
See Alexis Richardson's talk, Databases Sucks for Messaging:
[http://oxford.geeknights.net/2009/may-27th/talks/keynote-
Ale...](http://oxford.geeknights.net/2009/may-27th/talks/keynote-
AlexisRichardson.pdf)

~~~
amix
Messaging != A queue (at least the queue that Celery represents...)

------
mshafrir
Google App Engine needs something like this.

~~~
ropiku
They said they are working on a queuing system for offline processing that
uses HTTP POST for doing the actual work. See the session at Google I/O:
[http://code.google.com/events/io/sessions/OfflineProcessingA...](http://code.google.com/events/io/sessions/OfflineProcessingAppEngine.html)

