Hacker News new | past | comments | ask | show | jobs | submit login
A Celery-like Python Task Queue in 55 Lines of Code (jeffknupp.com)
60 points by jknupp on Feb 11, 2014 | hide | past | favorite | 44 comments



For an alternative, check out RQ: http://python-rq.org/

We use it in production and it's been rock-solid. The documentation is sparse but the source is easy to follow.


This is the first thing I thought of when I saw this post. I use RQ extensively and it's great.


It's great but as far as i can remember doesn't support P3. Also, it'd be nice to use it with Mongo instead of Redis.


Actually Py3-support landed 6 months ago https://github.com/nvie/rq/pull/239


I'd recommend looking at alternative serialization formats. Pickle is a security risk that programmers writing distributed systems in Python should be educated about.


I understand the risk is basically because you're evaling when unpickling. What formats are safe then?


Pickle doesn't really use eval, but there is still the potential for users to execute arbitrary code[1]. JSON, YAML, MessagePack, etc are safe in this respect (assuming a well-implemented parsing library) because all the parser does is convert the data into simple data structures.

[1] http://lincolnloop.com/blog/playing-pickle-security/


I was just using eval loosely. The eval I particularly meant is that an __init__ is run for classes with a .__getinitargs__ method defined [1]. And I guess json et al. is the reasonable answer I should have expected. I was hoping for something that mimicked the functionality of pickle but maybe signed the information so that it would be safe to use across a network.

http://docs.python.org/2/library/pickle.html#object.__getini...


Python's YAML implementation isn't safe by default. You have to use yaml.safe_load() because the standard load() function can execute arbitrary code.


Pickle (or any use of eval) is a security risk only if you're using it in the context of untrusted code. Basically any distributed task queue is going to have that risk if it can execute arbitrary code.


I thought the risk was if the data came from an untrusted source, as it might contain code?


I think shooting it over the network is considered untrusted. Man in the middle becomes a problem.


I use JSON and protobuf


    Having a way to pickle code objects and their dependencies is a huge win, 
    and I'm angry I hadn't heard of PiCloud earlier.
That's a nice use of the cloud library, without using the PiCloud service. Unfortunately, the PiCloud service itself is shutting down on February 25th (or thereabouts).


Ah, I'm sorry to hear that. Hopefully, they make their library code more readily available before closing their doors.


Looks like the PiCloud team is joining dropbox but according to their blog post "The PiCloud Platform will continue as an open source project operated by an independent service, Multyvac (http://www.multyvac.com/)."

Source: http://blog.picloud.com/2013/11/17/picloud-has-joined-dropbo...


Yes, the Multyvac launch has been delayed until February 19.

They are supporting much of the PiCloud functionality, but not function publishing, which I used quite a lot. (That was a way you could "publish" a function to PiCloud, and then call it from a RESTful interface. It was a nice way to decouple my computational code from my website, which has very different dependencies.)

I fear it is more oriented toward the use case of doing long running scientific jobs, rather than short Celery-like jobs. I hope for the best.


Although Celery can use it, why is Amazon SQS treated as a second class citizen in python background worker systems?

I've yet to find/see a background worker pool that played nicely (properly) with SQS.


Thanks Jeff. As someone else mentioned, I love these little projects that demonstrate the basics of what the big projects actually do. Makes it much easier to understand the big picture.


Absolutely. I'm always pleased when documentation includes some pseudocode for what the system generally does, without the overhead of configuration, exceptional control flow, etc. It's not always possible with large systems, but makes it a lot easier to see the forest, not the trees, in even mid-sized code bases.


http://docs.python.org/2/library/multiprocessing.html#sharin...

Why don't anyone build Celery and Redis alternative using this?


There have been a few times I did things in python that ended-up being a terrible PITA, multiprocessing was one of those. Basically after fork, you should call exec, but multiprocessing doesn't. So many things worked just fine in Linux and then completely fell apart on FreeBSD, OSX, and Windows. I think a lot of this has been fixed since then by using a Manger and a Pool.


Celery uses `billiard`, which is a fork of multiprocessing.

That doesn't help with communicating with a distributed worker pool, though.


I can't speak for Celery, as I've not used it very much.

It may be straightforward to write a functional Redis alternative in pure python using this library, but I would certainly have questions/concerns about performance.


I scratched an itch in this space to create, in Python, a web hook task queue. I wrote it up here http://ntorque.com -- would love to know if the rationale makes sense...


Are there any non-distributed task queues for Python? I need something like this for a tiny web application that just needs a queue for background tasks that is persisted, so tasks can resume in case the application crashes/restarts. Installing Redis or even ZeroMQ seems kind of excessive to me, given that the application runs on a Raspberry Pi and serves maximum 5 users at a time.


My suggestion is to just use RabbitMQ. It's written in Erlang, it uses a BerkeleyDB-like backend for message storage. It's non-distributed, "durable", and optionally with persistent and non-persistent messages. It has a web interface to examine messages. Second suggestion is to use JSON for your messaging format, although it's possible for basic tasks to put all the info you need in the headers.


I have personally found RabbitMQ one of the most bonkers, overengineered and painful to manage services I've ever dealt with.


I wrote https://pypi.python.org/pypi/pq, a transactional task queue for PostgreSQL, depends only on psycopg2.

It does about 1000 ops per second, depending on your database configuration, network etc.


That's pretty cool. What are the benefits of using this over celery?


Celery is typically used with RabbitMQ, which is an advanced message broker.

If you don't need the complexity (most don't), need persistence, transactional guarantees and already have PostgreSQL in your stack, then the pq-library is probably a good option.

It's not really for a distributed setup, but meant to operate within a cluster (local area network, same data center). Most people don't have distributed systems ;-).


I wouldn't really classify this example as a "task queue", it resembles more the RPC pattern really (and I would guess that this example does not persist the jobs in any way). Celery does have a very experimental filesystem based transport that could be used for your use case. I don't know if it fits on a Raspberry PI but celery does not have very high memory/space requirements (c.f. other Python libraries).


ZeroMQ takes care of the queueing. Though I didn't delve into it in depth in this example, you can create quite sophisticated broker-less distributed systems pretty easily with ZeroMQ.


Right, I understand that ZeroMQ is used to send and receive messages, what I mean is that there's no persistency involved so the tasks will not survive a system restart.


No, I mean ZeroMQ takes care of the underlying queue for free (there is an underlying set of queues, which can be persisted if necessary)


No, it doesn't, although it might be possible to sort of emulate it by setting ZMQ_HWM to 1 and enabling ZMQ_SWAP, but I wouldn't bet on it.

The best you can hope is to use the Titanic Service Protocol and just throw data into some sort of disk store. I've looked into doing this, but I settled on using RabbitMQ instead for persistence. Unless you're dealing with more than 10k messages a second, it's just as easy as ZeroMQ.

After a few months of experimenting, I've come to the conclusion that some combination of ZeroMQ and RabbitMQ is likely the easiest solution currently for a combination of low-overhead distributed messaging and broker-assisted persistent messaging.


That must be a very recent development for 0mq then, do you have any source for this new feature?


Are you thinking of RabbigMQ perhaps?


In a similar situation, I've used inotify to watch things just get dumped on the disk. It's a hack, because race conditions, but sometimes that's OK.


You could use mutex locking on the file system to avoid the race conditions.



With PostgreSQL, you can use advisory locks and other little tricks to get better performance, e.g.

https://github.com/malthe/pq/blob/master/pq/__init__.py#L224


I like these one-off projects that Jeff is doing, but it would be particularly instructive to see one, or a combination, make it to 'real' status.


Checkout sandman: www.sandman.io or www.github.com/jeffknupp/sandman




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: