I'd recommend looking at alternative serialization formats. Pickle is a security risk that programmers writing distributed systems in Python should be educated about.
Pickle doesn't really use eval, but there is still the potential for users to execute arbitrary code[1]. JSON, YAML, MessagePack, etc are safe in this respect (assuming a well-implemented parsing library) because all the parser does is convert the data into simple data structures.
I was just using eval loosely. The eval I particularly meant is that an __init__ is run for classes with a .__getinitargs__ method defined [1]. And I guess json et al. is the reasonable answer I should have expected. I was hoping for something that mimicked the functionality of pickle but maybe signed the information so that it would be safe to use across a network.
Pickle (or any use of eval) is a security risk only if you're using it in the context of untrusted code. Basically any distributed task queue is going to have that risk if it can execute arbitrary code.
Having a way to pickle code objects and their dependencies is a huge win,
and I'm angry I hadn't heard of PiCloud earlier.
That's a nice use of the cloud library, without using the PiCloud service. Unfortunately, the PiCloud service itself is shutting down on February 25th (or thereabouts).
Looks like the PiCloud team is joining dropbox but according to their blog post "The PiCloud Platform will continue as an open source project operated by an independent service, Multyvac (http://www.multyvac.com/)."
Yes, the Multyvac launch has been delayed until February 19.
They are supporting much of the PiCloud functionality, but not function publishing, which I used quite a lot. (That was a way you could "publish" a function to PiCloud, and then call it from a RESTful interface. It was a nice way to decouple my computational code from my website, which has very different dependencies.)
I fear it is more oriented toward the use case of doing long running scientific jobs, rather than short Celery-like jobs. I hope for the best.
Thanks Jeff. As someone else mentioned, I love these little projects that demonstrate the basics of what the big projects actually do. Makes it much easier to understand the big picture.
Absolutely. I'm always pleased when documentation includes some pseudocode for what the system generally does, without the overhead of configuration, exceptional control flow, etc. It's not always possible with large systems, but makes it a lot easier to see the forest, not the trees, in even mid-sized code bases.
There have been a few times I did things in python that ended-up being a terrible PITA, multiprocessing was one of those. Basically after fork, you should call exec, but multiprocessing doesn't. So many things worked just fine in Linux and then completely fell apart on FreeBSD, OSX, and Windows. I think a lot of this has been fixed since then by using a Manger and a Pool.
I can't speak for Celery, as I've not used it very much.
It may be straightforward to write a functional Redis alternative in pure python using this library, but I would certainly have questions/concerns about performance.
I scratched an itch in this space to create, in Python, a web hook task queue. I wrote it up here http://ntorque.com -- would love to know if the rationale makes sense...
Are there any non-distributed task queues for Python? I need something like this for a tiny web application that just needs a queue for background tasks that is persisted, so tasks can resume in case the application crashes/restarts.
Installing Redis or even ZeroMQ seems kind of excessive to me, given that the application runs on a Raspberry Pi and serves maximum 5 users at a time.
My suggestion is to just use RabbitMQ. It's written in Erlang, it uses a BerkeleyDB-like backend for message storage. It's non-distributed, "durable", and optionally with persistent and non-persistent messages. It has a web interface to examine messages. Second suggestion is to use JSON for your messaging format, although it's possible for basic tasks to put all the info you need in the headers.
Celery is typically used with RabbitMQ, which is an advanced message broker.
If you don't need the complexity (most don't), need persistence, transactional guarantees and already have PostgreSQL in your stack, then the pq-library is probably a good option.
It's not really for a distributed setup, but meant to operate within a cluster (local area network, same data center). Most people don't have distributed systems ;-).
I wouldn't really classify this example as a "task queue", it resembles more the RPC pattern really (and I would guess that this example does not persist the jobs in any way). Celery does have a very experimental filesystem based transport that could be used for your use case. I don't know if it fits on a Raspberry PI but celery does not have very high memory/space requirements (c.f. other Python libraries).
ZeroMQ takes care of the queueing. Though I didn't delve into it in depth in this example, you can create quite sophisticated broker-less distributed systems pretty easily with ZeroMQ.
Right, I understand that ZeroMQ is used to send and receive messages, what I mean is that there's no persistency involved so the tasks will not survive a system restart.
No, it doesn't, although it might be possible to sort of emulate it by setting ZMQ_HWM to 1 and enabling ZMQ_SWAP, but I wouldn't bet on it.
The best you can hope is to use the Titanic Service Protocol and just throw data into some sort of disk store. I've looked into doing this, but I settled on using RabbitMQ instead for persistence. Unless you're dealing with more than 10k messages a second, it's just as easy as ZeroMQ.
After a few months of experimenting, I've come to the conclusion that some combination of ZeroMQ and RabbitMQ is likely the easiest solution currently for a combination of low-overhead distributed messaging and broker-assisted persistent messaging.
We use it in production and it's been rock-solid. The documentation is sparse but the source is easy to follow.