1. We have problem, lets use Celery
2. Now we have one more problem.
Three days in, I can tell you that it does work, but it does take a lot of searching through the docs to optimize. It's very hard to run with class objects too, so we just created long-scripted functions for the worker.
Even now I'm trying to figure out why the worker is unable to refresh access tokens after 60 minutes, and tempted to just have it run as root.
- If you're using AMQP/RabbitMQ as your result back end it will create a lot of dead queues to store results in. This can easily overwhelm your RabbitMQ server if you don't clear these out frequently. Newer releases of Celery will do this daily I think - but it's worth keeping in mind if your RMQ instance falls over in prod.
- Use chaining to build up "sequential" tasks that need doing instead of calling one after another in the same task (or worse, doing a big mouthful of work) in one task as Celery can prioritise many tasks better than synchronously calling several tasks in a row from one "master" task.
- Try to keep a consistent module import pattern for celery tasks, or explicitly name them, as Celery does a lot of magic in the background so task spawning is seamless to the developer. This is very important as you should never mix relative and absolute importing when you are dealing with tasks. from foo import mytask may be picked up differently than "import foo" followed by "foo.mytask" would resulting in some tasks not being picked up by Celery(!)
- Never pass database objects, as OP says, is true; but go one step further and don't pass complex objects at all if you can avoid it. I vaguely remember some of the urllib/httplib exceptions in Python not being serializable and causing very cryptic errors if you didn't capture the exception and sanitise it or re-raise your own.
- Use proper configuration management to set up and configure Celery plus what ever messaging broker/backend. There's nothing more frustrating than spending your time trying to replicate somebody's half-assed Celery/Rabbit configuration that they didn't nail down and test properly in a clean-room environment.
If task_C returns a value which no other task cares about, it will insert the value into the queue, and never gets consumed. This is why dead queues (also known as "tombstones") happen.
Always remember to set ignore_result=True for tasks which don't return any consumed value.
EDIT: "Tombstones", not gravestones
Do note that if you use the chord pattern (http://celery.readthedocs.org/en/latest/userguide/canvas.htm...) anywhere, you must set ignore_result=False
With that experience, we wrote a task queue using Redis & gevent that puts visibility & tooling first: http://github.com/pricingassistant/mrq
Would love to have some feedback on that!
I'm not very happy with the community either. What with the dispersed, incomplete documentation, multiple discussion forums, and snide responses, I'm really getting ready to wash my hands of it.
However it is a complete rewrite because we felt we couldn't add gevent support and other features to provide extreme visibility without major changes. If you don't need those 2 things, you may want to check out RQ instead for now, it's still a very good piece of software.
I see now that mrq supports concurrency why python-rq does not (at least in a stable fashion).
I'll try mrq for the gevent integration. It's great that you guys are actively working on improving it. Python-rq is great too, but it hasn't been updated in a while and I don't think concurrency is on the radar.
And don't get me started on RabbitMQ.
I asked about a progress bar, for a long running web request on Stack Overflow, and Celery seemed to be the accepted way to do that.
I manged to get it set up eventually. Realized a month or so back that it hasn't been running, and it has taken me about 3 times as long to get it up again as it did on the first try. I am sure there must be an easier way.
1. Use task specific logging if you have a bunch of task: http://blog.mapado.com/task-specific-logging-in-celery/
2.Use statsd counters to keep track of basic statistics (counts + timers) for each task
3. Use supervisor + monit to restart workers after lack of activity (I have seen this happen a few times, but never been able to track down why it happens, but this is an easy fix)
I will say though Celery is probably overkill for a lot of tasks people think to use it for, in my case it was mandated to support scaling for a startup that never launched, partly because they kept looking at new technologies for problems they didn't have yet.
Consider how background jobs are typically managed with RabbitMQ, Redis, etc. They are usually created in an "after commit" hook from whatever gets persisted to your relational database. In this scenario, there is a gap between the database transaction being committed and the job being sent to and persisted by RabbitMQ or Redis; during this gap the only record of that task is being held in a process's memory.
If this process gets killed suddenly during this gap, that background job will be lost forever. It sounds unlikely, but if RabbitMQ or Redis is down and the process has to sit and retry, waiting for them to come back online, the gap can be sizable.
The systems that use these kinds of tools are usually not structured in a way that they need to wait for something in the database to be stored. By nature they are async tasks and they should be able to run whenever and return sometime in the future, and they will most likely produce some kind of result in the database, so there is no reason to store the job information itself in the database.
Jobs are usually not created as hooks after a database commit, so jobs being persisted with database transactions is not quite relevant and Celery has failure mechanism and ways to recover if it was not able to send a task to the broker (ie. RabbitMQ was down).
Redis and RabbitMQ do have a mechanism of persisting jobs onto disk as well so they don't get lost when the process is restarted. So there is no way that a job get's lost forever as you say, if you handle all these cases correctly.
One more thing, Python's database drivers don't work quite as you've described. Namely they don't (by design) make use of the autocommit feature of the database engine, rather they wrap every sql statement in a transaction, so either way each statement get's executed separately in it's own transaction. This would not guarantee, let's say a db record being added and the job being saved as well. You would have to use explicit atomic blocks (something a kin to what Django >= 1.6 has) to get both things or none to be persisted.
I agree with your point about polling being bad, however as someone pointed out below it's not an issue with Postgres's LISTEN/NOTIFY (and I added a note to the queue_classic gem which makes this easy to take advantage of in MRI Ruby).
Obviously I'm aware that Redis and RabbitMQ persist jobs. That's not what I was talking about at all.
I think we're on different wavelengths here so I'll let it be. :-)
I'm not aware of how ruby does it other then hearing about 2 most popular solutions that are used with rails.
As for Postgres's PUB/SUB I've replied to the reply that mentioned that and you can't really use that. Would be great if we cold update the broker driver and see if we could get support though.
Not according to Jim Gray. See "THESIS: Queues are Databases"
2- (pdf) http://research.microsoft.com/pubs/69641/tr-95-56.pdf
Honestly I think that's the ideal way of doing things, however that's not often how you see it done.
Also, take a look at #5 for a great monitoring stats solution.
You must be storing enough representation of the job to re-send it though, no? You couldn't do this with a simple flag for all job types I imagine.
> in most cases it's just the absence of the desired effect that can be a giveaway
Sure, but not all jobs create side effects you can easily look for (e.g. sending an email). And secondly, you'd have to create a "sweeper" process for every type of job you create then.
With the psycopg2 module, you can use this mechanism together with select(), so your worker thread(s) don't have to poll at all. They even have an example in the documentation.
Just slides though. Haven't gotten around to writing a post about it yet.
Unfortunately if you're using JRuby you can't benefit from this, as the Postgres JDBC driver does polling.
Of course, logging_tree is a great tool as well!
Rq is a lot smaller, more than 10x by line count. So if it works just as well, I'd go with the simpler implementation.
That is why I decided to use Rq, it is better to know limitations of something simple then know possibilities but not able to make choice.
celery is like a .50 caliber machine gun, industrial strength, lots of options, used for a variety of completely different use cases.
For simple stuff, use rq, but celery + rabbitmq work better if you have dozens and dozens plus worker nodes (ie: different servers), whereas with rq, you use redis, which could potentially be a SPOF, even with redis sentinel.
It always depends on your use-case but generally you want your application to behave correctly, which means it has to have correct/fresh data...you can't sacrifice correctness because of an inability to scale your database.
If you are processing a lot of data in Celery, you really want to try to avoid performing any database queries. This might mean re-architecting the system. You might for example have insert-only tables (immutable objects) to address this type of concern.
It’s a sadly under-rated ingredient! The flavor is subtle but unmistakable.
- use gevent + gunicorn, or Tornado, in order to keep a socket open while the worker is processing the task?
- use polling? (less efficient)
- use websockets (but then the implementation is perhaps a bit more complex)
can you do this simply using Flask?
If your ajax request requires long task processing and requires you to wait for it than this is not a background task any more, it's done in one of the web server threads, and even if the thread outsources the task to another process it's still waiting on that proces to finish before returning the ajax response. This is bad.
I'm not entirely convinced about websocket solutions in Python yet, but I've been told flask-websockets is awesome. Nevertheless this doesn't solve the problem for you. Cause the request is just keeping an open line and waiting for a respone....blocking is bad.
The most simplest advise I would have is to have the ajax request trigger a background task and return immediately. The background task will then have some kind of side effect (ie. write some result to a database somewhere) which the ajax request can the look for with some kind of polling mechanism (on some other endpoint). Of course you can complicate this a lot, depending on your needs, but this seemed like the most straightforward solution.
"The most simplest advise I would have is to have the ajax request trigger a background task and return immediately. The background task will then have some kind of side effect (ie. write some result to a database somewhere) which the ajax request can the look for with some kind of polling mechanism (on some other endpoint)."
Wow, overkill much? Polling is bad, and is exactly the kind of bad solution that a lot of these libraries are in place to prevent developers from needing to do.
Websockets were made to solve the long-polling and poll-spamming that was prevalent. Now all you have to do is keep a light, open web-socket connection to the server. And the server, being async/evented, will respond when the task is good and ready. Nice and clean.
"Wow, overkill much? Polling is bad, and is exactly the kind of bad solution that a lot of these libraries are in place to prevent developers from needing to do." - Polling is not bad if you have a good use case. You just cannot do non-blocking stuff with Django for instance, or it's very very hard and tricky. Websockets also limit you with the number of connections you can have open at once.
I was thinking whether using something like gevent or Tornado, a bit like nodejs, would enable the webserver to keep the socket open without blocking while the computation is made in a worker, then return the result simply to the socket, thus avoiding having to write a more complex websocket-based or polling-based system, but rather using AJAX transparently :)
Polling seems to be the best way to do it, as it doesn't leave sockets open, and doesn't require a websocket enabled browser.
The implementation I'm working on involves keeping the task metadata in the DB, and polling against that lookup (it makes it easier to do things like restrict task results to specific users as well).
I was also thinking that another way to do it could be to write the result in its final format to a /ajax_output/ directory with a randomly generated name. Then your polling would depend entirely on nginx, which could end up being much more efficient than running through your application framework. Just make sure you regularly clean unused files if you have privacy concerns.
a > Redis uses less memory
b > Redis is easier to setup
Comments, feedback and questions welcome :-)
Distributed just means that you can have your task processing spread out across multiple machines.
A specific example would be, let's say, after your user registers on your website for the first time you wan't to get a list of all his facebook/twitter friends. This action will take a long time and is not vital to the whole registration/login process so you set a task to do that later, and let the user proceed to the site and not make him look at the spinner the whole time, and when the friend list becomes available it will show up on the website (on his profile or whatever). Makes sense?
Eg, execute other tasks only if there are no pending important tasks.
AMPQ = Advanced Message Queuing Protocol so it's wrong to say that a message broker is "an AMQP". Also, give Redis a try - it's much easier to set up and uses fewer resources.
We should probably talk about the elephant in the room when addressing newbies: the Celery daemon needs to be restarted each time new tasks are added or existing ones are modified. I got past that with the ugly hack of having only one generic task but people new to Celery need to know what they're getting into.