A customer of mine has two projects. One running on their own hardware, Django + Celery. The other one running on AWS EC2, Django alone.
In the first one we use Celery to run some jobs that may last from a few seconds to some minutes. In the other one we create a new VM and make it run the job and we make it self destroy on job termination. The communication is over a shared database and SQS queues.
We have periodic problems with celery: workers losing connection with RabbitMQ, Celery itself getting stuck, gevent issues maybe caused by C libraries but we can't be sure (we use prefork for some workers but not for everything)
We had no problems with EC2 VMs. By the way, we use VirtualBox to simulate EC2 locally: a Python class encapsulates the API to start the VMs and does it with boto3 in production and with VBoxManage in development.
What I don't understand is: it's always Linux, amd64, RabbitMQ but my other customer using Rails and Sidekiq has no problems and they run many more jobs. There is something in the concurrency stack inside Celery that is too fragile.
Can share the sentiment, had to work with celery years ago, and the maintenance/footguns exceeded the expectations. The codebase and docs are also a bit messy, it's a huge project used and contributed by many so it's understandable I guess.
Anyway, Argo if you are in K8S, something else if you aren't. And if you are a startup and need speed, just go with something like procrastinate.
Migrated Celery to Argo Workflows. No wisdom as it was straightforward. You lose a lot startup speed though, so it's not a drop-in replacement and is only a good choice for long-running workflows. Celery was easier than Argo Workflows. Celery is really easy to get started with. I like Airflow the best, but it's closer to Argo Workflows in terms of more long-lived workflows. I hope to try Hatchet soon. I've read Temporal is even harder to manage.
Anyone here done the migration off of celery to another thing? Any wisdom?