
Ask HN: How would you build a website that runs long background jobs? - ryeguy_24
I am building a simple website whereby the user uploads data and the backend kicks off a long process (data computation and analytics)? My preliminary approach is to use Python Flask with Celery and RabbitMQ for job execution. Is this a decent approach? Can anyone else recommend alternative&#x2F;better approaches?
======
atmosx
Your stack is as good as any. Do yourself a favour if you're building a
business... Don't look at it as an engineer. If you're proficient in python,
use Flask and Celery. If you're proficient in Ruby use Rails/Sinatra and
Sidekiq. These tools are battle-tested, solid. You can build a business around
them easily. If you ever hit any limits, then and only then look around for
more.

My advice to you is to master these tools instead of learning a new
language/stack/tool.

I see ppl talking about Elixir, Erlang, Phoenix, etc. It doesn't matter, what
matters if for you to deliver. Python is an excellent choice, the tools,
you're talking about are solid... the only thing that matters now is for you
to deliver.

------
nanoscopic
I actually wrote a solution for this exact problem for SUSE Hack Week 2018. I
made a small job tracking / state tracking system using Perl and nanomsg. See
[https://github.com/nanoscopic/galear/blob/master/client/srv/...](https://github.com/nanoscopic/galear/blob/master/client/srv/www/galclient/lib/Server/NanoState.pm)

Essentially it is just a small server that listens on a nanomsg queue for new
tasks. Workers can then be created that periodically ping the server to grab a
task off the queue. Optimally a nanomsg queue would also be used to queue the
tasks out to the workers; I just build it this way since I could implement the
entire thing within hours and continue on with my hack week project.

The benefit of using nanomsg over several of the other message queues
suggested here is that nanomsg is a brokerless message queue, meaning that
there need not be any central queue tracking everything.

It is somewhat ironic in that sense that I essentially built a message broker
using it, but it demonstrates the simplicity of using nanomsg.

While the code there is written in Perl, it could easily be ported to any of
the other many languages nanomsg has libraries for.

In summary, to handle long running background jobs: 1\. Create as many
"worker" processes as you need to be able to simultaneously process multiple
long running background jobs. 2\. Have a way to feed tasks into those workers
( in my case a central task tracker ) 3\. Have a way to feed back status of
the workers to something central so you can tell what is going on

------
nathan_long
Elixir runs on the Erlang VM, which lets you spin up processes at will. Here's
an article of Phoenix tips - skip down to "AVOID TASK.ASYNC IF YOU DON’T PLAN
TO TASK.AWAIT" to see how you'd spin off a background job.

[https://dockyard.com/blog/2016/05/02/phoenix-tips-and-
tricks](https://dockyard.com/blog/2016/05/02/phoenix-tips-and-tricks)

~~~
cutety
Would second this, Elixir/Erlang are probably the best tool for this kind of
job. The Erlang VM along with OTP was built for highly reliably running a ton
of (potentially long running) processes. And Elixir/Phoenix is especially
great for this kind of task if you require a web front end, as you can
directly use all the Elixir OTP stuff and then with phoenix’s awesome channels
(websockets) you can publish job execution progress over time and results to
the channel directly from those processes. The best part being is you get all
this without having to setup some intermediary queue/pub sub (redis, rabbitmq)
to manage all this, it’s all built in and done in the Erlang VM.

The downside being you have to spend time to really learn the actor
concurrency model and OTP concepts (things like GenServer and supervision
trees) to really harness this power. While not impossible to learn, it sounds
like OP is coming from Python, so while Elixir’s syntax is closer to Python
(or higher level scripting languages), working with it is conceptually
completely different than working in Python.

A generalized solution to this is you want to break down a task (one big long
running job) into a bunch of super small jobs (or processes), that are kicked
off by a main job. This is so they can execute quickly, and more importantly
keeping them idempotent (easier to restart and debug small jobs with little to
no state than big jobs with lots of state). Before kicking off the main job,
you’ll want to record it starting somewhere (database, redis, genserver,
etc.). Then kickoff the main job, if you need to report progress you can have
the child jobs update the progress wherever you stored the start info. Then
this can be retrieved either by polling some endpoint, or with pubsub and
websockets. Then when the final job ends, update the information to mark that
it ended and store the results/reference to the results. Then as above this
can be retrieved through that endpoint so clients can be notified when their
job finished. If you need to keep state across the jobs, you can use a
database or something like redis as the central store for the global job data
(which you should choose is dependent on whether or not you need the speed of
a k-v store or the transactions/locking of a database) -- if you go the Elixir
route this would be done with a GenServer instead.

I also couldn’t recommend redis enough, you can use it as a queue as well as a
pub sub, eliminating the number of dependencies if you go the Python/Ruby
route.

------
mindcrime
I generally do something similar, where I push a message onto a queue to
trigger a long-running job. I mostly work in Java, so I usually use a JMS
provider of some sort, like ActiveMQ or HornetQ. Depending on what is supposed
to happen on the receive side, I might run the job in a Java thread, or I
might use ProcessBuilder to spawn a native process.

------
bdcravens
Celery (or in Ruby, Sidekiq) is a pretty simple approach. (I'd start with
Redis for the ephemeral storage if you want to keep it simple) Capture the
request in your database, update it when project finishes. If you need client
side notification when job finishes, use something like PubNub.

~~~
ryeguy_24
What do you recommend for the actual job execution for production environment?
Is Celery sturdy enough?

~~~
shoo
> Is Celery sturdy enough?

It probably depends on what your exact requirements are, but celery is likely
fine. My last project had celery doing batch job processing for a line of
business enterprise web app. It was fine and flexible enough to do what we
needed (thousands of jobs a week, job scheduling, rabbitmq broker & postgres
result store, in use for years).

One thing to be aware of: if you're not running on windows, celery worker
processes are forked, and by default python's process abstractions (subprocess
etc) will fight against you to prevent you launching processes from a celery
worker. This can be worked around but is a bit irritating.

This is probably obvious but you want to ensure your celery worker processes
are properly run as e.g. services with good monitoring, and configured to
automatically restart if they crash (due to defects in application specific
task logic, say).

Some tips / past discussions:

[https://khashtamov.com/en/celery-best-practices-practical-
ap...](https://khashtamov.com/en/celery-best-practices-practical-approach/)

[https://denibertovic.com/posts/celery-best-
practices/](https://denibertovic.com/posts/celery-best-practices/)

[https://news.ycombinator.com/item?id=7909201](https://news.ycombinator.com/item?id=7909201)

Sadly, the main issue I see with celery in the past is that it is a popular
open source project with no income stream to fund development, so at times
swathes of reported bugs have been closed with "won't fix ; we don't have the
resources".

------
87
Sound good if you're already familiar with RabbitMQ. If not I'd question how
much it makes sense for low-frequency long-running jobs. Given the learning
and complexity overhead that is.

------
osiutino
It dpends, but usually loop checking database isn’t that bad. You don’t really
need message queue when you kick start something new.

------
whb07
aws lambda for me. Just send that data out and get it later when it’s ready.
Run a simple python function to do your processing on the aws side.

So far I’m pretty happy with it. I minimize the need for maintaining a redis +
worker etc

~~~
kohanz
Lambda is not intended for long running jobs. It would be costly to do so.

I'm using Lambda to kick off an EC2 worker running a Docker image with AWS
Fargate. The EC2 instance only runs for the duration of the job (which lasts
from 5 to 20 minutes)

~~~
scprodigy
Hyper.sh is faster, launching your Docker image in 5 seconds.

------
knowsmorsecode
I use quartz schedular to do background jobs.

