
Ask HN: Best distributed job processing system in 2019? - sharmi
I have around 10 million network I&#x2F;O related jobs that I would like to do in a short period.<p>So I hope to use a job queue to run it distributed on several servers.<p>I have used celery in the past but it is not quite reliable.<p>Features most important to me are multiple retries, restarting workers that are not responding, ability to monitor status of the queue and workers. Nice to have features - cron scheduling, task chaining, high throughput.<p>Which is the most stable, reliable job queue out there? It would be preferable to support workers in multiple languages. The ones I would prefer are Python and Go.<p>I have used celery in the past, but workers often hang
======
gervu
Workers hanging is a thing that happens everywhere. (You might not have enough
resources if it's happening often, though.)

You should design the workers so that what needs to happen still happens in
the event of expected failures, or so that it at least fails gracefully and
with a useful paper trail. Failures happen, good engineering anticipates and
plans around them.

For example, you could schedule up to three attempts spaced at least five
minutes apart, set a timeout on jobs so they don't stay open indefinitely
(appearing to hang), have jobs that still fail get routed to a dead queue, and
make sure worker code behaves appropriately in response to internal errors and
improper input data (such as getting an HTTP error or unexpected MIME type)
while logging any unexpected states for later review. Most of the point of a
library like Celery is that it makes common strategies like these easier to
implement.

You mentioned in a reply that the jobs are requests to external websites. The
rate of errors from that is going to be like a thousand times all other
sources of jobs not completing as expected unless something is hella weird
with your setup.

~~~
sharmi
Thank you for taking the time to respond. That is only part of the problem. I
am quite aware of that there are quite a number of external issues that can
affect a job. Data extraction is something I have been working in for more
than a decade.

I have a few other pet peeves with Celery. I run scheduled tasks using
CeleryBeat but those tasks cannot be tracked from flower. Signals don't work.
I also would like something language agnostic so I can write memory/processing
intensive tasks in something more performant (Go or Rust).

------
shoo
Personally, I've cobbled something together using this:
[https://www.2ndquadrant.com/en/blog/what-is-select-skip-
lock...](https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-
in-postgresql-9-5/)

Storing the queue state and task results in postgres makes it easy to
integrate with workers in different languages, but you need to write the
library code to query and lock a free task to process.

I'm not sure how well this would scale for 10 million tasks in "a short
period". It works fine for me running the database and multiple workers on a
single machine with around 100k tasks that are scheduled and processed every
week or two.

> Features most important to me are multiple retries, restarting workers that
> are not responding, ability to monitor status of the queue and workers.

Some of these concerns might not be the responsibility of the job processing
system: you might just need to set up some monitoring and health checks to
restart services or machines if they stop responding

------
stephenr
I've been using Qless for a client recently.

The core logic itself is Lua that runs in Redis itself, but each language
generally needs a client to interface between the native expected norms and
the Lua. I can't comment on the availability or quality of Python or Go client
libraries.

It's not perfect, but it's workable.

------
miraculixx
Interesting - I have had good experience with Celery, so interested to hear
more about the problems you encounter. In particular Celery provides all the
features that you are looking for so it would be great to know more about your
specific issues.

Can you elaborate on your set-up?

------
shoo
Some ideas from prior hn discussion:
[https://news.ycombinator.com/item?id=15985103](https://news.ycombinator.com/item?id=15985103)

------
suff
Unfortunately queue doesn't always mean FIFO, as you might expect. Are you
submitting them all at once? Does response order matter?

~~~
sharmi
No response order does not matter. These are individual requests to separate
websites. For me queue is just a way to distribute work among workers.

------
dcolkitt
SLURM is pretty solid, and designed to scale to supercomputer sized workloads.

------
deathtrader666
BEAM - The Erlang Virtual Machine

------
dlahoda
Have you looked into actor frameworks?

------
streetcat1
kubernetes.

~~~
shoo
Can you go into more detail? I understand k8s might be a fairly reasonable way
to start and supervise a large number of services, and I've seen it used to
execute one-shot batch jobs.

But I can't quite join the dots to see how you could have a distributed job
processing system with just k8s.

Would you need to use some other system (perhaps also running in k8s) to track
the queue of tasks and store task status & task results? Or can you get k8s
itself to act as the task queue?

~~~
pookeh
Have a look at Argo if you are interested in leveraging k8s infrastructure
[https://github.com/argoproj/argo/blob/master/README.md#what-...](https://github.com/argoproj/argo/blob/master/README.md#what-
is-argo-workflows).

For us, we settled with Netflix Conductor as it scaled pretty well and allowed
us to have pretty complex workflows and error paths and retry logic. Also is
an independent and standalone tech.

