
Ask HN: How would you queue and process 10K+ long running jobs - bballer
Hey guys wanted to ask a question about what technologies &amp; methodologies you would architect together if you needed to constantly be queuing up 10K+ jobs, distributing the work out and then reporting that it was completed. It would be required that you should never be able to schedule a duplicate job when one is queued&#x2F;running, and you need to ensure that each job only gets picked up by 1 worker and run once. These jobs could last anywhere (in run time) from 30 seconds to 1 hour.<p>I&#x27;ve tried googling but my fu is failing me. Would love to hear the thoughts from people who have maybe solved similar problems.<p>Thanks!
======
sethammons
You have an interesting requirement: each job only gets picked up by 1 worker
and run once (with up to an hour long job).

I'll contend that you can't do that. You could get at least once or at most
once. Let's say that you go with at least once.

Just use a queue like RabbitMQ. Workers connect, request work, ack that they
are done, and you should be good to go. Done. You can set this up today if you
want.

If you need more thorough duplicate detection, you could sprinkle in some
redis to store job state (in progress / complete). Using atomic operations
like INCR/DECR on your key, you could pull a job from Rabbit (or your queue of
choice), hit redis to ensure that the job is not in progress or already
complete due to a network error between Rabbit and your workers, and then
proceed appropriately.

The key problem here is that the network could drop requests. You could pull
from your queue, complete the work, and think you ack'd, but the queue never
gets the ack, so it hands out the work again to a new worker after the lock
expires on it. So you could mitigate that with an additional layer of a
distributed KV store. But that could have the same problem.

I run a system that processes billions of events a day and we use a system
very similar to what I described above (though we have a custom queue solution
and a pool of redis nodes that we have some custom quorum logic around). We
don't seem to be duplicating hardly any jobs (maybe a handful a week).

If you use kafka, and only use the java clients, they say you can get exactly
once delivery. See [https://www.confluent.io/blog/exactly-once-semantics-are-
pos...](https://www.confluent.io/blog/exactly-once-semantics-are-possible-
heres-how-apache-kafka-does-it/).

They way they do it is by controlling the client and the server and sprinkling
some write-ahead-logs with logical clocks and using a formal consensus
protocol (paxos) under the hood. Even with all that, I'm skeptical of the
exactly once claim.

------
kwillets
I did a lot of troubleshooting on a system like this a year or two ago, and
most of it came down to making sure that global state transitions are atomic,
and making communications as robust as possible.

We had the basics of execute-once using a leasing pattern, but we had a number
of bugs related to multiple instances of a task existing in different threads
(the executor would load the task object and then fork, leaving two instances
in possibly stale states, and I also found failure paths that left multiple
instances running), and we also saw a number of daily double-executions
related to the lease-renewal process freezing, or non-transactional state
transition.

We added a lot of state-transition auditing, including a pid/thread ID to find
out where updates were coming from.

IIRC I eventually settled on having the executor (queue listener) do every
possible check prior to execution (checking resource limits, process count
limits, etc.) without loading the task instance itself (just the ID from the
queue message). After the fork the child loads the task and does a single
transaction that deletes the queue message and creates the execution record
(the one-and-only-one run, basically). Every failure up to that point will
requeue, but once the run is created, the queue message has to be deleted. We
then transition to leasing the execution, and mark it failed if the lease
expires.

We also created a centralized service to renew the leases on the execution
objects after we found that to be a failure point. Long-running processes just
have a lot of problems keeping connections open, etc.

------
saluki
Are you choosing a tech stack to do this?

Laravel has this built in with queues.

[https://laravel.com/docs/5.7/queues](https://laravel.com/docs/5.7/queues)

You can run multiple workers, it will intelligently distribute the jobs, and
there is a Laravel Horizon package that can handle monitoring of the
queues/jobs.

I expect Rails would have something similar, but I haven't used queues in
Rails.

~~~
bballer
We run all Java so that is a no go and at the scale I'm talking about don't
think I would trust those solutions. Thanks for the input though!

------
Sahhaese
Any message queue or similar could help with this.

Popular solutions:

* RabbitMQ

* Service Bus

* Kafka

RabbitMQ would be well suited, you define a producer and can then spin up as
many consumers as you would like, each consuming from the same queue.

Preventing duplicate queuing should be done on the producer before it is
queued.

It depends on the nature of the scaling and how much durability you want
though, you may wish to simply maintain an atomic queue of work to be done, in
which case any thread-safe list would suffice as long as changes were done
atomically.

10k isn't _that_ much in the grand scheme of things, a simple database could
easily store such a queue if you didn't have that many nodes trying to consume
the same table at once.

You would need to write stored procedures for transactional read & delete and
inserts to prevent duplication of jobs.

In that case something like Redis might be good, which itself can also act as
pub/sub and used for messaging.

Would you look to scale up more consumers as the queue lengthened, or would
there still be a fixed number of consumers?

~~~
bballer
Thanks for your comment! Don't have enough time to fully respond to it right
now but will get back to you in the morning.

------
lfx
It would be easily done by AWS SQS you can put as many elements to queue as
you want (some limits apply, but may be easily lifted) and then remove items
from the queue when the job is done, by tunning invisibility time-out you make
work by your requirements. You can use lambdas (too short for your case) or
ec2 or Fargate as your workes and it may scale up or down depending on loads.
What is cool that you can create multiple shards if you could predict how long
jobs would take so some could be done in lambda other using ec2 thus reducing
costs.

~~~
bballer
Thanks for your insight. This is currently close to what we have in production
for one of our systems and we are building a brand new system that shares a
lot of the same requirements, just fishing to make sure we aren't missing
anything :]

------
mtmail
Look for job scheduling software plus the name of your programming language of
choice. Or background processing. [https://sidekiq.org/](https://sidekiq.org/)
is one for Ruby for example,
[https://aws.amazon.com/sqs/](https://aws.amazon.com/sqs/) one that runs in
the cloud. Those pages should give you more words to search for as "job
scheduling" and "queue" gives too many non-software related results.

~~~
wallflower
Amazon SQS is not an ideal choice because it is not designed to be a “forever”
queue. Messages will expire in two weeks.

~~~
rubenhak
He doesn't seem to need a forever queue. Processing might take a while, but he
just needs to get it processed once (and exactly once) and move on. Yes, if he
wants to do event sourcing, then yes SQS would not suffice. But I don't think
he needs that. But SQS seems to be a great choice.

~~~
bballer
Yeah we are already using SQS for a similar system that's currently in
production. We don't need a forever queue as part of the goal is to keep the
queue as empty as possible at all times. What is really important is making
sure that only one job (with a distinct set of params) can be scheduled at one
time and run by only worker at one time.

------
shoo
it might be relevant to say how much throughput you need or other factors,
such as if your problem in inherently concurrent (reacting to events outside
your control) or if you actually just want to do parallel processing.

for example, "10k+ jobs" sounds like a large number, but depending on
throughout perhaps it is trivial.

i have a hobby project to fetch data from external sources and store the
results in a database. this has about 70k different jobs defined. each job is
scheduled to be run at some frequency. i run everything on a single physical
box with 4gb of ram and a low energy CPU. The worker processes are python
scripts, the job queue state is stored in the same postgres database i use to
store results. My throughout is low, i only need to process a job every few
seconds. The workers run on the same box as the database as I am too lazy to
maintain more machines and too cheap to rent cloud servers. Running costs are
about $15 / year for energy.

The queue implementation is based on this: [https://blog.2ndquadrant.com/what-
is-select-skip-locked-for-...](https://blog.2ndquadrant.com/what-is-select-
skip-locked-for-in-postgresql-9-5/)

from memory i think I am using primitives offered by database to prevent
multiple workers from acquiring the same job (transactions, transaction
isolation). this might not be very scalable but i only have two worker
processes and each job takes seconds to process.

do you want to do what I'm doing? Probably not. but perhaps what you are doing
is easy.

------
ecesena
There are great comments on the technology itself, including
SQS/PubSub/RabbitMQ, or Celery.

However if you need 1) management, i.e. easy way to look at what went wrong
and retry 2) dependencies between jobs, you should look into something like
Airflow (instead of building your own):
[https://airflow.apache.org](https://airflow.apache.org)

It's also a very good example of architecture, in case you decide it's not
good for you and you really want to build your own.

------
superasn
We did something similar on our site but the max duration of each job was 5
mins. So maybe this solution may not be totally relevant to you but what were
doing is we've created a aws lambda function that is triggered when a file is
written on S3.

So instead of traditional SQS we just write unique files on S3 with job data
and that triggers the lambda function to process the job and notify a URL upon
completion.

~~~
bballer
Yeah lambda functions won't suffice for the kind of jobs we run. Plus we are a
full java shop and the don't want to pay for the cold start times on lambda
for the jobs that do end up only taking < 1 minutes.

Thanks for your thoughts!

------
linksnapzz
Have you considered...a traditional batch system, like OpenPBS, SLURM, or
Gridengine? That sorta sounds like the tasks they were meant to solve...

------
iAm25626
The following comes to mind

Python based:

[http://www.celeryproject.org/](http://www.celeryproject.org/) <\-- async task
queue

[https://github.com/spotify/luigi](https://github.com/spotify/luigi) <\-- more
pipe line centric

There are many like it. RQ(more bare bone)

------
rubenhak
You should provide more info regarding your environment. If you're running
this in public cloud tell us which one. Every provider has native several
queue services for different needs and makes things easier to work with and
worry less about setting things up.

~~~
bballer
For workers we run only Java and run everything on AWS. We have an existing
production system that uses a combination of SQS, DynamoDB, and Postgres, and
EC2s to achieve something very similar to this. Just want to check all the
boxes before we dive into building out something for a new system coming into
production that shares many of the same requirements.

~~~
rubenhak
SQS-FIFO should let you process task once & only once. Just make sure you
configure timing parameters correctly.

DynamoDB has triggers that get fired upon changes. That would strongly help
with eventual consistency implementation (which i strongly recommend). But
with this you should write Lambdas. Check how well is Lambda Java supported.

Are you sure you want to use EC2 directly? Why not to use ECS? This would let
you focus more on the business and less on infrastructure

------
atomashpolskiy
I've done this for running long-lived subscriptions in a trading platform
back-end service. The load is similar. Basically, you need a cluster software
with sharding option. I personally used Akka Cluster Sharding (a Scala
library, which also happens to have Java bindings).

It starts a network node in each instance of the service and binds all nodes
into a cluster (discovery of other nodes is left for the service developer;
simplest solution is to have a static list of node addresses). The sharding
mechanism allows you to distribute arbitrary data objects among the nodes
according to some rules (e.g. based on the value "object's hash modulo number
of nodes in cluster", which produces a typical hashring). Data objects may
originate on any node (e.g. on schedule or on some external event). They also
need to be serializable to be passed between nodes and, obviously, the binary
representation should not be too big. So, depending on the nature of jobs in
your case, you may want to pass only the job ID as the data object and store
the actual job definition and/or arguments in a separate place (e.g. a
database).

Now, Akka will guarantee that the job will be run only by 1 node and run once
(until completion that is; the job may be started more than once, see tips
below). But if there are many jobs coming in, you risk overwhelming the
cluster (because passing messages between nodes takes time, jobs themselves
take time, etc.) So to defend your cluster against load, which it might not be
able to handle, you may put a message queue in front of it. This will let you
set up a max number of jobs that concurrently run in the cluster, and nodes
will take new jobs from the queue only when some of the currently running jobs
complete. Most MQs have persistence, so the jobs will be safe even if they
need to wait for a while sitting in the queue.

Few tips:

1) Akka persists node's data, so the jobs that are taken from the queue are
going to be safe in case the node they are located at fails

2) If a node leaves the cluster, all its' jobs will be moved to other nodes
(according to the sharding rules mentioned above) and started over, so you may
need to introduce some kind of transactions (same goes for cluster restart)

3) If the whole cluster is restarted, unfinished jobs will be distributed
among the nodes according to the sharding rules, so it might be a good idea to
make this rules "sticky", so that each individual job is always assigned to
the same node (and each node will just have to load its' jobs from its' own
persistent store). Otherwise there might be a lot of message passing which
will slow down the startup.

------
aprdm
The visual effects industry has been doing that for ever to render frames.

Have a look on Tractor, Qube, Rush, Flamenco by blender

------
dmarlow
Can you elaborate on what the "job" is or does? Can things be batched?

~~~
bballer
I wont fully elaborate on what a job is but I'll give you a couple examples:

Updating anywhere from 500-5M items over APIs that are rate limited.

Dumping datasets that have to be normalized and massaged into files/ and
dropped off at third party servers anywhere from every 15 minutes to once a
month. These files could contain anywhere from 500 lines to 5M lines.

Injesting datasets just as large as described above but massaged and saved
into our caches and DB.

------
blcArmadillo
Depending on what a job is, it seems like this could all be done with Jenkins.

~~~
bballer
Sorry but Jenkins doesn't achieve anything related to my question.

~~~
aprdm
How so? Jenkins is just a job scheduler, it has lot's of options for
scheduling and dispatching jobs.

