
When a Resource Scheduler Is Not Enough for GPUs - jamesblonde
https://www.logicalclocks.com/optimizing-gpu-utilization-in-hops-with-sparks-dynamic-executors/
======
KaiserPro
We have a similar problem, however we are not using spark for machine learning
(we use it further down in the pipeline)

The basis of our system is AWS batch, which with some wrappers is a reasonable
scheduler. (we needed to make a monitoring framework, and a simple way of
specifying jobs and relationships between them) from there we have a number of
compute resources. If your job needs lots of CPU, then in the CPU queue it
goes. Depending on how much cpu/memory you want it'll put you on a machine
with other jobs.

If you need GPU, then you select the GPU queue. when you've finished with the
GPU, your job should die.

Its really that simple.

The problem with the new crop of (non VFX) schedulers is that they handle
arbitrary dependencies very badly. for the longest while trying to make sure
that disk heavy jobs didn't land on a node hosting a DB in K8s was a massive
pain (it might now be solved).

In managers like alfred from pixar which must be pushing 20 years old, it is
fairly simple to tag arbitrary nodes with tags to limit job types landing on
said box.

tl;dr:

AWS can provide this functionality out of the box, but its really not that
full featured. these features and arbitrary tags exist already in VFX
schedulers, and have done for many years. new wave managers are lacking
compared to their VFX brethren

~~~
dankohn1
See inter-pod anti-affinity on Kubernetes to ensure "disk heavy jobs didn't
land on a node hosting a DB".

[https://kubernetes.io/docs/concepts/configuration/assign-
pod...](https://kubernetes.io/docs/concepts/configuration/assign-pod-
node/#affinity-and-anti-affinity)

