
Show HN: A simple scheduler for running jobs on GPUs - ExpectationMax
https://github.com/ExpectationMax/simple_gpu_scheduler
======
ole_tange
Examples simulated with GNU Parallel:

    
    
        simple_gpu_scheduler --gpus 0 1 2 < gpu_commands.txt
        parallel -j3 --shuf CUDA_VISIBLE_DEVICES='{=1 $_=slot()-1 =} {=uq;=}' < gpu_commands.txt
    
        simple_hypersearch "python3 train_dnn.py --lr {lr} --batch_size {bs}" -p lr 0.001 0.0005 0.0001 -p bs 32 64 128 | simple_gpu_scheduler --gpus 0,1,2
        parallel --header : --shuf -j3 -v CUDA_VISIBLE_DEVICES='{=1 $_=slot()-1 =}' python3 train_dnn.py --lr {lr} --batch_size {bs} ::: lr 0.001 0.0005 0.0001 ::: bs 32 64 128
    
        simple_hypersearch "python3 train_dnn.py --lr {lr} --batch_size {bs}" --n-samples 5 -p lr 0.001 0.0005 0.0001 -p bs 32 64 128 | simple_gpu_scheduler --gpus 0,1,2
        parallel --header : --shuf CUDA_VISIBLE_DEVICES='{=1 $_=slot()-1; seq() > 5 and skip() =}' python3 train_dnn.py --lr {lr} --batch_size {bs} ::: lr 0.001 0.0005 0.0001 ::: bs 32 64 128
    
        touch gpu.queue
        tail -f -n 0 gpu.queue | simple_gpu_scheduler --gpus 0,1,2 &
        echo "my_command_with | and stuff > logfile" >> gpu.queue
    
        touch gpu.queue
        tail -f -n 0 gpu.queue | parallel -j3 CUDA_VISIBLE_DEVICES='{=1 $_=slot()-1 =} {=uq;=}' &
        # Needed to fill job slots once
        seq 3 | parallel echo true >> gpu.queue
        # Add jobs
        echo "my_command_with | and stuff > logfile" >> gpu.queue
        # Needed to flush output from completed jobs 
        seq 3 | parallel echo true >> gpu.queue

------
heroic
Interesting! Is there a way to use this, or some other library, that can spin
up a GPU instance, on AWS or GCE, if there are no GPUs available, and discard
GPUs when no more tasks are left? Maybe also allow some "overtime", in case a
new task may get registered in the time when the instance gets free and a new
one starts up?

------
inetknght

        The package can simply be installed from pypi
        
        $ pip install simple_gpu_scheduler
    

When I run this, I get:

    
    
        ERROR: Could not find a version that satisfies the requirement simple_gpu_scheduler (from versions: none)
        ERROR: No matching distribution found for simple_gpu_scheduler
    

on Ubuntu 16.04. Admittedly I don't know much about Python or pypi so perhaps
the issue is on my end. Nonetheless, I'm not 100% sure how to resolve this.

~~~
sp332
Try with hyphens instead of underscores? [https://pypi.org/project/simple-gpu-
scheduler/](https://pypi.org/project/simple-gpu-scheduler/) has it both ways
which is very confusing.

~~~
ExpectationMax
Actually both should work. The problem in this case is probably that the
package requires python3.6 and thus cannot be found using a python 2 pip. I
changed the README for consistency though.

------
TuringNYC
Curious if anyone used a GPU-enabled Kubernetes cluster for something similar?
I recently set one up. The only downside was an inability to assign fractional
GPUs to tasks, so concurrent tasks were necessarily limited to the discrete
number of GPUs available to k8s

~~~
Denzel
This is one of the problems we solve at Algorithmia. We have a senior engineer
set to do a lightning talk at Kubecon next month, on the topic:
[https://events19.linuxfoundation.org/events/kubecon-
cloudnat...](https://events19.linuxfoundation.org/events/kubecon-
cloudnativecon-north-america-2019/schedule/). If you’re interested in working
on the problem, reach out to me.

------
p1esk
We used to have a google doc where everyone would put name, machine, gpu(s),
and estimated time to run, whenever they launched a simulation (longer than an
hour). Low tech, but worked ok as long as people didn’t forget to do so.

------
EricE
Reminds me of how BOINC started out over a decade ago :)

------
moonbug
A month in the lab saves a day in the library.

~~~
semi-extrinsic
I also have the feeling that this should be "easy" with a traditional
scheduler. But could you (or someone else) point me in the direction of
something that's easy to set up for a quad-GPU machine? I spent two days
trying to get SLURM to work, at one point. Currently I'm running with GNU
Parallel, but it's much less flexible of course.

~~~
bonoboTP
Slurm works well even for a single machine. It does require some configuration
but you only need to do it once and then it's very robust. The packages are
available in standard repositories (like Ubuntu's) and there's a web-based
configuration generator available. And if you ever want to add a second
machine or more, it will be very easy to expand.

What stopped you from using Slurm?

~~~
ExpectationMax
Installing Slurm requires root access to the server/workstation, which is not
always available. In academia the management of computational resources is
often the responsibility of a separate IT unit which is not part of the
research group itself. Further the IT unit might not be willing to implement
desired changes to the server infrastructure or might take _extremely_ long to
do so. Often the setup of a scheduler is also significantly more complicated
in this context, for example due to authentication via a University LDAP or
oddities of the server setup.

~~~
bonoboTP
Luckily in our academic research group we administer our own workstations and
servers ourselves. Has downsides as well of course, but at least we can fix or
change anything we want hardware or software-wise.

I can imagine it would be excruciating to explain our requirements to an
external IT department and then go back and forth clarifying what we want,
make them learn stuff they'd otherwise not need, convince them that we really
need it, wait until they have time...

The downside is that things are not "professionally set up" (we're computer
scientists but not pro admins), but at least if there's a deadline and
something isn't working, we'll surely fix it because we have skin in the game.

