
Ask HN: What tools do you use to do GPU pooling - thd-ai
At my work we have three servers with four GPU’s each at the moment and we’re soon getting a fourth one. The problem we have now is that when we want to do experiments, we choose a server, download the datasets that are needed and then we perform the experiments. Recently we’ve run into some problems as some people have downloaded multiple GB’s of data on one server and once they’re ready to do the experiments all the GPU’s will be used on that machine. At the same time on other servers there are GPU’s being idle.<p>What is being done now is just copying all your data from one server to the other (which might take days) and then hope you can use a GPU. Clearly this is not scaleable and we are interested to know if there are any ways to make a GPU pool where you can launch a job and the system also takes care of data management. Any links to tutorials are much appreciated.
======
ktpsns
You have a small cluster and you need cluster software for that. Look into
Slurm ([https://slurm.schedmd.com/](https://slurm.schedmd.com/)) which is
pretty standard in HPC world. It allows you to manage ressources (like your
individual GPUs), schedule jobs, copy codes and data to the particular machine
when the job starts, and much more. It is pretty customizable.

Having said that, of course a shared filesystem (NFS, better something like
[https://www.beegfs.io/](https://www.beegfs.io/) or [https://ceph.com/ceph-
storage/file-system/](https://ceph.com/ceph-storage/file-system/)) together
with a good network interconnect (Infiniband cards are expensive but worth
their money if you transfer a lot of data) can save a lot of headache.

~~~
thd-ai
Hi thanks for your reply! Will definitely take a look at the links you've
provided!

