
Ask HN: Multi-Tenancy GPU Solution with GPU Quota - apli
Several teams from different startups want to share the three GPU nodes for deep learning tasks (each with 8 GPUs), which we have locally available here.<p>What are the best open source tools to achieve this?<p>If possible, the solution should allow:<p>(1) Access only own data
(2) GPU Quotas per team
(3) Stats about GPU Usage
(4) Shared volume access for multiple container of one user (to share datasets)
(5) Secure, e.g. not be able to escape container through local volume mounts<p>We looked into Rancher 2.0 (Kubernetes, no GPU Quota support), Mesosphere (no GPU Quota support). Do you  have any ideas how to achieve this with the least effort?
======
SEJeff
So kubernetes and mesos (mesosphere is a company, not a product. Mesos is the
software) use control groups and linux namespaces colloquially known as
"containers" to manage things. In the upstream linux kernel, there is no real
concept of a "container", it is just used really for marketing.

There is currently no support in the nvidia kernel module or in upstream linux
for cgroup support of GPUs although nvidia is working upstream with Mesosphere
on just this. The way a lot of HPC firms do this is using a normal job
scheduler (Torque/Maui, SLURM, Univa Grid Engine, etc) with a GPU resource.
Then the users use said GPU resource on an honor system more or less and it
works provided you don't have malicious users.

Nvidia is working on making this less of a hack, but currently it is very much
a hack:

[https://github.com/NVIDIA/nvidia-container-
runtime](https://github.com/NVIDIA/nvidia-container-runtime)

[https://github.com/NVIDIA/nvidia-docker](https://github.com/NVIDIA/nvidia-
docker)

[https://github.com/NVIDIA/libnvidia-
container](https://github.com/NVIDIA/libnvidia-container)

good luck!

