
Putting GPUs to work with Kubernetes - marcoceppi
https://medium.com/@samnco/how-we-commoditized-gpus-for-kubernetes-7131f3e9231f
======
Seanny123
I keep seeing Kubernetes appear on HackerNews. Is there a quick thing I can
read to explain why everyone's so excited about? I know it's container
orchestration, but I'm not sure what people are using it for or what pain
point it is revealing.

~~~
moondev
[https://vishh.github.io/docs/concepts/overview/what-is-
kuber...](https://vishh.github.io/docs/concepts/overview/what-is-kubernetes/)

~~~
Seanny123
If you'd like to engage with me further, how does a company know it needs
Kubernetes? If I'm Soylent and I'm processing a few orders a minute, I'm
probably safe with a few redundant monoliths. Do I have to be Uber? What's the
middle-ground between Soylent and Uber that would still need this?

Is the answer the same as the question "who needs a microservice
architecture"?

~~~
samnco
There are a few killer features that you would benefit at any size and that I
really love

* self healing: when you create a deployment/replica set. it will be maintained at all cost, so if the app has a memory leak or anything goes wrong, it will be contained and kept up and running

* Rolling update: even when you run 5 frontends, it is a pain to use capistrano or other tools to just update at git repo. it is literally a one liner in Kubernetes. If you use CI/CD the setup is just a few lines in any Jenkins/Gitlab/Travis...

* Service discovery: the combination of ENV and predictable DNS endpoints is just awesome

* Ecosystem: PaaS, Serverless... Many of the new world infra is built on K8s, so it is a door to the next gen, whether you know you will use it or not.

As for Micro Service Architecture, just starting with the web frontend and a
couple of lightly dockerize middleware makes it sooooo simple that you
instantly want to get more out of it.

As the overhead of running K8s vs. set of servers is relatively low especially
at small scale, it is definitely worth looking at. Happy to do a run through
with you and show you how the deployment of a tiered app works as a demo, ping
me on @SaMnCo_23 if interested.

~~~
tokenizerrr
When running a kubernetes cluster on your own hardware what do you use for
storage?

~~~
samnco
You have several options:

* Run Ceph in separate nodes and connect it to the cluster. With Juju, you can do that from the bundle, as Ceph is also a supported workloads. This gives you scale for storage

* Run Ceph within the cluster with a Helm chart. We see that for openstack-helm for example. Also gives you scale, but the lack of device discovery makes it less interesting in my opinion

* Run an NFS server, plain easy but not very scalable.

* Use hostpath, which is the default but doesn't get you scale.

~~~
tokenizerrr
So Ceph is the preferred storage provider? I've noticed there is a huge list,
including GlusterFS. Do you have experience with any of the other ones?

~~~
ferrantim
Ceph is not a good solution at all for databases. Your performance is going to
be terrible.

Ceph is object-based SDS solution designed to take servers with local drives
and create a SAN out of them. In order to do this, they take each LUN (Ceph
volume) and scatter the data across all nodes in the cluster. They do not
assume that applications will run on these servers themselves... they assume
compute is elsewhere, like a traditional SAN. The goal is to replace a SAN
with servers, not create a converged platform. Also, Ceph was designed during
an age where an Intel server did NOT have tier-1 capacity (8 - 20 TB), which
is why they shard a volume across so many servers.

This causes a problem for modern applications like Cassandra, Mongo, Kafka
etc, where they like to scaleout themselves and want a converged system, where
data is not scattered, but on the node where an instance of that cluster runs.
Ceph also disrupts (undo) the HA capabilities that these scaleout applications
have (For example, a Cassandra instances data will not be on a node on which
it thinks it is).

~~~
tokenizerrr
Do you happen to have any suggestions for alternatives to look into?

~~~
ferrantim
You could look at Portworx (disclosure, I work there so am biased, but you can
test it yourself for free)

------
Tossrock
Unprofitable for ETH mining maybe, but it seems like a natural fit to rent
time on it to deep learning people with slow training models. Although that
could still be unprofitable after the cost of electricity, I guess it's a
question of market size/demand. A lot of deep learning is already at big
infrastructure players anyway who wouldn't need the service, leaving academics
/ smaller companies. But maybe some people would find a reliable, scalable GPU
cluster valuable.

~~~
samnco
Ahah, good point. Really the ETH stuff was "because I can". But in the same
charts repository you will find a Tensorflow chart. My previous series of
blogs [0] was about exactly that. A nice addition as well for compute
intensive workloads is the use of LXD [1]

Another use case is in media for transcoding. It is not a trivial job to
orchestrate transcoding at scale, and Kubernetes with or without GPUs is an
excellent solution for that as it is trivial to setup a completely automated
job queue.

Also another interesting field will eventually be HPC but there are some
constraints about compute that K8s does not tick scheduling wise at this point
in time. There is a pluggable scheduler in the works I think, and this will
eventually help. Also the LXD example is a nice optimization but it would not
replace the scheduler in any way.

[0]: [https://medium.com/intuitionmachine/gpus-kubernetes-for-
deep...](https://medium.com/intuitionmachine/gpus-kubernetes-for-deep-
learning-part-3-3-automating-tensorflow-deployment-5dc4d5472e91)

[1]: [https://hackernoon.com/job-concurrency-in-kubernetes-lxd-
and...](https://hackernoon.com/job-concurrency-in-kubernetes-lxd-and-cpu-
pinning-to-the-rescue-b9fb7b44f99d)

------
aub3bhat
Great read, on a smaller scale, I have found nvidia-docker with nvidia-docker-
compose to be a great solution for deploying docker containers on AWS P2
machines with 8 GPUs.

------
nrki
"1060GTX at home but on consumer grade Intel NUC"

A bit OT, but I'd like to see how this works...

Ah, very cool - [https://www.youtube.com/watch?v=wyY-
lTmgb8c](https://www.youtube.com/watch?v=wyY-lTmgb8c)

~~~
samnco
Actually, it was a fun DIY project I did a while ago. You can read about it
here: [https://hackernoon.com/installing-a-diy-bare-metal-gpu-
clust...](https://hackernoon.com/installing-a-diy-bare-metal-gpu-cluster-for-
kubernetes-364200254187)

It works, but the GPUs aren't very stable at 4x vs. a normal 16x.

~~~
jacquesm
That's one problem, another is the size of the powersupply. And maybe that's
the _only_ problem, I don't see why a GPU would become unstable when using
fewer lanes, all it should do is get slower.

~~~
samnco
I don't know. Maybe the make of the extenders isn't very good, I saw other
people with similar issues.

The PSU is the Corsair AX1500i (1500W), with 10x lines for GPUs. It's robust
on paper, didn't have any problem with just 4 plugged in.

But I must say... The T630 are very noisy compared to these, but so much more
powerful #NotGoingBack

~~~
jacquesm
I just bought a GTX1080ti + a similar corsair as an upgrade for my 3 year old
Dell, it works like a charm.

If you have a PSU that big then that probably isn't the problem. I thought you
might be using the PSU that comes with those extender boxes and they usually
are very puny (250 W or so).

Do you use it for gaming or for CUDA?

Do you run the 4 GPUs in the extender?

~~~
samnco
Yes, each GPUs has a 4x -> 16x and a 4x-4x extender, in addition to the m.2 ->
PCI-e 4x adapter.

So many potential failure points in there. The sole use case is CUDA.
Essentially I wanted a portable cluster with GPUs and that did the work for a
couple of month. Now it's getting more serious so the switch to T630 makes
sense, and I repurposed the NUCs into the control plane of the K8s cluster.

~~~
jacquesm
I built this long ago:

[https://clustercompute.com/images/image4.jpg](https://clustercompute.com/images/image4.jpg)

Which was a lot of fun.

Do you have all the GPUs internal to the T630?

Any chance of a picture (of the guts)?

I'm seriously thinking of duplicating your effort.

~~~
samnco
Here you go :)
[https://drive.google.com/file/d/0B1CCk51NQ4koSmkxSmxWb1E5Y0E...](https://drive.google.com/file/d/0B1CCk51NQ4koSmkxSmxWb1E5Y0E/view?usp=sharing)

Replicating is not very hard. You need a lightweight x86 machine for MAAS,
which takes ~20min to install, one VLAN for the iDRAC (IPMI), another for
networking that can connect to internet, and off you go. You can also enable
KVM power management in MAAS to run the Juju control plane in VMs and save a
box if you're limited in compute power.

[https://maas.io](https://maas.io)
[https://jujucharms.com/docs](https://jujucharms.com/docs)
[https://www.ubuntu.com/containers/kubernetes](https://www.ubuntu.com/containers/kubernetes)
for all the goodies.

If you run into problems, I am SaMnCo on #juju in freenode.

~~~
jacquesm
Ok, so 2 GPUs in there. Have you tried 4 or is that not possible for some
reason?

I have plenty of other hardware floating around here so no problem on hooking
it all up.

Thank you for the image.

~~~
samnco
I have not, but it is technically possible. the PSU is the double 1100W with
the GPU enablement kit. Up to 4x PCI-x 16x full speed. Also up to 1.5TB RAM,
and 8x 3.5" HDD or 16x 2.5". I didn't go this far though ($$$...)

------
shaklee3
Is the author of this working on official support or just testing? I know
there's a gpu roadmap for k8s, but I can't tell from this blog if this was
part of it.

~~~
samnco
Canonical will officially support GPUs when they lands GA upstream. The flag
is beta as of now in the Canonical Distribution of Kubernetes. Paying
customers either for the managed or supported solutions get a best effort for
GPU, and this feature is enabled by default.

~~~
puzzle
What is the requirement for privileged containers? The post never explains it.

~~~
samnco
privileged containers are required for the GPU to be shared with the
containers.

By default, the bundle come with a "auto" tag, which will activate privileged
containers just when GPUs are detected.

You can enforce "false" to remove that, but then you won't be able to run GPU
workloads.

Or you can enforce "yes" and have them activated all the time.

Does that answer the question? Not sure if I understood it right.

~~~
puzzle
The Kubernetes docs don't say anything about having to use privileged
containers for GPU support. Privileged containers are given tens of Linux
capabilities; which of those are actually needed in your setup? Or,
conversely, which specific step would fail for an unprivileged container?

Just because I want to use a GPU shouldn't require the power to change the
clock, switch UIDs, chown files, mess with logs, reboot the machine, etc.

~~~
marcoceppi
Since the GPU libraries are hosted on the node, privileged flag is typically
required to make that possible. I'm sure there will be improvements to not
require privileged, but today it's mostly a requirement to get anything useful
out of containers tapping into GPU.

That said, if you set the allow-privileged flag to false GPU drivers will
still be installed but you may not be able to make use of the cuda cores

~~~
puzzle
That's weird, because all the times I tried the experimental support, it
didn't need privileged containers. From the YAML files, it looks like it's
using hostPath directories, but those don't require special privileges, unless
you need to write to them:

[https://kubernetes.io/docs/concepts/storage/volumes/#hostpat...](https://kubernetes.io/docs/concepts/storage/volumes/#hostpath)

I suspect that there is a bug somewhere.

~~~
puzzle
Ah, wait:

[https://github.com/madeden/blogposts/blob/master/k8s-gpu-
clo...](https://github.com/madeden/blogposts/blob/master/k8s-gpu-
cloud/src/nvidia-smi.yaml)

You don't need to mount the /dev entries into the container at all. The
experimental support creates them automatically for you when you are using GPU
resources. Perhaps it's device nodes, not the libraries that required
privileges?

~~~
samnco
Hello,

OK I gave it a try and you are absolutely right. For the nvidia-smi, I could
run it the /dev/nvidia0, which is cool.

I was also able to run it unprivileged. I guess my mistake was to believe the
example from the docs and not test without.

Thanks for sharing that, I'll update my charts and the post accordingly.

~~~
puzzle
Awesome! Happy to hear that more containers will run without unneeded
privileges.

