
Benchmarking Modern GPUs for Maximum Cloud Cost Efficiency in Deep Learning - minimaxir
http://minimaxir.com/2017/11/benchmark-gpus/
======
dsacco
_> As with the original benchmark, I set up a Docker container containing the
deep learning frameworks (based on cuDNN 6, the latest version of cuDNN
natively supported by the frameworks) that can be used to train each model
independently. The Keras benchmark scripts run on the containers are based off
of real world use cases of deep learning._

Can someone who is knowledgeable about docker (and container performance in
general) comment on how using a docker instance impacts benchmark performance?

I understand containers are not virtual machines, but my preference would be
for a benchmark run on bare metal without containerization involved. In
particular, it’s not clear to me whether a container can take advantage of the
same CPU and GPU instruction optimizations that a binary or compiled-from-
source version can. For example, I’m curious if containers package redundant
versions of software for convenience that are underoptimized when compared to
versions packaged with the operating system. Is that a realistic concern?

In other words, is there generally a meaningful downside to benchmarking with
containers, and if so what is it? And equally importantly, are containers the
way most people use deep learning frameworks? I have never installed
Tensorflow or Keras using a container.

~~~
simcop2387
> it’s not clear to me whether a container can take advantage of the same CPU
> and GPU instruction optimizations that a binary or compiled-from-source
> version can.

Since the container is running on the bare metal, with the same kernel and
everything, just isolated, it absolutely can take advantage of these things.
It'll require about the same amount of work to do the compilation for using
any special instructions for performance as it would if it's on bare metal.

> For example, I’m curious if containers package redundant versions of
> software for convenience that are underoptimized when compared to versions
> packaged with the operating system. Is that realistic concern?

Maybe. It'll mean that there's extra data on disk and maybe in ram from
loading other shared libraries which will mean some overhead, but unless
there's some specific set of instructions or optimizations that would make a
difference for the workload it should be basically the same.

~~~
derefr
> unless there's some specific set of instructions or optimizations that would
> make a difference for the workload

I think this was the parent's worry. Most HPC deployments have versions of
libc, libm, etc. compiled with tuning flags specific to the microarchitecture
of the system, to allow workloads deployed on them to squeeze as much
performance as possible out of the system. I would worry that the versions of
libraries in the container are just the generic packages that ship with the
OS, compiled with -march=i686 or whatever the modern equivalent is, where they
don't try to take advantage of e.g. AVX512 instructions if they're available.

~~~
malux85
Yes you’re 100% correct, but if you’re using a well tuned GPU Deep Learning
pipeline the bottleneck is GPU compute and the CPU is always waiting on the
GPU.

It can be a bit tricky to optimise this though - some deep learning users do
things like resizing images on the fly, or on the fly vectorising / scaling of
input data in these cases the CPU can be the bottleneck and the optimisations
above can help - but this is largely a symptom of a poor pipeline
configuration in newbie shops, most serious places have optimised this away
and the GPU is the bottleneck

------
DTE
Great write up! I would be remiss to not mention Paperspace
[https://www.paperspace.com](https://www.paperspace.com) where we also offer
cloud GPU infrastructure that is less expensive and more powerful than most of
the larger clouds.

We also offer a suite of tools that makes setting up cloud AI pipelines a bit
easier to set up and manage.

If this is of interest to anyone here's a $5 promo code to try us out : HNGPU5

full disclosure: I'm one of the co-founders :)

------
dantiberian
I would be interested to see how the new preemptible GPU instances fare in
this comparison, perhaps on the next revision?
[https://cloud.google.com/compute/docs/instances/preemptible#...](https://cloud.google.com/compute/docs/instances/preemptible#preemptible_with_gpu)

~~~
minimaxir
Dammit. When did Google announce these?

It looks like preemptable GPUs are exactly half the price of normal GPUs (for
both K80s and P100s; $0.73/hr and $0.22/hr respectively), so they're about
double the cost efficiency (when factoring in the cost of the base preemptable
instance), which would put them squarely ahead of CPUs in all cases. (and
since the CPU instances used here were also preemptable, it's apples-to-
apples)

~~~
dantiberian
Quite recently I think. You can’t be blamed for missing it, the documentation
is inconsistent on if it is supported.
([https://mobile.twitter.com/danielwithmusic/status/9421780263...](https://mobile.twitter.com/danielwithmusic/status/942178026309140480/photo/1)).

------
houqp
Shameless plug, we also have a benchmark post for CPU/K80/V100 focusing on
image tasks: [https://blog.floydhub.com/benchmarking-floydhub-
instances](https://blog.floydhub.com/benchmarking-floydhub-instances). In this
case, V100 can be a lot more cost effect compared to CPU and K80. Our
tensorflow environments are built from source with optimizations targeting our
CPU instances.

------
oh-kumudo
Would be better if Volta-based P3 instances are involved in the comparison.

Spoiler Alert: It is a game changer.

~~~
minimaxir
I talked about Volta at the end of the post, but it’ll still be a bit for
Volta/cuDNN 7 support is baked into the native Tensorflow/CNTK distributions.

Even if Volta has the speed advantages touted, I doubt P3s will be as cost-
effective as a K80.

~~~
oh-kumudo
It is actually more interesting than that. My experience is P3 instance can
significantly accelerate the training performance, as much as 2x-3x, so to
train the same model, you won't need to keep the instance up nearly as long
for a K80 instance.

------
brianchu
1\. I wouldn't take much away from the LSTM benchmark. It's more a benchmark
of Keras since Keras only supports CuDNN's LSTM via Tensorflow right now.
AFAIK CNTK does supports CuDNN LSTM but not through Keras. Keras actually
implements its own LSTM in terms of the base math operations (it doesn't call
the Tensorflow or CNTK LSTM operations which are in some cases optimized in
C++ etc.), so on the CPU you probably could get better performance if you were
using the Tensorflow or CNTK functions directly.

2\. Compiling Tensorflow from source on CPUs is a bit of a hassle but I have
seen nice performance gains (10-20%) for LSTM tasks. I bet you would get even
higher gains for CNNs since they're more parallelizable. (Note: I've never
gotten the latest TF to work with Intel MKL).

3\. I haven't fully tested this myself, but with the P100s you also have full
support for half precision floats, which supposedly offer a huge speedup.

4\. Also would have liked to see benchmarks of other frameworks like PyTorch,
etc. I haven't used them myself but everything I've heard indicates that
Tensorflow is often slower.

------
lma21
Great write up! Even though I'm very far from understanding deep learning and
the various frameworks you're testing here, benchmarking GPUs has always been
a muse to me.

I'm interested to see the utilization of the underlying GPU devices when you
run the MLP or CNN benchmarks (monitored with `nvidia-smi`) — the speed-up
factor between the different benchmarks don't seem to be inline with the
speed-up factor shown in cuDNN link[1] between a K80 and P100. I'm wondering
if the P100 device is under-utilized when used with TensorFlow or CNTK.

[1] [https://developer.nvidia.com/cudnn](https://developer.nvidia.com/cudnn)

------
WhitneyLand
At least for personal experiments, how much worse is it to just build a system
with a 1080 ti, leave it in a corner where you live and remote in as needed,
from a laptop, school, wherever.

Separately it seems with ML price/performance is sometimes less important than
how much money do you have to spend?

For example with other development work, I’d probably never buy an 18 core cpu
because for most projects it wouldn’t speed up my iterative work much.

However ML is more like VFX work, where it’s common that nothing may be too
fast. In other words I would be more often willing ignore price/perf, and for
example pay 100% more for only a 50% perf gain, if it were within my means to
do so.

------
bhouston
This is a great deal, 4x 1080 for $1700 CAD/month if you need it the whole
month or for longer:

[https://www.ovh.com/ca/en/dedicated-
servers/gpu/1801gpu06.xm...](https://www.ovh.com/ca/en/dedicated-
servers/gpu/1801gpu06.xml)

But the best deal with regards to GPUs is to buy your own and put it in your
office.

~~~
ctlaltdefeat
Hetzner has a single 1080 for 99 EUR a month:
[https://www.hetzner.de/dedicated-rootserver/ex51-ssd-
gpu](https://www.hetzner.de/dedicated-rootserver/ex51-ssd-gpu)

~~~
asah
I've used hetzner, works great and a tiny fraction of the cost. But you have
to manage nodes yourself and bandwidth isn't great.

~~~
bhouston
With Kubernetes and docker, it has never been easier to run it yourself.

------
log_base_login
> Indeed, the P100 is twice as fast and the K80, but due to the huge cost
> premium, it’s not cost effective for this simple task.

Am I reading the graph wrong, or is this statement not true?

Looks like the P100 is running the code in around half the time as the K80 and
the cost is only 150% that of the K80.

.5::1.5 == 1::3 > 1::1

------
trollian
I'd be interested in hearing how this compares to TPU.

