
Benchmarking TensorFlow on Cloud CPUs: Cheaper Deep Learning Than Cloud GPUs - myth_drannon
http://minimaxir.com/2017/07/cpu-or-gpu/
======
boulos
Disclosure: I work on Google Cloud (and launched Preemptible VMs).

Thanks for the write-up, Max! I want to clarify something though: how do you
handle and account for preemption? As we document online we've oscillated
between 5 and 15% preemption rates (on average, varying from zone to zone and
day to day) but those are also going to be higher for the largest instances
(like highcpu-64). But if you need training longer than our 24-hour limit, or
you're getting preempted too much, that's a real drawback (Note: I'm all for
using preemptible for development and/or all batch-ey things but only if
you're ready for the trade-off).

While we don't support preemptible with GPUs yet, it's mostly because the team
wanted to see some usage history. We didn't launch Preemptible until about 18
months after GCE itself went GA, and even then it involved a lot of
handwringing over cannibalization and economics. We've looked at it on and
off, but the first priority for the team is to get K80s to General
Availability.

Again, Disclosure: I work on Google Cloud (and love when people love
preemptible).

~~~
minimaxir
> how do you handle and account for preemption?

I do most of my experiments with Jupyter Notebooks and Keras on top of
TensorFlow. Keras has a ModelCheckpoint callback
([https://keras.io/callbacks/#modelcheckpoint](https://keras.io/callbacks/#modelcheckpoint))
which saves a model to disk after each epoch and is super easy to implement (1
LOC), and a good idea even if I wasn't training on a preemptable instance. In
the event of an unexpected preemption, I can just retransform the data (easy
with a Jupyter-organized workflow), load the last-saved model (1 LOC) and
resume training.

The drawback there is if the epochs are long, which could risk in losing more-
than-wanted progress due to a preemption.

~~~
rryan
That's really odd that that Keras API's interval is measured in epochs (which
is a different wallclock interval for every different model/dataset/hardware
configuration). It's much more common to checkpoint based on a time interval.

~~~
gcr
Oh interesting, I've never seen checkpointing on a time interval. Most Torch
examples just dump the model to disk after the epoch finishes.

One reason to use epoch checkpointing is because that ensures that all samples
of the training data have been seen the same number of times. If your data is
large and diverse, with heavy enough augmentation it might not matter very
much

------
paulsutter
Shoutout for Hetzner's 99 euro/month server with a GTX 1080, much better than
the pseudo-K80s that Google Cloud provides for $520/month. The Google K80s are
half or quarter the speed of a real K80, part of the reason they show so badly
in the comparison.

[https://www.hetzner.com/dedicated-rootserver/ex51-ssd-
gpu?co...](https://www.hetzner.com/dedicated-rootserver/ex51-ssd-
gpu?country=ot)

~~~
shmageggy
Wish HN had a 'save' feature so I can remember this comment when I need a GPU
box.

~~~
detaro
click on the timestamp, click "favorite"

~~~
inovica
New one for me also. Thanks!

------
dkobran
One of the interesting variables in calculating ML training costs is developer
time. The cost of a Data Scientist (or similar role) on an hourly basis will
far outweigh the most expensive compute resource by several orders of
magnitude. When you factor in time, the GPU immediately becomes more
attractive. Other industries with heavy/time consuming computational workloads
like CGI rendering have understood this for decades. It's difficult to attach
a dollar sign to the value of speeding something up because it's not only
about simply saving time itself but also about the way we work: Waiting around
for results limits our ability to work iteratively, scheduling jobs becomes a
project of its own, the process becomes less predictable etc.

Disclaimer: Paperspace team.

~~~
0xbear
For training, that's likely to be true. For large scale inference it's not
possible to beat CPUs right now if cost is a factor. You might be able to beat
them once you can buy TPU access in cloud, depending on how steep a premium
Google attaches to it.

------
shusson
While the authors article is relevant if you are stuck on GCP, on AWS you will
not have the same conclusion. This is because AWS has GPU spot instances (P2)
which can be found for ~80% cheaper depending your region [1]. Hopefully one
day soon GCP will support preemptible GPU instances.

[1]
[https://aws.amazon.com/ec2/spot/pricing/](https://aws.amazon.com/ec2/spot/pricing/)

~~~
likelynew
When I started it was even cheaper(~10% of reserved cost), but even now, it's
pretty cheap.

------
joeblau
I would love to see these results put up against Google's new TPUs[1]. While
TPUs are still in Alpha, my guess is that customized hardware that understands
TensorFlow's APIs would be a lot more cost effective.

[1] - [https://cloud.google.com/tpu/](https://cloud.google.com/tpu/)

------
cobookman
I've been amazed that more people don't make use of googles preemtibles. Not
only are they great for background batch compute. You can also use them for
cutting your stateless webserver compute costs down. I've seen some people use
k8s with a cluster of preemtibles and non preemtibles.

~~~
ohstopitu
something I've always been curious about (and if a Google Cloud Engineer could
clear up - that would be great), is why we should not (as in, why does
everyone not) use preemptible nodes (apart from maybe the 3 / 5 master nodes).

My question specifically being: if I configure a k8s cluster to have all my
slaves as preemptible nodes...would GCP automatically add new nodes as my old
nodes are deleted (from what I understand preemptible nodes are assigned to
you for a max of 24 hrs)?

Considering the pricing of preemptible nodes + the discounts that GCP assigns
to you for sustained use, it makes cloud insanely cheap for an early stage
startup.

~~~
thesandlord
Google Cloud Developer Advocate here.

Go for it as long as you understand the downside. It's possible that all
instances get preempted at once (especially at the 24hr mark), that there
isn't capacity to spin up new preemptible nodes in the selected zone once the
old instance is deleted, etc. New VMs also take time to boot and join the
cluster.

If you are just doing dev/test stuff, I'd recommend using a namespace in your
production cluster or spinning up and down test clusters on demand (which can
be preemptible).

If you have long running tasks (like a database) or are serving production
traffic, using 100% preemptible nodes is not a good idea.

Preemptible can be great for burst traffic and batch jobs, or you can do a mix
of preemptible and standard to get the right mix of stability and cost.

~~~
swampthinker
If you don't mind me asking, what exactly is the role of a developer advocate?

~~~
fhoffa
[https://medium.com/google-cloud/a-day-in-the-life-of-a-
devel...](https://medium.com/google-cloud/a-day-in-the-life-of-a-developer-
advocate-for-google-cloud-platform-fe681c8645cf)

Not me or OP, but same team :)

~~~
swampthinker
Cheers!

------
visarga
For research and experimentation what you need is your own DL box. It will pay
for itself in a few months. You will feel better having your own reliable
hardware that you don't share or have to pay by the minute, and that will
impact the kind of ideas you are going to try.

Then you scale up to the cloud to do hyperparameter search.

~~~
alexcnwy
Do you have any advice on getting your own box set up in a data center -
constantly traveling...

------
AndrewKemendo
Excellent write-up, kudos on going through all of that Max. Too bad Google
will deprecate the preemptable instances as a result :P.

 _There is a notable CPU-specific TensorFlow behavior; if you install from pip
(as the official instructions and tutorials recommend) and begin training a
model in TensorFlow, you’ll see these warnings in the console:_

FWIW I get the console warnings with the Tensorflow-GPU installation from pip,
and I verified that it was actually using the GPU.

------
marze
A question for those who've used TensorFlow on NVIDIA GPUs:

What range of GPU performance do you see? As in, if the card does 10 TFLOPS
peak, does TensorFlow manage to reach that peak, or is it at 5% or 20% or some
other percent of peak typically?

And are there expectations for Googles new generation TPU? What range of peak
performance do people expect to get?

------
LeaderGPU
Thank you for benchmarks! It will be interesting include in your research
Inception-v4, Inception-ResNet. And try to compute with Nvidia 1080 / 1080TI
cards.

Our benchmarks for processing 1000000 images ResNet-50:

\- 8x Tesla K80: 43m 3 sec.

\- 8x Nvidia 1080: 17m 32 sec ( 0.09 euro / minute ).

We can provide you resources for free for research.

Disclosure: I'm founder of LeaderGPU.

------
data4science
Paperspace has dedicated GPU instances for $0.40/hr, I'll have to compare with
Hetzner...

~~~
icelancer
Been very impressed with their customer service and their bandwidth
availability. I'm getting 500/500 on speed tests to the west coast servers of
theirs. I don't use their GPU instances but I do use their high power CPU
instances for video rendering and V3D work.

------
automatapr
Neat article. I think it's worth pointing out that this guy is an active
commenter in the Hackathon Hackers facebook group, if you want to see more of
his content. He can be pretty pretentious sometimes, but good content
nonetheless.

~~~
minimaxir
You don't need to make a throwaway to call me pretentious. :P

------
jeremynixon
Fascinating. Wish he could have shown benchmarks on a larger image database
(Imagenet or CFAIR-100), as mnist is extremely easy to train on. Great to
know, especially the LSTM benchmarking.

------
zbjornson
As far as the 64 vCPU finding, that's quite possibly because it's crossing
NUMA modes. GCE's virtualization hides NUMA information unfortunately (at
least as far as I've ever seen), so there's no way to handle this in software
even.

Would be interesting to see these benchmarks on Haswell/Broadwell vs Skylake.

------
chris_st
Quick question to those with a deep understanding of these things... I have
not been able to get GPU tensor flow (on AWS) to run faster for the networks
I'm using.

This is with a small(ish) network of perhaps a few hundred nodes... should I
see a speedup for this case, or are GPUs only relevant for large CNNs, etc.?

~~~
david-gpu
Correct. GPUs are not efficient with very small networks.

~~~
chris_st
Any useful rules of thumb on when to use GPU?

------
user5994461
> [slower on 32 and 64 core systems]

The library doesn't handle NUMA hardware?

------
vzn
would be interesting to see benefit of MKL optimizations on the same examples

[https://software.intel.com/en-us/articles/tensorflow-
optimiz...](https://software.intel.com/en-us/articles/tensorflow-
optimizations-on-modern-intel-architecture)

------
yahyaheee
No spot instances?

~~~
minimaxir
I kept it the analysis to GCE only for simiplicity. (both because the costs of
spot instances are highly variable, and costs are not prorated on Amazon
meaning you have to pay for the full hour; an additional concern if you just
want to run a small ad-hoc training)

~~~
ranman
If a spot instance terminates in the first hour you're not charged for it. You
can grab spot blocks as well for specific duration workloads.

------
0xbear
FYI, y'all: cloud "cores" are actually hyperthreads. Cloud GPUs are single
dies on multi-die card. If you use GPUs 24x7, just buy a few 1080 Ti cards and
forego the cloud entirely. If you must use TF in cloud with CPU, compile it
yourself with AVX2 and FMA support. Stock TF is compiled for the lowest common
denominator.

~~~
tgtweak
This is very important if you're running any cpu intensive workload at scale.
We had custom compiled x264 then custom compiled that into ffmpeg to get
everything out of our CPUs for an encoding cluster. AMD cpus seem to really
shine here.

You'd be surprised the difference it makes. It was one of the reasons I liked
Gentoo, emerge would always build from source for your target CPU flags,
instead of using the package managed "one size fits all" build. Those 5-10%s
really compound when you add them up along all dependencies.

Kudos to tutorials and guides that instruct how to build from source.

The same is every bit as true today for your containers, assuming you have a
homogeneous target to run them (yes I know, containers are supposed to be
supremely portable, but private ones can be purpose built)

~~~
0xbear
This is especially important if most of your workload is matrix
multiplication. Those workloads heavily benefit from vectorization. It might
also help to enable Intel MKL, because Eigen, which TF uses by default is not
the fastest thing out there, just the most convenient to work with cross
platform.

~~~
edwinksl
Would hyperthreading be helpful or harmful?

~~~
0xbear
Hyper threading is not harmful per se. It lets your CPU make forward progress
when it would otherwise be stalled waiting for something. My issue is that
they call hyperthreads "vCPU" which makes it seem like you're getting a full
core, while in reality you're getting 60% of a core at most.

~~~
beagle3
Hyper threading often _is_ harmful when you use it, because while it does let
your CPU make forward progress, it does that at the expense of e.g. cache that
is evicted.

Obviously depends on your workload, but on my highly parallel "standard"
workloads, my experience is that you can get at most 15% more with
hyperthreading on (e.g. 4 cores/8 threads) compared to off (4 cores/4
threads), whereas on the cache intensive loads, I get 20-30% LESS with
hyperthreading on.

~~~
0xbear
I have never encountered such an abnormal workload. This is also less likely
to happen in Broadwell Xeon and up, where last level cache can be partitioned.
And this is also less likely to happen on Google Cloud in particular, because
Google uses high end CPUs with tons of cache.

~~~
beagle3
If both core threads are memory (and cache) intensive, then you get
effectively half the cache size and half the memory bandwidth. Partitioning
may make eviction less random, but the cache size is still halved, regardless
of how much "tons of cache" you start with.

~~~
0xbear
Increasing cache has the net effect of increasing hit ratio, sometimes
substantially. With 20MB per die this may change the calculation of where
things drop off. I have found that I can't reliably predict how a chip will
perform, so I just wrote a bunch of benchmarks and it takes me about half an
hour to see if the chip performs better or worse than I thought it would.
Google's Broadwell VMs perform very well.

