
Cost Comparison of Deep Learning Hardware: Google TPUv2 vs. Nvidia Tesla V100 - pul
https://medium.com/bigdatarepublic/cost-comparison-of-deep-learning-hardware-google-tpuv2-vs-nvidia-tesla-v100-3c63fe56c20f
======
zak
Here is a larger-scale comparison of Cloud TPU and Google Cloud GPU
performance and cost (focused on Cloud TPU Pods):
[https://cloud.google.com/blog/products/ai-machine-
learning/n...](https://cloud.google.com/blog/products/ai-machine-learning/now-
you-can-train-ml-models-faster-and-lower-cost-cloud-tpu-pods)

All the code used in that comparison is open source, and there is a detailed
methodology page with instructions that you can follow if you want to
reproduce the results:
[https://github.com/tensorflow/tpu/blob/master/benchmarks/Res...](https://github.com/tensorflow/tpu/blob/master/benchmarks/ResNet-50_v1.5_Performance_Comparison_TensorFlow_1.12_GCP.md)

Also, Cloud TPUs are available to everyone for free via Colab. Here is a
sample Colab that shows how to train a Keras model on the Fashion MNIST
dataset using the Adam optimizer:
[https://colab.research.google.com/github/tensorflow/tpu/blob...](https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb)

(I work on Cloud TPUs)

~~~
twtw
Are there plans for preemptible TPU pods?

As is, it looks like p3 16xlarge spot instance (or probably preemptible gcp
8xV100) are still by far the most cost effective option. You have to do a bit
extra to tolerate preemption, but it's worth the ~80% savings.

Also, the "TPU pod is 200x faster than single v100" comparison is a little
goofy. Might as well say Summit or Titan is faster than my desktop.

~~~
zak
At present, preemptible Cloud TPU v2 and v3 devices are widely available, and
they are likely to be the most cost-effective option for training any of the
models listed here, often by a wide margin:
[https://cloud.google.com/tpu/docs/tutorials](https://cloud.google.com/tpu/docs/tutorials)

Definitely appreciate your point about comparing Summit to your desktop -
however, the difference is that you can't rent Summit via any public cloud,
whereas you _can_ rent Cloud TPU Pods.

You might find the comparison between 8 x V100 GPUs on GCP and a full Cloud
TPU Pod more relevant - in that case, as of the time the Google Cloud blog
post linked above was published, a full Cloud TPU Pod delivered a 27X speedup
at 38% lower cost for a large-scale ResNet-50 training run, all without
requiring any code changes to scale beyond a single device.

~~~
twtw
Thanks for your response.

I'm interested particularly in preemptible _pods_ \- I see here
[https://cloud.google.com/tpu/docs/pricing](https://cloud.google.com/tpu/docs/pricing)
pricing for preemptible single units, but there is no indication that
preemptible pods are available.

------
trishume
This doesn't seem like a very informative benchmark to me. They don't mention
how/whether they tuned the learning rates and batch sizes to optimize for each
different device. Like they mention, they also use a very small network that
isn't something you need the power of a TPU to train quickly and may scale
differently than a large network.

They also don't post their code so I can't check that their problems with ADAM
aren't due to using L2 regularization, which
[https://arxiv.org/abs/1711.05101](https://arxiv.org/abs/1711.05101) shows
leads to worse performance than SGD and you should use weight decay instead.

~~~
bigdatarepublic
> They don't mention how/whether they tuned the learning rates and batch sizes
> to optimize for each different device.

All networks were trained with the same hyperparameters. Only the batch size
was increased with the amount of parallelization used (so increased 8-fold for
TPU and distributed GPU).

> Like they mention, they also use a very small network that isn't something
> you need the power of a TPU to train quickly and may scale differently than
> a large network.

I agree completely here. We unfortunately didn't have access to the TPUs long
enough to create more useful benchmarks on networks like Resnet-50 and with
bigger datasets.

> They also don't post their code so I can't check that their problems with
> ADAM aren't due to using L2 regularization, which
> [https://arxiv.org/abs/1711.05101](https://arxiv.org/abs/1711.05101) shows
> leads to worse performance than SGD and you should use weight decay instead.

The code is the same for all devices and in the single-GPU and CPU benchmarks
we see Adam performing better than SGD. We did not use any regularization
besides a dropout layer so I don't think this explains the bad Adam
performance on TPU.

I'll make some effort to clean up the code so it can be shared.

~~~
trishume
The fact that you didn’t change the learning rates for different batch sizes
even following a linear scaling rule let alone tuning for each batch size is
pretty important. IMO without tuning the learning rates comparisons across
batch sizes are pretty meaningless because it changes the effective learning
rate which can make a big difference in training.

~~~
bigdatarepublic
Good catch! I'll try to rerun the experiment. Hopefully the Google Colab TPUs
give similar results to the Google Cloud ones so I can keep experimenting.

Still, since Adam performs worse even for non-distributed TPU (where the batch
size and learning rate are the same for all devices) this wouldn't explain
what we're seeing. Anyway, I'll try to find some time tomorrow to post the
code so you can all have a look at it and hopefully see what's up.

------
twtw
Looks like riseml has shut down and taken their comparison post down. I was
hoping to compare the results.

~~~
pul
It's still in Googles cache:
[https://webcache.googleusercontent.com/search?q=cache:zZwHCS...](https://webcache.googleusercontent.com/search?q=cache:zZwHCS9zhzAJ:https://blog.riseml.com/comparing-
google-tpuv2-against-nvidia-v100-on-resnet-50-c2bbb6a51e5e+) Not in
archive.org though.

------
deepnotderp
Hmm interesting, is it possible that the batch size for the tpu is larger? I'm
guessing they might be using some sort of large batches to populate their
giant GEMM cores

------
andrewtbham
My understanding is that Teslas have scientific accuracy that is not needed
for deep learning... my machine has nvidia 1080s.

~~~
Nursie
Tesla V100 has tensor cores which are supposed to massively accelerate deep
learning compared to the Pascal architecture of a 1080, which lacks them.

I believe they are present in the Turing products from nVidia too.

