
Comparing Google’s TPUv2 against Nvidia’s V100 on ResNet-50 - henningpeters
https://blog.riseml.com/comparing-google-tpuv2-against-nvidia-v100-on-resnet-50-c2bbb6a51e5e
======
jacksmith21006
Thanks for sharing and very insightful. Guess the TPUs are the real deal.
About 1/2 the cost for similar performance.

Would assume Google is able to do that because of the less power required.

I am actually more curious to get a paper on the new speech NN Google is
using. Suppose to be 16k samples a second through a NN is hard to imagine how
they did that and was able to roll it out as you would think the cost would be
prohibitive.

You are ultimately competing with a much less compute heavy solution.

[https://cloudplatform.googleblog.com/2018/03/introducing-
Clo...](https://cloudplatform.googleblog.com/2018/03/introducing-Cloud-Text-
to-Speech-powered-by-Deepmind-WaveNet-technology.html)

Suspect this was only possible because of the TPUs.

Can't think of anything else where controlling the entire stack including the
silicon would be more important than AI applications.

~~~
nojvek
Half the cost? Where are you reading that? Yeah on demand rental in AWS is
expensive, but both long term and buying V100 yourself is significantly
cheaper. Cloud companies have pretty fat margins on on demand rentals.

You can’t buy a TPU, it’s a cloud only thing. They also show it’s not a huge
difference in both perf and time to converge (albeit only one architecture)

I would say kudos to V100 and this benchmark that breaks the TPU hype.

~~~
jacksmith21006
The chart has 6.7 per hour for 3186 images Google and 12.2 per hour for 3128
AWS.

Or maybe reading it wrong?

That is close to half has much to use Google is it not?

BTW, The TPUs are also about twice as fast also.

Sounds like Google is pretty far ahead of Nvidia. Which really just makes
sense as Google does the entire stack and just going to have the data to
optimize the silicon.

About half the cost is hype?

I want in the cloud and not have to deal with updating, etc. Would think most
are the same for anything of any scale. Could not imagine any longer building
up rigs and dealing with all the issues. Plus much harder to scale.

~~~
nightski
It's more a comparison of AWS vs. Google Cloud pricing than Nvidia vs. TPUv2.

~~~
jacksmith21006
Strongly disagree. If Google is able to offer at about 1/2 the cost using
their own silicon versus AWS using Nvidia that is all about the silicon
difference.

But we also have the V1 TPU paper and can see the TPUs are able to use less
joules per inference compared to an older Nvidia architecture. Was not that
close. Just makes sense Google V2 TPUs would do the same.

Hope Google does a V3 TPU and then will share a V2 TPU paper like they did on
V1 of the TPUs.

What is far more impressive of the TPUs is

[https://cloudplatform.googleblog.com/2018/03/introducing-
Clo...](https://cloudplatform.googleblog.com/2018/03/introducing-Cloud-Text-
to-Speech-powered-by-Deepmind-WaveNet-technology.html)

If really doing 16k a second through a NN and at a price you can offer
generally now that is incredible. I want this paper even more so.

~~~
shaklee3
Maybe, maybe not. They have the advantage that they make the hardware, so
they're not paying as much retail as nvidia is charging them for their cards.
I don't think there's any way you can say the TPU is cheaper compared to
buying your own system. If Google decides to release it to the public, that's
a different story. Also, keep in mind that Google allows you to mix and match
the CPU core count to GPU, whereas AWS doesn't. It's possible that the Google
cloud price with fewer CPU cores will be much cheaper than the AWS instance.

~~~
jacksmith21006
That is true. But the cost of running is where all the cost is at really not
so much in making the chips.

Yes I can say it is a lot cheaper. That is what this article is all about.

You can do about twice the images per dollar using the TPUs with GCP versus
using Nvidia with AWS.

Or what am I missing?

BTW, Google has released to the general public. What are you talking about?

"Google’s AI chips are now open for public use"

[https://venturebeat.com/2018/02/12/googles-ai-chips-are-
now-...](https://venturebeat.com/2018/02/12/googles-ai-chips-are-now-open-for-
public-use/)

~~~
shaklee3
You misunderstood. They released them to the public _on GCP only_. Nvidia's
cards are released to the public as a hardware device that you can customize
around. Big difference.

~~~
jacksmith21006
Yes in the cloud as you would expect in 2018. Available to the general public.

~~~
shaklee3
They announced in 2016 they had TPUs. So no, I would not expect that 2 full
years later they're just now being available in the public cloud. These are
not new products to them; they likely just don't want to deal with supporting
them in different configurations.

~~~
ndesaulniers
> they likely just don't want to deal with supporting them in different
> configurations.

It is a lot of work. But mainly that TPUv1 only did inference while TPUv2 does
training+inference.

~~~
shaklee3
Exactly my point. It's a lot of work. That's the reason why Nvidia has such a
large team doing it, and also why they spent 3 billion dollars to build the
V100 ASIC.

~~~
jacksmith21006
Big difference is Google does the entire stack and also has scale to
conceptually be able to create a better solution. But that is theoretical.

Here we can see the results where the Google TPU gets almost twice result per
dollar over Nvidia. But then Google should be able to iterate more quickly.

Take the move from using CNN to using capsule networks. The idea for Capsules
came from Hinton and Google is going to be there first to optimize in
hardware. This is the benefit of playing in all layers of the stack.

Or the using NN for text to speech and offering at scale. Google just has
inherent advantages over Nvidia and now we get to see a little more concrete
results. But hope we get a lot more similar and see if the Google advantage
holds up.

------
elmarhaussmann
Hi, author here. The motivation for this article came out of the HN discussion
on a previous post
([https://news.ycombinator.com/item?id=16447096](https://news.ycombinator.com/item?id=16447096)).
There was a lot of valuable feedback - thanks for that.

Happy to answer questions!

~~~
MrBuddyCasino
I found it interesting that they are so close together in performance - I mean
what are the odds that they end up within 2% of each other?

~~~
jacksmith21006
The TPUs are doing almost 2x the images for the same cost.

That is not all that close is it?

------
zmarty
Slower alternative: "fastai with @pytorch on @awscloud is currently the
fastest to train Imagenet on GPU, fastest on a single machine (faster than
Intel-caffe on 64 machines!), and fastest on public infrastructure (faster
than @TensorFlow on a TPU!) Big thanks to our students that helped with this."
\-
[https://twitter.com/jeremyphoward/status/988852083796291584](https://twitter.com/jeremyphoward/status/988852083796291584)

~~~
jorgemf
One machine with 8 V100 GPUs. If you consider one TPU pod a single machine the
TPU is faster. Those numbers also show that 8 GPUs are slower than 8 TPUs (so
same conclusion as the article)

------
dimitry12
An important hidden cost here is coding a model which can take advantage of
mixed-precision training. It is not trivial: you have to empirically discover
scaling factors for loss functions, at the very least.

It's great that there is now wider choice of (pre-trained?) models formulated
for mixed-precision training.

When I was comparing Titan V (~V100) and 1080ti 5 months ago, I was only able
to get 90% increase in forward-pass speed for Titan V (same batch-size), even
with mixed-precision. And that was for an attention-heavy model, where I
expected Titan V to show its best. Admittedly, I was able to use almost double
the batch-size on Titan V, when doing mixed-precision. And Titan V draws half
the power of 1080ti too :)

At the end my conclusion was: I am not a researcher, I am a practitioner - I
want to do transfer learning or just use existing pre-trained models - without
tweaking them. For that, tensor cores give no benefit.

~~~
bitL
How did you get your hands on Titan V 5 months ago? I still can't find it
anywhere in retail in EU...

~~~
dimitry12
It was in stock on and off and I was able to order it directly from Nvidia US.

After 59 days of playing with it, I sent it back (initiated return on 30th
day, after I already figured out it doesn't live up to the hype, then had
another 30 days to actually send it back).

With $3,000 I can buy 4 1080ti's, while only two are necessary to beat Titan V
(in Titan V's best game). I only bought one though. NowInStock.net helped with
buying 1080ti directly from Nvidia.

------
Nokinside
Nvidia is currently in cashing out phase. They have monopoly and money flows
in effortlessly. The cost performance ratio reflects this.

AMD will enter the game soon once they get their software working, Intel will
follow.

I suspect that Nvidia will respond with its own specialized machine learning
and inference chips to match the cost/performance ratio. As long as Nvidia can
maintain high manufacturing volumes and small performance edge, they can still
make good profits.

~~~
jacksmith21006
"The cost performance ratio reflects this."

But the TPUs are half the cost per this article?

Plus Google does the entire stack and can better optimize the hardware versus
Nvidia. So it seem Google can improve faster I would think.

If there ever was a huge advantage doing the entire stack it is with neural
networks.

A perfect example is Google new speech doing 16k samples a second with a NN.

[https://cloudplatform.googleblog.com/2018/03/introducing-
Clo...](https://cloudplatform.googleblog.com/2018/03/introducing-Cloud-Text-
to-Speech-powered-by-Deepmind-WaveNet-technology.html)

Do not think Google could offer this service as a competitive cost without the
TPUs.

This new method is replacing the method that was far less compute intensive so
to offer at a competitive price requires lowering compute cost which suspect
is only possible with the TPUs.

~~~
shaklee3
I'm not sure what you mean by google does the entire stack. Nvidia writes all
of the major CUDA libraries used behind the scenes in the NN libraries, such
as cuDNN, cuBLAS, etc. Nvidia can likely improve their hardware significantly
faster/more efficiently than Google can because their entire business depends
on it. Google has incentive for improving their TPU for internal use, but they
don't make any money by selling TPU time on GCP yet.

~~~
ndesaulniers
> I'm not sure what you mean by google does the entire stack.

Consider that Google has some of the best machine learning researchers,
compiler engineers, hardware engineers, and infrastructure in the business
working on this.

~~~
shaklee3
Huh? Machine learning and infrastructure Engineers, yes. Compiler and Hardware
engineers? No. What gives you reason to believe they have a lead in either of
those departments other than they have a lot of money? They're forced to use
the same foundry as Nvidia, and their Hardware team is likely significantly
smaller.

~~~
jacksmith21006
Google been buying up AI resources well before anyone else and has the
strongest and deepest team at this point.

It is why so many of the break throughs have come from Google. Great example
is winning at Go almost a decade earlier than anyone thought possible.

They probably two of the strongest teams with one the Brain team and then the
Deepmind team. But all the other engineers and infrastructure is first rate at
Google.

Really at this point do not think the $100B cash is as important as Google
already built the team and now experinced resources are far more difficult to
get.

The other advantage for Google is their ability to attract the top engineers
in addition.

Google just got started a lot earlier on all of this.

~~~
shaklee3
Google got started a lot earlier on this? Did you read what you are saying?
Nvidia has been making hardware longer than Google has been a company. No,
Google does not have a better hardware team. Google has the luxury of making a
device that is used for a single purpose that they control. Nvidia made a
device that can be used for far more and works on commodity hardware. By the
way, deepmind/alphago uses Nvidia GPUs, so that was an extremely bad example.

~~~
jacksmith21006
Hardware optimize for NN. Nvidia dominate focus had been graphics. Big
difference which we can see the results in this article.

Plus benefits not having the baggage that Nvidia would have.

But never going to be able to use a TPU for graphics.

In the end it is about results.

~~~
shaklee3
Tensor cores are hardware optimized for NN. You call it baggage, Nvidia calls
it extra revenue. Because some people need double precision, and those people
are willing to pay a lot of money. So the V100 continues to be the cheapest
way to train and do inference on NN because you can actually amortize the
server cost over time. With tpu, you pay the hourly price forever. TPU are
better only in the case of NN jobs that are short in length or you don't have
the capital to buy a server. Anything longer, you can buy a Titan v and come
out far ahead.

By the way, the Tesla cards have no graphics output, so I'm sure why you'd say
they have graphics baggage.

~~~
jacksmith21006
The problem for Nvidia is they do NOT do the entire stack. So Google has the
ability to better optimize and here we are seeing those results as using TPUs
is about 1/2 the price of Nvidia hardware.

Baggage is a company thing. Google really has been an AI company since in the
late 90s when Larry Page was asked about using AI to improve search and he
replied he was using search to make AI happen.

Ha! When you amortize you are still spending money and you saying this really
bothers me and is such a problem.

Too many look at things like you do and why companies get into problems.
Capitalizing is not magic.

BTW, Google is also going to be able to iterate much quicker as the AI
breakthroughs happen and come out with new versions that should stay well
ahead of Nvidia.

The dynamics of the chip business have changed. Use to be companies bought
chips from someone and then put them in to servers and sold the servers.

The problem is the company making the chips are NOT running the chips and do
not have any skin in the game or the data needed to improve.

Now we have companies like Google making the chips and also running the chips
and why we see power footprint being the focus far more than the past.

We will see all the big operations including Amazon make their own chips more
and more.

A perfect example if Capsule networks replacing some uses of CNNs. Google with
Hinton developed the Capsule network approach and will be supporting it far
faster then you will see from Nvidia.

Then there is the canonical framework for AI being TF.

All of this was theoretical advantageous for Google and now we get to see they
appear to be real with the pricing of the TPUs being about half of the cost of
using Nvidia.

~~~
shaklee3
You still haven't given a single example of what you mean by "doing the entire
stack". I'm assuming that's because you don't have one?

You seemed to have completely missed why Nvidia's stock has gone up 17x in 4
years while google only 3x. The dynamics of the chip business have not
changed; you are focusing on a single market, DNN, which is a small piece of
the entire science/engineering community. Google made a chip that accelerates
DNN. They also chose not to make an API to use that hardware with outside TF.
So if you could buy a tpu and put it in your own server, it would beat the
V100 in performance/watt. You can't do that, so nvidia wins, because I can buy
a V100, and in 51 days the price I bought it for ($8K) has already been burned
through in GCP. If you need me to do the math to help you realize that now
your only recurring cost on the v100 (power) is more than 100x less than the
TPU, I can do that for you. But hopefully you understand now that the TPU is
for a niche market outside of google, and it will never be a large source of
revenue for them at $6.50/hour.

TF is not exclusive to google. Nvidia has engineers working on TF.

Your capsule example is again extremely poor. You think google can respin an
asic quicker than nvidia? Not only does history say the exact opposite, but
they both use TSMC.

~~~
ndesaulniers
> Nvidia's stock has gone up 17x in 4 years while google only 3x

Not sure the market cap or the P/E are apples to apples there.

Also:

> [https://www.cnbc.com/2018/02/23/secretive-chinese-bitcoin-
> mi...](https://www.cnbc.com/2018/02/23/secretive-chinese-bitcoin-mining-
> company-may-have-made-as-much-money-as-nvidia-last-year.html)

------
samfisher83
>For GPUs, there are further interesting options to consider next to buying.
For example, Cirrascale offers monthly rentals of a server with four V100 GPUs
for around $7.5k (~$10.3 per hour). However, further benchmarks are required
to allow a direct comparison since the hardware differs from that on AWS (type
of CPU, memory, NVLink support etc.).

Can't you just buy some 1080s for cheaper than this. I understand there is
electricity and hosting costs, but cloud computing seems expensive compared to
buying equipment.

~~~
dgacmu
Yes, you can. The problem starts when "you" are a large company -- NVidia
restricts "datacenter" use of consumer GPUs (see previous HN discussion of
that one:
[https://news.ycombinator.com/item?id=15983587](https://news.ycombinator.com/item?id=15983587)
). A single Titan V is somewhere in the 90% range of a V100 at less than 1/3
the cost, and a 1080ti, if you can find one, likely offers a slightly better
price/performance spot. 4-GPU training may suffer due the lack of NVlink, but
not enough for it to matter too much. As you scale, though, the lack of NVlink
will hurt more. And, of course, all of these things come with a capex vs opex
tradeoff, and a sysadmin vs cloud tradeoff, that will appeal differently to
different situations.

~~~
baybal2
Hire people to buy 1080 in retail. This problem is solvable easily.

~~~
dgacmu
It's not about getting the cards (though supplies are limited because of
cryptocurrency mining, but you could buy Titan V's off the shelf in batches of
2). It's about whether or not you're big enough of a target for Nvidia's
lawyers if you violate the agreement and actually build a datacenter out with
them.

------
bitL
Excellent! Thanks for these numbers, I wanted to see exactly this kind of
benchmarks! Do you plan to try different benchmarks with the same setup for
different problems, like semantic segmentation, DenseNet, LSTM training
performance etc. as well?

~~~
elmarhaussmann
Happy to hear the benchmark is useful to you! We'd love to try different
setups and further models/networks. On the other hand, such benchmarks are a
LOT of effort (which we underestimated it initially), so we'll have to see.

------
kyloon
Excellent work. Do you have plans to open source the scripts/implementation
details used to reproduce the results? Would be great if others can also
validate and repeat the experiment for future software updates (e.g.
TensorFlow 1.8) as I expect there will be some performance gain for both TPU
and GPU by CUDA and TensorFlow optimizations.

Sidenote: Love the illustrations that accompany most of your blog posts, are
they drawn by an in-house artist/designer?

~~~
elmarhaussmann
Happy you like the post! The implementations we used are open source (we
reference the specific revisions), so reproducing results is possible right
now. We haven't thought about publishing our small scripts around that
(there's not much to it), but it's a good idea. There's also work towards
benchmarking suites like DAWNBench
([https://dawn.cs.stanford.edu/benchmark/](https://dawn.cs.stanford.edu/benchmark/)).

The illustrations are from an artist/designer we contract from time to time. I
agree, his work is awesome!

~~~
ndesaulniers
> The illustrations are from an artist/designer we contract from time to time.
> I agree, his work is awesome!

Kudos to them; they are awesome!

------
scottlegrand2
What they're not saying is that one can't use all nvlink bandwidth for
gradient reduction on a DGX-1V with only 4 GPUs because nvlink is composed of
2 8-node rings. And given the data parallel nature of this benchmark, I'm very
interested in where time was spent on each architecture.

That said, they fixed this on NVSwitch so it's just another HW hiccup like
int8 was on Pascal.

~~~
elmarhaussmann
For this benchmark, NVLink and gradient reduction isn't the bottleneck. The
performance scales almost perfectly linearly from one GPU to four.

------
drej
Thanks for this, just a minor thing:

You have price per hour and performance per second. Thus that ratio is not
performance per image per $, you need to scale that. Also, the metric is not
"images per second per $", but just "images per $".

~~~
elmarhaussmann
Thanks for catching this!

------
wyldfire
How much detail do we know about the TPUs' design? Does Google disclose a
block-diagram level? ISA details? Do they release a toolchain for low-level
programming or only higher-level functions like TensorFlow?

EDIT: I found [1] which describes "tensor cores", "vector/matrix units" and
HBM interfaces. The design sounds similar in concept to GPUs. Maybe they don't
have or need interpolation hw or other GPU features?

[1] [https://cloud.google.com/tpu/docs/system-
architecture](https://cloud.google.com/tpu/docs/system-architecture)

~~~
jacksmith21006
Great paper on the Generation 1 TPU. But Google has not shared much details on
gen 2 and in some ways kind of hid information.

Suspect we will need a gen 3 to get a paper on the gen 2.

Here is the gen 1 paper and highly recommend. Pretty interesting using 65536
very simple cores.

[https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf](https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf)

------
twtw
Great work, RiseML. This benchmark is sincerely appreciated.

I wonder whether NVLink would make any difference for Resnet-50. Does anyone
know whether these implementations require any inter-GPU communication?

~~~
elmarhaussmann
They don't require it but some of the ResNet-50 implementations can make use
of it (e.g., the ones in the Docker containers on the Nvidia GPU Cloud). But
even the ones without seem to scale to 4 GPUs pretty well. This may be a
different story for 8 GPUs and larger/deeper networks, e.g., ResNet-152.

------
threeseed
Was this running the AWS Deep Learning AMI or did you build your own.

Because Intel was involved in its development and made a number of tweaks to
improve performance.

Be curious if it actually was significant or not.

~~~
elmarhaussmann
On AWS this was using nvidia-docker with the TensorFlow Docker images.
Probably, the AWS AMI Deep Learning gives very similar performance (with same
versions of CUDA, TensorFlow etc.). There's only so much you can tweak if the
GPU itself is the bottleneck...

------
Tenoke
>For the V100 experiments, we used a p3.8xlarge instance ( _Xeon
E5–2686@2.30GHz 16 cores, 244 GB memory_ , Ubuntu 16.04) on AWS with four V100
GPUs (16 GB of memory each). For the TPU experiments, we used a small
n1-standard-4 instance as host ( _Xeon@2.3GHz two cores, 15 GB memory_ ,
Debian 9) for which we provisioned a Cloud TPU (v2–8) consisting of four TPUv2
chips (16 GB of memory each).

A bit odd that the TPUs are provisioned on such a weaker machine compared to
the V100s, especially when there were comparisons which included augmentation
and other processing outside of the TPU.

~~~
elmarhaussmann
All of the computation, including pre-processing, is offloaded to the TPU. The
weak machine is really just idling. A bigger one will only cost money and have
no measurable effect on the performance.

~~~
KaoruAoiShiho
What is the cost difference between the CPUs on the google cloud vs AWS? How
would adjusting for it effect the cost/images ratio?

