
Benchmarking Google’s new TPUv2 - henningpeters
https://blog.riseml.com/benchmarking-googles-new-tpuv2-121c03b71384
======
boulos
Disclosure: I work on Google Cloud.

While not perfect, I want to commend the RiseML folks for doing not only an
“just out of the box” run in both regular and fp16 mode (for V100), but also
adding their own LSTM experiment to the mix. We need third-party benchmarks
whenever new hardware or software are being sold by vendors (reminder: I
benefit from you buying Google Cloud!).

I hope the authors are able to collect some of the feedback here and update
their benchmark and blog post. The question about batch size comparisons is
probably the most direct, but like others, I’d encourage a run on 1, 2, 4 and
8 V100s as well.

~~~
joe_the_user
So this is a chip that no one outside of Google is going to be able to get a
physical copy of ever?

It makes any benchmarks become Google-cloud benchmarks, right?

Edit: I am complaining a bit about the lack of availability but there's also a
real point here. If there's no source for TPUs outside of Google, Google Cloud
competes only with other cloud providers and with owning physical GPUs - long
term, it has no incentive to be anything but little bit more efficient than
these however much it's price for producing TPUs declines.

~~~
dgacmu
It's going to be a very exciting multi-company arms race -- at minimum,
Google, Intel, Nvidia. Microsoft has their FPGAs, Amazon has their rumors. And
there are several startups trying to enter the space. I don't think we're
looking at stagnating; very much the opposite. It's going to be fantastic for
the field.

(I'm saying this with my CMU hat, not my Google hat.)

~~~
joe_the_user
I would not see things stagnating but it seems like there's a potential for
the individual to get cut out of this excitement if each of these entities is
keeping it's chips close to it's chest.

The era of the mainframe, with each provider competing with a custom chip,
wasn't necessarily beneficial for individual buying computer power.

~~~
ucaetano
Industries go through cycles of innovation and concentration. During
innovation cycles, many new non-standard products appear with innovative
solutions, the entire pie grows really fast. As growth eventually stabilizes,
standards become more relevant and consolidation happens, eventually leading
to a stagnation that makes the industry ripe for disruption and change again.

If you look at processors, you see that with the early custom processors,
followed by some standardization and copy around the IBM S/360, followed by
more proprietary innovation around the PC era, resulting finally in the x86,
eventually disrupted by the mobile chips, which then consolidated around ARM
and so on.

~~~
jxy
I second this.

With cloud computing, we are essentially going back to the era of time-shared
mainframes with remote access.

------
angrygoat
Google claim 29x better performance-per-Watt with TPUs than contemporary
GPUs[0]. Interesting to contrast that to the images-per-$ figure in this post,
which is more like 2x.

I assume there's a high capital cost for this new hardware, but when they
scale it up I wonder if the ratio of cost TPU to GPU will trend towards the
ratio of power-per-Watt between the platforms? Seems like a natural limit,
even if it never quite gets there.

[0] [https://cloud.google.com/blog/big-data/2017/05/an-in-
depth-l...](https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-
at-googles-first-tensor-processing-unit-tpu)

~~~
aje403
Maybe Google will pivot from high tech into crypto mining

------
jrk
[Edited] The top line results focus on comparing four TPUs in a rack node
(which marketing cleverly named “one cloud TPU”), running ~16 bit mixed
precision, to one GPU (out of 8 in a rack node), also capable of 16 bit or
mixed precision, but handicapped to 32 bit IEEE 754. That is a misleading
comparison. Images/$ are obviously more directly comparable, but again the
emphasized comparisons are at different precision. Very different batch sizes
make this significantly more misleading, still. Images/$ also only tells us
that Google has chosen to look at the competition and set a competitive price;
the per-die or per-package comparison is much more relevant to understand any
intrinsic architectural advantage, since these are all large dies on roughly
comparable process nodes.

~~~
BLanen
The amount of devices is what is completely irrelevant.

It's all about performance per dollar.

~~~
boulos
Disclosure: I work on Google Cloud.

Not necessarily. The DGX-1, for example, has pretty poor perf/$$ but reduces
the time a data scientist spends waiting. For some organizations, their people
time is so valuable that what matters is “what gets me my answers back
faster”, because that employee is easily $100/hr+.

That’s actually why the 8xV100 with NVLINK is so attractive (and why the TPUs
_also_ have board to board networking, not just chip to chip).

------
gok
The bar graph seems a little whacky. It groups the TPU (which can only do
FP16) with the FP32 results from the GPUs, then puts the FP16 GPU results off
to the side even though that's much closer to what the TPU is doing.

Impressive results regardless though; quite a bit faster than V100 than the
paper specs would suggest.

~~~
ekelsen
It also seems like the price comparison should compare with the fp16 numbers
on both platforms, not the fp32 numbers.

------
slashcom
Wait but, the batch size is 8x bigger for the TPU? That's not a fair
comparison; increasing batch size always speeds things up...

~~~
elmarhaussmann
Author here.

Note that the TPU supports larger batch sizes because it has more RAM. We
tested multiple batch sizes for GPUs and reported the fastest one. We'll try
increasing the batch sizes as far as possible and report. The overall
comparison will likely not change by much - we saw speed increases of around
5% doubling the batch size from 64 to 128.
([https://www.tensorflow.org/performance/benchmarks](https://www.tensorflow.org/performance/benchmarks)
also reports numbers for batch sizes of 32 and 64 on the P100)

~~~
boulos
Disclosure: I work on Google Cloud.

Oh! You should definitely say that. It's semi-reasonable then to choose the
batch size that is optimal for the part. It'd be good to make sure this _isn
't_ why your LSTM didn't converge though...

~~~
elmarhaussmann
I tested many different batch sizes for the LSTM, so I am pretty confident
it's not the reason.

------
dkobran
Just to clarify, is this benchmark leveraging mixed-precision mode on the
Volta V100? The major innovation of the Volta generation is mixed-precision
which NVIDIA claims is a huge performance increase over the Pascal generation
(P100 in the case of your benchmark).

Link to NVIDIA documentation on mixed-precision TensorCores:
[https://devblogs.nvidia.com/inside-
volta/](https://devblogs.nvidia.com/inside-volta/)

~~~
elmarhaussmann
Where specified "fp16", the V100 benchmarks use the code from
[https://github.com/tensorflow/benchmarks/tree/master/scripts...](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks)
with the flag --use_fp16=true which enables fp16 for some but not all Tensors.

~~~
dkobran
It's my understanding that fp16 (available on the previous generation P100)
and mixed-precision (major innovation of V100) are different things and the
speedup of TensorCores is entirely missing from this benchmark. Unlike the
general purpose P100, the TPU is a heavily optimized chip built for Deep
Learning, hence it's performance increase. However, the V100 is also heavily
optimized for Deep Learning (arguably the first non-GPU chip) from NVIDIA. I'm
in no position to defend NVIDIA here haha but it seems like the benchmark
misses the point if this is indeed the case.

~~~
elmarhaussmann
It was my understanding that the TensorFlow benchmarks do make use of
TensorCores on the V100. We'll verify and update accordingly.

------
Nokinside
Specialization brings speedups.

TPUv2 is specially optimized for deep learning.

Nvidia's Volta microarchitecture is graphics processor with additional tensor
units. It's a General-purpose (GPGPU) chip designed with graphics and other
scientific computing tasks in mind. Nvidia has enjoyed monopoly power in the
market and single microarchitecture has been enough in every high performance
category.

Next logical step for Nvidia is to develop specialized deep learning TPU to
compete with TPUv2 and others.

~~~
deepnotderp
Volta V100 already has "tensor cores" which are basically little matrix
multiplication ASICs.

~~~
Nokinside
That's what I said.

The microarchctiecture has many unnecessary things and it's not optimized as a
whole for deep learning.

~~~
deepnotderp
I believe it was either the last MICRO* or the one before that when Dally
addressed this point. The specialized hardware for graphics ends up comprising
such a small portion of the overall chip that it wasn't worth it to remove it.
The "GPUs were made for graphics thus aren't good for DL" argument really
doesn't hold a lot of water IMO.

* It might've been a difference conference now that I think of it.

------
alexnewman
The entire idea that people are going to gain some huge advantage over nvidia
with hardware softmax seems dubious. I do think it will buy them some time but
eventually it seems as though nvidia will win this one.

------
ysleepy
I'd be interested how the superior perf/watt claims holds in googles practical
setup. The additional Networking gear and power supply losses and so on might
make the difference less.

I'm also not sure how we can take googles word for the numbers, since they
might as well be eating a less-than-ideal power cost to promote their
platform. Any upfront cost will probably offset by locked-in customers later
on.

I might just be a bit cynical though.

------
twtw
IIRC, TPUv2 uses 16 bit floating point in some format with higher dynamic
range and lower precision than standard fp16. Can someone confirm?

If that is right, is the "Tensorflow-optimized" Resnet-50 using 16bit floats
when running on TPUv2?

~~~
deepnotderp
Re: fp16 dynamic range: yes.

------
PaulHoule
Does this take into account the fact that you might need fewer epochs if you
reduce the batch size? (as is done for the CPU?)

~~~
elmarhaussmann
Author here.

No, this is really only comparing the throughput on the devices. A thorough
comparison should really focus on time to reach a certain quality - including
all of the tricks available for a certain architecture.

------
neves
Would I be able to buy one of these for my home? Or just in the cloud? If I
can buy, how much would it cost?

~~~
elmarhaussmann
Author here.

These are only available on the Google Cloud right now. I don't think there
are plans to sell them anytime soon.

------
amelius
> In order to efficiently use TPUs, your code should build on the high-level
> Estimator abstraction.

Does this mean it's inference-only? (I only quickly scanned the article)

~~~
jlebar
No, this whole blog post is about training models.

------
chapill
I wonder if Chinese companies will use (or be allowed to use) TPUs. It seems
like a pretty obvious way to have the NSA scoop up any Chinese AI advancements
China may want to keep secret.

~~~
NelsonMinar
I wonder which Chinese companies are developing their own processors like
TPUs.

~~~
chapill
Well, they do have the fastest supercomputer in the world currently and it's
made with homegrown chips. No Intel ME backdoors there. Smaller chinese
companies could, for a little more money, get similar performance buying 8x
V100 machines from NVidia. I don't think they want to share their advancements
in AI fighter pilots with USA. They have a big lead.

~~~
olfactory
What is the hardest thing to accomplish with something like a TPU? Is it the
IP or the fabrication?

How does the TPU design offer improved performance? By leveraging IP or
fabrication improvements?

~~~
deepnotderp
Neither, it's a matrix multiply systolic array ASIC, that's been done decades
ago.

There are host of Chinese companies developing similar processors.

~~~
olfactory
Why is Google investing in its own?

~~~
deepnotderp
Cost advantage.

------
bhouston
It is hard for Google to make money on these TPUs as the whole engineering
cost has to be made back from its pricing on Google Cloud, where as with
NVIDIA it can pay back its engineering costs via multiple mature channels
(games, super computers, and multiple cloud providers.)

I wonder which is higher, the cost for creating the TPUs in terms of
engineering and manufacturing or the cost differential in terms of usage as
compared to NVIDIA's latest?

I worry about Google long term here. I am surprised the TPU doesn't kick the
ass of the NVIDIA chips.

~~~
boulos
Disclosure: I work on Google Cloud.

By the logic above, you would conclude that TPUv1 (the inference-only chip)
might have been a mistake, but we’ve been very public about how it “saved us
from building lots of datacenters”.

That wasn’t ever sold as part of Cloud, so the benefit there is all from the
second bit you mentioned: cheaper and more efficient than GPUs _at the time_.
The paper also goes into more detail, but the size of that initial engineering
team and time to market were both quite small.

For training, before Volta (and kind of Pascal), GPUs were the best option but
not particularly efficient. Volta does the same “we should have a single
instruction that does lots of math in one shot” by cleverly reusing the
existing functional units. That the V100 is a great chip, is a good outcome
for the whole industry. But GPUs aren’t (and shouldn’t be) just focused on ML.
My bet is that there’s still a decent amount of runway left in specialized
chips for ML, just as GPUs carved out their own niche versus CPUs.

But again, the “even just for Google” benefit is really enormous so I wouldn’t
assume that Cloud has to pay for the entire effort. Could GPU manufacturers
improve the cost:performance ratio of ML workloads enough that Google doesn’t
have to build TPUs anymore? Perhaps, but like the V100 improvements that would
be a great outcome!

~~~
puzzle
Is there going to be an updated paper on performance per Watt, now that TPUv2
is public and V100 has been preannounced on the Google blog?

