
Cloud TPU Pods Break AI Training Records - jonbaer
https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-pods-break-ai-training-records
======
rrss
It's disappointing there is no attempt at normalizing results. It's not
perf/$, or perf/W, or perf/chip, or anything that might be useful - it seems
to just be perf/(the largest machine google/nvidia could afford to put
together for a given benchmark).

Seriously.

Transformer: 1024 TPUs are twice as fast as 480 GPUs.

Resnet50: 1536 GPUs are about as fast as 1024 TPUs.

SSD: 1024 TPUs are twice as fast as 240 GPUs.

Great.

~~~
zak
Author of the blog post here.

Cloud TPUs are designed to maximize performance-per-dollar, so you are right
that pure performance comparisons at maximum scale don't tell the whole story.

The most straightforward performance-per-dollar comparisons would be among
several different hardware configurations across the major public clouds.
However, we haven't yet seen any other public cloud MLPerf submissions at
scales comparable to Cloud TPU Pods, so there isn't currently a strong
baseline available for comparison. It's also not clear whether public cloud
networking will ultimately be able to match the performance of the network
hardware that was used to produce the largest-scale on-premise MLPerf
submissions.

~~~
yaroslavvb
Can you elaborate on the performance-per-dollar part? IE, I'm seeing GCP
providing V100 at $2.48/hour and a TPUv3 at $8.00/hour

~~~
zak
Sure. The $2.48/hour per V100 GPU on GCP does not include the price of the CPU
host; that is purely the price to rent a single accelerator. By contrast, a
network-attached Cloud TPU v3 device includes both a CPU host and four
connected TPU v3 chips that collectively deliver up to 420 teraflops.
Furthermore, each individual V100 GPU on GCP has 16 GB of memory, whereas the
Cloud TPU v3 device has 128 GB of HBM.

The best apples-to-apples performance-per-dollar comparison we have publicly
available was published last fall, and it compared the performance and cost of
using various Cloud TPU v2 Pod slice sizes with the performance and cost of
using various numbers of V100 GPUs attached to a single GCP host:

[https://cloud.google.com/blog/products/ai-machine-
learning/n...](https://cloud.google.com/blog/products/ai-machine-learning/now-
you-can-train-ml-models-faster-and-lower-cost-cloud-tpu-pods)

We went to great lengths to ensure that we trained exactly the same version of
ResNet-50 to the same accuracy in the same way across all hardware
configurations. The methodology predated MLPerf and is documented in full
here:

[https://github.com/tensorflow/tpu/blob/master/benchmarks/Res...](https://github.com/tensorflow/tpu/blob/master/benchmarks/ResNet-50_v1.5_Performance_Comparison_TensorFlow_1.12_GCP.md)

If you were going to do a similar performance-per-dollar comparison today, the
simplest approach might be to try to get the code from NVIDIA's MLPerf 0.6
submissions running at scale on one or more major public clouds using the
fastest-available networking technology that each cloud provides:

[https://github.com/mlperf/training_results_v0.6/tree/master/...](https://github.com/mlperf/training_results_v0.6/tree/master/NVIDIA/benchmarks)

It would be very interesting to see how distributed training performance using
large-scale GPU clusters in public clouds compares with the published on-
premise MLPerf performance numbers using exactly the same MLPerf code and
methodology. With these measurements in hand, it would then be straightforward
to make performance-per-dollar comparisons with Cloud TPU v3 Pod slices of
various sizes.

------
ArthurBrussee
Is it me or are those results somewhat underwhelming if anything? Dedicated
hardware for a 2x speedup at best, tossup for most results, and only competes
in some categories. Not to be just a NVIDIA fan here, surely there is value in
dedicated training hardware, but just surprising that benefit isn't bigger!

~~~
totoglazer
I think it’s probably because the benchmark isn’t optimized for TPU Pods.
Check out the BERT in 76 minutes paper for how you need to rethink the
training regime to take advantage of pods.

~~~
zak
Yes, Cloud TPU Pods are designed to train much larger models on much larger
datasets. And, as you mention, if you are willing to adjust your model
architectures and training algorithms to take full advantage of the hardware,
you can sometimes achieve substantial gains.

------
gok
I'm curious how (absolutely) efficient the Transformer training is even on the
TPUs. The results from self-attention models are really impressive but
unfortunately their topology makes them very difficult to implement
efficiently in silicon. It often becomes purely a question of memory
bandwidth, because you're not doing much math per weight on each iteration. I
wonder if the speedup is from the use of on-chip HBM in the TPUs.

~~~
rrss
You mean this speedup?

> 1024 TPUs are twice as fast as 480 GPUs

Might it be because there are twice as many?

~~~
zak
Author of the blog post here.

We submitted multiple results using various Cloud TPU v3 Pod slice sizes to
show the current achievable Transformer training efficiency at several scales:

[https://mlperf.org/training-results-0-6](https://mlperf.org/training-
results-0-6)

We're actively improving the whole TPU software stack, so training efficiency
is likely to continue to increase over time.

~~~
gok
What I'm really asking is: how much effective compute throughput are you able
to get during Transformer training relative to the amount of theoretical raw
compute available?

~~~
zak
That's a great question. I don't have that analysis handy, but it would
definitely be worth doing.

~~~
gok
What about energy? (watt-hours?)

------
daxfohl
What's in the picture? Can't figure out the scale. Are those like server racks
or breadboards or ...?

~~~
morphle
All you see is 8 server racks with colored network cables, switches and power
supplies. The TPU ASICs themselves make up just a tiny part of this
datacenter, you also have the printed circuit boards, cooling fins, 8x48 metal
boxes, power and network cables, DC/DC or AC/DC converters at the bottom, fans
or water tubes for cooling and airgaps.

My startup is trying to develop wafer scale integration where you collapse 2
racks of the network, metal boxes, power and cooling into a 300mm wafer
immersed a 100 mm x 400mm box with a few fibers and three power cables coming
out. That can save around 900% of the capital cost and orders of magnitude of
power (especially if you put the box in a building where the waste heat is not
wasted but used to heat water for showering and space heating).

As it is now, these datacenter customers and hyperscalers don't seem to care
about the enormous waste and cost of paying for inefficient hardware.
Considering the enormous cost and carbon emmission savings a wafer scale
integration would bring (and the competitive advantage), you would be suprised
how hard it is to get funding from them to develop it.

I suspect is also the main reason they don't care to publish normalized
benchmarks for $/performance/joule, as it would demonstrate how wasteful it
all is.

~~~
rrss
> would be suprised how hard it is to get funding from them to develop it.

That's probably because there has been no evidence that wafer scale
integration can actually work.

~~~
morphle
I agree there is no _complete_ evidence of a full WSI yet. There are several
recent papers on Wafer Scale Integration (WSI) and some WSI built (large
sensors) with reasonable yields. There are many papers on partial problem
solutions that, if combined in one project would yield a full working WSI with
existing 7nm standard CMOS process. There are silicon interconnect fabric
(SiIF) which is one step removed from a full WSI. There are many commercial
chips 1/70th the size of a WSI already and some unpublished WSI results.

