
Under the Hood of Google’s TPU2 Machine Learning Clusters - Katydid
https://www.nextplatform.com/2017/05/22/hood-googles-tpu2-machine-learning-clusters/
======
jacquesm
Some serious tea leaves reading going on here.

Regarding the power consumption of the TPU2 racks and whether or not Google
will continue to use them: as long as they're more power efficient than GPUs
of course they will continue to be used. That is most likely the only reason
the TPU exists in the first place. Running the same workload on GPUs would
consume a few Mega Watts if the TPUs consume 500 KW. And Google has lots of
workloads like that so the savings must be enormous.

But GPUs are a moving target and even if for now the TPU seems to have the
edge on power efficiency that could easily change.

Another thing that does not make much sense to me is that if you were to use
systolic array processors with a certain limit that it would make very good
sense to figure out a way to daisy chain them so that multiple units can be
combined into larger arrays.

That may very well be the function of that chip in the center, some kind of
crossbar to allow for easy linkage of single TPUs into larger fabrics.

~~~
philipkglass
The other reason (maybe the dominant reason) to prefer a TPU is to avoid
paying for all the GPU functionality that is irrelevant to these workloads
(FP32, FP64, graphics-specific functionality, costs of writing graphics
drivers that gets bundled into hardware unit prices...)

If I look at a GTX 1080 Ti and optimistically assume that I can keep it
completely busy 24/7, operating at the 250 watt TDP, it would take 4.5 years
for the retail electricity cost to match the initial purchase cost of $699. (I
pay a bit under 7 cents/kWh). Now I _do_ live in one of the cheapest-
electricity regions of the US, but I would also assume that big data center
operators are building facilities where electricity is cheap. And big
customers get lower rates than households. I wouldn't be surprised if a GPU
cluster's hardware is considered obsolete before its electricity costs match
its initial hardware costs.

~~~
friedman23
>The other reason (maybe the dominant reason) to prefer a TPU is to avoid
paying for all the GPU functionality that is irrelevant to these workloads
(FP32, FP64, graphics-specific functionality, costs of writing graphics
drivers that gets bundled into hardware unit prices...)

Often bundling decreases cost. Bundling also has the additional advantage of
producing more demand when selling excess capacity a la AWS.

~~~
MR4D
Not in this case.

Google wants a die as efficient as possible All of those take up die space,
creating a larger processor. That increases the cost of defects, as well as
overall manufacturing costs.

For Amazon, you may be right, but for Google, they only want a specific
function.

------
aresant
When the author made the mental connection on the matching labeling color from
the TPU2 front panel connections the nerd in me did a little jig, I LOVE the
level of detail and authority this article conveys.

Despite my general server / hardware comprehension being on the low end of
intermediate this article was extremely digestible and helped close some gaps
for me about Google's strategy and how/where to think about deploying cloud-ML
in the future - love seeing content like this on HN.

------
jloughry
That article reads just like a SIGINT intelligence report on some foreign
country's new radar system, inducing capabilities, limitations, and
specifications from clues visible in photographs.

~~~
marmaduke
I think you mean deduce or infer

~~~
jloughry
You make a good point. I meant inductive in the sense that the author of this
article really _is_ operating like an intelligence analyst, working from
available but limited information and drawing conclusions like `the white
cables are probably for management' (judging from the number of them and the
type of connectors seen on the ends). It's a statement of probability, based
on what we can see, what we know from experience ought to be present, and what
else we don't see. More interesting is the author's prediction that maps Xeon
cores 1:1 with TPU chips (again, by counting boards in the CPU racks on the
ends); that one is a combination of inductive _and_ deductive logic.

The article is a great piece of analysis.

------
maldeh
Another potential source of friction I've wondered about regarding getting
research workloads running on TPU2 is the heavy reliance of most of the
popular DL frameworks on CUDA and CuDNN; I don't understand how one could port
those libraries to a drastically different architecture such as this. Some DL
frameworks have made overtures to OpenCL but it has not typically been a top
priority [0].

Could this instead imply Google is working on its own library of low-level
math primitives just for Tensorflow, and if so, how long would that take
before it's competitive performance-wise? At any rate, having to support a
different computing platform / API would be another blocker to general
adoption.

[0]
[https://github.com/tensorflow/tensorflow/issues/22](https://github.com/tensorflow/tensorflow/issues/22)

~~~
polishTar
Would this not be solved by TensorFlow XLA?

[https://www.tensorflow.org/performance/xla/](https://www.tensorflow.org/performance/xla/)

~~~
maldeh
Per my understanding of XLA, it provides the right high-level abstraction for
compiling the tensorflow computation graph under a range of architectures,
CPU, NVIDIA GPU, mobile, etc. but it still delegates to a lower level domain-
specific API like CUDA.

This undertaking isn't something to be taken lightly; CuBLAS has some pretty
cutting edge architecture-specific optimizations for batching operations for
matrix multiplication that came from several years of research - and is
arguably a massive competitive advantage of NVIDIA over AMD. Depending on the
development state of such an API internally within Google, it could mean that
the Cloud TPU isn't going to be ready for wide-spread commercial use for a
good while, and is very much still in the research-and-development phase
(which could explain why they're only opening it up for the research community
right now).

~~~
maksimum
From what I gather from reading AMD's recent AMA is that the key is getting
hardware and software guys [working
together]([https://lists.freedesktop.org/archives/dri-
devel/2016-Decemb...](https://lists.freedesktop.org/archives/dri-
devel/2016-December/126808.html)). For AMD the issue is that they can't afford
to have hardware guys dedicated to helping optimize the software since they
need to build the next and next next generation of hardware. On the other
hand, Google surely has the pockets to get man hours and knowledge to (a)
figure out what kind of software is needed (b) design from hardware level to
meet their software needs.

And if you caught any of their talks on [Tensorflow at Google
IO]([https://www.youtube.com/watch?v=5DknTFbcGVM](https://www.youtube.com/watch?v=5DknTFbcGVM)),
it seems like their goal is to provide ultra-high level APIs like POST/GET,
and high level python ones like Tensorflow pre-built and trained models. At
that point Google can control the engineering, and as long as you run on their
cloud platform you just have to worry about the high-level. I'm definitely not
a fan of vendor lock-in, but it seems like a interesting product, and I'm
curious to see what Google does in this space.

------
scottlegrand2
So Google still won't reveal what kind of floating-point TPU 2 actually runs.
That is really really really strange. I cannot help but suspect it's some sort
of non-standard floating point but that's easy to clear up, no?

~~~
deepnotderp
Pretty sure it's a mini float of some sort.

~~~
monocasa
Back in the day it was fairly common to not be compliant with IEEE-754 to cut
down gate count. Like just giving the wrong results on de-normals, fixed to
one rounding mode, etc. I bet they're doing something like that.

~~~
Houshalter
Stochastic rounding is also important for doing these algorithms in very low
precision. Without stochastic rounding, even very small rounding errors
accumulate enough to destroy the accuracy.

~~~
deepnotderp
Stochastic rounding shouldn't be entirely necessary for floating point though,
no?

~~~
Houshalter
How does floating point change anything? You still only have so many bits of
precision, and the lower bits need to be rounded. In fact floating point has a
lot less precision, since many of the bits are needed to store the exponent.

~~~
deepnotderp
The range of the exponent. Stochastic rounding is intended to prevent bias
towards zeroes which paralyze learning, but with floating point this is less
of an issue since you only round the mantissa.

~~~
Houshalter
The issue is when you add a very small number to a larger number. This occurs
in neural networks when accumulating gradients during learning. Many of the
gradients are very small, and when adding them repeatedly they get rounded
down. There was a paper studying the effect, and I think it was after about 14
bits of precision the accuracy is very diminished. But with stochastic
rounding they could get down to very few bits.

~~~
scottlegrand2
All of this is fantastic if you are a deep learning Jedi Master. But there's a
very small supply of those sorts.

I really worry about all these reduced floating-point representations when
they are made use of by people who mostly understand deep learning through
tinkering with existing tensorflow tutorials.

FP32 seems like a relative sweet spot with sufficient dynamic range to let
most amateurs avoid getting trapped in the weeds. I could probably be
persuaded to believe that FP24 is sufficient as well.

But I suspect that once you get down to 16 bits or so you have to do all sorts
of stochastic rounding / dithering that is beyond the skill set of most data
scientists. And that's because I suspect many of them don't really get dynamic
range.

Which then leads me to believe that what we really need here is the equivalent
of R for machine learning.

------
wmf
Some details of this analysis seem off to me. The pink QSFP cables are more
likely to be PCIe than Omni-Path. The orange/yellow/green cables are likely to
be some kind of network (torus? hypercube?) but I doubt Google would use IBM
IP when they could design their own.

~~~
jldugger
Certain Google employees won't shut up about Clos networking, but I donno if
it applies here:
[https://en.wikipedia.org/wiki/Clos_network](https://en.wikipedia.org/wiki/Clos_network)

------
mtgx
Their first (more indepth) post on the TPU2:

[https://www.nextplatform.com/2017/05/17/first-depth-look-
goo...](https://www.nextplatform.com/2017/05/17/first-depth-look-googles-new-
second-generation-tpu/)

------
peter303
If google builds 90 of these, then they sort of have the first Exoflop
supercomputer. The supercomputer rankings are based on how fast they can run
some standard matrix math tests. TPU may not be general purpose enough to run
these tests.

------
snorrah
If anyone from the web team at Next Platform are reading this: please, please
consider fixing the weirdness of browsing your site on iPad (and maybe this
affects other tablets too).

In portrait mode, the text scrolls off the side of the screen so I have to
continually pan back and forth to read an article. In landscape mode, a very
thin column of text is used. It's as if the landscape res media query is being
applied to portrait view and vice versa.

Edit - oh hmm looks like manually zooming out a bit in portrait will fit the
whole article. Was sure that didn't used to be the case in earlier days. Maybe
it got fixed!

