
Quantifying the performance of the TPU, our first machine learning chip - fhoffa
https://cloudplatform.googleblog.com/2017/04/quantifying-the-performance-of-the-TPU-our-first-machine-learning-chip.html
======
joe_the_user
So since "sharing the benefits with everyone" could involve just allowing
people to rent time on the Google cloud, we can still ask when/if the chips
themselves will ever be available for purchase?

------
iandanforth
"This first generation of TPUs targeted inference ..."

Makes me wonder if there are more recent generations that target training.

------
wyldfire
From the paper:

> if the TPU were revised to have the same memory system as the K80 GPU, it
> would be about 30X - 50X faster than the GPU and CPU.

Is it "hard" to interface with GDDR5/HBM? Layout challenges? Or do they need
the capacity more than the speed? Why _wouldn 't_ they have used faster memory
than DDR3?

~~~
dom0
Memory controllers are not so simple to do, and fast MCs also eat quite a bit
of power. So a simple reason that they didn't do it could be either

a) they did not want to license a more expensive, faster design

b) while it would be faster, it would decrease efficiency to a point that did
not meet their goals (for data centers, efficiency > absolute performance,
within reasonable boundaries)

c) like b) just with cost of memory

d) GDDR5 and DDR3/4 have different design trade-offs. The former is optimized
for sequential bandwidth (and low capacity; GDDR always was a point-to-point
memory bus just to achieve the clock speeds), while DDR3/4 takes random
read/write workloads into account (eh... to the amount possible with DRAM...)

\--

HBM requires a silicon interposer, which is basically like another complete
chip (just without the FEOL parts, "just" metallization), that has to be
significantly larger than the size of _all chips combined_. So unless you
really need that performance or have a volume product it's unlikely to be a
good deal.

~~~
dom0
Looking at their block diagrams: They have two large "cache-like" structures:
The unified buffer (24 MiB) and the accumulators (4 MiB). Bandwidth between
these and the matrix multiply unit is high (167 GiB/s), bandwidth out of that
complex is low. So it would seem that they just don't need a very large
bandwidth out of that function complex.

------
dicroce
Is this device optimized for forward passes or backward passes or both?

It seems to me that Google engineers could use Tesla's or other high end GPU's
for training and development, but then deploy those models on hardware
optimized for forward passes...

~~~
pcmonk
If "forward passes" means inference (as opposed to training), then the post
says the first generation targets that. I don't think they say anything about
any future generations (other than they're working on them).

~~~
alfalfasprout
I mean, there's no real reason they shouldn't be able to do a backwards pass
assuming they're using trivially differentiable activation functions.

------
pc2g4d
Maybe it's just me misunderstanding, but to me "inference" and "training" are
one and the same. But the article defined it thus:

This first generation of TPUs targeted inference (the use of an already
trained model, as opposed to the training phase of a model, which has somewhat
different characteristics)

This Nvidia article treats them differently, too:
[https://blogs.nvidia.com/blog/2016/08/22/difference-deep-
lea...](https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-
training-inference-ai/)

But the definition of "statistical inference" on Wikipedia says "Statistical
inference is the process of deducing properties of an underlying distribution
by analysis of data" which seems exactly like training.

~~~
mjn
Some parts of stats (esp. classically) do use "inference" for the whole
process of going from data -> result, especially when doing descriptive rather
than predictive statistics. In most of ML the process is split into two
phases, training a model on a data set, followed by using the trained model to
predict labels (or whatever else it's predicting) for new data. Training is an
implementation of statistical induction, from data to model, while model "use"
or "evaluation" is an implementation of (probabilistic) deduction, from model
+ query to result. In ML, "inference" is usually a synonym for the "deduction"
or "evaluation" portion. It makes sense to me intuitively if you think of it
as learning things vs. inferring things from learned knowledge.

But you do find constructions using it as a synonym for training too, as in
phrases like "model inference" (which means inferring models from data, aka
model induction or training). Inferring things from other things is a pretty
general concept, so it can be slippery without context...

------
wangqufei
This is a very very bad idea. The so called AI is changing, far from being
stable. Software can change, hardware can not.

------
bsamuels
so basically they're ASICs?

Would love some tech details, but it seems that the paper wont be published
until 5pm today

~~~
dom0
Most accelerator-esque chips are often considered some kind of ASIC. CPUs are
not ASICs because ??? ... never heard a good reason there. It's a fuzzy (but
not fluffy) term.

~~~
Cyph0n
Because CPUs are not application-specific? ASICs are typically designed from
the ground-up to do a specific thing really well.

~~~
dom0
I'll bite: In which way is a computation engine (CPU [not ASIC], GPU [ASIC],
Xeon Phi [ASIC]) more general purpose than a computation engine for AI [ASIC],
GPU [ASIC] or Xeon Phi [ASIC, even though some of these are usable as a host
processor?]

For other parts it's even less clear: Flash or hard drive controllers are
pretty normal micros with some dedicated hardware bolted on - clearly ASIC,
but most of it was not designed for the "AS" part, and you could just ignore
the flash and SATA interfaces and use it as a regular micro.

So the distinction, if any, can't be about volume (since a lot of them are
large volume parts), nor about functionality, but some fuzzy distinction by
narrowness of intended use of the part (- but then again, GPUs).

And how does it apply to other domains of chips? Is a TDA7000 an ASIC?

:)

~~~
Cyph0n
Firstly, on what basis are the Xeon Phi and GPUs ASICs? Secondly, I'd argue
that a GPU is definitely more general-purpose than the TPU. Just look at the
block diagram shown in the linked paper: a GPU is orders of magnitude more
complex than that! The added complexity is a result of a GPU having to support
a wider variety of workloads. As for Flash controllers etc., if they
incorporate a MCU or CPU, then they are simply not ASICs?

I believe that the term ASIC itself is quite overloaded. From what I've seen,
many people use the term ASIC to refer to any IC that is not reconfigurable
(i.e., FPGA). Going by that definition, an ASIC is any circuit that is custom
designed and fabbed on a wafer. Naturally, this would include a CPU, a GPU,
and whatever else you can think of.

The way I like to think of it is that an ASIC is a circuit designed to perform
a specific task _as efficiently as possible_. Note that I used to word
circuit; in other words, an ASIC could be part of some larger design.

Some examples off the top of my head:

\- Digital camera CMOS sensor

\- Video decryption chip (e.g., in a cable box)

\- Active noise cancellation chip (if custom and not a DSP)

\- Full-custom TPM

\- Full-custom RSA-2048 engine

\- High-performance Ethernet switch controller

\- CPU cache controller (ASIC that is part of a CPU)

~~~
dom0
Counterpoint to complexity: Simple CPUs (scalar, in-order) are likely less
complex than the TPU. E.g. here's a super-scalar in-order core:
[https://patentimages.storage.googleapis.com/US6311261B1/US06...](https://patentimages.storage.googleapis.com/US6311261B1/US06311261-20011030-D00011.png)

> I believe that the term ASIC itself is quite overloaded.

I think we're arguing the same thing, just from different angles :)

