
Google TPU Performance Analysis - kartD
http://www.anandtech.com/show/11749/hot-chips-google-tpu-performance-analysis-live-blog-3pm-pt-10pm-utc
======
alexnewman
So many details that people gloss over. I have used tensorflow (TF) and it is
true that GPUs suck at interference at it. But it's not always the GPUs fault

\- TF can't do anything quantized on GPUs. It just switches back to to the
CPU/TPU. \- TF gets relatively poor utilization of the GPU and tends to not be
careful with memory use. \- I was able to do certain types of classification
hundreds of times faster by seeing what TF was doing it and hand writing it in
OCL. Using [https://docs.rs/ocl/0.14.1/ocl/](https://docs.rs/ocl/0.14.1/ocl/).
It's a super cool library for rust. Also users should checkout tensorRT
[https://github.com/NVIDIA/gpu-rest-
engine/tree/master/tensor...](https://github.com/NVIDIA/gpu-rest-
engine/tree/master/tensorrt). It's not super well supported and may go away,
but it is fast

~~~
gcp
Question about TensorRT: It takes in Caffe trained models. Is this BVLC Caffe
or NV-Caffe? In my experience the models aren't compatible between the two.

~~~
alexnewman
Either but I've never tried

------
jcbeard
Seems very much "back to the future." Systolic array processors were used to
accelerate neural networks in the 1980's. Great for matrix math too. (ref:
[http://repository.cmu.edu/cgi/viewcontent.cgi?article=2939&c...](http://repository.cmu.edu/cgi/viewcontent.cgi?article=2939&context=compsci)).
These aren't quite the systolic array processor of old, but too close to be
considered new arch/micro-arch. The formula is simple, have low precision MM
to accelerate, drop in a matrix multiply unit that can be blocked for and high
bandwidth memory to feed it and let it go. I'm waiting for more new takes on
old arch....as fabbing chips becomes more economical, I hope to see more retro
chips. Especially things that didn't quite make the jump from research to
production b/c of scaling (or other reason), might now make sense.

------
baybal2
Back in early-noughties, I remember that there were a company that was
developing an accelerator chip for seismic data analysis for oil exploration
companies. I can't remember the name now. Can anybody remember?

They were proposing a chip that did nothing but a limited set of linear
algebra operations at gigabit rates. They were former Transmeta people

~~~
moonbug22
Clearspeed? The HPC history books are littered with the bankrupt corpses of
special purpose hardware.

~~~
semi-extrinsic
I remember an ASIC that was supposed to accelerate multigrid preconditioners,
out of some big German university. They were never able to get stuff to market
fast enough to beat Intel and Moore's law.

Perhaps the biggest recent success story in this field is Anton.

[https://en.wikipedia.org/wiki/Anton_(computer)](https://en.wikipedia.org/wiki/Anton_\(computer\))

------
mooneater
Looks to be all about TPU1? Which is inference-only. Afaik TPU2 allows for
training as well, Im much more interested in that. Last line: "There was a
TPU2 talk earlier that I missed that I need to look through the slides of and
write up later"

~~~
Symmetry
The Hot Chips talks will eventually make their way onto YouTube...

------
nhaehnle
I really don't get how they came up with those numbers comparing CPUs to GPUs.

They claim to have 3.5x as much on-chip memory as a GPU, but the R9 Fury X has
16.7 MiB of _register_ memory compared to their 28MiB. And then of course
there's caches on top of that (which funnily add up to less than the register
memory, I believe).

I also don't get how they come up with those MAC numbers. An RX Vega 64 can do
27 TFlop/s of half-precision arithmetic, which is _way_ more than 1/25x the 92
TOp/s they claim for the TPU. In fact, it makes the GPU look pretty damn good,
considering the TPU only does 8-bit ops.

Of course I'd expect the TPU to beat a GPU in terms of perf/watt, but that's
not what they're comparing on that particular slide.

There's the whole question of how you manage latency in inference, but then
I'd expect them to talk about the utilization of the GPU resources relative to
the theoretical peak.

~~~
puzzle
They compared TPU v1 with server hardware available at the time (2015, i.e.
K80 and Haswell).

Are the cards you mention from 2015? Are they for gaming or servers? Do they
use ECC? If not, that rules them out right away.

You can find the paper with the methodology, theoretical peaks and latency
management at
[https://arxiv.org/abs/1704.04760](https://arxiv.org/abs/1704.04760)

~~~
buildbot
I would argue ECC is completely pointless during neural network inference, the
amount of change a single bit flip, even a sign change, is likely to generate
is minimal.

~~~
vomjom
An exponent bit flip, especially at later layers, would completely break
inference.

~~~
puzzle
I'd be concerned about code, too. It's not unheard of for a single bit flip to
make petabytes go away (true story).

------
shaklee3
This article just seems odd. They're still quoting numbers from how they
compared 2 years ago to Kepler GPUs. Unless they have a new TPU out, these are
worse than the V100 GPU out today, so it's strange that in a field moving so
fast they're constantly quoting old data. It doesn't matter anymore that you
had the fastest chip in 2015. If you haven't iterated since then, you are
probably losing.

~~~
jorgemf
The link is about TPUv1, but Google is already using TPUv2 (or maybe TPUv3,
they don't talk too much about this things).

