
An In-Depth Look at Google's Tensor Processing Unit Architecture - Katydid
https://www.nextplatform.com/2017/04/05/first-depth-look-googles-tpu-architecture/
======
struct
Interesting points I took from the paper[1]:

* They actually started deploying them in 2015, they're probably already hard at work on a new version!

* The TPU only operates on 8-bit integers (and 16-bit at half speed), whereas CPU/GPUs are 32-bit floating point. They point out in the discussion section that they did have an 8-bit CPU version of one of the benchmarks, and the TPU was ~3.5x faster.

* Used via TensorFlow.

* They don't really break out hardware vs hardware for each model, it seems like the TPU suffers a lot whenever there's a really large number of weights and layers that it must handle - but they don't break out the performance on each model individually, so it's hard to see whether the TPU offers an advantage over the GPU for arbitrary networks.

[1]
[https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk...](https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view)

~~~
make3
"The TPU only operates on 8-bit integers" The 8 bit part is fine, but
integers? what the hell. that's a new one for me

~~~
dlubarov
Anecdotally, it seems most models can be quantized to 8 bits without much loss
of accuracy, and fixed point arithmetic requires much less hardware. Training
is still done with floating point though.

~~~
ChuckMcM
This, when you get right down to it a lot of models do fine with only 256
unique weights.

~~~
alttab
Agreed - however as we progress I expect a comment like this to be akin to
Bill Gate's 64K comment.

~~~
mannigfaltig
The brain appears to spend about 4.7 bits per synapse (26 discernible states,
given the noisy computation environment of the brain); so it seems to be
plenty enough for general intelligence. This could, of course, merely be a
biological limit and on silicon more fine-grained weights might be the
optimum.

Here is another paper demonstrating very good results with just 6 bit
gradients:
[https://arxiv.org/abs/1606.06160](https://arxiv.org/abs/1606.06160)

------
mooneater
"This first generation of TPUs targeted inference" from [1]

So they are telling us about inference hardware. Im much more curious about
training hardware.

[1] [https://cloudplatform.googleblog.com/2017/04/quantifying-
the...](https://cloudplatform.googleblog.com/2017/04/quantifying-the-
performance-of-the-TPU-our-first-machine-learning-chip.html)

~~~
sgk284
Using approaches like OpenAI's recent evolution strategies paper would remove
the need for backprop, likely allowing these TPUs to be using for training
without any changes.

~~~
p1esk
Evolution strategy method is used in reinforcement learning models. How are
you planning to use it for supervised learning?

~~~
mdda
People have known that training NNs (for any purpose) using evolution works
well since the 1990s. The rise of the NN frameworks has made doing
differentiation much easier now than it was before (and having gradient hints
is intuitively a good idea). But for OpenAI to allow their PR people to
declare this as a novel advance is ... surprising.

~~~
p1esk
Citation for training NN for image classification task where evolution works
well?

Let's say you want to use a genetic algorithm to find a good set of weights:
you generate, mutate, combine and select many random networks, and repeat this
process many times. How many networks and how many times? That depends on the
length of your chromosome and complexity of the task. Networks that work well
for image classification need at least a million weights. The entire set of
weights is a single chromosome. You realize now how computationally
intractable this task is on modern hardware?

~~~
mdda
> NN for image classification task

You've created your own straw man here.

> "You realize now how computationally intractable this task is on modern
> hardware?"

Here are the people that prove it isn't computationally intractable :
[https://blog.openai.com/evolution-
strategies/](https://blog.openai.com/evolution-strategies/) \- but to say
they've discovered a new breakthrough method is over-selling the result.

~~~
p1esk
You said: "training NNs (for any purpose) using evolution works well". I gave
you an example of a purpose where it does not work well. So, let me ask you
again: can you give an example of evolutionary methods that work well when
applied to training NNs, other than this breakthrough by OpenAI, which only
works for RL?

------
slizard
It's a pity they omitted comparing against the Maxwell-gen GPUs like the
M40/M4. Those were out already in late 2015 and are also on 28 nm.

Perhaps the reason is simply that they don't have them in their servers, but
we'll see if Jeff Dean replies on G+ [1].

[1]
[https://plus.google.com/+JeffDean/posts/4n3rBF5utFQ?cfem=1](https://plus.google.com/+JeffDean/posts/4n3rBF5utFQ?cfem=1)

~~~
dgacmu
Neither Google Cloud nor Amazon Web Services offers Maxwell-series GPUs. Both
jumped, or, to be more precise, are in the process of jumping, from the
K-series to the P100 series.

When I google around a bit, I see several results talking about the software
licensing cost model for the M-series GPUs.

~~~
keltor
There are no datacenter class Maxwell-series GPUs. They never released a
version with ECC SRAM and so Amazon and Google never used them in production.

Part of the fault was GDDR5's limitations that involved trickery to make the
Kepler-series work.

Pascal is coming with ECC because HBM2 comes with ECC built-in.

------
MichaelBurge
It's interesting that they focus on inference. I suppose training needs more
computational power, but inference is what the end-user sees so it has harder
requirements.

Most of us are probably better off building a few workstations at home with
high-end cards. The hardware will be more efficient for the money. But if
you're considering hiring someone to manage all your machines, power-
efficiency and stability become more important than the performance/upfront $
ratio.

There's also FPGAs, but they tend to be much lower quality than the chips
Intel or Nvidia put out so unless you know why you'd want them you don't need
them.

~~~
throwaway71958
They're also not very interested in making it easier for you to train models
at home. Not that it's a big risk for them if you were able to do so: you
don't have the data, and your models are only as good as your data, but they'd
rather you came to their cloud and paid $2/hr per die for an outdated Tesla
K80. Which, to their credit, they've made it very easy to hook up to your VM.
Literally, you just tell them how many you need and your VM starts with that
many GPUs attached. Super slick.

~~~
thesandlord
P100s are coming soon!

(I work on GCP)

------
zitterbewegung
Looking at the analysis of the article one of the big gains of this is that
they have a Busy power usage of 384W which is lower than the other servers
while having performance that is competitive with the other methods (although
only restricting to inference).

~~~
maga
I was wondering how it compares to other solutions in terms of
performance/watt, luckily they address it in the paper[1]:

> The TPU server has 17 to 34 times better total-performance/Watt than
> Haswell, which makes the TPU server 14 to 16 times the performance/Watt of
> the K80 server. The relative incremental-performance/Watt—which was our
> company’s justification for a custom ASIC—is 41 to 83 for the TPU, which
> lifts the TPU to 25 to 29 times the performance/Watt of the GPU.

[1]
[https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk...](https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view)

------
zackmorris
While this is interesting for TensorFlow, I think that it will not result in
more than an evolutionary step forward in AI. The reason being that the single
greatest performance boost for computing in recent memory was the data
locality metaphor used by MapReduce. It lets us get around CPU manufacturers
sitting on their hands and the fact that memory just isn’t going to get
substantially faster.

I'd much rather see a general purpose CPU that uses something like an array of
many hundreds or thousands of fixed-point ALUs with local high speed ram for
each core on-chip. Then program it in a parallel/matrix language like Octave
or as a hybrid with the actor model from Erlang/Go. Basically give the
developer full control over instructions and let the compiler and hardware
perform those operations on many pieces of data at once. Like SIMD or VLIW
without the pedantry and limitations of those instruction sets. If the
developer wants to have a thousand realtime linuxes running Python, then the
hardware will only stand in the way if it can’t do that, and we’ll be left
relying on academics to advance the state of the art. We shouldn’t exclude the
many millions of developers who are interested in this stuff by forcing them
to use notation that doesn’t build on their existing contextual experience.

I think an environment where the developer doesn’t have to worry about
counting cores or optimizing interconnect/state transfer, and can run
arbitrary programs, is the only way that we’ll move forward. Nothing should
stop us from devoting half the chip to gradient descent and the other half to
genetic algorithms, or simply experiment with agents running as adversarial
networks or cooperating in ant colony optimization. We should be able to start
up and tear down algorithms borrowed from others to solve any problem at hand.

But not being able to have that freedom - in effect being stuck with the DSP
approach taken by GPUs, is going to send us down yet another road to
specialization and proprietary solutions that result in vendor lock-in. I’ve
said this many times before and I’ll continue to say it as long as we aren’t
seeing real general-purpose computing improving.

------
saosebastiao
Are people really using models so big and complex that the parameter space
couldn't fit into an on-die cache? A fairly simple 8MB cache can give you
1,000,000 doubles for your parameter space, and it would allow you to get rid
of an entire DRAM interface. It's a serious question, as I've never done any
real deep learning...but coming from a world where I once scoffed at a random
forest model with 80 parameters, it just seems absurd.

~~~
deepnotderp
hahhahahahhaahah

The SOTA networks are around 300MB+...

~~~
saosebastiao
Not sure if you meant to laugh at a serious question. I am fully aware of my
ignorance of the space.

Since it appears you're in the deep learning hardware business, what would be
the impediment to using eDRAM or similar? eDRAM is too costly at those sizes
for general purpose processors, but I imagine the reduced latency and
increased bandwidth would be a _huge_ win for a ridiculously parallel deep
learning processor, and would definitely be a tradeoff worth making.

~~~
deepnotderp
Sorry, that was more of a laugh at the state of deep learning model sizes than
anything.

Okay, so about eDRAM. There are two types of eDRAM: On-die and on-package. On-
die eDRAM refers to manufacturing of DRAM cells on the logic die, which would
be a big boon in terms of density since eDRAM cells can be almost 3x as dense
as SRAM. The problem however, is that on-chip eDRAM has been impossible to
scale beyond 40nm, which mitigates any advantages you would receive from using
eDRAM.

On-package eDRAM is more interesting but the primary cost in memory access is
the physical transportation of the data, which is a physical limit and can't
be circumvented. You can call it all sorts of fancy names such as "eDRAM", but
the fact of the matter is that you're still moving data. For reference, the
projected cost of movement of a 64-bit word on 10nm (ON CHIP) according to
Lawrence Livermore national laboratories is ~1pJ, whereas the cost of a 64-bit
FLOP is estimated to be 1pJ also. As you can see, the cost of data movement
dwarfs the cost of computation.

Of course, you gain a lot compared to DRAM, but HBM can offer the same
efficiency gains of course.

Didn't meant to be rude with the first response. Let me know if you have any
other questions, I'd be happy to answer them :)

------
mdale
Interesting stuff; really points to the complexity of measurement of technical
progress against the Mores law; it's really a more fundamental around how
institutions can leverage information technologies and organize work and
computation towards goals that are valued in society.

------
cr0sh
This appears to be a "scaled up" (as in number of cells in the array) and
"scaled down" (as in die size) as the old systolic array processors (going
back quite a ways - 1980s and probably further).

As an example, the ALVINN self-driving vehicle used several such arrays for
it's on-board processing.

I'm not absolutely certain that this is the same, but it has the "smell" of
it.

------
sgt101
Does anyone have a view as to how much deep kernels might be useful for riding
to the rescue for the rest of us?

[https://arxiv.org/abs/1611.00336](https://arxiv.org/abs/1611.00336)

------
amelius
Are they using it in feedforward mode only? Or also for learning?

~~~
agravier
It's mostly designed for inference.

------
andrepd
They're comparing against 5-year old Kepler GPUs. I wonder how it had fared vs
the latest Pascal cards, since they're several times more efficient than
Kepler.

~~~
modeless
5-year-old Kepler GPUs are the best you can get in the cloud right now, and
that's Nvidia's fault. So it's releavant to compare against them.

~~~
p1esk
There are several providers which have been offering P100 GPUs for a while
now: [https://www.nimbix.net/nimbix-cloud-demand-
pricing/](https://www.nimbix.net/nimbix-cloud-demand-pricing/)

