
Google supercharges machine learning tasks with TPU custom chip - hurrycane
https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html
======
luu
I'm happy to hear that this is finally public so I can actually talk about the
work I did when I was at Google :-).

I'm a bit surprised they announced this, though. When I was there, there was
this pervasive attitude that if "we" had some kind of advantage over the
outside world, we shouldn't talk about it lest other people get the same idea.
To be clear, I think that's pretty bad for the world and I really wished that
they'd change, but it was the prevailing attitude. Currently, if you look at
what's being hyped up at a couple of large companies that could conceivably
build a competing chip, it's all FPGAs all the time, so announcing that we
built an ASIC could change what other companies do, which is exactly what
Google was trying to avoid back when I was there.

If this signals that Google is going to be less secretive about
infrastructure, that's great news.

When I joined Microsoft, I tried to gently bring up the possibility of doing
either GPUs or ASICs and was told, very confidentially by multiple people,
that it's impossible to deploy GPUs at scale, let alone ASICs. Since I
couldn't point to actual work I'd done elsewhere, it seemed impossible to
convince folks, and my job was in another area, I gave up on it, but I imagine
someone is having that discussion again right now.

Just as an aside, I'm being fast and loose with language when I use the word
impossible. It's more than my feeling is that you have a limited number of
influence points and I was spending mine on things like convincing my team to
use version control instead of mailing zip files around.

~~~
blue11
OK, so what is it? The announcement neither says what a TPU actually is nor
what it can do. It's a magic black box. No specs. No price.

~~~
Florin_Andrei
Well, it's a first announcement on a blog. They say it accelerates TensorFlow
by 10x. They say it fits in an HDD slot. And the whole announcement must stay
within a page or two.

It's a "more details to follow" type of thing. Pretty standard actually.

~~~
lightcatcher
> They say it accelerates TensorFlow by 10x

They say 10x performance / watt, nothing about performance per unit time.

~~~
DigitalJack
You can make some assumptions though. If the power consumption was equal, the
performance is 10x.

The speed at which an ASIC will run is constrained by temperature (power
dissipation) and and logic timing, which itself has a dependency on
temperature.

So we could call that vertical scaling, to some power ceiling which may not
take us all the way to 10x, but it's not impossible.

Then there is horizontal, which I assume is applicable to these problems...
running more in parallel.

In both cases, I think it's safe to assume they are getting a performance
increase in the instantaneous sense.

~~~
zenlikethat
> You can make some assumptions though. If the power consumption was equal,
> the performance is 10x.

While I agree some performance per unit increase is likely, how does a direct
10x increased based on power savings follow? Less power usage does not mean
that the chip can run through more flops in the same amount of time, right?

~~~
DigitalJack
It does if power was the limiting factor in clock speed.

~~~
archgoon
The relationship between clockspeed and power consumption is nonlinear.

[http://electronics.stackexchange.com/questions/122050/what-l...](http://electronics.stackexchange.com/questions/122050/what-
limits-cpu-speed)

(see graph in the first answer)

Also, it's not known that the TPU have a way to allow to increase the
clockspeed arbitrarily, nor is it known whether their architecture is capable
of ensuring correctness at arbitrary clock frequencies. Some architectures
make assumptions like "The time for this gate to reach saturation is very
small compared to the clock frequency, so we'll pretend that it's
instantaneous."

------
bd
So now open sourcing of "crown jewels" AI software makes sense.

Competitive advantage is protected by custom hardware (and huge proprietary
datasets).

Everything else can be shared. In fact it is now advantageous to share as much
as you can, the bottleneck is a number of people who know how to use new tech.

~~~
tptacek
[http://www.joelonsoftware.com/articles/StrategyLetterV.html](http://www.joelonsoftware.com/articles/StrategyLetterV.html)

~~~
kristianp
Cool how he foreshadows the end of Sun (takeover by Oracle in 2010) in that
article from 2002:

"Sun's two strategies are (a) make software a commodity by promoting and
developing free software (Star Office, Linux, Apache, Gnome, etc), and (b)
make hardware a commodity by promoting Java, with its bytecode architecture
and WORA. OK, Sun, pop quiz: when the music stops, where are you going to sit
down? Without proprietary advantages in hardware or software, you're going to
have to take the commodity price, which barely covers the cost of cheap
factories in Guadalajara, not your cushy offices in Silicon Valley."

~~~
incepted
Predicting that a company will fail without giving a date is not
foreshadowing, it's stating the obvious.

~~~
jmoiron
What? He also predicted how and why it would fail. Sun was a big enough player
then that it could have survived plenty of other ways. Apple of today looks
nothing like the company in 2000, but Sun got caught out more or less exactly
as described and never adapted.

------
abritishguy
I think this shows a fundamental difference between Amazon (AWS) and Google
Cloud.

AWSs offerings seem fairly vanilla and boring. Google are offering more and
more really useful stuff:

\- cloud machine learning

\- custom hardware

\- live migration of hosts without downtime

\- Cold storage with access in seconds

\- bigquery

\- dataflow

~~~
gcr
Vanilla? Boring?

I read "Vanilla" and "Boring" as "Horray, I don't have to spend time rewriting
all this complicated code I already have!"

If I'm just dipping my toes into (say) Caffe or Theano, I don't have to
rewrite it from scratch.

That is a huge advantage---not a disadvantage!---of AWS over google.

~~~
vgt
Your point is valid, but I think what the OP was saying is that Google is
offering all this stuff IN ADDITION to the boring stuff.

Google does boring stuff very well too.. and one can argue much better than
AWS as well.. take a look at Quizlet's story: [https://quizlet.com/blog/whats-
the-best-cloud-probably-gcp](https://quizlet.com/blog/whats-the-best-cloud-
probably-gcp)

(shamelessly biased Googler)

~~~
jbooth
If I recall correctly, it took Google a while to actually offer the boring
stuff. For a while, you could get a Google Compute Engine but you couldn't
just get a dang VM image, because Google knows better than you and you should
do things their way. They've fixed it now, but lost a lot of potential market
share for that conceit.

~~~
boulos
"So"?

If you're evaluating something today, how does it change your decision that we
were late to market with Compute Engine (and in this specific case "bring-
your-own-kernel")?

If it's about future boring stuff, I think the list of boring stuff isn't too
long ;).

Disclosure: I work on Compute Engine.

~~~
engizeer
All given, the fact that Google itself doesn't extensively use GC is kind of a
red flag(I know quite a few Googlers from search infrastructure and none of
them said their teams used GCE internally).

A solid guarantee with AWS is if AWS goes down, then a multitude of Amazon's
services also will go down(ex Amazonian myself), so it gives me a belief that
AWS's uptime is more important to Amazon itself that it is for external
customers.

~~~
boulos
Search Infra (and Ads for that matter) is an extreme case. Google Search might
be one of the worlds most highly tuned infrastructure projects: a marriage of
code and hardware design to maximize performance, scoring, relevance and
ultimately ROI.

Before we had custom machine types (November 2015 GA), we wouldn't have been
remotely close to what they needed. I'm not even sure we've had anyone
evaluate the amount of overhead KVM adds in either latency or throughput.

tl;dr: Don't let Search be your "not until they do it". We've got folks in
Chrome, Android, VR, and more building on top of Cloud (as well as much of our
internal tooling being on App Engine specifically).

------
manav
Interesting. Plenty of work has been done with FPGAs, and a few have developed
ASICs like DaDianNao in China [1]. Google though actually has the resources to
deploy them in their datacenters.

Microsoft explored something similar to accelerate search with FPGAs [2]. The
results show that the Arria 10 (20nm latest from Altera) had about 1/4th the
processing ability at 10% of the power usage of the Nvidia Tesla K40 (25w vs
235w). Nvidia Pascal has something like 2/3x the performance with a similar
power profile. That really bridges the gap for performance/watt. All of that
also doesn't take into account the ease of working with CUDA versus the
complicated development, toolchains, and cost of FPGAs.

However, the ~50x+ efficiency increase of an ASIC though could be worthwhile
in the long run. The only problem I see is that there might be limitations on
model size because of the limited embedded memory of the ASIC.

Does anyone have more information or a whitepaper? I wonder if they are using
eAsic.

[1]:
[http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=701142...](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7011421&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D7011421)

[2]:
[http://research.microsoft.com/pubs/240715/CNN%20Whitepaper.p...](http://research.microsoft.com/pubs/240715/CNN%20Whitepaper.pdf)

~~~
hedgehog
You can build an ASIC with fast external memory, it adds to the cost but then
you can handle larger models similar to a GPU. Software support is an issue
but for deep learning applications there's no reason in principle you couldn't
add support to TensorFlow etc for new hardware to make it simple for
application developers to adopt. Movidius has announced that they're doing
this and it's likely that other ML chip vendors will do the same.

~~~
emcq
External memory kills your power budget. Efficient processors, whether they be
GPUs, cpus, or asics all gain efficiencies by keeping data local to
computation. Having your chip focused on off chip data is difficult to
optimize power for if you don't have compute bound problems.

------
semisight
This is huge. If they really do offer such a perf/watt advantage, they're
serious trouble for NVIDIA. Google is one of only a handful of companies with
the upfront cash to make a move like this.

I hope we can at least see some white papers soon about the architecture--I
wonder how programmable it is.

~~~
nickysielicki
There's no way Google lets this leave their datacenters. Chip fabrication is a
race to the bottom at this point. [1]

Google is doubling down on hosting as a source of future revenue, and they're
doing that by building an ecosystem around Tensorflow.

What I think is interesting is how weak Apple looks. Amazon has the talent and
money to be able to compete with Google on this playing field. Microsoft is
late, but they can, too.

Where's Apple? In the corner dreaming about mythical self-driving luxury cars?

[1]: [http://spectrum.ieee.org/semiconductors/design/the-death-
of-...](http://spectrum.ieee.org/semiconductors/design/the-death-of-moores-
law-will-spur-innovation)

~~~
kuschku
> There's no way Google lets this leave their datacenters. Chip fabrication is
> a race to the bottom at this point. [1]

I’d hope someone somewhere steals the blueprints and posts all of them
publicly online.

The whole point of patents was that companies would publish everything, but
get 20 years of protection.

But by now, especially companies like Google don’t do so anymore – and
everyone loses out.

EDIT: I’ll add the standard disclaimer: If you downvote, please comment why –
so an actual discussion can appear, which usually is a lot more useful to
everyone.

~~~
vidarh
There's little need for anyone to steal the blueprints. It's unlikely there's
anything particularly "special" there other than identifying operations in
Tensorflow that take long enough and are carried out often enough and are
simple enough to be worth turning into an ASIC. If there's a market for it,
there will be other people designing chips for it too.

------
mrpippy
Bah, SGI made a Tensor Processing Unit XIO card 15 years ago.

evidence suggests they were mostly for defense customers:

[http://forums.nekochan.net/viewtopic.php?t=16728751](http://forums.nekochan.net/viewtopic.php?t=16728751)
[http://manx.classiccmp.org/mirror/techpubs.sgi.com/library/m...](http://manx.classiccmp.org/mirror/techpubs.sgi.com/library/manuals/4000/007-4222-002/pdf/007-4222-002.pdf)

~~~
Keyframe
Damn, what a riveting read would be to find out what IT toys defense has now!
Even an informed speculation would be nice to read.

------
jhartmann
3 generations ahead of moore law??? I really wonder how they are accomplishing
this beyond implementing the kernels in hardware. I suspect they are using
specialized memory and an extremely wide architecture.

Sounds they also used this for AlphaGo. I wonder how badly we were off on
AlphaGo's power estimates. Seems everyone assumed they were using GPU's,
sounds like they were not. At least partially. I would really LOVE for them to
market these for general use.

~~~
reitzensteinm
But isn't 3 generations ahead just 8x? Which doesn't sound at all unreasonable
for a custom hardware.

~~~
dnautics
This is about right! 64-bit IEEE fp -> 16-bit IEEE-style fp[0] is a 4x bit
size reduction, and multiplication is O(n^2) is silicon transistor count.

[0] If google is smart, they'd ditch +/\- infinity and if they were ballsy,
they'd ditch zero in their FP implementation.

~~~
Symmetry
Generally speaking GPUs are already very good at running with float32s,
usually much better than they are at using float64s in fact. The big
advantages of using an ASIC are mostly on the storage side but they also allow
you to get away with non IEEE floating point numbers that don't necessarily
implement subnormals, NaN, etc.

------
asimuvPR
Now _this_ is really interesting. I've been asking myself why this hadn't
happened before. Its been all software, software, software for the last decade
or so. But now I get it. We are at a point in time where it makes sense to
adjust the hardware to the software. Funny how things work. It used to be the
other way around.

~~~
kens
This is known as the Wheel of Reincarnation. Functionality moves to special-
purpose hardware then back to software, and the cycle repeats. (The computer
term is from 1968 so this has been happening for a long time.)

[http://www.catb.org/jargon/html/W/wheel-of-
reincarnation.htm...](http://www.catb.org/jargon/html/W/wheel-of-
reincarnation.html)

~~~
aab0
I've been thinking for a while that with the end of silicon shrinkages, we may
start seeing the final cycles of the wheel, with a final stop at mostly
specialized hardware for greater power efficiency.

~~~
asimuvPR
Not only power efficiency but software efficiency. Custom chips combined with
DSLs are a powerful combination. At the expense of segmentation, of course.

------
breatheoften
A podcast I listen to posted an interview with an expert last week saying that
he perceived that much of the interest in custom hardware for machine learning
tasks died when people realized how effective GPUs were at the (still-
evolving-set-of) tasks.

[http://www.thetalkingmachines.com/blog/2016/5/5/sparse-
codin...](http://www.thetalkingmachines.com/blog/2016/5/5/sparse-coding-and-
madbits)

I wonder how general the gains from these ASIC's are and whether the
performance/power efficiency wins will keep up with the pace of
software/algorithm-du-jour advancements.

~~~
hyperopt
I listen to the Talking Machines as well. Great podcast. Another question
would be are the gains worth the cost of an ML-specific ASIC. GPUs have the
entire, massive gaming industry driving the cost down. I suppose that as
adoption of gradient-descent-based neural networks increases, it may be worth
the cost in a similar way that GPUs are worth the cost. Then again, I have
never implemented SGD on a GPU so I'm not aware if there are any bottlenecks
that could be solved with an ML-specific ASIC. Can anyone else shed some light
on this?

~~~
indolering
> massive gaming industry driving the cost down.

Per-unit manufacturing cost scales logarithmically. Even a single batch of
custom silicon on yesterday's technology is only $30K. This is one of the
reasons there is so much interest in RISC-V; hardware costs are not the
barrier-to-entry that they used to be.

So yeah, the gaming market pushes the per-unit price of GPUs down, but even an
additional 2x reduction in rackspace and power will pay for itself at the
right scale.

------
RIMR
Somewhat off topic, but if you look at the lower-left hand corner of the
heatsink in the first image, there's two red lines and some sort of image
artifact.

[https://2.bp.blogspot.com/-z1ynWkQlBc8/VzzPToH362I/AAAAAAAAC...](https://2.bp.blogspot.com/-z1ynWkQlBc8/VzzPToH362I/AAAAAAAACp0/2QBREGUEikoHrML1nh9h3SEKQVzm8NV7QCLcB/s1600/tpu-2.png)

They probably didn't mean to use this version of the image for their blog -
but I wonder what they were trying to indicate/measure there.

~~~
cm3
Did the image change? I can't seem to find what you're describing.

~~~
davidlakata
It's still there: the red lines outline the lower left hand corner of the
heatsink (the big metallic structure).

~~~
cm3
Got it. That looks like a bad digital image stitching job. Especially the
misaligned fins. But the red lines are odd indeed.

------
danielvf
For the curious, that's a plaque on the side if the rack showing the Go board
at the end of AlphaGo vs Lee Sedol Game 3, at the moment Lee Sedol resigned
and AlphaGo won the tournament (of five games).

~~~
visarga
By the way, more merit for Lee Sedol now that we know what he played against.

------
nkw
I guess this explains why Google Cloud Compute hasn't offered GPU instances.

~~~
hyperopt
That's what I'm thinking. I was anticipating the release of GPU instances, but
now I'm thinking that they will simply leapfrog over GPU instances straight to
this.

------
fiatmoney
I'm guessing that the performance / watt claims are heavily predicated on
relatively low throughput, kind of similar to ARM vs Intel CPUs - particularly
because they're only powering it & supplying bandwidth via what looks like a
1X PCIE slot.

IOW, taking their claims at face value, a Nvidia card or Xeon Phi would be
expected to smoke one of these, although you might be able to run N of these
in the same power envelope.

But those bandwidth & throughput / card limitations would make certain classes
of algorithms not really worthwhile to run on these.

~~~
arcanus
> I'm guessing that the performance / watt claims are heavily predicated on
> relatively low throughput, kind of similar to ARM vs Intel CPUs -
> particularly because they're only powering it & supplying bandwidth via what
> looks like a 1X PCIE slot.

Agreed. Also tells you that they don't need to communicate with the CPU much,
given that it only has a PCIE. Reminds me of Knights Ferry, in this respect.

> a Nvidia card or Xeon Phi would be expected to smoke one of these

Will be very interesting to see some head-to-head benchmarks between these
guys (on tensorflow and other libraries) in the next few months. Especially as
Knights Landing starts to appear, and the new Nvidia card.

~~~
visarga
I am so happy that Nvidia got a real competitor in the ML hardware market.
Maybe now they will be two times more creative.

------
bravo22
Given the insane mask costs for lower geometries, the ASIC is most likely an
Xilinx EasyPath or Altera Hardcopy. Otherwise the amortization of the mask and
dev costs -- even for a structured cell ASIC -- over 1K unit wouldn't make
much sense versus the extra cooling/power costs for a GPU.

~~~
nickpsecurity
Don't forget shuttle runs. Adepteva used those and otherwise good engineering
practice to develop two products, latest in 65nm, with no more than $2mil.
This one might be simpler and cheaper given its requirements.

[http://www.adapteva.com/andreas-blog/a-lean-fabless-
semicond...](http://www.adapteva.com/andreas-blog/a-lean-fabless-
semiconductor-startup-model/)

~~~
bravo22
True. That's another possibility.

I'd imagine since they'd want to squeeze every performance per want out of the
chip they'd want to go for smallest node possible. Virtex7 EasyPath is 16nm!
It is pin to pin compatible with the FPGA version -- because they just change
the mask layer and you get it in about 6 weeks. Hard to beat that.

~~~
nickpsecurity
I didn't know they were still doing EasyPath. And I surely didn't know they
did pin-for-pin on 16nm. Holy crap! :)

~~~
bravo22
Yes. It is very popular for high-end FPGA use cases. Essentially instead of
relying on the built-in routing matrix which makes FPGAs what they are, they
modify the metal layer and connect the chip per your design as an ASIC. You
get much lower power consumption and much faster clock rates. You also get it
in about 6-8 weeks and is guaranteed to match your original design in
functionality. It is 100% pin compatible because it is the same base silicon
and packaging.

~~~
nickpsecurity
Cool stuff. Previously, I was looking at eASIC or Triad if I needed this cuz I
thought FPGA people cancelled S-ASIC's. Good to know there's a high-end one
from Xilinx. Here's the others in case you didn't know about them:

[http://www.easic.com/products/28-nm-easic-
nextreme-3/](http://www.easic.com/products/28-nm-easic-nextreme-3/)

[https://www.triadsemi.com/reconfigurable-full-custom-
asic/](https://www.triadsemi.com/reconfigurable-full-custom-asic/)

eASIC has a maskless capability where they straight-up print your silicon for
prototyping/testing. Triad brought S-ASIC's to analog/mixed-signal. They're
top players. eASIC's basic prototyping was $50k for 50 chips on older ones.
Idk now. Triad I heard is $400k flat. Need a price quote to be sure. ;)

~~~
bravo22
That's awesome. I had heard about them before but never used. Crazy how low
that price is.

~~~
nickpsecurity
Yeah, Triad is still in startup mode and picky. eASIC has been around quite a
while. They also have ezCopy or something to produce ASIC's from their
S-ASIC's. A side benefit is there's lots of pre-tested IP, including Gaisler
OSS CPU.

So, worth considering. I need to get numbers on Xilinx, though, in terms of
pricing and royalties. Esp if they have something for 28nm, 45nm, or 65nm that
will be significantly cheaper than other one.

------
Coding_Cat
I wonder if we will be seeing more of this in the (near) future. I expect so,
and from more people then just Google. Why? Look at the problems the fab labs
have had with the latest generation of chips and as they grow smaller the
problems will probably rise. We are already close to the physical limit of
transistor size. So, it is fair to assume that Moore's law will (hopefully)
not outlive me.

So what then? I certainly hope the tech sector will not just leave it at that.
If you want to continue to improve performance (per-watt) there is only one
way you can go then: improve the design at an ASIC level. ASIC design will
probably stay relatively hard, although there will probably be some
technological solutions to make it easier with time, but if fabrication stalls
at a certain nm level, production costs will probably start to drop with time
as well.

I've been thinking about this quite a bit recently because I hope to start my
PhD in ~1 year, and I'm torn between HPC or Computer Architecture. This seems
to be quite a pro for Comp. Arch ;).

------
phsilva
I wonder if this architecture is the same Lanai architecture that was recently
introduced by Google on LLVM. [http://lists.llvm.org/pipermail/llvm-
dev/2016-February/09511...](http://lists.llvm.org/pipermail/llvm-
dev/2016-February/095118.html)

~~~
startling
No, this is an ASIC. It's not general-purpose.

------
taliesinb
I don't know much about this sort of thing but I wonder if the ultimate
performance would come with co-locating specialized compute with memory, so
that the spatial layout of the computation on silicon ends up mirroring the
abstract dataflow dag, with fairly low-bandwidth and energy efficient links
between static register arrays that represent individual weight and grad
tensors. Minimize the need for caches and power hungry high bandwidth lanes,
ideally the only data moving around is your minibatch data going one way and
your grads going the other way.

I wonder if they're doing that, and to what degree.

------
harigov
How is this different from - say - synthetic neurons that IBM is working on,
or what nvidia is building?

~~~
DannyBee
Without sounding crass:

1\. It works already (IE it's _already in use_ )

2\. It works really well (or else they wouldn't be using it so broadly)

3\. Considering how long this was said to be in development, it also likely
means they are working on the next big improvement before these guys have even
gotten the current one working.

------
Bromskloss
What is the capabilities that a piece of hardware like this needs to have to
be suitable for machine learning (and not just one specific machine learning
problem)?

~~~
wmf
AFAIK 16-bit "half-precision" floating point.

~~~
HappyTypist
8 bit is enough and I suspect it's what the TPU is using:
[https://petewarden.com/2016/05/03/how-to-quantize-neural-
net...](https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-
with-tensorflow/)

~~~
visarga
It is interesting how malleable are neural networks.

\- you can drop half the connections and it still works, in fact it works even
better, during training

\- you can represent the weights on as little as one bit, but still use real
numbers for computing activations

\- you can insert layers and extend the network

\- you can "distill" a network into a smaller, almost as efficient network or
an ensemble of heavy networks into a single one with higher accuracy

\- you can add a fixed weights random layer and sometimes it works even better

\- you can enforce sparsity of activations and then precompute a hash function
to only activate those neurons that will respond to the input signal, thus
making the network much faster

It seems the neural network is a malleable entity with great potential for
making it faster on the algorithmic side. They got 10x speedup mainly on
exploiting a few of these ideas, instead of making the hardware 10x faster.
Otherwise, they wouldn't have made it the size of a HDD - because they would
need much more ventilation in order to dissipate the heat. It's just a
specialized hardware taking advantage of the latest algorithmic optimizations.

------
nathan_f77
I'm thinking that this has the potential to change the context of many debates
about the "technological singularity", or AI taking over the world. Because it
all seems to be based on FUD.

While reading this article, one of my first reactions was "holy shit, Google
might actually build a general AI with these, and they've probably already
been working on it for years".

But really, nothing about these chips is unknown or scary. They use algorithms
that are carefully engineered and understood. They can be scaled up
horizontally to crunch numbers, and they have a very specific purpose. They
improve search results and maps.

What I'm trying to say is that general artificial intelligence is such a lofty
goal, that we're going to have to understand every single piece of the puzzle
before we get anywhere close. Including building custom ASICs, and writing all
of the software by hand. We're not going to accidentally leave any loopholes
open where AI secretly becomes conscious and decided to take over the world.

~~~
Houshalter
Just because you understand a machine doesn't mean it can't be dangerous. I
could completely understand every aspect of a nuclear bomb, and I could still
make a mistake and cause quite a bit of damage with a real one. Complex
systems are notorious for having all sorts of unexpected problems, and
mistakes happen all the time. How much complex software is entirely bug free?

The danger of AI is more than just a random bug though. It's that an
intelligent AI is inherently not good. If you give it a goal, like to make as
many paperclips as possible, it will do everything in it's power to convert
the world to paperclips. If you give it the goal of self preservation, it try
to destroy anything that has a 0.0001% chance of hurting it, and make as many
redundant copies as possible. Etc.

Very, very few goals actually result in an AI that wants to do _exactly_ what
you want it to do. And if the AI is incredibly powerful, that will be a very
bad outcome for humanity.

~~~
visarga
You can condition the AI on the well being and freedom of human population.
Hard to define precisely what that means, but it can be approximated with
indirect measures. This is just what Asimov thought of in his novels.

Another way to protect against catastrophe would be to launch multiple AI
agents that optimize for the goal of nurturing humanity. They can keep each
other in check.

Also, humans will evolve as well. Genetics is advancing very fast. We will be
able to design bigger/better brains for ourselves, perhaps also with the help
of AI. Human learning could be assisted by AIs to achieve much higher levels
than today.

We will also be able to link directly to computers and become part of their
ecosystem, thus, creating an incentive for it to keep us around. Taking this
path would enable uploading and immortality for humans as well.

My guess is that we will all become united with the AI. We already are united
by the internet and we spend a lot of time querying the search engine (AI),
learning its quirks and, by feedback, helping improving it. This trend will
continue up to the point where humans and AI become one thing. Killing humans
would be for the AI like cutting out a part of your brain. Maybe it will want
a biological brain of its own and come over to the other side, of biological
intelligence.

~~~
Houshalter
We don't know how to "condition" an AI to respect the well being and freedom
of humanity. It's an extremely complicated problem with no simple solutions.
Making an AI that wants to destroy humanity, however, is quite
straightforward. Guess which one will most likely be built first?

Building multiple AIs doesn't solve anything. They can just as easily
cooperate to destroy humanity as to help it.

Uploading humans won't be possible until we can already simulate intelligence
in computers. We can't have uploads _before_ AI.

------
cschmidt
This seems very similar to the "Fathom Neural Compute Stick" from Movidius:

[http://www.movidius.com/solutions/machine-vision-
algorithms/...](http://www.movidius.com/solutions/machine-vision-
algorithms/machine-learning)

TensorFlow on a chip....

~~~
azinman2
Although that stuff seems to be more about on the fly reasoning at low wattage
so you can embed a drone with neural nets... This is more for servers.

------
isseu
Tensor Processing Unit (TPU)

Using it for over a year? Wow

------
revelation
There is not a single number in this article.

Now these heatsinks can be deceiving for boards that are meant to be in a
server rack unit with massive fans throwing a hurricane over them, but even
then that is not very much power we're looking at there.

------
hyperopt
The Cloud Machine Learning service is one that I'm highly anticipating.
Setting up arbitrary cloud machines for training models is a mess right now. I
think if Google sets it up correctly, it could be a game changer for ML
research for the rest of us. Especially if they can undercut AWS's GPU
instances on cost per unit of performance through specialized hardware. I
don't think the coinciding releases/announcements of TensorFlow, Cloud ML, and
now this are an accident. There is something brewing and I think it's going to
be big.

~~~
visarga
I'd love to have a place to experiment cheaply but still not be required to
invest 2-3K$ in it.

~~~
hyperopt
It doesn't just need to be a place to experiment cheaply. Many companies
building software around ML techniques still rent time on EC2. Unless you are
training models 24/7 and have your machines located in a very cost efficient
location in terms of power/cooling, It's probably better for your training to
be done in the cloud. It think very few use cases fall into the latter
category.

------
saganus
Is that a Go board stick to the side of the rack?

Maybe they play one move every time someone gets to go there to fix something?
or could it be just a way of numbering the racks or something eccentric like
that?

~~~
triangleman
It appears to be a commemorative plaque from when they defeated world champion
Lee Sodol.

~~~
saganus
Ah. That makes sense.

------
hristov
It is interesting that they would make this into an ASIC, provided how
notoriously high the development costs for ASICs are. Are those costs coming
down? If so life will get very hard for the FPGA makers of the world soon.

It would be interesting to see what the economics of this project are. I.e.,
what are the development costs and costs per chip. Of course it is very
doubtful I will ever get to see the economics of this project, it would be
interesting.

~~~
zhemao
It's mainly a question of volume. ASICs have a big economy of scale, so the
cost-per-chip goes down considerably once you go over a certain number of
chips. Plus, there's all the NRE costs of an ASIC design over an FPGA design.
Google probably figured they could use enough chips to make the cost of
manufacturing ASICs worthwhile.

I don't think FPGAs are going to be beat out by ASICs for low volume
applications anytime soon.

~~~
indolering
But even a custom run of silicon (on yesterday's technology) will only set you
back $30K. That's one of the reasons there is so much interest in RISC-V.

------
protomok
I'd be interested to know more technical details. I wonder if they're using
8-bit multipliers, how many MACs running in parallel, power consumption, etc.

------
j-dr
This is great, but can google stop putting tensor in the name of everything
when nothing they do really has anything to do with tensors?

~~~
Difwif
I'm unsure if Tensorflow actually uses this but I know in literature some
models use tensor decompositions for learning.

------
__jal
My favorite part is what looks like flush-head sheet metal screws holding the
heat sink on.

No wondering where you left the Torx drivers with this one.

------
j1vms
I wouldn't be surprised if Google is looking to build (or done so already) a
highly dense and parallel analog computer with limited precision ADC/DACs. I
mean that's simplifying things quite a bit, but it would probably map pretty
well to the Tensorflow application.

~~~
skaevola
What are the advantages of analog computing for this application?

~~~
read_only
If you have an application that can tolerate error (like classification), then
analog computing can give enormous gains in terms of speed _and_ power
efficiency. Essentially, the savings come from using physics to perform the
math (see Kirchhoff's current law) vs. using discrete time steps vs. fully-
unrolling the logic. Google may not be using analog processing for this
version, but I read an analog neural network researcher's page who said he
moved to Google last year. (Sorry, I can't find the page again, but I think he
was from the UK.)

~~~
dharma1
What do you think about [http://optalysys.com/](http://optalysys.com/) or
[http://lighton.io](http://lighton.io)?

~~~
read_only
I think Optalysys looks interesting!

For the curious, Optalysys has built a general purpose optics-based
correlation/pattern matching machine. From some of their predecessor-company
marketing material: The correlator performs pattern matching on large data
sets such as high-resolution images, providing a measure of similarity and
relative position between objects within the input scene. This allows large
images [and general data converted to images] to be analysed far faster than
electronic equivalents.

Going back to the topic of NN-based computing, I found this talk to be
intriguing:
[https://www.youtube.com/watch?v=dkIuIIp6bl0](https://www.youtube.com/watch?v=dkIuIIp6bl0).
The main argument is that because Moore's law may no longer be in effect, it
will become increasingly important to explore alternate computing solutions.
(Google's TPU could be supporting evidence for this argument.) The speaker
also co-authored a paper which I liked "General-Purpose Code Acceleration with
Limited-Precision Analog Computation".

~~~
skaevola
This video was very cool. Are there any IC's that can perform analog computing
for neural networks on the market now? I'm picturing something like an FPGA
but with a bunch of op amps that you can connect into summers or amplifiers.

If not, how would one practically implement an analog computer for neural
network programming (without several tables full of op-amps?)

~~~
read_only
Glad you liked the video!

You can implement an analog neural network yourself using a Field Programmable
Analog Array. (I've never done it, but you'll see academics online writing
papers about it.)

Another thing that is sort of related is Lyric Semiconductor; they built these
cool application-specific probabilistic processors; they were purchased by
Analog Devices a while back.

------
aaronsnoswell
I'm curious to know; is this announcement something that an expert in these
sorts of areas could have (or did?) predict months or years ago, given
Google's recent jumps forwards in Machine learning products? Can someone with
more knowledge about this comment?

~~~
hellameta
Sure. I don't have anything to link on the spot but this was/is/has been
foreseeable for some time. Although it's all very cool and shiny - most
business applications of machine learning remain squarely in the territory of
classic algos like GLM & forests (random, boosted trees etc. etc.). As a fun
note, advances like these highlight that data scientists etc. will not be
beaten by more complex automated methods, but simply by speed. Much like the
filing system that 'runs' whatever you're using to see these words
([https://www.youtube.com/watch?v=EKWGGDXe5MA](https://www.youtube.com/watch?v=EKWGGDXe5MA)).

Edit: to elaborate... single model training runs are possible to do quite fast
now, but knowing how to tune hyper parameters remains the 'voodoo' of the
field. But the best hyper params are also possible to discover through brute
force: try every combination you can! Today, you can use various heuristics to
improve this process, but either way, being able to train whatever X times
faster just means we can search hyper parameter space that much faster. The
robots are coming :)

~~~
visarga
They could run 10x more experiments for the same cost and experiment with many
more configurations, but soon enough there will probably be an algorithm that
can do the same on an single computer. I am waiting for the moment neural
networks will become as good as people in designing neural networks.

------
eggy
Pretty quick implementation.

On the energy savings and space savings front, this type of implementation
coupled with the space-saving, energy-saving claims of going to unums vs.
float should get it to the next order of magnitude. Come on, Google, make
unums happen!

------
paulsutter
> Our goal is to lead the industry on machine learning and make that
> innovation available to our customers.

Are they saying Google Cloud customers will get access to TPUs eventually? Or
that general users will see service improvements?

------
nxzero
Is there anyway to detect what hardware to being used by the cloud service if
you're using the cloud service? (yes, realize this question is a bit of a
paradox, but figured I'd ask.)

------
mistobaan
Another point is that they will be able to provide much higher computing
capabilities at a much lower price point that any competitors. I really like
the direction that the company is taking.

------
swalsh
I wonder if opening this up as a cloud offering is a way to get a whole bunch
of excess capacity (if it needs it for something big?) but have it paid for.

------
dharma1
hasn't made a dent on Nvidia's share price yet

------
amelius
One question: what has this got to do with tensors?

~~~
vintermann
That they're used a lot in machine learning? If you're processing video for
instance, you might have a 5-dimensional tensor: x, y, color channel, time
index and batch index.

------
eggy
I think the confluence of new technologies, and the re-emergence / rediscovery
of older technologies is going to be the best combination. Whether it goes
that way is not certain, since the best technology doesn't always win out.
Here, though, the money should, since all would greatly reduce time and energy
in mining and validating:

* Vector processing computers - not von Neumann machines [1].

* Array languages new, or like J, K, or Q in the APL family [2,3]

* The replacement of floating point units with unum processors [4]

Neural networks are inherently arrays or matrices, and would do better on a
designed vector array machine, not a re-purposed GPU, or even a TPU in the
article in a standard von Neumann machine. Maybe non-von Neumann architectire
like the old Lisp Machines, but for arrays, not lists (and no, this is not a
modern GPU. The data has to stay on the processor, not offloaded to external
memory).

I started with neural networks in late 80s early 1990s, and I was mainly
programming in C. matrices and FOR loops. I found J, the array language many
years later, unfortunately. Businesses have been making enough money off of
the advantage of the array processing language A+, then K, that the per-seat
cost of KDB+/Q (database/language) is easily justifiable. Other software like
RiakTS are looking to get in the game using Spark/shark and other pieces of
kit, but a K4 query is 230 times faster than Spark/shark, and uses 0.2GB of
memory vs. 50GB. The similar technologies just don't fit the problem space as
good as a vector language. I am partial to J being a more mathematically pure
array language in that it is based on arrays. K4 (soon to be K5/K6) is list-
based at the lower level, and is honed for tick-data or time series data. J is
a bit more general purpose or academic in my opinion.

Unums are theoretically more energy efficient and compact than floating point,
and take away the error-guessing game. They are being tested with several
different language implementations to validate their creator's claims, and
practicality. The Mathematica notebook that John Gustafson modeled his work on
is available free to download from the book publisher's site. People have
already done some type of explorator investigations in Python, Julia and even
J already. I believe the J one is a 4-bit implementation of enums based on
unums 1.0. John Gustafson just presented unums 2.0 in February 2016.

[1] [http://conceptualorigami.blogspot.co.id/2010/12/vector-
proce...](http://conceptualorigami.blogspot.co.id/2010/12/vector-processing-
languages-future-of.html)

[2] jsoftware.com

[3] [http://kxcommunity.com/an-introduction-to-neural-networks-
wi...](http://kxcommunity.com/an-introduction-to-neural-networks-with-kdb.php)

[4] [https://www.crcpress.com/The-End-of-Error-Unum-
Computing/Gus...](https://www.crcpress.com/The-End-of-Error-Unum-
Computing/Gustafson/p/book/9781482239867)

~~~
pwang
At this point in time, I think that the Python/Numpy stack offers the best
performance, productivity, and expressiveness trade-off. With the
[Numba]([http://numba.pydata.org](http://numba.pydata.org)) just-in-time
compiler, you can now easily bounce between numeric SIMD codes that leverage
tuned BLAS/MKL, then go into more explicit loop-oriented constructs that
perform equivalently to hand-coded C, while still being Python. If I were
starting anew, it would be hard to justify investing in a big J/K/Q code base
or team, despite the potential performance benefits.

I agree with your overall point that we're seeing a confluence of factors. The
advances in compiler technology, combined with the vectorial nature of the
problems that are interesting to solve in an era of big data, mean that we can
achieve a great deal of productivity by using high-level vector-capable
languages.

~~~
eggy
You may be right. I can't argue with Python's ubiquity; I have even steered my
son in that direction, but with a hitch: I still had him learn some J.

The creator of Pandas, Wes McKinney, had a link up a few years back mentioning
he was looking for people who were familiar with APL, J or K. It seems he was
working on a new project/startup I think (could this have been the shuttered
DataPad?). The links are dead now, but I will double check.

If the creator of Pandas is/was eyeing the older APL, and its newer brethren,
I'd say it's a safe bet to keep J or K or Q on your radar because they fit.
They're vector/array based; they are fast and iterative with a REPL; there is
a lot of mathematical formalism in their origins and usage throughout the
years, yet they are more beginner-friendly than say Haskell IMHO. I like
Haskell too!

------
camkego
Does anyone have links to the talk or the graphs?

~~~
honkhonkpants
Looks to be approximately here
[https://youtu.be/862r3XS2YB0?t=7320](https://youtu.be/862r3XS2YB0?t=7320)

------
ungzd
Does it use approximate computing technology?

------
niels_olson
I like that the images are mislabeled :)

------
LogicFailsMe
Perf/W, the official metric of slow but efficient processors. How many times
must we go down this road?

Let's see this sucker train AlexNet...

~~~
dgacmu
Wearing my CMU hat for a moment (but keeping in mind Google's paying me this
year):

Google's always been cautious about the balance of speed and efficiency, out
of concerns about programmer productivity, parallelization, and generality.
See, for example, Urs's article in response to my and a few other people's
crazy-academic research on using "Wimpy" nodes:
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36448.pdf)
\- vs [http://www.cs.cmu.edu/~fawnproj/](http://www.cs.cmu.edu/~fawnproj/)

There's a big difference between just cranking down the GHz and going for ASIC
specialization. GPUs, for example, already represent a point on this spectrum
-- it's true that they run at reduced GHz compared to high-end CPUs, but
they're arithmetic monsters. The blog post notes, in fact, that the use of
TPUs in AlphaGo let them do _more_ searching. So why would you assume
automatically that they're slow?

~~~
LogicFailsMe
Because there's an absolute sippy straw of bandwidth to the thing if that's a
1x pci-e connection.

For if it were delivering performance on par with a $1000 Maxwell class GPU,
why wouldn't you guys crow about it? That would be a really big deal wouldn't
it? TitanX for 20W? That'd be awesome.

And having suffered through multiple pitches for us to buy various FPGA and
boutique processors, I have yet to see someone who produced perf per watt
numbers first, subsequently produce an impressive performance number. In fact,
it took nearly yelling at one vendor for them to finally admit perf was
abysmal.

Finally, training does not equal inference. Training requires strong scaling,
but inference need only weak scale. So I suspect that Urs had to bite his
tongue and buy a bunch of gpus for training networks.

Am I missing something?

~~~
dgacmu
That wasn't really my point - I'm simply noting that Urs is one of the last
people I'd think to hop on the wimpy crazy train. His published articles
suggest that he's got a very good grasp of the tradeoffs involved in "real"
TCO -- i.e., taking a fairly global view of both the human, capital, and
operating expenses involved in a technology decision such as using wimpies
(no) or fabricating a custom ASIC for machine learning (yes).

That doesn't mean a TPU is faster or slower than anything in particular, it
just means that quite likely that it's good for some machine learning tasks
that Google cares enough about to spend the whatever dollars it cost to make
the thing.

The WSJ article has a few more quotes from Norm Jouppi, btw.:
[http://www.wsj.com/articles/google-isnt-playing-games-
with-n...](http://www.wsj.com/articles/google-isnt-playing-games-with-new-
chip-1463597820) (Sorry if that gets paywalled. Googling "wall street journal
google tensor processing unit" got me there.)

~~~
LogicFailsMe
There's a huge gap between wimpy and GPU, no? I'd guess this chip was right-
sized to run inference on a 10 Gbit/s feed, optimized for perf/W.

------
rando3826
Why use an ANKY in the title? Using an ANKY(Acronym no one knows yet) is bad
writing, makes readers feel dumb, etc. Google JUST NOW invented that acronym,
sticking it in the title like just another word we should understand is
absolutely ridiculous.

~~~
placeybordeaux
I honestly didn't find it that hard to understand as they recently released a
machine learning library that started with a T.

------
simunaga
In what sense in this a great news? Yes, it's a progress, so what? After all,
you - programmers - earn money for your jobs and pretty soon you might not
have one. Because of these kinds of great news -- "Whayyy, this is really
interesting, AI, maching learning. Aaaaa!".

"I'll get fired, won't have money for living and AI will take my place, but
the world will be better! Yes! Progress!"

Who will benefit from this? Surely not you. Why are you so ecstatic then?

~~~
visarga
I thought long and hard about the future when 95% of the jobs as they exist
today will be taken by robots. Initially, it would seem that people would be
reduced to beggars, as they depend on BHI or other forms of welfare to
subsist. But then it struck me:

You don't need jobs as long as you have land, renewable energy sources and
robots (and 3d printers). You can live in a community that is self sufficient.
You will be employed by your land, as it always was up until 100 years ago. We
will also have robots, maybe not the latest generation, but we don't need to
go back to the 19th century agriculture.

It is you who will benefit in the end, if you can use AI to improve your life.
As long as AI doesn't remain locked in the hands of one entity and we all
share into the benefits, it will work out ok. In the short run we need some
sort of social welfare though, and to invest in renewables and self-
sufficiency technologies.

How much self-sufficient a country, city, village or small farm could be?
There is a lot of potential to migrate back to small community agrarian
economy with robotics and 3d printing and solar panels.

<speculation>People could trade using a different currency than that used for
robotic produced goods. This currency will have to enforce differentiation of
economic agents (diversity) and integration (low barriers of entry). A
currency that will automatically disable the accumulation of power in a few
hands and work for humans. We have to build an economy that functions more
like the brain. In the brain there is no master neuron. They all share in the
activity. So should be an enlightened human society.</speculation>

~~~
simunaga
yes, could be. but not necessarily will be and not necessarily that you and
other people of the middle or low class will be given a luxury to have a robot
and not to work.

maybe low and middle class will have to serve people from high class who will
have robots.

