
Cloud TPUs in Beta - saeta
https://cloudplatform.googleblog.com/2018/02/Cloud-TPU-machine-learning-accelerators-now-available-in-beta.html
======
boulos
Disclosure: I work on Google Cloud.

I want to highlight this paragraph from the post:

> Here at Google Cloud, we want to provide customers with the best cloud for
> every ML workload and will offer a variety of high-performance CPUs
> (including Intel Skylake) and GPUs (including NVIDIA’s Tesla V100) alongside
> Cloud TPUs.

We fundamentally want Google Cloud to be the best place to do computing. That
includes AI/ML and so you’ll see us both invest in our own hardware, as well
as provide the latest CPUs, GPUs, and so on. Don’t take this announcement as
“Google is going to start excluding GPUs”, but rather that we’re adding an
option that we’ve found internally to be an excellent balance of time-to-
trained-model and cost. We’re still happily buying GPUs to offer to our Cloud
customers, and as I said elsewhere the V100 is a great chip. All of this
competition in hardware is great for folks who want to see ML progress in the
years to come.

~~~
sabalaba
Any plans to support AMD GPUs and the Radeon Open Compute project? The AI/ML
community really needs viable alternatives to NVIDIA, otherwise they will
continue to flex pricing power. Google, via TensorFlow, is in a phenomenal
position to promote open source alternatives to the proprietary Deep Learning
software ecosystem that we see today with CUDA/CuDNN.

~~~
londons_explore
Google would happily accept patches to enable support for it.

AMD hopefully has a team writing such patches now. It makes business sense for
them to do so.

Google is getting even more price gouging from Nvidia than the general public,
and has even more incentive to level the playing field.

~~~
socceroos
Or the opposite - they're getting nice savings in return for not actively
developing or encouraging CUDA/cuDNN alternatives.

------
minimaxir
That $6.50/hr rate might be the big deal here. Amazon does offer instances
with a V100 GPU ([https://aws.amazon.com/ec2/pricing/on-
demand/](https://aws.amazon.com/ec2/pricing/on-demand/), the P3 instances),
but if you're training something like ImageNet, you'll want the biggest image
(p3.16xlarge) at _$24.48 /hr_.

Attaching a VM of similar power to a TPU on Google Compute Engine is much
cheaper
([https://cloud.google.com/compute/pricing](https://cloud.google.com/compute/pricing),
n1-highmem-64, +$3.78/hr to the TPU cost for $10.28/hr total).

Per recent benchmarks for training ImageNet
([https://dawn.cs.stanford.edu/benchmark/](https://dawn.cs.stanford.edu/benchmark/)),
training ImageNet on a p3.16xlarge cost $358, when this post claims it'll cost
less than $200. (EDIT: never mind; the benchmark uses ImageNet-152, and Google
compares TPU performance against ImageNet-50) Interesting.

~~~
borramakot
Back of the envelope, a TPU costs a little more than 2x as much as a Volta on
AWS P3, and delivers a little less than 2x the performance (180 TOPs for the
TPU, 100 for Volta). On a raw performance/$ metric, I'm not sure the TPU is
that interesting.

It might be worth it if I were willing to pay a huge amount to get back
results from an experiment faster, by using lots of TPUs- distributed learning
on GPUs doesn't seem easy yet.

~~~
boulos
Disclosure: I work on Google Cloud.

Peak ops/second isn’t the only thing that matters though. You have to be able
to feed the units. The V100 does lots of finer-grained matrix multiplies which
can make it harder to keep up.

Don’t get me wrong, the V100 is a great chip. And we’re all looking forward to
more (preferably third-party) benchmark results, to tease out when one is the
better choice for a workload. But don’t just compare ops/second or any other
architectural number.

~~~
deepnotderp
This makes no sense, the V100 has _more_ memory bandwidth than both the TPU
and TPUv2

~~~
jlebar
V100 has 900gb/s memory bandwidth [0].

TPUv2 has 600gb/s _per chip_ x 4 chips, so 2400gb/s [1].

As we've discussed elsewhere [2], comparing TPUv2 to V100 on a _per chip_
basis doesn't make much sense. Who cares how many chips are on the board? If
Google announced tomorrow that TPUv3 is coming out, which is identical to
TPUv2 but the four chips are glued together, nobody would care.

The questions that we should instead be asking are, how fast can I train my
model and how much does it cost?

Per elsewhere in thread [3], on Volta you have 900gb/s per 100Tops/s = 0.9
bytes/s per op/s, whereas on TPUv2 you have 2400gb/s memory bandwidth over
180Tops/s = 1.33 bytes/s per op/s. This means that TPUv2's memory-bandwidth-
to-compute ratio is 1.33/9 = 1.5x higher than Volta's.

We can do a similar comparison for memory available. V100 has 16gb per
100Tops, TPUv2 has 64gb per 180Tops. So the memory-to-compute ratio for Volta
is 16g/100T = .16 milli while for TPUv2 it's 64g/180T = .36 milli, for a ratio
of .36/.16 = 2.25x higher on TPUv2.

Does any of this matter? Does it translate into faster and/or cheaper
training? Do models actually need and benefit from this additional memory and
memory bandwidth? My guess from working on GPUs is yes, at least insofar as
bandwidth is concerned, but it's just a guess. I'm excited to find out for
real.

(Disclaimer: I work at Google on XLA, and used to work on TPUs.)

[0]
[https://images.nvidia.com/content/technologies/volta/pdf/437...](https://images.nvidia.com/content/technologies/volta/pdf/437317-Volta-V100-DS-
NV-US-WEB.pdf) [1]
[https://supercomputersfordl2017.github.io/Presentations/Imag...](https://supercomputersfordl2017.github.io/Presentations/ImageNetNewMNIST.pdf)
[2]
[https://news.ycombinator.com/item?id=16360212](https://news.ycombinator.com/item?id=16360212)
[3]
[https://news.ycombinator.com/item?id=16359531](https://news.ycombinator.com/item?id=16359531)

~~~
twtw
I responded to your other comment to disagree, and I'll do so again here.

Nobody is comparing DGX1-V to a single TPUv2 chip, because it doesn't make any
sense to do so. they are totally different kinds of machines. But for some
reason everyone is comparing a cluster of 4 TPUv2 chips to a single V100 chip.

It only makes sense to compare 4xTPUv2 to 1xV100 if they are equivalent in
some meaningful metric, like total die size, power, etc.

In lieu of any available data, I'm going to continue to assume that each TPUv2
chip is roughly comparable in terms of power & die size to each V100 chip. If
this was grossly wrong, I would expect that all four would be condensed into a
single chip, which would dramatically increase the performance of the
interconnects.

We could resolve this rapidly if there were any data available about die size,
TDP, anything of TPUv2.

~~~
jlebar
> But for some reason everyone is comparing a cluster of 4 TPUv2 chips to a
> single V100 chip.

I agree that some people are doing that. Marketing, I suppose. But that
comparison is explicitly _not_ the point of my parent post. I'm comparing the
"shapes" of the chips -- specifically, the compute/memory and compute/memory-
bandwidth ratios. These ratios stay the same regardless of whether you
multiply the chips by 4 or by 400.

The point I was trying to make is that V100 has a higher peak-compute-to-
memory(-bandwidth) ratio than TPUv2. This much seems clear from the
arithmetic. Whether this matters in practice, I don't know, but I think it is
relevant if one believes (as I do, based on the evidence I have as an author
of an ML compiler targeting the V100) that the V100 is starved for memory
bandwidth.

> In lieu of any available data, I'm going to continue to assume that each
> TPUv2 chip is roughly comparable in terms of power & die size to each V100
> chip. If this was grossly wrong, I would expect that all four would be
> condensed into a single chip, which would dramatically increase the
> performance of the interconnects.

I'm sure Google's hardware engineers operate under a lot of constraints that
I'm not aware of; I'm not about to make assumptions. But more to the point, as
we've said, things like die size and TDP don't directly affect consumers. The
questions we have to ask are, how fast can you train your model, and at what
cost?

Just as you don't like it when people (incorrectly, I agree) insist on
comparing one V100 to four TPUs, because that's totally arbitrary (why not
compare one V100 to 128 TPUs?), I don't like it when people insist on
comparing TPUv2 to V100 on arbitrary metrics like die size, or peak
flops/chip, or whatever. So I disagree that we could resolve anything if we
had more info about the TPUv2 chip itself. None of that matters.

~~~
deepnotderp
Well, if you ignore power consumptiom because ",it doesn't matter to the end
user", you're talking about economic comparisons, not technical comparisons.

BTW, I absolutely agree that memory bandwidth is the bottleneck, I've built my
company around that assertion and the data for that exists (Mitra's
publications come to mind)

------
tejasmanohar
This is exciting. There are lots of specific reasons to choose Google Cloud
over AWS (and vice versa), but proprietary hardware is surely an advantage
that is going to be hard to replicate / compete with. If TPUs hold up to the
hype, GCloud may become the _de facto_ for ML/AI startups.

~~~
danjoc
>If TPUs hold up to the hype, GCloud may become the de facto for ML/AI
startups.

Don't startups want to win a big exit though? Google won't need to buy the
startup for billions, because the TOS already grants them permission to use
all the models and training data for free. Seems like a Faustian bargain to
me.

~~~
shafyy
Regardless of the TOS saying that or not (I haven't read them), I can think of
at least two reasons why your statement doesn't hold:

1) AI __startups __usually don 't have a lot of value to potential acquirers
based on their data, but based on other things (e.g., talent, customers,
business model, platform, brand). That's like saying you shouldn't use AWS
because Amazon can just steal and commercialize all your data.

2) There are other companies than Google that acquire startups

Having said that, I highly doubt that Google can just use all the training
data to on GCloud to launch their own products with that. They can surely look
at it and maybe do stuff with them internally, but I am pretty sure that they
can't use them commercially.

~~~
danjoc
>They can surely look at it and maybe do stuff with them internally, but I am
pretty sure that they can't use them commercially.

How would you ever know if they did? People who worked at Google have been
accused, by Google, of stealing the entire self driving car program and taking
it to a competitor.

~~~
shafyy
First of all, that's wrong (as another comment pointed out). Of course, the
probability of them stealing your stuff is non-zero, but it's very rare. Even
if you use all your own hard and software, people still can steal your stuff
:-)

~~~
danjoc
I can be hacked by malware which can leak secrets from air gapped, Faraday
caged machines. Therefore, I should put my billion dollar idea on the public
cloud and just trust Google.

I shiggy diggy.

------
gcp
Interestingly, GCP now appears to be available to individuals in Europe. It
wasn't like that before, no idea when that policy got changed. Before, GCP
wasn't even a consideration compared to AWS (which always handled that).

~~~
kyrra
More details: [https://cloud.google.com/billing/docs/resources/vat-
overview](https://cloud.google.com/billing/docs/resources/vat-overview)

~~~
gcp
"You can’t change the tax status of your Google Cloud Platform billing
account."

I think this is what tripped me up before. I closed my business years ago but
it was completely impossible to get Google to fix this. Now it fixed it "by
itself".

Just a warning to everyone before signing up with your main Google account :-)

------
twtw
Some things:

A "single TPU" is 4 ASICs. It is not clear if it makes sense to compare a
"single TPU" to a "single GPU."

As a point of reference, NVIDIA's numbers are 6 hours for Resnet-50 on
Imagenet when training with 8xV100. From a naive extrapolation, 4xV100 would
probably take ~12 hours and 1xV100 about two days.

Google has previously only compared TPUs to K80, so it will be interesting to
see some benchmarks that compare TPUs to more recent GPUs. K80 was released in
2014, and the Kepler architecture was introduced in 2012.

~~~
jlebar
> A "single TPU" is 4 ASICs. It is not clear if it makes sense to compare a
> "single TPU" to a "single GPU."

Why does the number of chips matter?

Put another way, suppose Google tomorrow announced Cloud TPU v3 which was one
ASIC identical in all ways to four v2 ASICs glued together. Would that be
notable in any way? Seems like it would be a nop to me.

I think what matters is, how fast can you train a model, and at what cost?
Doesn't really matter if it's one chip or 10,000 behind the scenes.

~~~
twtw
It doesn't matter in the ways you are considering. The ultimate comparisons
are going to be time, cost, and power to complete some benchmark, just as you
say.

I only mention the number of chips because loads of people are comparing the
"single TPU" to a single V100 with the assumption that it is meaningful. I
don't know the TDP, die size, etc. of the TPUv2 chip, so it may well make more
sense for ballpark comparisons to compare "single TPU" to 4xV100.

For example, a "single TPU" has 64 GB of memory, whereas a "single GPU" has 16
GB (V100). Is this meaningful? I don't know.

It just seems like something worth noting. I could buy a DGX1-V with 8xV100,
rebrand it as the TWTW TPU, and then go around and tell everyone how my TPU is
8x faster than GPUs. It appears that everyone is normalizing by marketing unit
until benchmarks come out, which is potentially flawed.

------
jakozaur
That may be Google Cloud competitive edge for AI startups. Both in terms of
development cycle and cost efficiency.

Hard to replicate by competitors: AWS and Azure.

------
Talyen42
How does this compare to Nvidia GPUs on AWS price/perf-wise?

The article makes it sound like this is a new thing...

~~~
bloudermilk
Google claims[0] the TPU is many times faster for the workloads they've
designed it for.

> On our production AI workloads that utilize neural network inference, the
> TPU is 15x to 30x faster than contemporary GPUs and CPUs.

As far as I know this will be the first opportunity for the public to prove
those claims, as until now they've not been available on GCP. I don't mean to
sound skeptical–I'm quite confident they're not exaggerating.

[0]: [https://cloudplatform.googleblog.com/2017/04/quantifying-
the...](https://cloudplatform.googleblog.com/2017/04/quantifying-the-
performance-of-the-TPU-our-first-machine-learning-chip.html)

~~~
vomjom
Keep in mind that what you linked refers to TPUv1, which is built for
quantized 8-bit inference. The TPUv2, which was announced in this blog post,
is for general purpose training and uses 32-bit weights, activations, and
gradients.

It will have very different performance characteristics.

~~~
bloudermilk
Thanks for pointing that out!

------
a_imho
tensor processing unit

[https://en.wikipedia.org/wiki/Tensor_processing_unit](https://en.wikipedia.org/wiki/Tensor_processing_unit)

------
bcheung
This seems a bit pricey compared to other offerings. Wouldn't an ASIC make
things more economical?

Seems like in terms of cost per performance, both AWS P3 spot instances and
Paperspace v100 offerings are more economical.

Are these prices expected to become more competitive once it is out of beta?

~~~
make3
isn't the tpu kind of a deep learning asic?

------
tveita
Is this just go-faster-juice for Tensorflow code or does it have other
implications? If you train on TPUs can you still run the model efficiently
elsewhere?

------
mobileexpert
I assume Azure and AWS have some buddying up with Intel/Nervana and Nvidia
counterstroke to Google TPUs. I can’t quite imagine what it will be though.

~~~
jacksmith21006
Amazon announced today they are working on their own TPU type chips.

~~~
thousandx
Do you have a link for that?

~~~
jacksmith21006
"Amazon is reportedly following Apple and Google by designing custom AI chips
for Alexa"

[https://www.theverge.com/2018/2/12/17004734/amazon-custom-
al...](https://www.theverge.com/2018/2/12/17004734/amazon-custom-alexa-echo-
ai-chips-smart-speaker)

------
yazr
What are the chances of TensorFlow code gradually optimizing for TPUs over
GPUs?!

(Yes TF is OSS, but realistically Google is putting much more resources into
it)

~~~
dgacmu
Very low. A lot of the performance on GPUs comes from Nvidia's optimizations
in CuDNN -- it's mostly a matter of making sure TensorFlow feeds the right
formats/etc. to CuDNN for core NN ops. TF should run well on CPUs, GPUs, TPUs,
and likely future embedded accelerators (via tensorflow lite, which already
supports the Android Neural Networks API).

(I'm part time on Brain, but, of course, this isn't some kind of Official
Statement(tm)).

------
tempay
Is there any way to use these for applications other than tensorflow/machine
learning?

------
otterley
I'm puzzled by the phrase "differentiated performance per dollar."

Is it more performant, or less?

If it's less performant, why mention it at all?

If it's more performant, why not simply say "better performance per dollar"?

~~~
surajrmal
It is both more performant overall as well as per dollar.

------
bufferoverflow
We really need a standard easy to run benchmark.

------
polskibus
When is off the shelf edition coming?

~~~
DannyBee
My guess: Never

~~~
polskibus
I hope that's not true, for the sake of progress. Todays clouds wouldn't have
happened if AMD and Intel had restricted cloud use of their processors.

~~~
DannyBee
Among other things, it would be expensive (in a ton of ways), a digression,
require providing direct end user support in a way they aren't good at.

It also would have significant export restrictions: Neural network related
asics are very tightly export controlled:

[https://www.bis.doc.gov/index.php/forms-
documents/pdfs/1245-...](https://www.bis.doc.gov/index.php/forms-
documents/pdfs/1245-category-3/file)

(search for neural network)

My 2c: It would be an expensive waste of time for Google :)

Though certainly, not gonna disagree it would be cool for the sake of
progress.

------
aw4y
does anyone thought about cryptocurrency mining?

------
utopcell
Game-changer.

------
whataretensors
I don't like it. Google is mixing too many things. No way to buy a TPU. No
competition from other cloud providers. Proprietary hardware and vendor lock-
in.

~~~
puzzle
This is really Tensorflow as a service. You get an IP address and a port you
send gRPC requests to:

[https://github.com/tensorflow/tpu/blob/master/tools/diagnost...](https://github.com/tensorflow/tpu/blob/master/tools/diagnostics/diagnostics.py#L103)

Presumably, there's a whole server behind that address that has all the right
drivers and libraries: details you don't need to care about.

The only partial lock-in is that not all ops are supported and you need to
figure if there are any parts of the graph in the critical part that will run
on the CPU instead. There's a tool for that:

[https://cloud.google.com/tpu/docs/cloud-tpu-
tools#tpu_compat...](https://cloud.google.com/tpu/docs/cloud-tpu-
tools#tpu_compatibility_checker)

Competitors could launch something similar that uses GPUs tomorrow. Now, if
you don't already use TF and don't want to switch, that's another story.

~~~
whataretensors
That's my point. Competitors are largely moated out by high costs of TPU
production and proprietary drivers.

~~~
puzzle
Why are proprietary drivers a blocker? As long as you expose the same gRPC
interface, your customers don't need to know what happens behind the scenes.
You could have an FPGA or a Beowulf cluster of Raspberry Pis hiding.

~~~
whataretensors
I should clarify. I like all the individual pieces(hardware, cloud services,
grpc interface) I just wish you could opt into them independently.

------
ramshanker
Someone at Dell/HPE headquarter - When can we start selling "Integrated TPU"
machines. ;)

Google aspiring to be leader in Cloud machine learning. Let's do On Premise.

------
danjoc
Reading the TOS it seems like this is a really great deal for Google:

"When you upload, submit, store, send or receive content to or through our
Services, you give Google (and those we work with) a worldwide license to use,
host, store, reproduce, modify, create derivative works (such as those
resulting from translations, adaptations or other changes we make so that your
content works better with our Services), communicate, publish, publicly
perform, publicly display and distribute such content. The rights you grant in
this license are for the limited purpose of operating, promoting, and
improving our Services, and to develop new ones."

All your training data are belong to us.

We can use your models to improve ours.

The terms will prevent me from using it. I can't grant Google permission to
redistribute HIPAA PHI.

~~~
barrus
Cloud TPU product manager here.

The TOS you are quoting only refers to the information you provide in the
survey. Here are the Google Cloud TOS:
[https://cloud.google.com/terms/](https://cloud.google.com/terms/) if you're
interested in what Cloud does with customers data.

5.2 Use of Customer Data. Google will not access or use Customer Data, except
as necessary to provide the Services to Customer.

Your training data and models are secure.

~~~
danjoc
This URL isn't on the TPU beta signup page. The Google TOS is. Perhaps you can
see the confusion? I would be reluctant to trust random 37 karma guy on Hacker
News message board on this particularly important consideration.

