
Google’s Cloud TPU Pods are now publicly available in beta - cokernel_hacker
https://cloud.google.com/blog/products/ai-machine-learning/googles-scalable-supercomputers-for-machine-learning-cloud-tpu-pods-are-now-publicly-available-in-beta
======
pd0wm
I wonder if they will apply the same terms of service as with their Cloud
Machine Learning offerings (Auto ML, Cloud Vision, etc).

A snippet from [https://cloud.google.com/terms/service-terms#12-google-
cloud...](https://cloud.google.com/terms/service-terms#12-google-cloud-
platform-machine-learning-group-and-google-cloud-machine-learning-engine):

    
    
      Customer will not, and will not allow third parties to: (i) use these Services
      to create, train, or improve (directly or indirectly) a similar or competing
      product or service or (ii) integrate these Services with any applications for
      any embedded devices such as cars, TVs, appliances, or speakers without Google's
      prior written permission. These Services can only be integrated with
      applications for the following personal computing devices: smartphones, tablets,
      laptops, and desktops. In addition to any other available remedies, Google may
      immediately suspend or terminate Customer's use of these Services based on any
      suspected violation of these terms, and violation of these terms is deemed
      violation of Google's Intellectual Property Rights. Customer will provide Google
      with any assistance Google requests to reasonably confirm compliance with these
      terms (including interviews with Customer employees and inspection of Customer
      source code, model training data, and engineering documentation). These terms
      will survive termination or expiration of the Agreement.

~~~
grej
Sincerely thank you for the public service of posting that. You’ve probably
saved a lot of folks on here from a big mistake.

I had no idea GCP had such terms. I had been considering alternative cloud
hosting for a ML SaaS but will definitely not consider GCP.

~~~
dodobirdlord
You'll probably find that all ML SaaS EULAs look about the same.

------
rryan
Cloud TPU pods are seriously amazing. I'm a researcher at Google working on
speech synthesis, and they allow me to flexibly trade off resource usage vs.
time to results with nearly linear scaling due to the insanely fast
interconnect. TPUs are already fast (non-pods, i.e. 8 TPU cores are 10x faster
for my task than 8 V100s) but having pods open up new possibilities I couldn't
build easily with GPUs. As a silly example, I can easily train on a batch size
of 16k (typical batch size on one GPU is 32) if I want to by using one of the
larger pod sizes, and it's about as fast as my usual batch size as long as the
batch size per TPU core stays constant. Getting TPU pod quota was easily the
single biggest productivity speedup my team has ever had.

~~~
sytelus
Are TPUs drop-in replacement for CUDA if you were using TF?Can I simply change
device from CUDA to TPU and run any TF code? Last I heard, TPUs still had long
way to go towards making this happen...

~~~
nl
If you are using TF, you can look in Tensorboard at your model graph and it
will show you any incompatible operations.

These days it's pretty good. Not perfect, but you can do RNNs on it now. The
FAQ has decent docs:
[https://cloud.google.com/tpu/docs/faq](https://cloud.google.com/tpu/docs/faq)

------
64738
If you want to view their header image at larger size but right-clicking
doesn't give you the option to "Open image in new tab", the direct link is
below. Not a big deal, but it might save a few clicks for some:

[https://storage.googleapis.com/gweb-cloudblog-
publish/origin...](https://storage.googleapis.com/gweb-cloudblog-
publish/original_images/TPUpod.png)

------
oooshha
Is this pretty bad news for Nvidia?

~~~
p1esk
Nvidia could have released a DL specific chip a long time ago, if they wanted
to. I’m not sure why they haven’t (market not big enough?), but they probably
will at some point.

~~~
jlebar
(I work at Google on compilers for ml, including compilers for Nvidia gpus.)

Devices like the v100 and t4 _are_ ml-specific chips. You can do graphics
stuff on them, but that doesn't mean that Nvidia if leaving a ton of ml
performance on the table by including that capability. Indeed there may be
economies of scale for them in having fewer architectures to support.

They aren't dumb. :)

~~~
p1esk
V100 has 640 tensor cores, and 5k general FP32/64 cores. Most of DL
computation is done by tensor cores. Can you imagine how much faster it would
get if they released a chip with say 10k tensor cores?

~~~
jlebar
> Can you imagine how much faster it would get if they released a chip with
> say 10k tensor cores?

I can, actually. :) Adding 10k tensor cores to a GPU would not make it run
much faster, and would be prohibitive in terms of die space. Moreover _getting
rid of the 5,000 FP32 cores_ would slow down DL workloads significantly.

The 640 Tensor Cores vs 5,000 F32 cores comparison is misleading, because they
are not measuring the same thing.

An "FP32/64 core" corresponds to a functional unit on the GPU streaming
multiprocessor (SM) which is capable of doing one scalar FP32 operation. One
FLOP, or maybe two if you are doing an FMA. V100 has 5120 FP32 units and 2560
FP64 units.

In contrast, a "Tensor Core" corresponds to a functional unit on the SM which
is capable of doing _64_ FMAs per clock. That is, a Tensor Core does 64 times
as much work as an FP32 core. Integrated circuits aren't magic, if you're
doing 64 times as much work, you need more die space.

Moreover, there is nothing to say that nvidia isn't able to use some of the
same circuits for both the fp32 and tensor core operations. If they are able
to do this (I expect they are) then reducing one does not necessarily make
space for the other.

Increasing the number of tensor core flops by a factor of 20 (500 -> 10,000)
would not make the GPU 20x faster, probably not even 2x faster. This is
because you will quickly run into GPU memory bandwidth limitations. This is
not a simple problem to solve, nvidia GPUs are already pushing what is
possible with HBM.

Lastly, although you're correct that, in terms of number of flops, most DL
computation is done by tensor cores (if you've written your application in
fp16), that doesn't mean we could get rid of the f32 compute units, or even
that significantly reducing their number would have minimal effect on our
models. Recall Amdahl's law. We usually think about it in terms of speedups,
but it applies equally well in terms of slowdowns. If even 10% of our time is
spent doing f32 compute, and we make it 10x slower...well, you can do the
math.
[https://en.wikipedia.org/wiki/Amdahl%27s_law](https://en.wikipedia.org/wiki/Amdahl%27s_law)

Indeed, I was just looking at an fp16 tensor-core cudnn kernel yesterday, and
even _it_ did a significant amount of fp32 compute.

The implicit argument I read in parent post is that nvidia could build a
significantly better DL chip "simply" by changing the quantities of different
functional units on the GPU. This is predicated on nvidia being quite bad at
their core competency of designing hardware, despite their being the market
leader in DL hardware. It's kind of staggering to me how quickly nonexperts
jump to this conclusion.

Here's a talk I gave at cppcon about much of this (note that it's pre Volta).
[https://www.youtube.com/watch?v=KHa-
OSrZPGo&t=1s](https://www.youtube.com/watch?v=KHa-OSrZPGo&t=1s)

~~~
p1esk
Thank you for the detailed answer.

I think your main point is that memory bandwidth would prevent the performance
speedup. Are V100s memory bound when executing F16 ops on tensor cores?

Second, do we really need dedicated FP32 cores for DL? Tensor cores accumulate
in FP32 (is that what you meant when you said they did a significant amount of
FP32 compute?), and recent papers indicate we’re moving towards 8 bit training
[1]. Besides, do TPUs use dedicated FP32 hw?

Finally, if the memory bandwidth is indeed the bottleneck, perhaps all that
die area from FP32 and especially FP64 cores could be used for massive amount
of cache.

[1] [https://arxiv.org/abs/1805.11046](https://arxiv.org/abs/1805.11046)

~~~
jlebar
V100s are often memory bound when using tensor cores, yes. But I guess my
point is broader than that. There is a "right shape" for hardware that wants
to excel at a particular workload, depending on the arithmetic intensity,
degree of temporal locality, and so on. The point is that you usually can't
just turn up one dimension to eleven, it's not usually that simple.

For example, massively increasing the GPU last level cache size would not have
the effect of increasing memory bw much on most workloads, because cache only
helps when you have temporal locality and gpus like to stream through many GB
of data.

This is covered in Hennessy and Patterson if you're curious to learn more. I
also talk about it some in the video I linked above.

(Also I doubt that getting rid of f64 support would be a significant die size
win. I notice that v100 has, in their marketing speak, twice the fp32 cores as
fp64 cores. What do you think are the chances that Nvidia decided a priori
this is the optimal ratio? What if instead they are sharing resources between
these functional units, at a ratio of two to one?)

To the question of, do you really need fp32 cores, I am not aware of any
"widely deployed" GPU model today that does not do significant fp32 work.
Perhaps there is research which suggests this isn't necessary! But that is a
different thing than we were talking about here, that Nvidia could somehow
make a much better chip _for the things people are doing today_.

I don't want to speak to the question of whether TPUs have f32 hardware,
because I'm afraid of saying something that might not be public. But I think
the answer to your question can easily be found by some searching and is
probably even in the public docs.

------
harigov
Can anyone comment on the reliability/availability of these pods?

~~~
reilly3000
They have been at this for multiple years internally; I have to imagine they
are battle tested.

------
m0zg
Unfortunately for Google, NVIDIA's offerings are very strong, and TPUs are a
pain the rear to use and require TensorFlow, which in itself is a pain to use,
making it doubly painful, to the extent that using their offering requires a
significant degree of desperation or not knowing any better.

~~~
nl
Well you'll be glad to know that PyTorch is available (in development form) on
the TPU: [https://github.com/pytorch/xla](https://github.com/pytorch/xla)

~~~
m0zg
If it's "available" but doesn't do anything useful, it's IMO not really
"available". It's a good start though. I hope they don't throttle it in favor
of TF because that'd only prolong TF's agony.

~~~
nl
I think this is pretty unfair.

I know a bunch of PyTorch using people who have switched back to TF 2.0, and
that is ignoring things like the Swift TF implementation.

The deployment story around TF is a lot better too.

