
Train TensorFlow models faster and at lower cost on Cloud TPU Pods - frankchn
https://cloud.google.com/blog/products/ai-machine-learning/now-you-can-train-ml-models-faster-and-lower-cost-cloud-tpu-pods
======
etaioinshrdlu
Does this require distributed training?

My understanding and experience is that it is not always trivial to get linear
training speedups with additional machines.

It can be sometimes hard to get any speedup at all.

The difference with gradient descent can be described simply as:

* Single-machine training: take more steps

* Distributed training: take fewer but more confident and accurate steps. This could allow you to take bigger steps (learning rate) as well, but there is a limit to this as well.

They are not equivalent processes and it is an area of active research how to
get an equivalent result with distributed training.

~~~
antognini
I worked on a research project that set out to answer that question:

[https://arxiv.org/abs/1811.03600](https://arxiv.org/abs/1811.03600)

We looked at a bunch of different model architectures and datasets and found
that you can get speedups from larger batch sizes up until a certain point,
but that point is different for different datasets and architectures. The
range of good hyperparameters is also narrower for larger batch sizes, which
makes tuning harder.

~~~
gdahl
To clarify, we found the range shrink if one trains for a fixed number of
epochs and expand if one trains for a fixed number of steps. So tuning can be
easier or harder depending on the budget.

------
ethikal
Full ML Perf results can be found here:
[https://mlperf.org/results/](https://mlperf.org/results/)

------
KenoFischer
Julia runs on these ;). Well, it will once multicore access for non-TF
frameworks is public.

~~~
zak
We're on it! = D

------
polskibus
What a pity Google TPUs are locked in Google Cloud - many would like to use
them on their own, in already owned servers. It kind of slows the ML progress,
such restriction of access to the product.

~~~
kajecounterhack
How these chips work is pretty reliant on Google's specific infrastructure and
there's a stark difference between the work needed to offer a service and
selling chips. It isn't a matter of bad will or trying to slow ML progress but
a matter of what's practical.

~~~
ori_b
> How these chips work is pretty reliant on Google's specific infrastructure.

I find that hard to believe. Can you be more specific?

~~~
puzzle
I haven't been at Google for years and never worked on TPUs, but off the top
of my head:

They control the system attached to the chips. That includes the kernel and
userland (glibc, libstdc++, etc.).

Very limited number of configurations (CPU, RAM, network, motherboard, any
RDMA use) to qualify.

Monitoring, profiling, diagnostics, firmware, networking, security and
automation follow the internal Google standards.

They can tune cooling to accommodate Google's motherboards and racks, whether
it's air (v2) or liquid cooling (v3).

You can bet that they talk directly to the GCS backends through Stubby/gRPC,
rather than sending HTTP traffic through the outside network and traversing
GFEs.

Then there's all the other "mundane" stuff they don't need to worry about:
packaging, manuals, warranties and user-facing RMA, multi-tenancy, etc.

~~~
ori_b
As far as I'm aware, they're still on OCP, which standardizes the bulk of
this. Basically, this comes down to not being willing to be bothered with
testing on diverse hardware.

Which, fair enough, is a pain.

~~~
kajecounterhack
Google isn't on OCP afaik. Their hardware layer is pretty custom. Maybe you
were thinking of Facebook?

Google's hardware layer is one thing; as btian mentioned, TPU also is deeply
integrated with software infra e.g borg
[https://ai.google/research/pubs/pub43438](https://ai.google/research/pubs/pub43438)

~~~
puzzle
Years ago, Google announced they'd join OCP. I'd be surprised if even half the
fleet made the transition, given how much mileage they like to get out of old
hardware.

Besides, for special toys like TPU pods it's not clear that they would value
adhering to OCP standards above anything else.

------
riku_iki
Curious what will be the answer from Nvidia.

~~~
david-gpu
This is what NVidia says [0].

Notice how Google did not submit benchmark results for several tests.

[0] [https://blogs.nvidia.com/blog/2018/12/12/record-breaking-
mlp...](https://blogs.nvidia.com/blog/2018/12/12/record-breaking-mlperf-ai-
benchmarks/)

~~~
riku_iki
NVidia won RESNET-50 benchmark with 80 nodes of DGX-1, curious if it is
accessible in the cloud, and how much it will cost to rent it..

------
AlexCoventry
I thought TPUs were mostly used for inference, because they're low- precision.
Has that changed?

~~~
gradys
Yep. That was the first generation that was only for inference. Subsequent
generations can do inference and training.

------
fredguth
TF only? Any reason why, besides Google tech lock in?

