
GradientFlow: Training ImageNet in 1.5 Minutes on 512 GPUs - gavinuhma
https://arxiv.org/abs/1902.06855
======
frankchn
The authors trained ResNet-50 in 7.3 minutes at 75.3% accuracy.

As a comparison, a Google TPUv3 pod with 1024 chips got to 75.2% accuracy with
ResNet-50 in 1.8 minutes, and 76.2% accuracy in 2.2 minutes with an optimizer
change and distributed batch normalization [1].

[1]: [https://arxiv.org/abs/1811.06992](https://arxiv.org/abs/1811.06992)

~~~
gpm
A TPUv3 pod is ~107 petaflops (Googles number from your paper). 512 Volta GPUs
is ~64 petaflops (Nvidias number from [1]).

v3 pods don't seem to be publicly available. A 256 chip 11.5 petaflop v2 pod
is $384 per hour, $3.366 million per year. [2]

Meanwhile Google Cloud Volta GPU prices (which are probably inflated over
building your own cluster, but are hopefully close enough to a reasonable
ballpark) are $1.736 per hour, would be $7.791 million per year for 512.

Unless Google GPU prices are really inflated, clusters are legitimately
substantially cheaper than cloud GPUs, or these researchers did a poor job, it
seems like this is a good advertisement for TPUs.

[1] [https://www.nvidia.com/en-us/data-center/volta-gpu-
architect...](https://www.nvidia.com/en-us/data-center/volta-gpu-
architecture/)

[2] Pod availability / performance / pricing information here:
[https://cloud.google.com/tpu/](https://cloud.google.com/tpu/)

[3] GPU pricing info:
[https://cloud.google.com/gpu/](https://cloud.google.com/gpu/)

------
gambler
512 GPUs on 56 Gbps network? I'd rather see researchers exploring potentially
more efficient alternatives to traditional neural nets, like XOR Nets, or
different architectures like ngForests or probabilistic and logistic circuits,
or maybe listen to Vapnik and invest into general statistical learning
efficiency.

~~~
ribalda
I have realised that the age of a AI is just a movement of big corporations
towards a profitable monopoly.

They are teaching us how to solve problems with hardware instead with
algorithms, and we, as individuals, will never have access to their computing
power.

~~~
nl
This is completely wrong.

I work in the field, and the way it works is like this, in this order:

1) Someone works out how to do something

2) Someone works out how to improve the accuracy

3) The accuracy maxes out

4) People improve training efficiency.

We see this over and over again. Take a look at the FastAI results on ImageNet
training speed for example.

------
nl
For all those complaining about the cost:

FastAI trained RestNet-50 to 93% accuracy in 18 minutes for $48[1] using the
same code which can be run on your own GPU machine.

If you want to do it cheaper and faster, you can do the same for in 9 minutes
for $12 on Googles (publicaly available) TPUv2s.

This isn't a monopolization of AI, it is the opposite.

[1]
[https://dawn.cs.stanford.edu/benchmark/](https://dawn.cs.stanford.edu/benchmark/)

~~~
naveen99
Do you have a link to their code?

~~~
nl
Click the "source" link on the page to go directly to the tagged Git code that
got that performance.

~~~
naveen99
Thanks. Blew right past the source link first time through before your
comment.

------
bitL
Oh well, this is the death of democratic AI and an end of independent
researchers :-( There goes any hope of a single Titan RTX producing meaningful
commercial models.

~~~
caenorst
Sorry but even tho the big companies produce a lot of interesting research, I
challenge you to not find any interesting model trained on a single GPU from
recent publications (the majority coming from academia). Actually it's very
rare to find a paper where largely distributed training is necessary (i.e: the
training would fail or would be unreasonably too long). Yes having more money
help you to scale your experiments, it's nothing new and it's not something
specific to AI.

~~~
bitL
A trivial example is BERT_large; won't fit into 12/16GB and takes ~year to
train from the scratch on a single 15TFlops machine. It's now a base model for
transfer learning for NLP.

~~~
caenorst
I'm not saying that there's no very big model, just saying that it's a
minority of publications, for any trivia example of big model I can show you
10x trivia examples of relevant non-big models.

Also you are talking about a model which is specifically designed for TPU (the
dimensionality of the networks is especially fine-tuned).

And even tho, BERT_large still fit in the memory of a single GPU (for very
small batch), there is an implementation on Pytorch. I don't understand, are
people complaining that Deep Learning is actually (reasonably) scaling ? Isn't
it a good news ?

~~~
bitL
You need to study state-of-art a bit more. BERT_large can't reproduce results
its authors achieved with TPUs on a Titan V/Tesla P100 as for getting there
you need to use substantially larger batch sizes that won't fit into 12/16GB.
If you get a V100/Titan RTX, it would fit, but you'd wait ~1 year for a single
training session (40 epochs) to finish.

MS already published another model based on BERT that is even better. It's
unlikely memory x #GPUs would go down in foreseeable future; it's more like
that everybody will start as large models as their infrastructure allows if
they find something that improves target metrics.

------
IshKebab
I thought we were optimising for cost now?

------
RhysU
Why? What deep problems have been solved? How will this make our children
better?

------
thro_awayz_days
Silicon is inferior to chemical energy. Human is upwards of thousands of
orders of magnitude more efficient than today's best GPU's. However, speed !=
efficiency. Classifying imageNet with a human would take 1000 hours.

~~~
the8472
On the other hand GPUs are made for computing and will crunch those numbers
for you until they die. If you want a human to classify things you have to
consider the lifetime cost of making said human and keeping it entertained.

They also need idle periods every day, during which they don't even shut down!

You can't power them with PV cells either, instead they rely on carbohydrates
produced via a horribly inefficient chemical photosynthesis process.

And if you intend to let your human classifier run for 8 hours a day you
better buy at least three of those for error correction.

And I must say this comparison is still quite lenient towards the humans since
we're not even comparing them to purpose made silicon entities but
generalists.

~~~
trhway
Matrix begs to difer.

