
Facebook Trains ImageNet in 1 Hour - jonbaer
https://news.developer.nvidia.com/facebook-trains-imagenet-in-1-hour/
======
tanilama
A practically interesting paper. Some insights:

1.Larger batches requires large learning rate. And this paper shows that
learning rate can even scale linearly with the batch size, which leading to
extremely large learning rate/batch sizes.

2.Larger batch causes initial learning difficult, so this paper proposes to
have a warm-up period where during the initial epochs, the learning rate grows
from a smaller value gradually to a larger one.

But if you are not Google/Facebook/Amazon/Microsoft, the experiment setting is
unrealistic to you. Best AWS instances didn't come with the 50GBits network.
For now, for the others, we would still stick to at most 8 GPUs on a single
node, even your soul screams distributed :/

~~~
jamesblonde
I disagree that this is not an architecture for others. We are a research lab,
and we're building a distributed cluster. Cheap infiniband. Lots of
gtx1080Tis. It's not that expensive to have a 40-GPU cluster with 10 4U
servers (about 100K Euro).

~~~
gm-conspiracy
What mobo/CPU are you using? How much RAM in each server?

Can you provide some more detailed specs?

Thanks!

~~~
jamesblonde
CPU is not that important. A motherboard that has PCIe 3.0 and 7 PCIx16 slots
is ok for at least 3 GPUs. Each GPU takes typically two slots. Then, a >1400
Watts PSU. Here is a relatively cheap box that also has 8 disk slots (for
hadoop):
[https://exxactcorp.com/index.php/solution/solu_detail/320](https://exxactcorp.com/index.php/solution/solu_detail/320)

~~~
Retric
Some people are using water cooling setups to get 7 GPU's on a 7 PCI slot MB.
You need to cut off the DVI that would otherwise occupy another slot, but
overall it's not that time consuming and reduces your networking needs.

EX:
[https://www.youtube.com/watch?v=9hsQmcSwGv0](https://www.youtube.com/watch?v=9hsQmcSwGv0)

------
Seanny123
tl;dr they found a clever way to spread the training across 256 GPUs by
synchronising the stochastic gradient descent

~~~
randyrand
this seems like the trivial, most obvious way to parallelize training across
GPUs. Not, imo, clever.

The important bit here is that they've shown that large mini batch sizes still
can maintain accuracy if you slow the learning rate.

~~~
opportune
Just because it's simple to explain at a high level doesn't make it trivial.

Plenty of theoretically trivial solutions to problems are absolute pains to
implement. I mean there are entire companies that at their core solve
relatively "trivial" problems but employ huge numbers of engineers. Just
because the core concept is simple to explain doesn't mean it's easy.

~~~
randyrand
Someone asked the other day how to parallelize gpu learning. I had never
thought about the problem before, but still came up with and gave this as the
most obvious way.

[https://news.ycombinator.com/item?id=14510146](https://news.ycombinator.com/item?id=14510146)

------
antirez
Trivia: the Pieter in the paper is the one of the Redis fame.

~~~
pietern
Hi Salvatore! :D

And there's even a tiny Redis dependency (optional though) in the code to
generate these results. In particular the collective communication library
needs a rendezvous phase where all nodes connect to their peers. Using Redis
for this is one of the options. See:
[https://github.com/facebookincubator/gloo/tree/master/gloo/r...](https://github.com/facebookincubator/gloo/tree/master/gloo/rendezvous)

~~~
antirez
Hey Pieter! Wow cool :-) Thanks for the info. See you soon!

------
jamesblonde
Some observations:

* for synchronous model-based distributed training to scale linearly, the time required to broadcast the model must be much larger than the time required for a worker (GPU) to process a batch

* it's not strict synchronous training, as when gradients are computed at a worker, they are transmitted to all workers - so the driver doesn't have to send models to all 32 workers at the same time (8 GPUs per worker makes 256 GPUs in total).

* there are extremely large batch sizes (8196)

* it's a good network (50 Gb Ethernet, albeit not infiniband)

So, the relative amount of work done training at each worker is much higher
than the time spent broadcasting the model (which is quite small (~100 MB, i
think)) to the workers for each iteration. For larger models with smaller
batch sizes, this relationship would break down. The interesting contribution
here is that you can have massive batch sizes and Facebook provided a
heuristic for adjusting the learning rate to converge with such massive batch
sizes.

~~~
pietern
Re: your second point, it is strictly synchronous, though since there are 8
GPUs per process (thus have 1 process per machine) the gradient reduction is
done in 3 phases. First they are reduced within the process, then across
processes/machines, and then broadcast within the processes.

~~~
jamesblonde
I misphrased that point, agreed. It's not classic driver-driven synchronous
training, as you would do in tensorflow. It's using all-reduce (not available
in tensorflow yet, i think).

------
quadrature
relevant paper from facebook
[https://research.fb.com/publications/ImageNet1kIn1h/](https://research.fb.com/publications/ImageNet1kIn1h/)

~~~
boulos
This is the URL that jonbaer originally submitted (sadly at an awkward time).
I meant to send mail about it (there's no "Really! I vouch for this!" for low
point stories that languish), but I see the result worked out anyway.

------
breatheoften
I'd always conceptualized decreasing batch size as a performance/memory
optimization to deal with the fact that datasets don't all fit into memory and
to reduce overall training time. You look at batch_size samples and compute
the sum of the gradient of the errors to update the network weights so as to
reduce the error -- shouldn't a larger batch_size inherently provide more
information about the optimal direction of the update?

It seems to my naive view like it should be "nice" from an accuracy
perspective to look at more samples before making an adjustment to the network
weights ...?

In general, does changing the batch_size hyperparameter make a lot of
difference on different problems ...? Does the right value for batch size tend
to be problem specific and/or network architecture specific?

~~~
IanCal
> shouldn't a larger batch_size inherently provide more information about the
> optimal direction of the update?

Not necessarily, since a batch gradient output (as I understand it, and at
least used to code it) all gets averaged together.

Consider standing in a valley with two equal hills either side of you. If you
were to try one direction and see that climbing that way helps, you'd take a
step that way. Then the next step would keep taking you up that hill.

Now, if you batched together two direction tests, what would happen? You'd
average together your left and right and end up with moving _nowhere_. Having
both at the same time doesn't give you better information about how you move
if you only see the result after averaging.

This interestingly maps to something we see in humans, though I'm struggling
to find a decent paper on it (from the PRISM lab in Birmingham, UK if anyone
else has any luck, think the person doing the research might have been called
Chris). Simple adaptation tasks, in this case learning to control a joystick
that has a clockwise/anticlockwise force applied to it, don't work well if you
try and learn both one thing and the opposite straight away. However, sleeping
in-between learning each left you able to do both well. Perhaps this was early
results though.

Batch tradeoffs:

[https://stats.stackexchange.com/questions/164876/tradeoff-
ba...](https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-
vs-number-of-iterations-to-train-a-neural-network)

[https://arxiv.org/abs/1609.04836](https://arxiv.org/abs/1609.04836)

~~~
jononor
If this is the case, could one get improved learning by mixing large and small
batches?

------
qeternity
I can't seem to find it anywhere but what is the interconnect between servers
being used? NVLink is used internally for GPU to GPU communication within a
single box...correct? But this sounds like it takes a cluster of 32 of their 8
GPU Big Basin boxes.

~~~
pietern
Interconnect between the servers is 50 Gbit Ethernet (see section 4 of the
paper).

------
EvgeniyZh
Here is another paper on large batches
[https://arxiv.org/abs/1705.08741](https://arxiv.org/abs/1705.08741) It is
better written and has more information IMHO

------
rlv-dan
Does anyone know if there are any "consumer grade" image training kit out
there? I'm thinking a software that you can train on your own images to put
into categories.

~~~
spuz
Yes, you can use Tensorflow and Google's "inception" image recognition model
to do this. The model by default is trained on the Imagenet database of
images/categories, but Tensorflow allows you to retrain the last layer of the
model on your own images to produce your own categorisation. Since you are
only retraining the last layer of the model, you can easily do it within about
20 minutes on a laptop. See the tutorial here:
[https://www.tensorflow.org/tutorials/image_retraining](https://www.tensorflow.org/tutorials/image_retraining)

------
personjerry
How much time would it have taken using Torch instead of Caffe2? (I still
don't understand which I should use...)

~~~
technics256
It depends on your application. For most general purposes caffe2 will be fine.
For research and pushing the limits pytorch is your best bet.

------
horsecaptin
Looks like Nvidia is doing a proper content marketing push to compete against
AMD.

~~~
Eridrus
Interestingly enough I think the people who gain the most from this paper are
cloud providers. Very few orgs will buy 256 GPUs, but with linear scaling,
renting them makes a lot of sense.

~~~
shezi
There are "only" 32 GPUs in the cluster, with 8 workers per GPU.

~~~
dgacmu
They had 32 _servers_ in the cluster, each with 8 P100 GPUs. Each GPU was one
"worker" in their parlance.

("How to train ResNet-50 in one hour on two million dollars of hardware." :-)

