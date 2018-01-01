1.Larger batches requires large learning rate. And this paper shows that learning rate can even scale linearly with the batch size, which leading to extremely large learning rate/batch sizes.
2.Larger batch causes initial learning difficult, so this paper proposes to have a warm-up period where during the initial epochs, the learning rate grows from a smaller value gradually to a larger one.
But if you are not Google/Facebook/Amazon/Microsoft, the experiment setting is unrealistic to you. Best AWS instances didn't come with the 50GBits network. For now, for the others, we would still stick to at most 8 GPUs on a single node, even your soul screams distributed :/
That's a consumer/gaming motherboard (asrock supercarrier), which doesn't have enough PCIe bandwidth to support the cards, but part of what we're researching are ways to reduce synchronization bandwidth. I wouldn't recommend that route as a general approach, though - not flexible enough for future uses. The 8x 1080ti Supermicro build posted a few days ago is probably a better choice: https://news.ycombinator.com/item?id=14508928
The problem is that one student can easily tie up the entire cluster for half the duration of her Ph.D. Machine learning people have voracious appetites for compute. :)
The important bit here is that they've shown that large mini batch sizes still can maintain accuracy if you slow the learning rate.
Plenty of theoretically trivial solutions to problems are absolute pains to implement. I mean there are entire companies that at their core solve relatively "trivial" problems but employ huge numbers of engineers. Just because the core concept is simple to explain doesn't mean it's easy.
And there's even a tiny Redis dependency (optional though) in the code to generate these results. In particular the collective communication library needs a rendezvous phase where all nodes connect to their peers. Using Redis for this is one of the options. See: https://github.com/facebookincubator/gloo/tree/master/gloo/r...
* for synchronous model-based distributed training to scale linearly, the time required to broadcast the model must be much larger than the time required for a worker (GPU) to process a batch
* it's not strict synchronous training, as when gradients are computed at a worker, they are transmitted to all workers - so the driver doesn't have to send models to all 32 workers at the same time (8 GPUs per worker makes 256 GPUs in total).
* there are extremely large batch sizes (8196)
* it's a good network (50 Gb Ethernet, albeit not infiniband)
So, the relative amount of work done training at each worker is much higher than the time spent broadcasting the model (which is quite small (~100 MB, i think)) to the workers for each iteration. For larger models with smaller batch sizes, this relationship would break down. The interesting contribution here is that you can have massive batch sizes and Facebook provided a heuristic for adjusting the learning rate to converge with such massive batch sizes.
It seems to my naive view like it should be "nice" from an accuracy perspective to look at more samples before making an adjustment to the network weights ...?
In general, does changing the batch_size hyperparameter make a lot of difference on different problems ...? Does the right value for
batch size tend to be problem specific and/or network architecture specific?
Not necessarily, since a batch gradient output (as I understand it, and at least used to code it) all gets averaged together.
Consider standing in a valley with two equal hills either side of you. If you were to try one direction and see that climbing that way helps, you'd take a step that way. Then the next step would keep taking you up that hill.
Now, if you batched together two direction tests, what would happen? You'd average together your left and right and end up with moving nowhere. Having both at the same time doesn't give you better information about how you move if you only see the result after averaging.
This interestingly maps to something we see in humans, though I'm struggling to find a decent paper on it (from the PRISM lab in Birmingham, UK if anyone else has any luck, think the person doing the research might have been called Chris). Simple adaptation tasks, in this case learning to control a joystick that has a clockwise/anticlockwise force applied to it, don't work well if you try and learn both one thing and the opposite straight away. However, sleeping in-between learning each left you able to do both well. Perhaps this was early results though.
Batch tradeoffs:
No, batches also help you escape local minima.
Your comment didn't make sense to me at first, but I think I get it now. Even if you were able to fit the entire dataset into memory, batches are still a good idea, because optimizing on the entire dataset is non-convex and will likely lead you into a local minimum. However, what is a local minimum for one batch may not be a local minimum for the next batch, which helps you escape.
This explains why optimization gets harder with very large batch sizes - the gradients for different batches become more similar (as they resemble the "global" gradient more closely), so you become more susceptible to local minima. I think this also explains why the learning rate scaling helps - it increases the variance across gradients, and helps you escape local minima.
I wonder if rather than computing a single gradient for a large batch you could simultaneously compute a gradient for the batch and for several subsets of the batch -- then pick or combine the gradient subset(s) that most differ from the full batch result. Not sure if that would work out to a computational efficiency gain.
Are there any optimizers that dynamically scale the batch size up/down based on an online metric?
("How to train ResNet-50 in one hour on two million dollars of hardware." :-)
