
How AI Training Scales - giacaglia
https://blog.openai.com/science-of-ai/
======
taliesinb
This is an exciting work.

Gradient noise scale would be interesting to compute individually for each
weight/bias. I'm curious how it scales as a function of layer depth, for
example -- it could well be that different layers require very different
treatment in terms of learning rate. Supporting this is that gradient
magnitudes are typically very different for different layers, although they
evolve similarly in time. Here's something I just whipped up in Mathematica:

[https://imgur.com/a/HhCnlDL](https://imgur.com/a/HhCnlDL)

~~~
samsamoa
Author here— Yes, this is a great question! I think there's a lot more
interesting work to be done here. It would be great if we could understand
e.g. why layer-wise learning rates help with large-batch training of ImageNet
([https://arxiv.org/abs/1708.03888](https://arxiv.org/abs/1708.03888)), and
maybe the per-component noise scale has something to do with it.

------
yaroslavvb
This kind of scaling was rigorously shown for a related metric called
"gradient diversity" in
[https://arxiv.org/abs/1706.05699](https://arxiv.org/abs/1706.05699)

~~~
samsamoa
One of the authors here. Thanks for the comment! Yes, we mention this work and
number of others in the blogpost and paper. This isn't the first (or the last)
paper on the topic but I think we've clarified the large-batch training
situation significantly by connecting gradient noise directly to the speed of
training, and by measuring it systematically on a bunch of ML tasks and
characterizing its behavior.

~~~
TTPrograms
Similarly I think the theory developed in the Three Factors paper predicts
this scaling law, might be worth citing:
[https://arxiv.org/abs/1711.04623](https://arxiv.org/abs/1711.04623)

------
jamesblonde
It looks like "gradient noise scale" will be a useful hyperparameter for
distributed training. However, how do you efficiently compute it without
training a network? Do you need to first train the network to compute it to
find out the scaling factor for the network? Is there a more efficient way to
compute it from a subset of a large dataset?

~~~
jareddk
Author here — thanks for your interest! So far we do not know of a way to
estimate the noise scale without any training. You can generally get a rough
picture of the noise scale from a small fraction of a training run, but the
noise scale tends to increase over time, so to get a fully accurate
measurement you do need to do at least one full training run. You also need
the hyperparameters (primarily the learning rate) to be reasonably well
chosen, but the choice of batch size isn’t important — and the batch size is
what you’re trying to determine.

We also show that it's possible to compute the noise scale as you train
(without any extra overhead), and this can be used to adjust the batch size in
real-time, so that in theory you can get it right the first time and in a way
that adapts over the training run. However, those experiments are still
preliminary (see appendix D of the paper).

~~~
samsamoa
Another author here– I'll add to Jared's comment above that for long-running
experiments (like the ones our Dota team runs), it can be useful to track this
statistic in real time to see whether or not it would be useful to scale up
the experiment.

~~~
taliesinb
Whats a cheap and unobtrusive way to estimate the BSimple version of the noise
scale in real time? Piggy back on ADAM's moving mean and variance estimates?
Edit: I see that Appendix A has a method for the multi-device training
setting, but I'm thinking of single device training.

------
sjroot
This was a great read made even better by the simple yet high-quality
visualizations. OpenAI does a great job of using UX to make their research
more accessible, and I'd love to see more researchers follow suit.

------
cracker_jacks
So this blog post posits that there is a growing upper bound to the
effectiveness batch size. Is there a similar analysis on checking whether
there is also a growing lower bound? Or does batchsize=1 training work for all
these tasks?

~~~
staticfloat
batchsize 1 should always work. The trade off is always “bigger batch
size/more efficient” versus “smaller batch size/better results”. By using
larger batches you are able to reduce overhead, take better advantage of
memory caches, etc.... which can give very large speed ups. But mathematically
it can slow down your training convergence (check out the plots showcasing how
smaller batches converge in fewer optimizer steps). This is because when you
average over an entire batch, you throw away a little bit of information that
could otherwise be used to better learn your model. For intuition on this,
imagine that you are doing homework, and you are forced to finish 1000
practice problems before being allowed to look at the answers; if you had
looked earlier you may have been able to correct your mistake earlier and not
gotten all 1000 wrong. However it would have taken you more time to go through
those 1000 practice problems because you would have been looking up the
answers and mentally adjusting your model for each problem you did.

~~~
tarlinian
I think you're confused about the benefits/drawbacks of increasing batch size.
The training progress per optimizer step increases with batch size (otherwise
the smallest batch would be always optimal). Increasing the batch size
improves the quality of a single optimizer step but does not scale linearly.
(i.e., larger batch training requires more samples to converge than small
batch training even though fewer steps are required). Depending on the problem
scale the computational efficiency of larger batches makes this tradeoff worth
it because even though you need to process more samples to converge, it will
take less wall clock time due to improved efficiency.

~~~
staticfloat
Yes, we're saying the same thing. It takes longer for the optimizer to
converge when using larger batch sizes, in terms of number of samples pushed
through the model. It takes less time in terms of wall clock time due to
increased efficiency.

