
A Bayesian Perspective on Generalization and Stochastic Gradient Descent - godelmachine
https://ai.google/research/pubs/pub46697
======
evrydayhustling
This result is an amazing twofer. Not only do they give a formal probabilistic
perspective on how deep nets fall into local minima, the result proposed a way
to reduce two "voodoo parameters" \- learning rate and batch size - to one, by
showing how to pick the other in a way that is optimal for expected
generalization.

~~~
mlthoughts2018
I suspect actually there is some connection between optimal batch size and the
proofs of this paper regarding how deep NN will provably converge to zero
training loss with gradient descent ( _not_ stochastic gradient descent).

[https://arxiv.org/abs/1811.03804](https://arxiv.org/abs/1811.03804)

------
itissid
This development has parallels to tuning parameters in the Montecarlo world,
specifically using HCMC+NUTS one does not have to worry about finding the
complete distribution and you get warned when the distribution is
misbehaved[0].

HCMC+NUTS not only does better exploration of the posterior but can help you
select things like step sizes, acceptance rate, simulation time(See the NUTS
Sampler[1]) much more easily.

[0] [http://mc-stan.org/misc/warnings.html#runtime-warnings](http://mc-
stan.org/misc/warnings.html#runtime-warnings) [1] [http://mc-
stan.org/workshops/ASA2016/day-1.pdf](http://mc-
stan.org/workshops/ASA2016/day-1.pdf)

------
joe_the_user
I am increasingly amazed by the use of the term "generalization" \- _without
modifiers or contexts_ \- in describing what current deep learning does.
Certainly, deep learning systems do generalize in certain ways when present
with certain large datasets. But how can this just be "generic"? Is this just
"data you find in the world" without other considerations?

Shouldn't they be discussing "generalizing what qualities?" Wouldn't image-
generalization be different chat-script-generalization and so-forth?

~~~
mlthoughts2018
Generalization in this context is usually a very specific technical term that
means after training on a training set, the fitted model continues to perform
well when used on previously unseen examples (a test set) that came from the
same distribution (data generating process) that Nature used to produce the
training set.

You can define other notions of generalization that try to be more than this,
like generalizing to multiple tasks or an algorithm capable of playing any
turn-based perfect information game, etc., but that is not usually what people
are talking about when they talk about the formal statistics concept of
generalization error.

~~~
TeMPOraL
I think what 'joe_the_user is saying is one should specify what exactly is the
"distribution" that "Nature used to produce the training set". E.g. "images of
cat faces taken indoors and cropped to square" is a pretty specific set.

~~~
mlthoughts2018
This is usually always specified to a complete degree in research papers that
address this issue. I work professionally in computer vision and image
processing and have never once encountered a mainstream research paper where
the definition of the data generating process under study was not fully and
unambiguously clear.

Can you point to examples where this was problematically under-specified?

~~~
sgt101
Of course you are right, but these are the parts of papers not read or
understood by journalists or ceo's or the public. This is leading to
inappropriate and unsafe applications, and will lead to a backlash and damage.
We should be as clear (at least) as the medics, promising lab results that may
deliver in the future, until field tests we can't be sure.

Unfortunately the lab to field practice in ml is underspecified.

~~~
mlthoughts2018
Yes, but this specific technical research paper doesn’t suffer these issues.
So the original comment seemed wrongly placed here.

------
letitgo12345
The openreview thread --
[https://openreview.net/forum?id=BJij4yg0Z](https://openreview.net/forum?id=BJij4yg0Z)

