
Mathematics of Deep Learning [pdf] - magoghm
https://arxiv.org/abs/1712.04741
======
msmith10101
Newtons method and linear algebra. High school math, basically.

~~~
santaclaus
> Newtons method

Where are you seeing Newton's method? I didn't think second order information
was available for typical systems in statistical machine learning.

~~~
make3
my understanding is that the issue is that the full Hessian of the loss is too
expensive to compute at each step for the relative size of the increase in
learning speed

~~~
_0ffh
Yeah I think that's why quasi-Newton methods like BFGS have been developed.

------
leecarraher
I'm think this is a work in progress? Seems to be in ieee format, where pages
are gold, and one would almost never waste an entire on references(surveys),
and leave another mostly blank space. I'll wait for the peer reviewed version.

------
banachtarski
This is a horribly titled paper and I don't even agree with the premise
necessarily. Since when did we become convinced deep learning achieves "global
optimality" as they put it.

~~~
nabla9
When the network is deep or big enough, local minimums tend to have loss close
to global optimum. There is theoretical and empirical evidence for this.
Global minimum is usually overfitting so what is needed is getting close
enough.

In practice algorithms like stochastic gradient descent have problems
distinguishing between saddle points and local minima. There is hypothesis
that in many/most cases local minima where algorithms get stuck are actually
saddle points with long plateau.

~~~
viewtransform
This mathematical hand waving plagued the field of genetic algorithms and its
variants in the 90's until the 'No free lunch theorem' was published from the
Santa Fe institute. It essentially said (my take on it) that if you had no
information about the landscape you were searching - then couldn't say much
about the algorithm you were selling.

[https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...](https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization)

~~~
nabla9
You are using too general argument.

We are talking about continuous multidimensional optimization. We have already
selected our bias and subset of problems we want to solve. Now we are figured
out that in this domain there are some theoretical reasons that explain why
gradient descent works so well over large number of problems.

~~~
viewtransform
Agreed. NFL is general and we need to discuss a specific landscape.

Deep learning is continuous <nonlinear> multidimensional optimization - yes.

What defines the subset of problems ? a general nonlinear mapping from R^n to
R^m n>m ? or are you limiting to image classification ? speech recognition ?
which would be a subset.

We have empirical evidence that deep-learning works but I'm not confident that
we have the mathematical tools to understand why.

~~~
aoeusnth1
[https://arxiv.org/abs/1710.05468](https://arxiv.org/abs/1710.05468) was an
interesting paper that came out recently.

It showed that large CNN models which have far greater capacity than the data
they are shown (and _could_ have memorized it) still tend to learn very good
generalizable minima.

See proposition 1:

(i) For any model class F whose model complexity is large enough to memorize
any dataset and which includes f∗ possibly at an arbitrarily sharp minimum,
there exists (A, Sm) such that the generalization gap is at most epsilon, and

(ii) For any dataset Sm, there exist arbitrarily unstable and arbitrarily non-
robust algorithms A such that the generalization gap of f_A(Sm) is at most
epsilon.

