Hacker News new | past | comments | ask | show | jobs | submit login

> "why does sgd even work lol"

I find this hand a little over played.

It depends on the degree of fidelity we demand of the answer and how deep we want to go questioning the layers of answers. However, if one is happy with a LOL CATS fidelity, which suffices in many cases, we do have a good enough understanding of SGD -- change the parameters slightly in the direction that makes the system work a little bit better, rinse and repeat.

No one would be astonished that using such a system leads to better parameter settings than ones starting point, or at least not significantly worse.

Its only when we ask more questions, ask deeper questions that we get to "we do not understand why SGD works so astonishingly well"




Yeah I didn't mean to imply "Why does SGD result in lower training loss than the initial weights" is an open question. But I don't think even lolcatz would call that a sufficient explanation. After all if the only criterion is "improves on initial training loss" you could just try random weights and pick the best one. The non-convexity makes sgd already pretty mysterious, and that is without even getting into the generalization performance, which seems to imply that somehow sgd is implicitly regularizing.


With over-parameterized neural networks, the problem essentially becomes convex and even linear [1], and in many contexts provably converges to a global minimum [2], [3].

The question then becomes: why does this generalize [4], given that the classical theory of Vapnik and others [5] becomes vacuous, no longer guaranteeing lack of over-fitting?

This is less well understood, although there is recent theoretical work here too.

[1] Lee et al (2019). Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. https://proceedings.neurips.cc/paper/2019/hash/0d1a9651497a3...

[2] Allen-Zhu et al (2019). A convergence theory for deep learning via over-parameterization. https://proceedings.mlr.press/v97/allen-zhu19a.html

[3] Du et al (2019). Gradient Descent Finds Global Minima of Deep Neural Networks. http://proceedings.mlr.press/v97/du19c.html

[4] Zhang et al (2016). Understanding deep learning requires rethinking generalization.

[5] Vapnik (1999). The nature of statistical learning theory. https://arxiv.org/abs/1611.03530


I dont disagree, except perhaps the lolcatz's demand for rigour. Improve with small and simple steps till you cant is not a bad idea after all.

BTW your randomized algorithm with a minor tweak is surprisingly (unbelievably) effective -- randomize the weights of the hidden layers, do a gradient descent on just the final layer. Note the loss is even convex in the last layer weights if matching/canonical activation function is used. In fact you dont even have to try different random choices, but of course that would help. The random kitchen sink line of results are a more recent heir to this line of work.

I suspect that you already know this and the fact that the noise in SGD does indeed regularize and the way it does so for convex function has been well understood since the 70s, so I am leaving this tidbit for others who are new to this area.


Why are there so few local minima, you mean?

I think it’d have to be related to the huge number of dimensions it works on. But I have no idea how I’d even begin to prove that.


Its not even certain that they are few. Whats rather unsettling is that with these local moves of SGD the parameters settle on a good enough local minima in spite of the fact that we know that many local minima exists that have zero or near zero training loss. There are glimmers or insight here and there but the thing is yet to be fully understood




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: