Hacker News new | past | comments | ask | show | jobs | submit login

Yeah I didn't mean to imply "Why does SGD result in lower training loss than the initial weights" is an open question. But I don't think even lolcatz would call that a sufficient explanation. After all if the only criterion is "improves on initial training loss" you could just try random weights and pick the best one. The non-convexity makes sgd already pretty mysterious, and that is without even getting into the generalization performance, which seems to imply that somehow sgd is implicitly regularizing.



With over-parameterized neural networks, the problem essentially becomes convex and even linear [1], and in many contexts provably converges to a global minimum [2], [3].

The question then becomes: why does this generalize [4], given that the classical theory of Vapnik and others [5] becomes vacuous, no longer guaranteeing lack of over-fitting?

This is less well understood, although there is recent theoretical work here too.

[1] Lee et al (2019). Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. https://proceedings.neurips.cc/paper/2019/hash/0d1a9651497a3...

[2] Allen-Zhu et al (2019). A convergence theory for deep learning via over-parameterization. https://proceedings.mlr.press/v97/allen-zhu19a.html

[3] Du et al (2019). Gradient Descent Finds Global Minima of Deep Neural Networks. http://proceedings.mlr.press/v97/du19c.html

[4] Zhang et al (2016). Understanding deep learning requires rethinking generalization.

[5] Vapnik (1999). The nature of statistical learning theory. https://arxiv.org/abs/1611.03530


I dont disagree, except perhaps the lolcatz's demand for rigour. Improve with small and simple steps till you cant is not a bad idea after all.

BTW your randomized algorithm with a minor tweak is surprisingly (unbelievably) effective -- randomize the weights of the hidden layers, do a gradient descent on just the final layer. Note the loss is even convex in the last layer weights if matching/canonical activation function is used. In fact you dont even have to try different random choices, but of course that would help. The random kitchen sink line of results are a more recent heir to this line of work.

I suspect that you already know this and the fact that the noise in SGD does indeed regularize and the way it does so for convex function has been well understood since the 70s, so I am leaving this tidbit for others who are new to this area.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: