Hacker News new | comments | ask | show | jobs | submit login

In addition to "We need a cluster for deep learning", the second most popular mostly untrue thing I hear is "We have no idea how neural networks learn".

However, there are many papers that explore various ways to make a network learn, and they keep improving on performance, suggesting they're on to something. There are also many papers that discuss possible theoretical implications of experimental results.

But what does Knowing Why The Network Works mean exactly? "It works because universal approximation and gradient descent", but that's not a very satisfactory answer. "It works because it starts at a general solution and, over the course of many iterations, takes many small steps in an ever changing direction defined by a gradient approximation generated by looking at the difference between an average error and a target output (which should trend towards 0)".

What would a satisfactory "why" even look like exactly? As in, what form might it take compared to some other scientific discipline where we do know what's going on?

Personally, I think the whole thing is a red herring -- people in the field have some idea of how neural nets work, and there are many disciplines considered by many to be mature sciences that are far from settled on a grand theoretical scale.

That said, the theory I'm most interested in is recent attempts to connect a memory module to neural networks so they can "learn" to store important/complex/distributed information that can be recalled with high accuracy later. That will make it easier to do things like ask a neural network to remember your name, or where you left your keys, or whatever.

> What would a satisfactory "why" even look like exactly? As in, what form might it take compared to some other scientific discipline where we do know what's going on?

To me this is obvious: a proof is an adequate answer. All of the explanations given in this area are heuristic reasons. They're nice, but we can't know if they're correct or if we're being fooled by our intuition without a proof.

The why is largely why does gradient descent converge to a good answer instead of getting stuck in a local minima.

Because our intuition about local minima is wrong in extremely high dimensional spaces. In two and three dimensions, local minima are common. In a million dimensions, local minima are rare. The intuitive explanation is that for a local minimum to exist, the function must be curving up (first derivative = 0, second derivative >= 0) simultaneously in every dimension. It makes sense that as you add more dimensions this becomes less and less likely, and for a million dimensions it's vanishingly unlikely.

What you get instead of local minima are saddle points, where some dimensions are curving up and some are curving down. Saddle points can also be problematic for optimization but they can be dealt with using fancy optimization techniques. For a more rigorous explanation, see http://arxiv.org/abs/1406.2572

This intuition is not correct. For a random point in a high-dimensional space--yes--I totally agree it's highly unlikely all the eigenvalues will have the same sign (as is needed for a local minimum). The problem with Surya's claim is that the set needs to be restricted to just the critical points (where the derivative is zero). It's very hard to describe the spectral properties generally for this set, and there has been results for only 2 and 3 dimensions as far as I'm aware. Maybe the measure is the same, at least for Deep Nets, and thus a critical point is as unlikely to be a local minima as a random point is. But no work has shown that. Furthermore, we know the number of critical points to be exponential for broad classes of functions so the subset cannot be ignored.

ELI5 version: One can ask, what's the probability that a random person drawn from Earth's population (points in high dimensions) owns a Bugatti (is a local minimum)? It's very small, obviously. But that doesn't tell us anything about the probability of Bugatti ownership among select subsets of people (critical points).

Wouldn't it still be likely to settle on a local minima when the deciding factors for the existence of a local minima are limited to those that contribute to the likelihood of a single output category, and not whether all functions are curving up?

An example I can think of would be an absurd million input neural network, where one of the inputs only has a pronounced effect on one of the outputs. It seems like it would be possible for the path of the input to output to be dragged downhill in the context of all outputs, but uphill in the context of the single output it affects.

Is what I've described not likely, or am I just completely off base?

An interesting question. A million input neural network isn't necessarily absurd. Images are very high dimensional. One could easily imagine a one megapixel input to a neural net. But in natural images no single pixel is indicative of any single image characteristic by itself. I think this isn't a coincidence but a common characteristic of "natural" high dimensional data, on which neural nets tend to work well. So yes, I'd say what you've described is not likely for a large category of "natural" high dimensional data which probably includes most of the data we care about in the real world.

By that argument gradient descent should work for any classifier with a sufficiently high number of independent parameters.

Do you have any good examples of where it is likely to fail?

My thinking is that if this line of reasoning is true it suggests that we may as well be using kernel methods or other feature generation systems with high dimensionality with better algebraic properties that be might be able to train quicker or better interpret. My understanding is that those other techniques haven't been especially competitive with deep neural nets on large data sets these days, so that implies that the high dimension argument for gradient descent is insufficient to explain the success of deep neural nets.

Take any smooth function f and consider the multi-dimensional g(x,y,z...) = f(x) + f(y) + f(z) + ... If f had m local minima, then the n-dimensional version of g has m^n local minima.

You are correct in that there will be m^n minima, but since your surface is highly symmetric, local minima are trivially easy to find. Are you simplifying the statement by skipping over the importance of intervals?

If the intervals spanned by x,y,z do not overlap, then your statement does not hold unless local minima are evenly distributed over the entire region, which defeats the purpose of searching for minima in the first place. If the intervals spanned by x,y,z are identical, then gradient descent can be reduced to just searching over f(x) by symmetry. You are right that there are m^n minima, but they are found in (nm)^1 steps. If the intervals spanned by x,y,z partially overlap, you can further reduce the search space to just f(x) and the union of the intervals.

I don't see how your example shows that gradient descent is likely to fail for any classifier with a sufficiently high number of independent parameters. Am I missing something?

I was just tossing out an example of a higher dimensional function and counting the local minima. The reasoning that critical points tend to be saddle points, not minima, seems irrelevant if critical points are increasing exponentially for a typical higher dimension function.

This is a very cool insight for the uninitiated. Thanks.

Mainly because in high-dimensional space saddle points are much more common and many local minima have very close fitness. Look for example http://arxiv.org/abs/1405.4604 and following papers.

Because the solution space is convex if you've chosen your representation well.

This is definitely not the case for general neural networks, though.

If gradient descent is working reliably, the problem is convex. See the sibling comments for the intuition for large dimensional spaces.

"The problem is convex" and "the algorithm is unlikely to get stuck in a local minimum on realistic problems" are very different things.

Right. Hence the word 'reliably'.

> What would a satisfactory "why" even look like exactly?

Before deep learning happened, neural networks used to be popular for regression and interpolation problems. For a long time, our understanding of how they worked wasn't much better than our current understanding of how deep learning works.

In 1994, Radford Neal showed [1] that weight-decay neural networks were in fact an approximation to Bayesian inference on a Gaussian process prior over the space of possible functions. Amongst other benefits, this allowed better approximations to be utilised, and essentially (along with the advent of SVMs) marked the end the first neural network era.

Something like that is what I would consider a satisfactory "why"

[1] http://www.cs.toronto.edu/~radford/pin.abstract.html

But what does Knowing Why The Network Works mean exactly?

I would say it means: Why do these algorithms seem to be doing "well" despite their well-known theoretical intractability? Is it something about the instances? Or is it simply a matter of scale (so the local optima they are getting stuck in are not as obvious)? Or is there something "deep" we do not understand yet about these algorithms (or the theory)?

Either way, there is something going on here for sure.

Heck, I want that for the neural net in my head.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact