Hacker News new | past | comments | ask | show | jobs | submit login
Why deep learning works even though it shouldn’t (moultano.wordpress.com)
356 points by r4um on Oct 20, 2020 | hide | past | favorite | 142 comments

Setting aside the primary subject, this is an excellent observation:

> What I find however is that there are a base of unspoken intuitions that underlie expert understanding of a field, that are never directly stated in the literature, because they can’t be easily proved with the rigor that the literature demands. And as a result, the insights exist only in conversation and subtext, which make them inaccessible to the casual reader.

This is very true.

This is why conferences are important. A lot of knowledge is sociological in nature. Failures and tricks of the trade are discussed at the bar after 5. I've experienced this first hand.

Academic publications and lectures, in their (rightful) pursuit of rigor, aren't usually the right space for conversations around hunches and experiences. Reputations are at stake, and most people are doing impression management. When experienced readers read a journal publication, they read it with the implicit understanding that it is a highly curated view of the messiness behind the scenes.

It seems toxic to me that there's no accepted public venue for that stuff, though. The fact that people are too afraid to relay certain useful information until they're tipsy; the emphasis on "doing impression management".

Maybe it should remain separate from the rigorous stuff, but where's the "Op-Ed section" of academic publishing?

I feel the word "toxic" is maybe a bit too pejorative for what it is? Another framing is that it is "guild knowledge". The incentives in academia are complex and it affects how open some folks are or can afford to be with such knowledge.

That said, there are certain open avenues for making such knowledge public. MathOverflow is one. Some academics document their guild knowledge in "technical papers" which they put up on their website. The ML community (which I'm not part of, but that I'm able to observe as an outsider) seems to be particularly open when it comes to publishing blog posts -- sometimes to gain reputational points?

In some journals, arguments over publications are carried out in the Letters to the Editor section. Sometimes this leads to public feuds however, and some academic communities are small enough that if you make too many enemies your publications may be visited upon with disfavor when it comes time for peer review. It's not worth getting into public tiffs unless there's a principle at stake.

> The incentives in academia are complex and it affects how open some folks are or can afford to be with such knowledge.

> Sometimes this leads to public feuds however, and some academic communities are small enough that if you make too many enemies your publications may be visited upon with disfavor when it comes time for peer review. It's not worth getting into public tiffs unless there's a principle at stake.

This is more the kind of thing I was using the word "toxic" to describe. Of course I know this is a widespread and deep-seeded problem and not one that could be fixed overnight, I was just commenting on it

That's why publishing and presenting outside the formal academic realm is important. Especially presentations that have one level of indirection from original creators often provide much better intuition and also the presenter isn't afraid sharing that he/she doesn't know certain things or they present content in a more creative funny way.

Personally I've wanted a 1-page IEEE publication for a while that accepts smaller contributions, where people can share these kinds of insights. Just a "we tried this, this is what happened" or "we were not able to repeat this" or "we found this interesting, but we need more data".

You could argue it's kind of like a long abstract, but a long abstract really indicates you intend to probe it further, but in actual fact you might just want to indicate that there might be something there for somebody else.

People have been floating the idea of doing this for the life sciences too, where the cost of performing an experiment can be very high (animals, enzymes, local permitting and legal considerations etc. on top of the man hours). It would be very useful to see the results of an experiment you want to do(or something resembling it) but didn't follow up on it for whatever reason and would definitely accelerate research. And would probably help with the weird incentive structure and secrecy issues academia is pointlessly built around.

When I worked in crypto 20 years ago there was a journal like this which took papers that were only a few paragraphs long for quick results. I think it was called Electronics Letters.

I'm not sure that would be a net benefit over research blogs or something like a technical report directly published by a lab/working group to be quite honest. In a few domains I've recently looked at these notes to the editor / commentary sections seem to only be pseudo-reviewed and I'd say the likelihood of an IEEE rubberstamped one pager on perpetual motion would be non-zero.

> I'm not sure that would be a net benefit over research

> blogs or something like a technical report directly

> published by a lab/working group to be quite honest.

I think this is kind of the problem, half of this stuff sits on a webpage somewhere completely unread and not really reference-able.

> I'd say the likelihood of an IEEE rubberstamped one pager

> on perpetual motion would be non-zero.

I would hope that each single page would be reviewed with the same integrity as a six pager. Of course, it's not impossible crap leaks through to any conference/journal.

It’s the stuff you 60-80% believe but you don’t want to put your name on it because once you do people assume you believe it 99% or more. That’s unfortunately how publishing it works.

There is a venue for that stuff and it's in blog posts. The main problem however is that for a hype topic like this there are so many blog posts that it's hard to find the gems with valueable insights between all of the badly regurgitated common knowledge.

I'm curious which category you put this one in. :D

Definitely not the latter ;) I think it's too new of an article to already qualify as a gem, but it's certainly made me think about a few things with a new perspective, so there is that :)

All of this stuff already exists. By design it excludes people who would likely disparage the lack of rigor. And sometimes other people get caught in that net.

That's a good way. It's called knowing your audience. No one wants to be savaged on HN or Reddit by a bunch of people.

Require people to publish their code.

Good tricks or intuitions don't stay hidden for long. Everything good is quickly published, can easily be found in high quality implementations, and is discussed on GH issues, Pytorch forums, r/MachineLearning, or Twitter.

That just kicks the can down the road a step, though: that just makes the folklore knowing where to find the good GitHub issues.

The point is, the experts do freely share this knowledge. Making it accessible to "casual reader" would be low on my priority list.

Other fields have their popular message boards. It's not just programmers.

What are the popular message boards for practicing physicists and mathematicians?

The Mathematics Stack Exchange has consistently surprised me (in a good way) in their helpfulness and rigor.

There are blogs

But, then again, lots of academics will outright ignore blogs - if they're ever brought up in "serious" discussion.

> Reputations are at stake, and most people are doing impression management

What's the saying about academia? The politics are so vicious because stakes are so small? It's pretty ironic because the stakes are larger for a lot of decisions made by engineers in industry, but we seem content with swapping best practices.

While most people think this "knowledge" should be organized and even shared, I strongly disagree. For context, I have worked in large research labs, ML engineering organizations and startups and have encountered many people across the engineer and research spectrum.

These intuitions are often wrong and arise due to the lack of vocabulary in correctly describing the mechanisms that occur.

From a researcher's standpoint, learning these is counterproductive if the goal is to study and understand the underlying mechanisms from first principles.

From a beginner engineer perspective, these intuitions may be effective "functional truths" but there's the danger of perceiving these handwavy intuitions as truths. This leads to inflexibility in light of empirical evidence that contradicts these intuitions and even worse - not debugging enough since the pattern seems to match roughly the intuition. The latter results in flawed institutional knowledge being accrued over time. An engineer might say: "Engineer X tried Y and it didn't work because of ML intuition Z, since this is a related problem, we should not prioritize Y due to the precedent."

I think its much better for a beginner engineer to learn the methods from first principles and develop an appreciation for them. They can then learn the distinction between what's true and the intuitive language people use to describe a phenomenon they don't completely understand but can pattern match to. This will help them avoid making the mistakes that people who rely on this intuitive language too much, mistaking it for ML theory.

This is a very good point. Especially in ML, there have been many cases where very smart people were wrong with their intuitions (the original explanation of batchnorm comes to mind), or maybe their intuitions were correct, but attempts to explain the intuition (especially when there's a race to publish) led to wrong conclusions. I still think intuitions should be discussed and shared, but with a clearly stated caveat like: "that's just my guess, we don't know what's really happening there". This is how I usually explain to others (and to myself) my experimental results.

> I think its much better for a beginner engineer to learn the methods from first principles and develop an appreciation for them. They can then learn the distinction between what's true and the intuitive language people use to describe a phenomenon they don't completely understand but can pattern match to.

This seems super meta, because I am a beginner engineer and your comment is totally pattern matching to distillation, but for humans instead of models. Maybe beginners should start with those “functional“ (read: distilled) truths and train from there :)

If you want to go more meta, I can point out that your comment is pattern matching to training models and applying them to humans vs. thinking from first principles and realizing that humans and models learn differently. That's exactly the issue with applying functional truths as truths.

This is true in a lot of other fields. This is also the reason why experienced people make a lot more money than inexperienced ones. They know where the rakes are buried under the leaves.

This is NOT TRUE that it's NOT SHARED, especially in learning. Just browse quora on any deep learning realted fields, it's a bunch of guys shareing their (more or less relevant and exact) hunch. Also most ML vulgarisation articles focus on that too.

Not sure it's true for the other fields tho.

There is another reason why training deep neural networks is not as difficult as it sounds: the landscape of the loss function seems to be made of broad "U"-shaped valleys that gently descend towards a small loss region. At initialization, the network is likely close to such valley, and once it gets there the rest of training is just a leisurely stroll.

Formally, people have studied the spectrum of the Hessian and found that most of its eigenvectors are quite small with only a few, much larger ones. It all started with [1], with several recent extensions.

[1] https://arxiv.org/pdf/1611.07476.pdf

If anything for large models I thought the idea is everything is a saddle point. Your link looks at a relatively small dense network.

My intuition is this:

(1) decision surfaces are always linearly separable with enough dimensions

(2) NNs have enough dimensions

(3) NNs linear boundaries are coarse

(4) coarse boundaries in high dimensions are likely to approximate the low-loss true boundary (ie., given 1).

The idea behind (4) is just the linear regression idea: by (1) noise is gaussian and a straight-line is a good approximation. With a coarse line, we do not fit to noise, and hence prob. have a good aprox.

The phrase "neural network" disguises the obviousness of this reasoning: a NN is just high-dimensional piece-wise linear regression.

The only thing to be explained is why, in high dimensions, datasets end up nearly piece-wise linear.

That isnt so hard to explain.

As dimensionality increases the likelihood of a local minima decreases because there is almost always a dimension where the curvature is in the other direction.

Doesn't the loss function landscape depend a lot on what you're trying to get the neural network to learn (what problem you're trying to solve)?

Of course, but people seem to generally find it’s “egg-carton shaped” rather than “mountain range shaped”.

This is just bunch of simulation, not a derivation.

True, but it has been confirmed experimentally several times. Unfortunately theoretical understanding in deep learning is quite lagging behind practical progress.

This paper [1] seems to be getting closer to what you are looking for.

[1] https://arxiv.org/abs/1910.02875

Explain in layman's terms?

In late 90s/early 2000s the mainstream thought around numerical optimization was that it was easy-ish when it was a linear problem, and if you had to rely on nonlinear optimization you were basically lost. People did EM (an earlier subgenre of what is now called Bayesian learning) but knew that it was sensitive to initialization and that they probably didn't hit a good enough maximum. Late 90s neural networks were basically a parlor trick - you could make it do little tricks but almost everything we have now including lots of compute, good initialization, regularization techniques, and pretraining, was absent in the late 90s.

Then in the mid and later 2000s the mainstream method was convex optimization and you had a proof that there was one global optimum and a wide range of optimization methods were guaranteed to reach it from most initialization points. Simultaneously, the theory underlying SVMs and CRFs was developed - that you could actually do a large variety of things and still use these easy, dependable optimization techniques. And people hammered home the need for regularization techniques.

In the late 2000s to early 2010s, several things again came together - one being the discovery of DropOut as a regularization technique - and the understanding that it was one, the other being the development of good initializers that made it possible to use deeper networks. Add to that improved compute power - including the development of CUDA which started out as a way to speed up texture computation but then led to general purpose GPU computing as we know it today. All this enabled a rediscovery of NN learning which could take off where linear learning methods (SVMs, CRFs) had plateaued before. And often you had a DNN that did what the linear classifier before did but could learn features in addition to that - and could be seen as finding a solution that was strictly better.

But the lack of global optimum means that - even with good initializers and regularization packaged into the NN modules we use in modern DNN software implementations - the whole thing is way more finicky than CRFs ever were. (It would be wrong to say that CRFs are trivial to implement or never finicky at all, just as many well-understood NN architectures have a good out-of-the-box experience with TF/PyTorch etc. - so take this as a general statement that may not hold for all cases).

Deep learning is a form of optimization. Optimization involves moving along a high-dimensional surface to find the lowest point. In principle this can be nearly impossible because the surface might be covered in dramatic peaks/valleys/saddles obscuring the route to the lowest point. Some simulations have implied that this is not what the surfaces corresponding to deep networks look like, and that they instead look like a big gentle slope down to the minimum, with only small bumps along the way.

Just like in life, the imagined obstacles are scarier than the real ones.

Picture linear regression. If you have a bunch of data points you want to fit a line to, for any given line you can add up the vertical distance between all your data points and the line and come up with a measure of how inaccurate the line is. This is called your "loss". If you keep trying different values for the slope and intercept, you would find that this function is shaped like a big bowl, or valley. Regression is the process of repeatedly taking a step downhill until you can't go anywhere but up, and that must be the optimal line.

Neural networks train in a similar way. You have a "loss" function that adds up how wrong your predictions are compared to the training data. You try different values for the weights in your neural network to see which ones send you downhill the fastest, step them in that direction, and repeat. Since the loss function in this case is more complex, it's not a single valley, but potentially many valleys. You can end up at a decent local minimum.

...hmm, that was counter to my understanding (limited though it may be...) which was partially formed by this paper: https://arxiv.org/abs/1712.09913

TLDR - loss landscapes are nasty, but you can tame them with skip connections.

These two papers are not necessarily contradicting each other, but perhaps my description was a bit sloppy.

Sagun et al. (and derivative works) only focus on the Hessian on the trajectory followed by gradient descent, while Li et al. give a broader look at the loss surface as a whole.

I don't think we can say for sure that early stopping is the main reason deep networks generalize. Double descent [1] shows that models continue to improve even once they've "interpolated" the training data (fit every point perfectly), and critical periods [2] suggest that the early part of training is responsible for most of the generalization performance even though much of the numerical improvement happens later.

Overall it looks like gradient descent is a strong regularizer -- we know it tends to prefer small and low-variance weights, for example. So part of deep generalization has to do with how SGD is able to pick "good" features early, and then optimization pushes the unimportant weights to zero later (hence lottery tickets).

[1] https://openai.com/blog/deep-double-descent/ and other papers. [2] https://arxiv.org/abs/1711.08856 and others.

I think you misunderstood the point of deep double descent. The x-axis is not number of training epochs, it is model capacity.

I think you'd be interested in https://arxiv.org/abs/1611.03530. It discusses how SGD is an implicit regularizer. We also actually want high variance weights for symmetry breaking.

To add some more context, here's a rather readable summary:


Certainly a much shorter way to say this: if you have enough lines you can approximate any curve within a margin. This is what large neural networks are doing.

Another way to look at it: most neural nets are just a bunch of polynomials stitched together. You can see this from the popularity of the relu activation function. when the relu is negative, that poly is always zero in that area. When positive it's some poly multiplied by a const - another poly.

For nets that use other activation fns, they try to be linear in the area of most active input. So again they approximate a const * a ploy.

> if you have enough lines you can approximate any curve within a margin. This is what large neural networks are doing.

No, this is actually exactly the opposite of what the article is saying.

"If you have enough lines you can approximate any curve within a margin" is usually a bad thing because your approximations lose meaning as the number of lines increase.

The surprising thing about machine learning is that increasing the "number of lines", if you will, increases the meaning, too, and there's some wierd and subtle properties about mathematics in higher dimensions that makes this work.

You are missing the point. The surprising thing about deep learning is that it can generalize to unseen data so well. Polynomial regression cannot.

It doesn't necessarily seem that surprising to me.

If I can be really hand-wavy about it -- A lot of deep neural networks achieve their nonlinearity in a very constrained way. They stack linear models on top of each other, and the nonlinearity comes from using a a relatively simple nonlinear function such as logistic or tanh to scale the models' outputs before feeding them into the next one. (Without that step, you'd just have a linear combination of linear functions, which would itself be linear.)

That's a pretty constrained form of nonlinearity compared to polynomial regression, which tries to directly fit some high-order polynomial. I don't have anything like the math chops to prove it this, but I believe that means that the neural network is going to tend to favor a relatively smoother decision boundary, whereas polynomial regression is a naturally high variance sort of affair.

I agree this is one of the reasons for the success of neural networks. But it was not obvious at first, and it still is quite hard to formalize and explain in mathematical terms. That's what I meant with "surprising".

No, it's not constrained at all. In fact, even single hidden layer networks with nearly arbitrary activation functions are universal approximators (Hornik et al. 1989). Polynomials are also universal approximators (Weierstrass).

I'm not trying to say that neural networks are inherently constrained. I'm saying that, in typical usage, they tend to be used a certain way that I believe introduces some useful constraints. You can use a single hidden layer and an arbitrary activation functions, but, in practice, it's a heck of a lot more common to use multiple hidden layers and tanh.

It's worth noting that neural networks didn't take off with Hornik et al. style simple-topology-complex-activation-function universal approximators. They took off a decade or so later, with LeCun-style complex-topology-simple-activation-function networks.

That arguably suggests that the paper is of more theoretical than practical interest. It's also worth noting that one of the practical challenges with a single hidden layer and a complex activation function is that it's susceptible to variance. Just like polynomial regression.

This kind of stuff is called inductive bias and is a sexy topic nowadays.

That's a strong unsubstantiated claim. On the other hand, there's been some nice theoretical work arguing that deep learning is a form of polynomial regression.


They are both universal approximators. So are support vector machines, gaussian processes and gradient boosted trees. Yet the performance of neural networks is unrivaled in certain tasks, as has been proven over and over again.

As a whole, that paper is quite bad (and still unpublished, probably blocked by peer review) because (1) it only considers fully connected networks (which are a minority of models used nowadays) and (2) the experimental validation was done on tasks where neural networks are not very strong.

Show me examples of competitive polynomial regression models in language translation, image segmentation and Go playing and I will be convinced.

To be fair, most of the machine learning literature had no or a poor excuse for peer review. And many deep learning layers can be described by dense layers. For example, convolution.

You almost certainly can frame Go playing as polynomial regression but there would undoubtedly be numerical precision & other gradient issues. Deep learning is a practice is remarkably effective, no disagreement there.

Let us know when polynomial regression succeeds at any machine learning task.

A lot of people publish results that say deep learning is "just" something else, where the something else doesn't work.

How about 80% accuracy on CIFAR-10 with unsupervised training?

=> logistic regression + Kmeans


How's that different from regular interpolation?

If there is a pattern in data, such as it fitting on a curve, and you approximate that curve, then that should generalize to unseen data. What's surprising about that? A single polynominal regression wouldn't be able to do it, because some curves cannot be expressed as a polynominal, but superposition of multiple polynominals is apparently good enough.

> then that should generalize to unseen data.

It's not guaranteed at all. Overcomplicated models will "overfit" the training data and generalize very poorly.

> some curves cannot be expressed as a polynominal

You can approximate any (continuous and blah blah) curve arbitrarily well with Taylor expansions.

In fact, polynomials are one of the the most common examples to demonstrate overfitting. See figure 2 on wikipedia (https://en.wikipedia.org/wiki/Overfitting)

If that worked, people would use it, but it doesn't, so they don't.

Perhaps you mean: "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk."?


Side note: You can fit an elephant with just 1 parameter. See https://twitter.com/exobenelson/status/1001473539789213697 and the paper linked therein.

> if you have enough lines you can approximate any curve within a margin

How avoid overfitting the training data?

You're right - a LOT of work goes into this (eg GANs) because these large nNets just start "remembering" the training data.

This article is about optimization (finding good parameters), not the approximation power of neural networks (which is well-known through the universal approximation theorem).

I know. What we're both saying is: if you have enough lines, you can find the params easily.

ReLU networks are piecewise linear.

Have a look at some published nn architectures - you'll typically see a few nonlinear units/transforms and a LOT of ReLU ones.

Now that's fascinating.

I'd thought of machine learning as a form of optimization. Things like support vector machines really were hill climbing for some kind of local optimum point. But at a billion dimensions, you're doing something else entirely. I once went through Andrew Ng's old machine learning course on video, and he was definitely doing optimization.

The last time I actually had to do numerical optimization using gradients, I was trying to solve nonlinear differential equations for a physics engine in about a 20-dimensional space of joint angles that was very "stiff". That is, some dimensions might be many orders of magnitude steeper than another. It's like walking on a narrow zigzagging mountain ridge without falling off.

So deep learning is not at all like either of those. Hm.

I loved this post, thanks for writing it. I get the argument why one shouldn't expect local minima in very high dimensions. But then, what's wrong with the informal argument that there has to be a minimum because, well, the expected loss cannot be negative?

With squared loss where it's easy for the loss to be zero, then yes, it will have lots of global minima, all at a loss of zero. For losses that asympotote, like logloss, they may have no minima.

Thanks. I guess my worry was that, once you are doing extremely well and your loss is very low, gradients are no longer independent, and will tend to go mostly up. Is this wrong?

I’m an ML newb but I think this would be true only of a converged model. Your model could always technically diverge in another epoch if learning rate is high enough and you process a batch of extreme outliers

Even then, you may have still converged on a local optimum which was the take away I got from the article

I liked that part at the beginning where the author made clear they were going to discuss intuitions that, while they aren't proven, would be useful to make explicit for a more general audience. Good candor.

Hey @moultano in response to your argument about walls and Nets not being in a minima, its my understanding nets always live on high dimensional saddle points and that's commonly referred to in literature. Even when you're optimizing you're just moving towards ever lower cost saddles that are closer to the optimum but almost never a local optimum (for the reasons spelled out in your post).

Thank you. Several people have pointed that out, and I'm probably not reading the right papers. Is it common when people introduce a new flavor of adaptive SGD to address how it handles saddles specifically? It is probably just a a matter of what manages to bubble up to me rather than what work is actually getting done, but I felt like the non-convergence of ADAM got talked about a lot, but haven't seen people talking as much about how optimizers behave differently on the landscapes we actually observe.

Saddles are a way of conceptualizing high dimensional optimization problems. If you have a 3 dimensional surface you can imagine a saddle as an isocurve that follows a minima in at least one dimension.

Another way to conceptualize these is to think of being at the minima of a parabola in 2 dimensions, but then seeing you're not in a minima in a 3rd dimension. Any time you're in a minima in at least 1 dimension, you're on a saddle.

You can extend this concept to a neural net which lives in millions of dimensions, undergoing SGD. When beginning an optimization run SGD moves in some direction to minimize the a bundled cost, inevitably stumbling into minima in (usually) many dimensions. Subsequent iterations will shift some dimensions out of minima and other dimensions into minima, the net is always living on a saddle during this process.

There are many papers that discuss the process in these terms and others that implicitly use it. I wouldn't say its a "hot area of research" but more of a tool for thinking about these processes and sometimes gaining some insight in to why things get stuck during training.

This paper makes the points that it's the saddles and not local minima that are the problem: https://arxiv.org/abs/1406.2572 It was the basis for adding 'momentum' to optimizers - so that you could skate across the saddles.

It looks like the article was deleted ("Oops! That page can’t be found"). Here is the Google Cache: https://webcache.googleusercontent.com/search?q=cache:5HMZ_Z...

> High dimensional spaces are unlikely to have local optima, and probably don’t have any optima at all.

Can someone who knows more about DL than I do help me understand this a little better?

The article uses the analogy of walls:

> Just recall what is necessary for a set of parameters to be at a optimum. All the gradients need to be zero, and the hessian needs to be positive semidefinite. In other words, you need to be surrounded by walls. In 4 dimensions, you can walk through walls. GPT3 has 175 billion parameters. In 175 billion dimensions, walls are so far beneath your notice that if you observe them at all it is like God looking down upon individual protons.

I'm struggling to understand what this really means in 4+ dimensions. But when I try to envision it going from 1 or 2 to 3 dimensions, it doesn't seem obvious at all that a 3D space should have fewer local optima than a 2D space.

In fact, having a "universal function" like a deep network seems like it should have more local optima. What am I missing?

The “physicist” explanation that I heard (meaning non-rigorous but good for building intuition) is that at every point where the derivative vanishes, for suitably random functions, every direction you move in will either be a direction where you increase or decrease at about 50-50 odds. In D dimensions there are 2D independent directions to rise or fall in (e.g. in D=2 dimensions, there’s north south east west), so there’s about 1/2^(2D) odds that any given critical point is actually a minimum if those probabilities are independent. That gets small really fast at large D.

Obviously this is not rigorous, though.

But you have to be careful about that word "independent".

There's a reason that things like 3D protein structure estimation, for example, are still very difficult problems, because none of the coordinates are even approximately independent of the others.

So you're back to a standard "minimization is really difficult" even in ultra-high dimensional spaces.

Yeah I was thinking about that as I was writing and trying to convey why I feel like deep models are different. I think one way of thinking about it is that protein structure, even though it has lots of parameters, it is all happening within the confines of 3D space. A protein that could move in lots of dimensions at once, could probably reliably fold much more easily, and it would be easy to find this structure.

That makes sense and explains why it's unlikely to have local minimas, but I don't get the "no minimal at all" argument. Why no global minimas at all, just because of high dimensionality?

It’s a heuristic argument that critical points are extremely unlikely to be local minima (ie positive definite second derivative). Loss surfaces of DNNs do typically have a global minimum (zero if they fit the training data exactly).

Arguably, a DNN seems likely to have many global minimums - given the level of (over)parametrization commonly used, a set of parameters that gets the lowest possible loss won't be unique, there will be huge sets of parameters that give exactly identical results.

Due to symmetry, at least, there are many global minima, but with the same minimum value.

> but I don't get the "no minimal at all" argument.

That's just wrong, no need to think about it. Only unbounded loss functions don't have minima, but using such a loss function would not make sense.

This isn't true. A loss function that asymptotes can also have no minima, which commonly used loss functions do.

Doesn't the fact the network is discrete (floats have a maximum precision) mean this isn't actually the case? There's a finite number of states the net can be in, and one (or more) is best.

Your intuition is off, but only slightly. As the dimensionality increases, the number of stationary points (zero gradient) also increases. But they become overwhelmingly likely to be saddle points rather than minima.

You can read more here [1].

[1] http://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf

Think of the model as a point in high d space that you're trying to trap inside a cage. (Corresponding to it being at a local minima and surrounded by higher points.) So you're trying to build a cube in n dimensions, which has 2n sides. Now imagine that each of those sides will randomly be there or not, probabilistically, with probability p. To actually trap a model like gpt3, p^175,000,000,000 has to be high enough that you observe a case during training where you roll a 1.

My intuition here is that there is always a fair bit of noise in the (data -> label) pairs. Models tend to air on the side of being _too_ expressive to compensate (just add another layer right?). Assuming our model is too expressive, we must turn off training before we get too far down and start memorizing training datasets. Put another way, one input pair will want to go down one path, another will want to go down a totally different path. Assuming the model is too expressive, we actually don't want to reach a minima, because that essentially guarantees we've overfit.

What the author is saying is that very quickly optimization becomes a maze, and can quickly turn into a combinatorial game. Each input starts down its own set of corridors (parameters at certain values) and it can take an _extremely_ long time for this maze to end. Any noise in the (data, label) pairs can make this maze have no end at all if the model is too small. If the model is too big, it's a moot point because it will have overfit at this point.

Consider trying classifying shapes into either square or circle. The model outputs probability of circle. The absolute best you can do is to completely learn the training set. Assign 1 to all the circles and 0 to all the squares.

It is typical to squish the set of all real numbers into the interval 0 to 1. Any finite value will be squished to a value less than 1. So the model tries to make the number go higher and higher. No matter which model you have, you can always go a bit higher. Thus there is no optimal model.

The author mentions regularization, but strangely he then proceeds as if it didn't exist. Because with regularization, you can prove that there is a minimum. Basically (don't worry if you don't understand, I just include for other readers): loss goes to infinity as parameter norm goes to infinity, loss has a lower bound so it has a highest lower bound. Take a sequence of points in parameter space with loss converging to this bound. By Bolzano Weirstass it has a convergent sub sequence. Loss is continuous function of parameters, so the loss of the limit point is the limit of the loss. I.e. it's a minimum.

I think the point being made is that a deep network (GPT-3 was the example with 175B parameters) will (due to the virtue of its size) not have any 'local optima' in the sense that there is no traditional 'local' for these high dimensional places. This is because as the # of dimensions increases it is easier to move away from or towards both better or worse parameter sets. Thus optimization algorithms don't have to be concerned about being trapped in a 'local optima'. Also because there are many good parameter sets not all parameters even need be used to get good results, thus processes like distillation can work.

The concept of optima is heavily dependent upon constraints on dimensionality.

If you're surrounded by walls at a 3D coordinate (walls in this sense is usually something akin to an a priori constraint on step size, which itself is the upper and lower bounds of the imposed delta introduced to the current step to try to find a new direction to go in numerical gradient descent), but you can arbitrarily "jump" 4 dimensionally, and there exists a point in the fourth dimensional space where your 3 dimensional space is no longer constrained from moving in the lower 3 dimensions again, you've essentially "avoided" a local optima, because your optimization function can continue to shift to find something better.

At least this holds if we're talking gradient descent, where the definition of an optima for a function is a point wherein any numerical deviation within your error tolerance always ends up converging to the same point.

If you take that same technique and apply it to higher dimensional spaces, the more dimensions you have, the less likely (in theory) your model is from getting stuck.

I know for a fact this doesn't always hold though, as almost every GPT2/3 model I come across I can still manage to get it snarled in predictive loops, where it does nothing but suggest the same thing over and over and over and over and over and over again, indicating a locally maximal optima for the predictor.

One of my favorite way to trip them up is generally some variant of "I once heard a story from a man who heard it from a man, who heard it from a man,... usually it sets it up for the loop. Sometimes you need to massage it a bit, but it's generally pretty easy to lead the predictor into a loop.

If you really want to blow the theory there are no higher dimensionality optima though, just look at other people. If there were not higher dimensional optima, why do bad habits exist, and get converged on so readily that we actively have to discourage, label, or avoid them?

The fact is that for a general function simulator, the trick isn't not falling victim to higher dimensional optima, but learning to recognize what and when you can safely tolerate some, and when you can't, because you really can't avoid the damn things in a resource or physically constrained problem space.

Autocomplete in GPT-3 isn't optimization at all, though? The optimization was in the training.

I've read that the loops you're talking about come from taking too many high-probability choices compared to real-world text that has some low-probability words.

TL;DR: For high-dimensional models (say, with millions to billions of parameters), there's always a good set parameters nearby, and when we start descending towards it, we are highly unlikely to get stuck, because almost always there's at least one path down along at least one among of all those dimensions -- i.e., there are no local optima. Once we've stumbled upon a good set of parameters, as measured by validation, we can stop.

These intuitions are consistent with my experience... but I think there's more to deep learning.

For instance, these intuitions fail to explain "weird" phenomena, such as "double descent" and "interpolation thresholds":

* https://openai.com/blog/deep-double-descent/

* https://arxiv.org/abs/1809.09349

* https://arxiv.org/abs/1812.11118

* See also: http://www.stat.cmu.edu/~ryantibs/papers/lsinter.pdf

We still don't fully understand why stochastic gradient descent works so well in so many domains.

But it doesn't. Researchers have been saying for several years now that computer vision is more accurate than human vision, and face recognition was one of the first problems "solved." And yet when the pandemic hit, Apple had to scramble to adjust its unlock mechanism in iOS 13.5 because Face ID cannot recognize people wearing masks [1]. Humans have no trouble identifying people wearing masks. We are now almost a year into the pandemic, iOS 14 has been released, Face ID has not been fixed, and now we hear that Apple is bringing Touch ID back [2].

So sure, you've developed a methodology that can overfit nicely not only the train data but even the test data. But it still fails miserably when you apply your model in the field.

[1] https://www.theverge.com/2020/5/20/21265019/apple-ios-13-5-o...

[2] https://appleinsider.com/articles/20/10/16/under-display-tou...

Intuitively, it seems like voice signatures, body language (like walking habits), and height/weight would play a larger role in helping humans identify a masked person than exposed facial features.

Because the network behind Apple's facial recognition software cannot have access to this kind of data (well, maybe voice, but that doesn't seem secure), I'm not sure this is a fair comparison.

Would love to be refuted, however.

Even when just going by face/head, my personal anecdotal data runs counter to what GP wrote.

I've been growing my hair since shortly before the pandemic hit (from a few mm buzzcut which I had for years previously), and combining that with a mask, I'm apparently unrecognizable. I've run into people in the street that would recognize me in a heartbeat otherwise, but with the additional change of hair, people only recognize me when I pull my mask down. So humans aren't that amazing at recognition either.

I don't know that I'm looking to refute you per se but...

A better example than face masks maybe is the recent controversy over Twitter's AI and Obama images (https://www.theverge.com/2020/9/20/21447998/twitter-photo-pr...).

A lot was made of racial issues, which is fine, but the broader issue is why subtle changes in photos, like cropping, should confuse things so completely.

The target piece (the focus of this HN thread) sums itself up this way:

"There’s a good set of params somewhere nearby. When we start walking to it, we can’t ever get stuck along the way, because there are no local optima. Once we’ve stumbled upon a good set of parameters, we’ll know it and we can just stop."

I think there's some useful insights there, but this is in many ways the definition of local optima. What I might argue is that because there's so many locations in high-dimensional space that will satisfy some classification goal, it's "easy" to find one that works with regard to some population that defines the model development space (training + test). However, that model development space/population is itself implicitly defined by a certain set of constraints -- it's a subpopulation of some broader population. What you want to generalize to to define overfitting is broader. You can still not overfit to your model development space, but be overfitting with regard to some broader set of possible inputs.

Whether or not the constraints of the model development population/space are important and reasonable considerations -- e.g., in your argument, not having access to things like body language etc -- is maybe a little variable. In some cases the implicit defining characteristics of the model development population are meaningful, but in other cases they're hidden.

In Twitter's case, you end up finding out later that there's weird things that probably defined the space of their images that they didn't intend. It's only in the adversarial case that you learn about this.

In classical statistics, you talk about generalization and overfitting, but there's an implicit population you're sampling from that defines both of those things. That is, you have a training/fitting/initial sample, and you ask yourself how well your model would perform on a test/validation sample. But implicit in that is some assumption about what it means to be a random sample from the same population.

I think lots of times with DL, the cross-validation/test sample is also implicitly defined as coming from some population. But the population isn't some model, it's some source. Some image dataset, something like that. There will be things about that source that are "of interest", but other things that are idiosyncratic about it, and unrepresentative of the "real" population of interest. In this way, I'm not sure that held-out samples from some source are really the right way to think of generalization and overfitting -- I think the adversarial setting is the generalization setting.


Along the way from the classical to the DL regime I think there have been some overlooked issues about what it means to generalize, what you're really sampling from, and what your "population" actually is. It parallels tensions about theory versus experimentation because having a population in the classical case that you're sampling from requires a certain data-generating theory, which is often lacking in DL. The closest thing in the classical regime to DL generalization theory is maybe a sort of ultra-high-dimensional bootstrapping with random effects: showing that your bootstrap samples are close to your observed sample isn't the same thing as showing they're close to the population, or to other samples drawn from that population, especially in the presence of random effects.

It's too bad you're being downvoted. I had a similar reaction, and I think you're on to something important.

Many adversarial cases are good examples of this: a DL model being completely thrown off by something very incidental, that a human would instantly recognize as not being within a class. Not just something a human would instantly recognize as not being within a class, but something a human would be perplexed by as an adversarial case.

The point isn't that humans are better or worse, it's that the models do often seem to be overfitting, but overfitting in a way that isn't evident until the inputs are generalized beyond whatever is in the development samples. Put another way, they might be learning something about your development datasets more so than the actual features of interest, which is the whole idea of overfitting. It's just that what it means to "generalize" is much broader.

It's a really interesting piece but I think there's lots more to the story.

I don't think this is quite fair. Of course a machine learning model doesn't work on things it wasn't trained for. Computer vision is not more accurate than human vision in every possible field. And are you really sure humans have no trouble identifying people wearing masks? Lots of people look pretty similar, especially if all you can see is their face.

A lot of people talk about minima because thats the language we have for analyzing optimization techniques. Deep Learning is still new enough that there is lots of low hanging fruit to explore including empirical approaches and applying existing theoretical tools to try and explain DNNs. The community is slowly moving towards developing new tools specifically for deep learning properly analyze these networks and prove stuff (bounds, convergence etc) about them.

This quote:

there are a base of unspoken intuitions that underlie expert understanding of a field, that are never directly stated in the literature, because they can’t be easily proved with the rigor that the literature demands. And as a result, the insights exist only in conversation and subtext, which make them inaccessible to the casual reader.

I’d love to hear these intuitions from every field. Anyone got some?

All sensor data, the closer you get to the analog side of things, is bullshit. It's just about smoothing over the bullshit enough to make the tolerances workable for real world applications.

We call this bullshit smoothing "calibration". If you're doing work on sensor data and don't have every calibration parameter, whether from configuration or magic factory numbers and statistical tolerances, someone, somewhere is pulling the wool over the eyes of the software guy downstream that works with the final data.

Ever looked at weather data from two separate apps and have values vary by multiple degrees? Two different pipeline just sprinkled on their own versions of interpretations on top of raw sewage data.

> Ever looked at weather data from two separate apps and have values vary by multiple degrees?

Actually this is because surface temperatures really do have this kind of variation.

There is a trick in physics I am reminded about. In infinite dimensions there is no way to have a Gaussian measure on just an infinite dimensional Hilbert space. It needs to be embedded inside a bigger infinite dimensional space and then you can have some relative measure.

So you do not look at just a Gaussian integral, you look at a quotient of Gaussian integrals.

Perhaps there is a similar idea. Perhaps there is some sort of renormalisation that would make neural networks work better. Even if your neural network is infinite dimensional it still makes sense to talk about some surface relatively.

I find the article style unreadable. Could someone please say whether the author explains why deep learning shouldn't work?

There is a bit at the start about how people in statistical learning throw their hands up at deep learning etc, but none of that makes sense to me. Neural nets are an idea as old as AI - even older, in fact. The need for deeper networks was well understood by the 1980's. There are well known results about feedforward neural nets with arbitrary hideen units being universal function approximators. Why shouldn't deep learning work?

People coming from the perspectives of general optimization believed that they were impossible to train, due to being very non-convex. People coming from the perspective of classical statistics believed that they couldn't generalize due to needing large numbers of parameters. Both of those turned out to be very wrong, and this post is trying to explain why.

Thank you for the summarisation. Does the article give any examples of such arguments, or is it something stated as a commonly known fact? Coming from an artificial intelligence background I am not aware of such opinions. I know that deep neural nets were considered difficult to train until the re-discovery of backpropagation, but not because of anything to do with the shape of the error function.

However, as usual there is confusion about what "generalisation" means. For example, I was in a summer school at Oxford a couple of years ago and one of the lectures made a similar point, about the surprising generalisation ability of deep neural nets. I approached the lecturer after the lecture and asked what they meant because the way I knew it, neural nets can't generalise, and they explained that they meant that they generalise surprisingly well on the test set but not on unseen data, or out-of-distribution data, i.e. not on any data that was not available to the researcher during training (as training, validation or test set).

In other words, neural nets are great at "generalisation" in the sense of interpolation, but are almost completely incapable of "generalisation" in the form of extrapolation.

I like to quote Francois Chollet of Keras on this:

This stands in sharp contrast with what deep nets do, which I would call "local generalization": the mapping from inputs to outputs performed by deep nets quickly stops making sense if new inputs differ even slightly from what they saw at training time. Consider, for instance, the problem of learning the appropriate launch parameters to get a rocket to land on the moon. If you were to use a deep net for this task, whether training using supervised learning or reinforcement learning, you would need to feed it with thousands or even millions of launch trials, i.e. you would need to expose it to a dense sampling of the input space, in order to learn a reliable mapping from input space to output space. By contrast, humans can use their power of abstraction to come up with physical models—rocket science—and derive an exact solution that will get the rocket on the moon in just one or few trials. Similarly, if you developed a deep net controlling a human body, and wanted it to learn to safely navigate a city without getting hit by cars, the net would have to die many thousands of times in various situations until it could infer that cars and dangerous, and develop appropriate avoidance behaviors. Dropped into a new city, the net would have to relearn most of what it knows. On the other hand, humans are able to learn safe behaviors without having to die even once—again, thanks to their power of abstract modeling of hypothetical situations.


In short, if the point of the article is that neural networks "work" because they generalise in the sense of extrapolation, then that's not right.

It seems to me all of these arguments apply just as well when the "deep" network has only one hidden layer.

That's true. I left out saying that deeper networks represent a wider variety of functions more easily, because that seems generally intuitive to everyone. But the arguments about how easy it is to optimize them should apply equally well to wide and shallow networks as to very deep ones.

Thanks for the response. Hmm, it's still pretty mysterious to me. Why should a deep network with the same number of parameters as a wide network represent a wider variety of functions? In some sense they represent the "same" number of functions, in the sense that the manifold of functions given by two network architectures with the same number of parameters have have the same dimension, even if one is wide while the other is deep.

I think a deeper network has less degrees of freedom in which to move, or rather, in which to move usefully, because parameters are more interdependent. That means in order to generate a useful function, it has to learn more abstract features than a shallow and wide network. This is because any adjustment to irrelevant features that are unique to a small number of examples has a larger negative impact on the rest of the data than it would in a shallower net (except in the latter layers). Over time, abstract changes are stochastically rewarded and specific changes are penalised, at least for the earlier layers, and the latter layers then have to work with this more abstract information so they simply can't overfit that much.

Would be interested in OP's take on this though.

Another interesting explanation for deep learning's success in the physical world:

Why does deep and cheap learning work so well?

Max Tegmark et al.,


Great article, though I never understood why people would think deep learning shouldn't work.

Blank page? :(

The article was gone?

The author argues that deep-learning has abandoned statistics. I could not disagree more! Too much of the field was concerned with detailed proofs and mathematical formalism that were somewhat disconnected from probability theory. Modern machine learning (or AI or whatever) still has strong roots in probability and statistics. Loss functions are still based on concepts such as the log-likelihood function.

Formal proofs and mathematics are essential, but can become a distraction from the end goal. It is like playing Chess by going after your opponents pawns instead of their king. I would say modern machine learning has become tantamount to experimental physics and this article is written from the perspective of a string theory theorist.

That isn't what I'm arguing at all.

I think this is a way to really piss off statisticians:

  Stop talking about minima. Stop talking about how your optimization algorithm behaves around a minimum. Nobody ever trains their model remotely close to convergence. ADAM doesn’t even provably converge. All real models diverge! You are nowhere close to a minimum! Stop talking about minima already goddamnit! Why even think about minima?! Minima are a myth! Everybody proves their results for minima of a convex function.
Actually arguing for divergent models. I mean, I love it, just because it's so very very cruel.

How do we define 'works'?

Are you not familiar with how awful machine translation was ten years ago and how eerily good it is today?

I think voice recognition would be a better example. It went from 'toy' to 'everyday use' because the word error rate dropped an order of magnitude.

Is that due to ML or all the massive crowdsourcing and more keyword/search driven approach? e.g. has dictation apps like dragon improved substantially?

Some of it is due to bigger data, but the majority is definitely ML. For constant data, the error rate dropped dramatically due to much improved algorithms.

15 years ago, voice recognition was all hidden markov model based. The data sets were limited, in part due to the cost of collection, but mostly because larger datasets didn't significantly improve accuracy.

As algorithms improved, larger datasets became more important as more data did actually improve accuracy.

Both funnily enough. ML has been around for years, the main reason it's getting better is that it's easier than ever to collect massive amounts of data to train good models.

Japanese and Chinese translation to English is still absolute garbage outside of some obviously targeted sites (Wikipedia, I believe). Translating a German news site is still riddled with key errors because it doesn't understand how words and phrases get used in different contexts.

It's improved a bit, but it's still miles and miles away from any half-decent human translation.

i would say that's more a function of corpus than ML

plus is it really that much better? i still get weirdness out of google translate

Nope, we have loads of experimental evidence (and I mean, it's something one can verify at home for smaller datasets, it does not take that much compute) that neural MT gets significantly better results than what we could (and still can) achieve with "pre-neural" SMT methods on the exact same corpus. A general benchmark for comparing the effectiveness of different MT approaches are the WMT conference series (e.g. http://www.statmt.org/wmt20) shared tasks where the systems are trained on the same corpus.

They're still not perfect, and sure, you get weirdness, but it has become significantly better according to all metrics including human comparisons of different translation aspects, which are expensive/rare to do but have been done quite a few times; there's a clear consensus that deep learning "works" for ML.

There are certain niches where other methods may still be better (IIRC languages with very little data, and translation of specific 'controlled language' domains), but for mainstream MT I think that nowadays nobody would decide to build a non-neural system.

It still seems pretty aweful today for a lot stuff i try to translate with it

It still seems pretty aweful today for a lot stuff i try to translate.

I'm definitely not familiar with that. From my perspective it was mediocre but workable 10 years ago, and it's marginally better now if the languages are closely related.

There exist more objective measurements of translation quality which if you take to be relevant, show that machine translation quality has gotten better between most languages. https://en.wikipedia.org/wiki/Evaluation_of_machine_translat...

Here's a blogpost by google showing scores creeping up since 2009 https://ai.googleblog.com/2020/06/recent-advances-in-google-...

Also I agree that I don't recall it being awful like the parent poster suggests but maybe only because I reserve 'awful' for the type of rule based machine translation used before statistical approaches came onto the scene.

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact