> What I find however is that there are a base of unspoken intuitions that underlie expert understanding of a field, that are never directly stated in the literature, because they can’t be easily proved with the rigor that the literature demands. And as a result, the insights exist only in conversation and subtext, which make them inaccessible to the casual reader.
This is why conferences are important. A lot of knowledge is sociological in nature. Failures and tricks of the trade are discussed at the bar after 5. I've experienced this first hand.
Academic publications and lectures, in their (rightful) pursuit of rigor, aren't usually the right space for conversations around hunches and experiences. Reputations are at stake, and most people are doing impression management. When experienced readers read a journal publication, they read it with the implicit understanding that it is a highly curated view of the messiness behind the scenes.
Maybe it should remain separate from the rigorous stuff, but where's the "Op-Ed section" of academic publishing?
That said, there are certain open avenues for making such knowledge public. MathOverflow is one. Some academics document their guild knowledge in "technical papers" which they put up on their website. The ML community (which I'm not part of, but that I'm able to observe as an outsider) seems to be particularly open when it comes to publishing blog posts -- sometimes to gain reputational points?
In some journals, arguments over publications are carried out in the Letters to the Editor section. Sometimes this leads to public feuds however, and some academic communities are small enough that if you make too many enemies your publications may be visited upon with disfavor when it comes time for peer review. It's not worth getting into public tiffs unless there's a principle at stake.
> Sometimes this leads to public feuds however, and some academic communities are small enough that if you make too many enemies your publications may be visited upon with disfavor when it comes time for peer review. It's not worth getting into public tiffs unless there's a principle at stake.
This is more the kind of thing I was using the word "toxic" to describe. Of course I know this is a widespread and deep-seeded problem and not one that could be fixed overnight, I was just commenting on it
You could argue it's kind of like a long abstract, but a long abstract really indicates you intend to probe it further, but in actual fact you might just want to indicate that there might be something there for somebody else.
> blogs or something like a technical report directly
> published by a lab/working group to be quite honest.
I think this is kind of the problem, half of this stuff sits on a webpage somewhere completely unread and not really reference-able.
> I'd say the likelihood of an IEEE rubberstamped one pager
> on perpetual motion would be non-zero.
I would hope that each single page would be reviewed with the same integrity as a six pager. Of course, it's not impossible crap leaks through to any conference/journal.
That's a good way. It's called knowing your audience. No one wants to be savaged on HN or Reddit by a bunch of people.
But, then again, lots of academics will outright ignore blogs - if they're ever brought up in "serious" discussion.
What's the saying about academia? The politics are so vicious because stakes are so small? It's pretty ironic because the stakes are larger for a lot of decisions made by engineers in industry, but we seem content with swapping best practices.
These intuitions are often wrong and arise due to the lack of vocabulary in correctly describing the mechanisms that occur.
From a researcher's standpoint, learning these is counterproductive if the goal is to study and understand the underlying mechanisms from first principles.
From a beginner engineer perspective, these intuitions may be effective "functional truths" but there's the danger of perceiving these handwavy intuitions as truths. This leads to inflexibility in light of empirical evidence that contradicts these intuitions and even worse - not debugging enough since the pattern seems to match roughly the intuition. The latter results in flawed institutional knowledge being accrued over time. An engineer might say: "Engineer X tried Y and it didn't work because of ML intuition Z, since this is a related problem, we should not prioritize Y due to the precedent."
I think its much better for a beginner engineer to learn the methods from first principles and develop an appreciation for them. They can then learn the distinction between what's true and the intuitive language people use to describe a phenomenon they don't completely understand but can pattern match to. This will help them avoid making the mistakes that people who rely on this intuitive language too much, mistaking it for ML theory.
This seems super meta, because I am a beginner engineer and your comment is totally pattern matching to distillation, but for humans instead of models. Maybe beginners should start with those “functional“ (read: distilled) truths and train from there :)
Not sure it's true for the other fields tho.
Formally, people have studied the spectrum of the Hessian and found that most of its eigenvectors are quite small with only a few, much larger ones. It all started with , with several recent extensions.
(1) decision surfaces are always linearly separable with enough dimensions
(2) NNs have enough dimensions
(3) NNs linear boundaries are coarse
(4) coarse boundaries in high dimensions are likely to approximate the low-loss true boundary (ie., given 1).
The idea behind (4) is just the linear regression idea: by (1) noise is gaussian and a straight-line is a good approximation. With a coarse line, we do not fit to noise, and hence prob. have a good aprox.
The phrase "neural network" disguises the obviousness of this reasoning: a NN is just high-dimensional piece-wise linear regression.
The only thing to be explained is why, in high dimensions, datasets end up nearly piece-wise linear.
That isnt so hard to explain.
This paper  seems to be getting closer to what you are looking for.
Then in the mid and later 2000s the mainstream method was convex optimization and you had a proof that there was one global optimum and a wide range of optimization methods were guaranteed to reach it from most initialization points. Simultaneously, the theory underlying SVMs and CRFs was developed - that you could actually do a large variety of things and still use these easy, dependable optimization techniques. And people hammered home the need for regularization techniques.
In the late 2000s to early 2010s, several things again came together - one being the discovery of DropOut as a regularization technique - and the understanding that it was one, the other being the development of good initializers that made it possible to use deeper networks. Add to that improved compute power - including the development of CUDA which started out as a way to speed up texture computation but then led to general purpose GPU computing as we know it today.
All this enabled a rediscovery of NN learning which could take off where linear learning methods (SVMs, CRFs) had plateaued before. And often you had a DNN that did what the linear classifier before did but could learn features in addition to that - and could be seen as finding a solution that was strictly better.
But the lack of global optimum means that - even with good initializers and regularization packaged into the NN modules we use in modern DNN software implementations - the whole thing is way more finicky than CRFs ever were. (It would be wrong to say that CRFs are trivial to implement or never finicky at all, just as many well-understood NN architectures have a good out-of-the-box experience with TF/PyTorch etc. - so take this as a general statement that may not hold for all cases).
Neural networks train in a similar way. You have a "loss" function that adds up how wrong your predictions are compared to the training data. You try different values for the weights in your neural network to see which ones send you downhill the fastest, step them in that direction, and repeat. Since the loss function in this case is more complex, it's not a single valley, but potentially many valleys. You can end up at a decent local minimum.
TLDR - loss landscapes are nasty, but you can tame them with skip connections.
Sagun et al. (and derivative works) only focus on the Hessian on the trajectory followed by gradient descent, while Li et al. give a broader look at the loss surface as a whole.
Overall it looks like gradient descent is a strong regularizer -- we know it tends to prefer small and low-variance weights, for example. So part of deep generalization has to do with how SGD is able to pick "good" features early, and then optimization pushes the unimportant weights to zero later (hence lottery tickets).
 https://openai.com/blog/deep-double-descent/ and other papers.
 https://arxiv.org/abs/1711.08856 and others.
I think you'd be interested in https://arxiv.org/abs/1611.03530. It discusses how SGD is an implicit regularizer. We also actually want high variance weights for symmetry breaking.
Another way to look at it: most neural nets are just a bunch of polynomials stitched together. You can see this from the popularity of the relu activation function. when the relu is negative, that poly is always zero in that area. When positive it's some poly multiplied by a const - another poly.
For nets that use other activation fns, they try to be linear in the area of most active input. So again they approximate a const * a ploy.
No, this is actually exactly the opposite of what the article is saying.
"If you have enough lines you can approximate any curve within a margin" is usually a bad thing because your approximations lose meaning as the number of lines increase.
The surprising thing about machine learning is that increasing the "number of lines", if you will, increases the meaning, too, and there's some wierd and subtle properties about mathematics in higher dimensions that makes this work.
If I can be really hand-wavy about it -- A lot of deep neural networks achieve their nonlinearity in a very constrained way. They stack linear models on top of each other, and the nonlinearity comes from using a a relatively simple nonlinear function such as logistic or tanh to scale the models' outputs before feeding them into the next one. (Without that step, you'd just have a linear combination of linear functions, which would itself be linear.)
That's a pretty constrained form of nonlinearity compared to polynomial regression, which tries to directly fit some high-order polynomial. I don't have anything like the math chops to prove it this, but I believe that means that the neural network is going to tend to favor a relatively smoother decision boundary, whereas polynomial regression is a naturally high variance sort of affair.
It's worth noting that neural networks didn't take off with Hornik et al. style simple-topology-complex-activation-function universal approximators. They took off a decade or so later, with LeCun-style complex-topology-simple-activation-function networks.
That arguably suggests that the paper is of more theoretical than practical interest. It's also worth noting that one of the practical challenges with a single hidden layer and a complex activation function is that it's susceptible to variance. Just like polynomial regression.
As a whole, that paper is quite bad (and still unpublished, probably blocked by peer review) because (1) it only considers fully connected networks (which are a minority of models used nowadays) and (2) the experimental validation was done on tasks where neural networks are not very strong.
Show me examples of competitive polynomial regression models in language translation, image segmentation and Go playing and I will be convinced.
You almost certainly can frame Go playing as polynomial regression but there would undoubtedly be numerical precision & other gradient issues. Deep learning is a practice is remarkably effective, no disagreement there.
A lot of people publish results that say deep learning is "just" something else, where the something else doesn't work.
=> logistic regression + Kmeans
It's not guaranteed at all. Overcomplicated models will "overfit" the training data and generalize very poorly.
> some curves cannot be expressed as a polynominal
You can approximate any (continuous and blah blah) curve arbitrarily well with Taylor expansions.
In fact, polynomials are one of the the most common examples to demonstrate overfitting. See figure 2 on wikipedia (https://en.wikipedia.org/wiki/Overfitting)
How avoid overfitting the training data?
I'd thought of machine learning as a form of optimization.
Things like support vector machines really were hill climbing for some kind of local optimum point. But at a billion dimensions, you're doing something else entirely. I once went through Andrew Ng's old machine learning course on video, and he was definitely doing optimization.
The last time I actually had to do numerical optimization using gradients, I was trying to solve nonlinear differential equations for a physics engine in about a 20-dimensional space of joint angles that was very "stiff". That is, some dimensions might be many orders of magnitude steeper than another. It's like walking on a narrow zigzagging mountain ridge without falling off.
So deep learning is not at all like either of those. Hm.
Even then, you may have still converged on a local optimum which was the take away I got from the article
Another way to conceptualize these is to think of being at the minima of a parabola in 2 dimensions, but then seeing you're not in a minima in a 3rd dimension. Any time you're in a minima in at least 1 dimension, you're on a saddle.
You can extend this concept to a neural net which lives in millions of dimensions, undergoing SGD. When beginning an optimization run SGD moves in some direction to minimize the a bundled cost, inevitably stumbling into minima in (usually) many dimensions. Subsequent iterations will shift some dimensions out of minima and other dimensions into minima, the net is always living on a saddle during this process.
There are many papers that discuss the process in these terms and others that implicitly use it. I wouldn't say its a "hot area of research" but more of a tool for thinking about these processes and sometimes gaining some insight in to why things get stuck during training.
Can someone who knows more about DL than I do help me understand this a little better?
The article uses the analogy of walls:
> Just recall what is necessary for a set of parameters to be at a optimum. All the gradients need to be zero, and the hessian needs to be positive semidefinite. In other words, you need to be surrounded by walls. In 4 dimensions, you can walk through walls. GPT3 has 175 billion parameters. In 175 billion dimensions, walls are so far beneath your notice that if you observe them at all it is like God looking down upon individual protons.
I'm struggling to understand what this really means in 4+ dimensions. But when I try to envision it going from 1 or 2 to 3 dimensions, it doesn't seem obvious at all that a 3D space should have fewer local optima than a 2D space.
In fact, having a "universal function" like a deep network seems like it should have more local optima. What am I missing?
Obviously this is not rigorous, though.
There's a reason that things like 3D protein structure estimation, for example, are still very difficult problems, because none of the coordinates are even approximately independent of the others.
So you're back to a standard "minimization is really difficult" even in ultra-high dimensional spaces.
That's just wrong, no need to think about it. Only unbounded loss functions don't have minima, but using such a loss function would not make sense.
You can read more here .
What the author is saying is that very quickly optimization becomes a maze, and can quickly turn into a combinatorial game. Each input starts down its own set of corridors (parameters at certain values) and it can take an _extremely_ long time for this maze to end. Any noise in the (data, label) pairs can make this maze have no end at all if the model is too small. If the model is too big, it's a moot point because it will have overfit at this point.
It is typical to squish the set of all real numbers into the interval 0 to 1. Any finite value will be squished to a value less than 1. So the model tries to make the number go higher and higher. No matter which model you have, you can always go a bit higher. Thus there is no optimal model.
The author mentions regularization, but strangely he then proceeds as if it didn't exist. Because with regularization, you can prove that there is a minimum. Basically (don't worry if you don't understand, I just include for other readers): loss goes to infinity as parameter norm goes to infinity, loss has a lower bound so it has a highest lower bound. Take a sequence of points in parameter space with loss converging to this bound. By Bolzano Weirstass it has a convergent sub sequence. Loss is continuous function of parameters, so the loss of the limit point is the limit of the loss. I.e. it's a minimum.
If you're surrounded by walls at a 3D coordinate (walls in this sense is usually something akin to an a priori constraint on step size, which itself is the upper and lower bounds of the imposed delta introduced to the current step to try to find a new direction to go in numerical gradient descent), but you can arbitrarily "jump" 4 dimensionally, and there exists a point in the fourth dimensional space where your 3 dimensional space is no longer constrained from moving in the lower 3 dimensions again, you've essentially "avoided" a local optima, because your optimization function can continue to shift to find something better.
At least this holds if we're talking gradient descent, where the definition of an optima for a function is a point wherein any numerical deviation within your error tolerance always ends up converging to the same point.
If you take that same technique and apply it to higher dimensional spaces, the more dimensions you have, the less likely (in theory) your model is from getting stuck.
I know for a fact this doesn't always hold though, as almost every GPT2/3 model I come across I can still manage to get it snarled in predictive loops, where it does nothing but suggest the same thing over and over and over and over and over and over again, indicating a locally maximal optima for the predictor.
One of my favorite way to trip them up is generally some variant of "I once heard a story from a man who heard it from a man, who heard it from a man,... usually it sets it up for the loop. Sometimes you need to massage it a bit, but it's generally pretty easy to lead the predictor into a loop.
If you really want to blow the theory there are no higher dimensionality optima though, just look at other people. If there were not higher dimensional optima, why do bad habits exist, and get converged on so readily that we actively have to discourage, label, or avoid them?
The fact is that for a general function simulator, the trick isn't not falling victim to higher dimensional optima, but learning to recognize what and when you can safely tolerate some, and when you can't, because you really can't avoid the damn things in a resource or physically constrained problem space.
I've read that the loops you're talking about come from taking too many high-probability choices compared to real-world text that has some low-probability words.
These intuitions are consistent with my experience... but I think there's more to deep learning.
For instance, these intuitions fail to explain "weird" phenomena, such as "double descent" and "interpolation thresholds":
* See also: http://www.stat.cmu.edu/~ryantibs/papers/lsinter.pdf
We still don't fully understand why stochastic gradient descent works so well in so many domains.
So sure, you've developed a methodology that can overfit nicely not only the train data but even the test data. But it still fails miserably when you apply your model in the field.
Because the network behind Apple's facial recognition software cannot have access to this kind of data (well, maybe voice, but that doesn't seem secure), I'm not sure this is a fair comparison.
Would love to be refuted, however.
I've been growing my hair since shortly before the pandemic hit (from a few mm buzzcut which I had for years previously), and combining that with a mask, I'm apparently unrecognizable. I've run into people in the street that would recognize me in a heartbeat otherwise, but with the additional change of hair, people only recognize me when I pull my mask down. So humans aren't that amazing at recognition either.
A better example than face masks maybe is the recent controversy over Twitter's AI and Obama images (https://www.theverge.com/2020/9/20/21447998/twitter-photo-pr...).
A lot was made of racial issues, which is fine, but the broader issue is why subtle changes in photos, like cropping, should confuse things so completely.
The target piece (the focus of this HN thread) sums itself up this way:
"There’s a good set of params somewhere nearby.
When we start walking to it, we can’t ever get stuck along the way, because there are no local optima.
Once we’ve stumbled upon a good set of parameters, we’ll know it and we can just stop."
I think there's some useful insights there, but this is in many ways the definition of local optima. What I might argue is that because there's so many locations in high-dimensional space that will satisfy some classification goal, it's "easy" to find one that works with regard to some population that defines the model development space (training + test). However, that model development space/population is itself implicitly defined by a certain set of constraints -- it's a subpopulation of some broader population. What you want to generalize to to define overfitting is broader. You can still not overfit to your model development space, but be overfitting with regard to some broader set of possible inputs.
Whether or not the constraints of the model development population/space are important and reasonable considerations -- e.g., in your argument, not having access to things like body language etc -- is maybe a little variable. In some cases the implicit defining characteristics of the model development population are meaningful, but in other cases they're hidden.
In Twitter's case, you end up finding out later that there's weird things that probably defined the space of their images that they didn't intend. It's only in the adversarial case that you learn about this.
In classical statistics, you talk about generalization and overfitting, but there's an implicit population you're sampling from that defines both of those things. That is, you have a training/fitting/initial sample, and you ask yourself how well your model would perform on a test/validation sample. But implicit in that is some assumption about what it means to be a random sample from the same population.
I think lots of times with DL, the cross-validation/test sample is also implicitly defined as coming from some population. But the population isn't some model, it's some source. Some image dataset, something like that. There will be things about that source that are "of interest", but other things that are idiosyncratic about it, and unrepresentative of the "real" population of interest. In this way, I'm not sure that held-out samples from some source are really the right way to think of generalization and overfitting -- I think the adversarial setting is the generalization setting.
Along the way from the classical to the DL regime I think there have been some overlooked issues about what it means to generalize, what you're really sampling from, and what your "population" actually is. It parallels tensions about theory versus experimentation because having a population in the classical case that you're sampling from requires a certain data-generating theory, which is often lacking in DL. The closest thing in the classical regime to DL generalization theory is maybe a sort of ultra-high-dimensional bootstrapping with random effects: showing that your bootstrap samples are close to your observed sample isn't the same thing as showing they're close to the population, or to other samples drawn from that population, especially in the presence of random effects.
Many adversarial cases are good examples of this: a DL model being completely thrown off by something very incidental, that a human would instantly recognize as not being within a class. Not just something a human would instantly recognize as not being within a class, but something a human would be perplexed by as an adversarial case.
The point isn't that humans are better or worse, it's that the models do often seem to be overfitting, but overfitting in a way that isn't evident until the inputs are generalized beyond whatever is in the development samples. Put another way, they might be learning something about your development datasets more so than the actual features of interest, which is the whole idea of overfitting. It's just that what it means to "generalize" is much broader.
It's a really interesting piece but I think there's lots more to the story.
there are a base of unspoken intuitions that underlie expert understanding of a field, that are never directly stated in the literature, because they can’t be easily proved with the rigor that the literature demands. And as a result, the insights exist only in conversation and subtext, which make them inaccessible to the casual reader.
I’d love to hear these intuitions from every field. Anyone got some?
We call this bullshit smoothing "calibration". If you're doing work on sensor data and don't have every calibration parameter, whether from configuration or magic factory numbers and statistical tolerances, someone, somewhere is pulling the wool over the eyes of the software guy downstream that works with the final data.
Ever looked at weather data from two separate apps and have values vary by multiple degrees? Two different pipeline just sprinkled on their own versions of interpretations on top of raw sewage data.
Actually this is because surface temperatures really do have this kind of variation.
So you do not look at just a Gaussian integral, you look at a quotient of Gaussian integrals.
Perhaps there is a similar idea. Perhaps there is some sort of renormalisation that would make neural networks work better. Even if your neural network is infinite dimensional it still makes sense to talk about some surface relatively.
There is a bit at the start about how people in statistical learning throw their hands up at deep learning etc, but none of that makes sense to me. Neural nets are an idea as old as AI - even older, in fact. The need for deeper networks was well understood by the 1980's. There are well known results about feedforward neural nets with arbitrary hideen units being universal function approximators. Why shouldn't deep learning work?
However, as usual there is confusion about what "generalisation" means. For example, I was in a summer school at Oxford a couple of years ago and one of the lectures made a similar point, about the surprising generalisation ability of deep neural nets. I approached the lecturer after the lecture and asked what they meant because the way I knew it, neural nets can't generalise, and they explained that they meant that they generalise surprisingly well on the test set but not on unseen data, or out-of-distribution data, i.e. not on any data that was not available to the researcher during training (as training, validation or test set).
In other words, neural nets are great at "generalisation" in the sense of interpolation, but are almost completely incapable of "generalisation" in the form of extrapolation.
I like to quote Francois Chollet of Keras on this:
This stands in sharp contrast with what deep nets do, which I would call "local generalization": the mapping from inputs to outputs performed by deep nets quickly stops making sense if new inputs differ even slightly from what they saw at training time. Consider, for instance, the problem of learning the appropriate launch parameters to get a rocket to land on the moon. If you were to use a deep net for this task, whether training using supervised learning or reinforcement learning, you would need to feed it with thousands or even millions of launch trials, i.e. you would need to expose it to a dense sampling of the input space, in order to learn a reliable mapping from input space to output space. By contrast, humans can use their power of abstraction to come up with physical models—rocket science—and derive an exact solution that will get the rocket on the moon in just one or few trials. Similarly, if you developed a deep net controlling a human body, and wanted it to learn to safely navigate a city without getting hit by cars, the net would have to die many thousands of times in various situations until it could infer that cars and dangerous, and develop appropriate avoidance behaviors. Dropped into a new city, the net would have to relearn most of what it knows. On the other hand, humans are able to learn safe behaviors without having to die even once—again, thanks to their power of abstract modeling of hypothetical situations.
In short, if the point of the article is that neural networks "work" because they generalise in the sense of extrapolation, then that's not right.
Would be interested in OP's take on this though.
Why does deep and cheap learning work so well?
Max Tegmark et al.,
Formal proofs and mathematics are essential, but can become a distraction from the end goal. It is like playing Chess by going after your opponents pawns instead of their king. I would say modern machine learning has become tantamount to experimental physics and this article is written from the perspective of a string theory theorist.
Stop talking about minima. Stop talking about how your optimization algorithm behaves around a minimum. Nobody ever trains their model remotely close to convergence. ADAM doesn’t even provably converge. All real models diverge! You are nowhere close to a minimum! Stop talking about minima already goddamnit! Why even think about minima?! Minima are a myth! Everybody proves their results for minima of a convex function.
15 years ago, voice recognition was all hidden markov model based. The data sets were limited, in part due to the cost of collection, but mostly because larger datasets didn't significantly improve accuracy.
As algorithms improved, larger datasets became more important as more data did actually improve accuracy.
It's improved a bit, but it's still miles and miles away from any half-decent human translation.
plus is it really that much better? i still get weirdness out of google translate
They're still not perfect, and sure, you get weirdness, but it has become significantly better according to all metrics including human comparisons of different translation aspects, which are expensive/rare to do but have been done quite a few times; there's a clear consensus that deep learning "works" for ML.
There are certain niches where other methods may still be better (IIRC languages with very little data, and translation of specific 'controlled language' domains), but for mainstream MT I think that nowadays nobody would decide to build a non-neural system.
Here's a blogpost by google showing scores creeping up since 2009 https://ai.googleblog.com/2020/06/recent-advances-in-google-...
Also I agree that I don't recall it being awful like the parent poster suggests but maybe only because I reserve 'awful' for the type of rule based machine translation used before statistical approaches came onto the scene.