I would guess that somewhere among the hundreds or more parameters to be adjusted, there would be at least two that would determine a bowl such as I described.
But, but, okay, I can guess: The neural network has so many parameters that not doing well on a few of them doesn't make much difference.
So, "it's in there", in the pot, garlic, tomato paste, olive oil, crushed tomatoes, oregano, basil, rosemary, a tin of anchovies, along with the daily stochastic mystery ingredient!
Checkout this paper if you are interested: https://arxiv.org/abs/1806.06949
Disclaimer: I am an author and that version is somewhat out of date.
I typed in a few points and got a nice fit; the polynomial went exactly through the points. Hmm ... I couldn't be the first to think of that! So, I typed in points something like a square wave and thought, "Let's see a polynomial fit that!". Well, it did fit -- a polynomial fits a square wave -- gotta be kidding.
So, I plotted out the whole thing. Below, spoiler!!!!
Sure, the polynomial went through the points but between the points it shot off toward either positive infinity or negative infinity and made a lot of progress before turning around.
Then there are spline fits, least squares spline fits, multivariate spline fits, and, ..., ironically, neural network fits.
Neural networks are usually smooth because they must be differentiable. Stochastic binary neurons and other techniques take us into a larger space of possible mappings, discontinuous ones.
But yeah, ultimately the model we solve for, will affect what kind of interpolation the solution ends up having.
The biggest question is what kind of interpolating techniques the brain uses. The search for the magic model....
I recall seeing some research on this topic but I got the impression (perhaps wrongly) it has to make so many simplifying assumptions to get a result that it’s not clear how much carries over to real systems where not all those assumptions hold.
The 'bowl' you described are these saddle points, and this 2014 paper was pretty big for tackling it. But, as others pointed out Adam or such optimizers are used more often than steepest descent.
There's a nice characterization of min, max, and saddle points in W. Fleming, Functions of Several Variables. It was a question I got asked once on an oral exam in optimization.
It's an old book; the results are still true! Likely most good research libraries have a copy. Photocopy a few pages and will have that little subject in good shape for a long time.
Fleming was long in the Division of Applied Math at Brown. He also wrote more advanced books, e.g., on optimal control.
He also does a lot on convexity and does the inverse and implicit function theorems and applies them to Lagrange multipliers. If get very far in optimization, especially with constraints, then may like Lagrange multipliers.
See also elsewhere the simpler but shockingly commonly useful Lagrange multiplier approach of Hugh Everett, right, THAT Everett, the quantum mechanics many worlds guy, got his Ph.D. in physics IIRC under J. Wheeler at Princeton.
He did the Lagrange multiplier work for optimum allocation of resources questions in, e.g., military situations near DC. Started Lambda Corporation to do that work where Lambda is commonly the variable used for Lagrange multipliers.
At times I've guessed that there might be a good use of Lagrange multipliers for taxi ridesharing where want to maximize revenue from sharing but also use a path that is a solution to the relevant traveling salesman problem.
That took me a while to grok. At first glance it seems like an oxymoron
That's an approximation but that's arguably a second order method, since it acknowledges the nonlinearity coming from the squared norm of f.
But reading the title, I was like "well... sure?". Bad title for this article.
> We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients
with little memory requirement. The method computes individual adaptive learning rates for
different parameters from estimates of first and second moments of the gradients
Most of the popular variants of SGD use approximations of the hessian in one way or another
Furthermore, I don't think by "moment of the gradients" they actually mean second derivatives.
Also from the paper: We introduce Adam, an algorithm for first-order gradient-based optimization ofstochastic objective functions...
It's written right in the abstract that the authors consider it a first-order method.
But good theories are still far and few between
'Backpropogation Is Just Steepest Descent'