
How to Escape Saddle Points Efficiently - mfagio
http://bair.berkeley.edu/blog/2017/08/31/saddle-efficiency/
======
cool_username
> Our sharp rate depends on a key observation — although we don’t know the
> shape of the stuck region, we know it is very thin.

Oh... really? :)

(After 12 years I finally get an excuse to show a fun side project I
coauthored during my PhD...)

[http://graemebell.net/pubs/taros05-bl-embedded-
preprint.pdf](http://graemebell.net/pubs/taros05-bl-embedded-preprint.pdf)

Check out Figure 5 / Section 3.4

The rest of the paper is an introduction to why saddle points can be
surprisingly problematic for people using potential fields (neural nets,
game/robot navigation). Hope someone finds it interesting.

~~~
Eridrus
These results seem important for nonconvex optimization in general, but for ML
applications where we usually use stochastic/batch gradient descent, I wonder
if the the stochasticity adds enough perturbation for this to not really be
that useful.

~~~
Cacti
Well, in general, no, there is no perturbation method large (or good) enough
to get out of saddle points, including via stochastic gradient descent. It
might work in a particularly specific problem, but not in general.

~~~
Eridrus
Fine, but I mostly care about ML applications, where I'm wondering if this is
expected to help at all.

~~~
Cacti
ML is far too broad a category. Like I said, it will depend on the
problem/function.

~~~
robrenaud
How does what you say agree/disagree with the authors paper?

The authors bound derivatives and 2nd derivatives, which excludes particularly
nasty functions?

------
kazinator
Put a shell programmer in the saddle point, and give him or her half a dozen
backslashes; he or she will figure out how to escape in no time.

------
deepnotderp
It's interesting, I once tried injecting perturbations for a short period
selectively when the norm of the gradient was near zero, and it gave me
consistent improvement.

~~~
AndrewOMartin
How do you select the average size/standard deviation of your perturbations?
Too small and you get no benefit, or very little speed up, too large and you
lose the ability to sensibly optimise, and if you attempt to make an adaptive
function then you'll find yourself with another postdoc.

~~~
LolWolf
You could probably let the schedule of the perturbations be a sequence ε→0,
such that ∑ε = ∞, and such that ∑ηε < ∞ with probability 1, where η is IID,
bernoulli on {-1, 1} (e.g. a square-summable sequence will do). I suspect this
will be a fairly good start.

Note that, then it should in general be independent of the chosen parameter,
since the added noise converges in probability to a constant whenever the
distribution is close to uniform, but can sum to any real value with
consistent bias.

------
petters
Many nonconvex problems are solved with more sophisticated methods, like
L-BFGS. Are perturbations still a good thing?

~~~
andbberger
Such methods are usually not practical for deep learning

~~~
traviscj
Jorge Nocedal begs to differ:
[http://users.iems.northwestern.edu/~nocedal/publications.htm...](http://users.iems.northwestern.edu/~nocedal/publications.html)

> This leads to a discussion about the next generation of optimization methods
> for large-scale machine learning, including an investigation of two main
> streams of research on techniques that diminish noise in the stochastic
> directions and methods that make use of second-order derivative
> approximations.

Disclosure: I was his student for a while and in his lab for my whole PhD. My
takeaway from studying under him was that even very very approximate hessian
information (either the Hessian itself or the solves) is enough for pretty
amazing convergence rates, in terms of flops / time.

------
T_D_K
I'm getting stuck trying to understand the equations in assumptions 1 and 2.
Can anyone point me to a resource that explains the idea behind them?
Wikipedia is a bit terse, and I'm not having any luck googling for "gradient-
lipschitz and hessian-lipschitz" and variations.

On the notation side, am I correct in thinking that "<del>f(x_n) is the
partial derivative w.r.t. x_n? And that the elements of the vector x are the
parameters against which a "cost function" (f) is computed? But that doesn't
seem right. Maybe x_n is a point in R^N, and therefore <del>f(x_n) is the
derivative at that point?

~~~
andars
∇f(x_1) is the gradient of f evaluated at x_1, a point in R^N.

The first equation indicates that for any two points in R^N, the maximum norm
of the difference in gradient is less than a constant times the distance
between the points.

The keyword to google for is just "Lipschitz".

~~~
T_D_K
Ok. I stared at it long enough, and I think I understand. Being Lipschitz-
continuous means (in a non-rigorous way?) that a the gradient / slope of a
function has an upper bound. And Hessian-Lipschitz means the same, but for the
second derivative / hessian.

So, f(x) = x^2 is not Lipschitz-continuous (because the slope gets arbitrarily
large), but something like f(x) = sin(x) is Lipschitz-continuous because the
slope never exceeds some upper bound.

Funny how trying to write down the question gives the brain the kick it needs
sometimes :)

~~~
Choco31415
That's correct.

In some situations, it's enough to prove an equation is Lipschitz-continuous
on a range. Example, y=x^2 is lipschitz continuous on x=[0,1].

------
dnautics
Has anyone experimented with other optimization techniques like particle swarm
optimization? Usually this technique is inefficient but one could adaptively
kick off extra particles into the swarm as the gradient starts levelling off,
and the direction of more optimal particles could be used to inform the
gradient, moving forward.

------
Demiurge
I'm not an AI expert, but I'm decent with computers. Could anyone explain what
is this about?

~~~
sumitgt
I'm not an AI expert either, but let me give this a try.

I assume you are vaguely familiar with gradient descent. In gradient descent,
we are basically trying to find the sweet spot where the value of a function
is minimized. We do this by calculating the derivative of the function at a
certain point and then use it to take small steps in the direction where we
believe the function will have a lower value.

Gradient descent usually suffers from a problem where the algorithm gets stuck
in local minimas if the function is not convex in shape.

However, when people use gradient descent to optimize functions with a very
large number of parameters (as is the case in Deep Learning), another problem
surfaces called saddle points. Imagine a 3 dimensional plot of the function at
different values of its parameters (in reality the plot will be multi-
dimensional). Now on this plot, there will be many regions where the
derivative of the components defining the surface become zero. This messes
with our plan to use derivatives to find the direction in which to move. So we
need to come up with strategies to escape saddle points during the gradient
descent process.

~~~
whatidonteven
How can a non-linear function even be convex in shape? I assume you mean the
whole volume below or above the function and not just the function's surface
itself?

Also, what about the case where the function isn't continuous or where it's
not defined everywhere (the surface has holes)?

~~~
wenc
> How can a non-linear function even be convex in shape?

I'm not sure what you mean. Apart from the linear case (which is weakly
convex), most convex functions are non-linear. So yes, it is not only
possible, it is the norm (in a colloquial sense). Refer to this for a
mathematical definition of convexity:
[https://en.wikipedia.org/wiki/Convex_function](https://en.wikipedia.org/wiki/Convex_function)

> Also, what about the case where the function isn't continuous or where it's
> not defined everywhere (the surface has holes)?

There are two different cases:

1) Discontinuous functions: these are by definition nonconvex e.g. step
functions. Gradient-descent methods cannot handle these directly; typically
they are modeled as mixed-integer problems.

2) Non-smooth functions: are convex but do not have derivatives defined
everywhere. e.g. abs(x). Gradient-descent methods don't work well on these
types of functions. These typically require subgradient/bundle methods, or can
be modeled as discontinuous functions.

~~~
whatidonteven
Ah, gotcha, so it's not the graph of the function that convexity refers to but
the volume above the graph of the function.

~~~
wenc
Well, no, in this scenario, it is actually the _function_ (in your words, the
graph of the function) that is convex. A 2D example would be y = x^2 (a
parabola), which is a convex function. A 3D example would be a paraboloid
function, which is also a convex function.

The "volume" (or "area" in the 2D case) above the graph is called an
_epigraph_.

One property of convex _functions_ is that their _epigraphs_ are convex _sets_
(note the word "sets" this time).
[https://en.wikipedia.org/wiki/Epigraph_(mathematics)](https://en.wikipedia.org/wiki/Epigraph_\(mathematics\))

Convex sets are more abstract in meaning, but in general in means can draw a
straight-line between any two points in the region without going outside of
the region.

Perhaps your notion of convexity comes from a mental idea of the shapes of
convex and concave lenses? Those are good visualizations but in mathematics,
convexity has a subtler, more rigorous meaning. With this rigorous meaning
comes many nice mathematical properties that make optimizing them easier than
nonconvex functions.

