
The Zen of Gradient Descent - ot
http://mrtz.org/blog/the-zen-of-gradient-descent/
======
thearn4
The author mentions Trefethen-Bau, “Numerical Linear Algebra” as a good
reference. I couldn't agree more, it's a great resource.

~~~
rrrrtttt
Yes, but the second half dealing with iterative algorithms is pretty much
useless, especially the chapter on conjugate gradients which is the one most
relevant here. There isn't really a book that offers as easy an intro to
iterative algorithms as Trefthen does for direct ones. In this case I would
recommend to fall back on trusty old Golub and Van Loan (recently released in
a new edition).

------
ibuildthings
Throwing in my personal goto reference on gradient descent :
[http://www.cs.cmu.edu/~quake-papers/painless-conjugate-
gradi...](http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf)
"An Introduction to the Conjugate Gradient Method Without the Agonizing Pain"
\- Jonathan Richard Shewchuk.

~~~
sitkack
This Shewchuk dude is one acceptable Canadian!

------
yosyp
I must be confused, but isn't stochastic gradient descent a bit of an
oxymoron? A stochastic search space implies independent points in the search
space, so the Lipschitz condition doesn't hold.

Does gradient descent apply to problems with many local extrema, or is it
restricted to purely globally convex problems?

~~~
michaelochurch
Stochastic gradient descent is a "crappy" method that works. Theoretically, it
ought to be slow to converge and often involves annoying tuning parameters
because you want to slow learning with each iteration. In practice, though, it
works quite well. Quite a large number of successful neural-net/deep-learning
implementations use SGD rather than second-order (quasi-Newton) or even
momentum methods.

Pointwise SGD tends to be uncommon and too noisy. Mini-batch sampling is more
common. Each iteration, you train on a bootstrap subset (say, 10,000 points
chosen with or without replacement from the full dataset). If you have less
than 10k data points, you generally don't need to use stochastic algorithms at
all.

 _Does gradient descent apply to problems with many local extrema, or is it
restricted to purely globally convex problems?_

There are a variety of approaches:

* Multiple starting points. Often gets the job done and finds a good local minimum. The idea is that, usually, better (lower) local minimums will have larger drainage regions, a sort of "deep is usually broad" hypothesis that is often but not always true.

* Momentum. Instead of moving with a velocity proportionate to the gradient, it's proportionate to a smoothed average of previous gradients. This makes it far less likely that you end up at a bad local minimum or some other stationary point (at a local maximum or saddlepoint, the gradient is zero).

* Other stochastic devices such as simulated annealing or (for neural nets) drop-out.

* Dimensionality reduction and regularization (weight decay) don't necessarily solve the "multiple local minima" problem but often make the search space smaller and the problem, therefore, more tractable. (The main purpose of DR and regularization, however, is to improve generalization performance.)

~~~
vegabook
Any favourite books/resources on the above?

------
vegabook
ah but descent is such a luxury! Zen comes only from knowing when it is time
to _ascend_ the gradient, there to find new descending vistas that made the
old valley look like a mountain!

would that all problems were so conveniently convex.

~~~
noelwelsh
As a wise woman once said to me, "You can approximately solve the exact
problem, or exactly solve an approximation of the problem." Truly these were
wise words, and wiser yet are those who can choose correctly between these two
(meta-)paths.

