Yes, but the second half dealing with iterative algorithms is pretty much useless, especially the chapter on conjugate gradients which is the one most relevant here. There isn't really a book that offers as easy an intro to iterative algorithms as Trefthen does for direct ones. In this case I would recommend to fall back on trusty old Golub and Van Loan (recently released in a new edition).
I must be confused, but isn't stochastic gradient descent a bit of an oxymoron? A stochastic search space implies independent points in the search space, so the Lipschitz condition doesn't hold.
Does gradient descent apply to problems with many local extrema, or is it restricted to purely globally convex problems?
Stochastic in this case means the computation of the gradient. Instead of computing the gradient over the entire data set and making a single update per pass over the data, in SGD you compute and update for each data point. This means noisier (hence stochastic) estimates of the gradient but generally faster convergence.
Gradient descent is a general purpose method. There are generally two approaches to problems with many local minima:
- modify the algorithm to try to get out of minima (e.g. simulated annealing, evolutionary methods, etc.); or
- make some simplifying assumptions and solve an approximate problem that is convex.
Stochastic gradient descent is a "crappy" method that works. Theoretically, it ought to be slow to converge and often involves annoying tuning parameters because you want to slow learning with each iteration. In practice, though, it works quite well. Quite a large number of successful neural-net/deep-learning implementations use SGD rather than second-order (quasi-Newton) or even momentum methods.
Pointwise SGD tends to be uncommon and too noisy. Mini-batch sampling is more common. Each iteration, you train on a bootstrap subset (say, 10,000 points chosen with or without replacement from the full dataset). If you have less than 10k data points, you generally don't need to use stochastic algorithms at all.
Does gradient descent apply to problems with many local extrema, or is it restricted to purely globally convex problems?
There are a variety of approaches:
* Multiple starting points. Often gets the job done and finds a good local minimum. The idea is that, usually, better (lower) local minimums will have larger drainage regions, a sort of "deep is usually broad" hypothesis that is often but not always true.
* Momentum. Instead of moving with a velocity proportionate to the gradient, it's proportionate to a smoothed average of previous gradients. This makes it far less likely that you end up at a bad local minimum or some other stationary point (at a local maximum or saddlepoint, the gradient is zero).
* Other stochastic devices such as simulated annealing or (for neural nets) drop-out.
* Dimensionality reduction and regularization (weight decay) don't necessarily solve the "multiple local minima" problem but often make the search space smaller and the problem, therefore, more tractable. (The main purpose of DR and regularization, however, is to improve generalization performance.)
ah but descent is such a luxury! Zen comes only from knowing when it is time to ascend the gradient, there to find new descending vistas that made the old valley look like a mountain!
would that all problems were so conveniently convex.
As a wise woman once said to me, "You can approximately solve the exact problem, or exactly solve an approximation of the problem." Truly these were wise words, and wiser yet are those who can choose correctly between these two (meta-)paths.
downvoter(s) -- this is a subtle, poignant, and (IMHO) poetically written joke about the prevalence of non-convexity in real-world optimization problems.