Does gradient descent apply to problems with many local extrema, or is it restricted to purely globally convex problems?
Gradient descent is a general purpose method. There are generally two approaches to problems with many local minima:
- modify the algorithm to try to get out of minima (e.g. simulated annealing, evolutionary methods, etc.); or
- make some simplifying assumptions and solve an approximate problem that is convex.
Pointwise SGD tends to be uncommon and too noisy. Mini-batch sampling is more common. Each iteration, you train on a bootstrap subset (say, 10,000 points chosen with or without replacement from the full dataset). If you have less than 10k data points, you generally don't need to use stochastic algorithms at all.
There are a variety of approaches:
* Multiple starting points. Often gets the job done and finds a good local minimum. The idea is that, usually, better (lower) local minimums will have larger drainage regions, a sort of "deep is usually broad" hypothesis that is often but not always true.
* Momentum. Instead of moving with a velocity proportionate to the gradient, it's proportionate to a smoothed average of previous gradients. This makes it far less likely that you end up at a bad local minimum or some other stationary point (at a local maximum or saddlepoint, the gradient is zero).
* Other stochastic devices such as simulated annealing or (for neural nets) drop-out.
* Dimensionality reduction and regularization (weight decay) don't necessarily solve the "multiple local minima" problem but often make the search space smaller and the problem, therefore, more tractable. (The main purpose of DR and regularization, however, is to improve generalization performance.)
would that all problems were so conveniently convex.