By that argument gradient descent should work for any classifier with a sufficie...

bendbro · on Dec 4, 2015

Do you have any good examples of where it is likely to fail?

TTPrograms · on Dec 4, 2015

My thinking is that if this line of reasoning is true it suggests that we may as well be using kernel methods or other feature generation systems with high dimensionality with better algebraic properties that be might be able to train quicker or better interpret. My understanding is that those other techniques haven't been especially competitive with deep neural nets on large data sets these days, so that implies that the high dimension argument for gradient descent is insufficient to explain the success of deep neural nets.

shasta · on Dec 4, 2015

Take any smooth function f and consider the multi-dimensional g(x,y,z...) = f(x) + f(y) + f(z) + ... If f had m local minima, then the n-dimensional version of g has m^n local minima.

yosyp · on Dec 6, 2015

You are correct in that there will be m^n minima, but since your surface is highly symmetric, local minima are trivially easy to find. Are you simplifying the statement by skipping over the importance of intervals?

If the intervals spanned by x,y,z do not overlap, then your statement does not hold unless local minima are evenly distributed over the entire region, which defeats the purpose of searching for minima in the first place. If the intervals spanned by x,y,z are identical, then gradient descent can be reduced to just searching over f(x) by symmetry. You are right that there are m^n minima, but they are found in (nm)^1 steps. If the intervals spanned by x,y,z partially overlap, you can further reduce the search space to just f(x) and the union of the intervals.

I don't see how your example shows that gradient descent is likely to fail for any classifier with a sufficiently high number of independent parameters. Am I missing something?

shasta · on Dec 7, 2015

I was just tossing out an example of a higher dimensional function and counting the local minima. The reasoning that critical points tend to be saddle points, not minima, seems irrelevant if critical points are increasing exponentially for a typical higher dimension function.