However, there are many papers that explore various ways to make a network learn, and they keep improving on performance, suggesting they're on to something. There are also many papers that discuss possible theoretical implications of experimental results.
But what does Knowing Why The Network Works mean exactly? "It works because universal approximation and gradient descent", but that's not a very satisfactory answer. "It works because it starts at a general solution and, over the course of many iterations, takes many small steps in an ever changing direction defined by a gradient approximation generated by looking at the difference between an average error and a target output (which should trend towards 0)".
What would a satisfactory "why" even look like exactly? As in, what form might it take compared to some other scientific discipline where we do know what's going on?
Personally, I think the whole thing is a red herring -- people in the field have some idea of how neural nets work, and there are many disciplines considered by many to be mature sciences that are far from settled on a grand theoretical scale.
That said, the theory I'm most interested in is recent attempts to connect a memory module to neural networks so they can "learn" to store important/complex/distributed information that can be recalled with high accuracy later. That will make it easier to do things like ask a neural network to remember your name, or where you left your keys, or whatever.
To me this is obvious: a proof is an adequate answer. All of the explanations given in this area are heuristic reasons. They're nice, but we can't know if they're correct or if we're being fooled by our intuition without a proof.
What you get instead of local minima are saddle points, where some dimensions are curving up and some are curving down. Saddle points can also be problematic for optimization but they can be dealt with using fancy optimization techniques. For a more rigorous explanation, see http://arxiv.org/abs/1406.2572
ELI5 version: One can ask, what's the probability that a random person drawn from Earth's population (points in high dimensions) owns a Bugatti (is a local minimum)? It's very small, obviously. But that doesn't tell us anything about the probability of Bugatti ownership among select subsets of people (critical points).
An example I can think of would be an absurd million input neural network, where one of the inputs only has a pronounced effect on one of the outputs. It seems like it would be possible for the path of the input to output to be dragged downhill in the context of all outputs, but uphill in the context of the single output it affects.
Is what I've described not likely, or am I just completely off base?
If the intervals spanned by x,y,z do not overlap, then your statement does not hold unless local minima are evenly distributed over the entire region, which defeats the purpose of searching for minima in the first place. If the intervals spanned by x,y,z are identical, then gradient descent can be reduced to just searching over f(x) by symmetry. You are right that there are m^n minima, but they are found in (nm)^1 steps. If the intervals spanned by x,y,z partially overlap, you can further reduce the search space to just f(x) and the union of the intervals.
I don't see how your example shows that gradient descent is likely to fail for any classifier with a sufficiently high number of independent parameters. Am I missing something?
Before deep learning happened, neural networks used to be popular for regression and interpolation problems. For a long time, our understanding of how they worked wasn't much better than our current understanding of how deep learning works.
In 1994, Radford Neal showed  that weight-decay neural networks were in fact an approximation to Bayesian inference on a Gaussian process prior over the space of possible functions. Amongst other benefits, this allowed better approximations to be utilised, and essentially (along with the advent of SVMs) marked the end the first neural network era.
Something like that is what I would consider a satisfactory "why"
I would say it means: Why do these algorithms seem to be doing "well" despite their well-known theoretical intractability? Is it something about the instances? Or is it simply a matter of scale (so the local optima they are getting stuck in are not as obvious)? Or is there something "deep" we do not understand yet about these algorithms (or the theory)?
Either way, there is something going on here for sure.