Similarly, Bayesian statistics (a.k.a. probabilistic programming, graphical models, Bayes networks etc.) has a rock solid mathematical foundation – both the statistics behind it and the sampling algorithms (MCMC).
Of course, many mathematical statistical methods started off as hacks too. The invention of the L2 loss function, I imagine, must've gone a little like "alright, let's try to find the pattern here, let's draw the line through this series of points that minimizes the sum of the distance of each point to the line... hm, I don't like how this looks, let's try squaring the distances instead and then minimize that sum so we punish large deviations from our line a bit more... oh, that looks better."
We later realized this worked so well because it's max likelihood of the model y=ax+b+e where e is a gaussian error function. But that came much later.
Then again, on the other hand, "calculus works nicely" and "it's simple" probably are good spooky directors toward useful models anyway.
|y - wx| isn't differentiable but it is subdifferentiable. So the derivative is defined at every point except where y = wx, and in that case you're more or less fine if you just pretend that the derivative is zero at that point.
One reason why squaring is preferred is that the derivative has a nice closed form, which can be used to find closed form solutions.
Another is that people like to fit their models with gradient descent, and gradient descent has better guarantees for strongly convex loss functions. It also works better in practice. Intuitively: if you try to minimize x^2 by gradient descent, you have the largest gradients when you're far from the minimum. If you try to minimize |x|, then your gradient is always either +1 or -1, making it harder to converge around the minimum. See the Huber loss for a strongly convex relaxation of the absolute value function.
In 'machine learning', structured learning by Vapnik and co (theory behind SVM), has a beautiful and strong mathematical underpinning. But in general, it is true we don't really have a good understanding of why learning algorithms work. The very notion of generalization to unobserved data is not well understood (I like D. Wolpert papers on that topic).
It has been a while since I read those papers, there may better references.
Gal and Ghahramani "We show that a neural network with arbitrary depth and non-linearities, with dropout applied before every weight layer, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process" http://arxiv.org/pdf/1506.02142.pdf
There have been many hierarchical networks in Bayesian statistics. Hierarchical Dirichlet Process, hierarchical beta process, Pitman-Yor process, Gaussian process, nested, dependent, translated, generalized versions of the Chinese Restaurant process, or the Indian Buffet one. The deep part isn't new.
People seem to talk about auto-encoders a lot. Its very definition introduces hidden variables about which very little is defined. It corresponds with some kind of prior on what representation should be used by the network. This is studied in Bayesian compressive sensing. http://machinelearning.wustl.edu/mlpapers/paper_files/icml20...
4.) Layered training.
Layers are trained one by one. Doesn't really ring a bell. Maybe it corresponds to some MCMC method I'm not aware of. There are often inner and outer loops in MCMC and I guess something fundamental can be told about the right moments to switch from inner to outer. However, that does not correspond to learning low-level features first.
More like so we can differentiate easy. Then if anyone asks, you wave your hands and mumble something along these lines :)