It would be interesting to see what the purists arrive at

 Well, a good example would be linear regression. We know pretty much all there is to know about its asymptotic behavior, and can prove that its results are "maximum likelihood estimates."Similarly, Bayesian statistics (a.k.a. probabilistic programming, graphical models, Bayes networks etc.) has a rock solid mathematical foundation – both the statistics behind it and the sampling algorithms (MCMC).Of course, many mathematical statistical methods started off as hacks too. The invention of the L2 loss function, I imagine, must've gone a little like "alright, let's try to find the pattern here, let's draw the line through this series of points that minimizes the sum of the distance of each point to the line... hm, I don't like how this looks, let's try squaring the distances instead and then minimize that sum so we punish large deviations from our line a bit more... oh, that looks better."
 Actually the squaring was for a more practical reason. Lets try to minimize the sum of the distance, oops |y-ax| is not differentiable so we can't use calculus, lets square it instead.We later realized this worked so well because it's max likelihood of the model y=ax+b+e where e is a gaussian error function. But that came much later.
 In grad school I asked that so many times trying to figure out why it was the "right" solution and this historical cheat caused so much cognitive dissonance.Then again, on the other hand, "calculus works nicely" and "it's simple" probably are good spooky directors toward useful models anyway.
 It's not really a cheat, as it produces the MLE (i.e. most likely) estimate under certain assumptions (e.g. errors are normally distributed, which occurs naturally if the errors are large sums of many unrelated measurements errors).
 Oh, I'm aware. It's just hard to always justify it and especially hard to do so in historical context.
 I don't think that this is quite the right explanation.|y - wx| isn't differentiable but it is subdifferentiable. So the derivative is defined at every point except where y = wx, and in that case you're more or less fine if you just pretend that the derivative is zero at that point.One reason why squaring is preferred is that the derivative has a nice closed form, which can be used to find closed form solutions.Another is that people like to fit their models with gradient descent, and gradient descent has better guarantees for strongly convex loss functions. It also works better in practice. Intuitively: if you try to minimize x^2 by gradient descent, you have the largest gradients when you're far from the minimum. If you try to minimize |x|, then your gradient is always either +1 or -1, making it harder to converge around the minimum. See the Huber loss for a strongly convex relaxation of the absolute value function.
 Regarding L2, I think it is just the most convenient way to do the maths from a set of observations (differentiability). Later one, the link with maximum likelihood-based methods was made, by Gauss.In 'machine learning', structured learning by Vapnik and co (theory behind SVM), has a beautiful and strong mathematical underpinning. But in general, it is true we don't really have a good understanding of why learning algorithms work. The very notion of generalization to unobserved data is not well understood (I like D. Wolpert papers on that topic).
 >The very notion of generalization to unobserved data is not well understood (I like D. Wolpert papers on that topic).Which papers?
 IIRC, that one was a decent overview, though he has worked on similar issues in older papers: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.99.1...It has been a while since I read those papers, there may better references.
 1.) Dropout.Gal and Ghahramani "We show that a neural network with arbitrary depth and non-linearities, with dropout applied before every weight layer, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process" http://arxiv.org/pdf/1506.02142.pdf2.) Deep.There have been many hierarchical networks in Bayesian statistics. Hierarchical Dirichlet Process, hierarchical beta process, Pitman-Yor process, Gaussian process, nested, dependent, translated, generalized versions of the Chinese Restaurant process, or the Indian Buffet one. The deep part isn't new.3.) Auto-encoders.People seem to talk about auto-encoders a lot. Its very definition introduces hidden variables about which very little is defined. It corresponds with some kind of prior on what representation should be used by the network. This is studied in Bayesian compressive sensing. http://machinelearning.wustl.edu/mlpapers/paper_files/icml20...4.) Layered training.Layers are trained one by one. Doesn't really ring a bell. Maybe it corresponds to some MCMC method I'm not aware of. There are often inner and outer loops in MCMC and I guess something fundamental can be told about the right moments to switch from inner to outer. However, that does not correspond to learning low-level features first.
 > let's try squaring the distances instead and then minimize that sum so we punish large deviations from our line a bit moreMore like so we can differentiate easy. Then if anyone asks, you wave your hands and mumble something along these lines :)

Search: