Hacker News new | comments | show | ask | jobs | submit login

Nice writeup, but comparing non-linear SVMs to decision trees and logistic regression is a bit disingenuous. Try comparing the performance to random forests and neural networks-- the differences will be much less stark. Also, in real-world data mining tasks, I think most people find that random forests generalize much better.

Really! That's not my experience or my understanding of what should happen... SVM's can be optimized to choose a minimal number of support vectors to define the decision surface in the training set, and this means that they suffer much less from structural risk than something like random decision forests or Adaboost - so I think that they are less prone to overfit.

Boosters and random decision forests are easy to learn in parallel, so suited to classifier discovery on data sets generated by a theory with a long tail.

Have I gone crazy?

You are incorrect. The ability to control for structural risk in random forests and boosting is comparable to that of SVMs.

(For the audience, "structural risk" = "model complexity". Structural risk can cause overfitting. Hence, Occam's razor.)

You control for structural risk in random forests through hyperparameters like the maximum depth.

You control for structural risk in Adaboost through early stopping. In extensions of Adaboost, there might also be a learning rate.

In practice, I find it of comparable difficulty to control for overfitting in SVMs and random forests.

You are also incorrect that boosting can be easily parallelized. Each model update causes the weight of each example to be updated, and this weight vector must be shared. Hence, it is not trivial to parallelize boosting.

I should know this, but how do you minimize model complexity with neural nets? Fewer layers?

In addition to what bravura said, Geoff Hinton's group at UT has recently introduced a new approach to training neural networks that they call "dropout". You can read about it in their paper (http://arxiv.org/pdf/1207.0580.pdf) or watch Hinton describe it in a recent talk he gave at Google (http://www.youtube.com/watch?v=DleXA5ADG78).

Roughly speaking, dropout training provides a strong regularizing effect through a sort of model averaging that is conceptually related to the well-known bagging approach from which random forests derive their power and flexibility. Dropout training has already produced state-of-the art results on several time-worn standard benchmarks and helped Hinton's group win a recent Kaggle competition (for an overview of their approach, see: http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it...).

I've played around with this a bit over the last few weeks, and have a Matlab implementation publicly available from my Github at: https://github.com/Philip-Bachman/NN-Dropout.

The number of layers is one of the factors controlling model complexity.

Interesting, the number of units in the hidden layers isn't as important as the size of the weights of the hidden units. This is a famous result from the 1990's.

Hence, a principled way of controlling overfitting in NNs is: Pick a large enough number of weights that you don't underfit, and apply l2-regularization to the hidden unit weights. This is superior to fiddling with the number of hidden units in an unregularized net.

A related result is that you can control model complexity by imposing sparsity on the activations of the hidden units.

Thanks, very interesting! How do I know if I have enough regularization?

Both underfitting and overfitting cause poor generalization performance. You can use cross validation to search for the parameters that give the lowest validation error.

do you have a reference?

Is L2 better than L1 in this regard? My experience is that L1 significantly outperforms L2 whenever overfitting (rather than noise / bad measurements) is the problem you are addressing.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact