Hacker News new | past | comments | ask | show | jobs | submit login
Why use SVM (yaksis.com)
147 points by cpleppert on Dec 26, 2012 | hide | past | web | favorite | 33 comments

First: great post! Don't let my comments deter you--I'm only sharing what I think is some constructive criticism.

I think this should be titled "why use kernels," as the gain here comes from using an RBF kernel (and not just a SVM). Kernel-based L2/Tikhonov regression (or kernel ridge regression) could perform just as well (although it might require more memory to train).

Other thoughts:

* your training set is 66% of the entire set, not 80%

* you use "degree=0.5" when creating the sklearn.svm.SVC, but since you don't specify the kernel type it defaults to RBF, which doesn't accept a degree arg. See [0] for more.

* you should motivate the kernel transformation more thoroughly; by mapping the data into a higher-dimension space, you hope to find a separating boundary that isn't present in the natural space. The mapping is performed such that we only need the inner product of each vector, which lets us map to ANY space (even those of infinite dimension!!) as long as we can compute inner products.

(There are some properties of the higher-dimension spaces, such as the fact they're Hilbert spaces, that make the above possible, but it's late and I can't remember details off the cuff).

[0] http://scikit-learn.org/dev/modules/generated/sklearn.svm.SV...

So, this is a good post and everyone should be aware of these. But kernels ("non-linear" svms) are not a panacea. If you use kernels, then your model will need to remember some non-trivial number of the data points (the "support vectors" that define the decision surface). SVMs are nice in that they need a relatively small number of points compared to the size of the dataset. Moreover, classification is now linear in the number of points you kept around, and training is potentially quadratic. If you use a straight-forward linear classifier, you only need a single weights vector.

Second, SVMs---especially with a non-linear kernel---are much trickier to tune than logistic. SVMs are very sensitive to choice of hyper-parameters (user specified tuning weights), which means repeatedly retraining. And if you use too powerful of a kernel, you will overfit your data very easily.

For these reasons, my advice is to start with logistic, and then if you're not satisfied, switch to SVMs.

(As an aside, it's totally possible to kernelize logistic regression. Without suitable regularization, it'll be even worse with regards to the number of datapoints you need though.)

Nice writeup, but comparing non-linear SVMs to decision trees and logistic regression is a bit disingenuous. Try comparing the performance to random forests and neural networks-- the differences will be much less stark. Also, in real-world data mining tasks, I think most people find that random forests generalize much better.

Really! That's not my experience or my understanding of what should happen... SVM's can be optimized to choose a minimal number of support vectors to define the decision surface in the training set, and this means that they suffer much less from structural risk than something like random decision forests or Adaboost - so I think that they are less prone to overfit.

Boosters and random decision forests are easy to learn in parallel, so suited to classifier discovery on data sets generated by a theory with a long tail.

Have I gone crazy?

You are incorrect. The ability to control for structural risk in random forests and boosting is comparable to that of SVMs.

(For the audience, "structural risk" = "model complexity". Structural risk can cause overfitting. Hence, Occam's razor.)

You control for structural risk in random forests through hyperparameters like the maximum depth.

You control for structural risk in Adaboost through early stopping. In extensions of Adaboost, there might also be a learning rate.

In practice, I find it of comparable difficulty to control for overfitting in SVMs and random forests.

You are also incorrect that boosting can be easily parallelized. Each model update causes the weight of each example to be updated, and this weight vector must be shared. Hence, it is not trivial to parallelize boosting.

I should know this, but how do you minimize model complexity with neural nets? Fewer layers?

In addition to what bravura said, Geoff Hinton's group at UT has recently introduced a new approach to training neural networks that they call "dropout". You can read about it in their paper (http://arxiv.org/pdf/1207.0580.pdf) or watch Hinton describe it in a recent talk he gave at Google (http://www.youtube.com/watch?v=DleXA5ADG78).

Roughly speaking, dropout training provides a strong regularizing effect through a sort of model averaging that is conceptually related to the well-known bagging approach from which random forests derive their power and flexibility. Dropout training has already produced state-of-the art results on several time-worn standard benchmarks and helped Hinton's group win a recent Kaggle competition (for an overview of their approach, see: http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it...).

I've played around with this a bit over the last few weeks, and have a Matlab implementation publicly available from my Github at: https://github.com/Philip-Bachman/NN-Dropout.

The number of layers is one of the factors controlling model complexity.

Interesting, the number of units in the hidden layers isn't as important as the size of the weights of the hidden units. This is a famous result from the 1990's.

Hence, a principled way of controlling overfitting in NNs is: Pick a large enough number of weights that you don't underfit, and apply l2-regularization to the hidden unit weights. This is superior to fiddling with the number of hidden units in an unregularized net.

A related result is that you can control model complexity by imposing sparsity on the activations of the hidden units.

Thanks, very interesting! How do I know if I have enough regularization?

Both underfitting and overfitting cause poor generalization performance. You can use cross validation to search for the parameters that give the lowest validation error.

do you have a reference?

Is L2 better than L1 in this regard? My experience is that L1 significantly outperforms L2 whenever overfitting (rather than noise / bad measurements) is the problem you are addressing.

Just adding this, for learning purposes, libsvm works just fine, but for most real problems I've run into I've had to hand code my own SVM libs. All open source solutions I've run into use libsvm or some derivation there of which does not scale well. It single cpu usage on an algo that can easily thread out (or even better run on distributed clusters using Hadoop). It also blows through absurd amounts of memory when the number of support vectors and or the vector features are large.

The libsvm train function and some of the associated tools grid.py do a fairly mediocre job at refining the coef (as someone else mentioned) when doing nonlinear SVMs. It getts really tricky to select good coef then. Also using the eigen matrix to cut down the number of important vector features ends up being a pretty big deal if you end up black box using SVMs on massive input data sets.

If you work in a space where you data set isn't a 50/50 mix of classification data, finding coefficients to maximize accuracy and cut incorrect classification, becomes a mess.

Maybe you can contribute some of your ideas to libsvm?

Working on it :-) to be posted on github. I'm about done into a re-write in perl (I just like that lang, don't just judge me :-P) that threads it out to all cores and cuts the memory usage in half for data large sets.

A kernel is basically a function which transforms data from one dimension into another. The mentioned kernel trick is the computational reduction from O(d^2) to O(d) through the fact that one doesn't need to calculate the dot products of all the extended basis vectors but only a kernel matrix (so called Gram matrix) which is made from the dot products of the original vectors which is only linear in cost. It seems almost everybody talking about SVMs and kernels is getting this wrong.

Actually, you get it wrong.

A Kernel function k(x, y) is a function that calculates phi(x)^T phi(y) for some phi. That means, it calculcates the dot product in a higher dimensional space of two data points; it does not do the transformation.

That implies that Kernels do not have to work on the gram matrix. Kernels can be sth completely different, e.g. Fisher Kernels.

(What I wrote is based on Bishop's book and others.)

This article may give a false impression that more complex models are better than less complex, and the only issue with complex models is that they require more computations. This is misleading and there should really be emphasis on bias-variance tradeoff and underfitting/overfitting.

This is only tangentially related to the article, but if you'd like to see a nice interactive visualization similar to the ones it includes, http://cs.stanford.edu/~karpathy/svmjs/demo/ is a great one. There are similar visualizations for neural networks and random forests linked there.

I'm a bit disappointed that such an unreflective post makes it to hacker news. Many of the scikit-learn staff have excellent blogs, I never saw any of those featured.

Btw, the sklearn docs have an example that I think explains different classifiers way better and also hints at a very important point: non-linear decision boundaries are mostly important for low-dimensional dense data and not helpful for text data for example.

> Many of the scikit-learn staff have excellent blogs, I never saw any of those featured.

maybe you should submit more of their posts :)

I've seen them on here. They're excellent technical posts that probably don't interest many of the main-page viewers here.

Here's Jake Vanderplas on SVD algorithms for sparse matrices:


Excellent content.

Nice introduction to SVM but, on my past experience with SVM and especially for large data-set, SVM algorithms are often quite intensive on memory consumption. Not sure if all implementations of SVM behave regarding memory consumption. I don't know if someone did some works on comparing the implementation and especially the appropriate kernel functions to match minimal memory requirements.

Looking at this article, I feel quite lost. I don't have the necessary background in machine learning.

Could someone point me to some reference materials (books, articles, videos, Coursera, etc) that will get me up to speed on the stuff I need to know in order to understand this article?

Andrew Ng, who teaches the CS 229 Machine Learning course at Stanford, has his lecture notes online: http://cs229.stanford.edu/materials.html . I have found these useful in the past.

He also teaches the Coursera machine learning course: https://www.coursera.org/course/ml

I'd just like to say, as a self-taught hacker, that I would pay $500 for a "_why's poignant guide to machine learning", and I'm sure I'm not alone.

What exactly do you want? Do you want a resource that "clicks" with you that you can use to learn about machine learning, or do you want a witty piece of art like "_why's poignant guide to ruby"?

If its the former you are looking for, I have several recommendations. There are a plethora of online resources to learn about machine learning from. In video form, my favorite resource is Yaser Abu-Mostafa's (Caltech prof) video lectures, available here: http://work.caltech.edu/teaching.html

You can also actually enroll in the next version of CS 156 (Abu-Mostafa's class) and do the class online with problem sets very similar to that of Caltech students actually enrolled in the class. (It starts January 8th).

Coursera/Udacity contain several classes that involve machine learning; Andrew Ng's class is a good place to start.

If you (like me) prefer text to video, you can buy Abu-Mostafa's book (I hear it follows the course very closely, and the book costs way less than $500 :D ), read Andrew Ng's online lecture notes, or read one of several freely available ML books (such as "Elements of Statistical Learning" or "Bayesian Reasoning and Machine Learning", but I've found these to take more mathematical maturity that the other resources recommended).

Note: One reason that I've highly recommended Abu-Mostafa's resources (besides having taken the class at Caltech and loving it) is that his class is more focused on the theory of machine learning (ie hammering in over-arching principles like avoiding overfitting by regularizing) rather than just covering as many algorithms as possible. Also, I believe Mitchell of CMU has good online resources for his machine learning course.

Definitely, something that 'clicks'.

_why's book for Ruby, pg/norvig's books for Lisp ... they're so enjoyable to read that you don't need to exert yourself for the subject to 'click', which is pretty impressive considering what you're learning.

I don't want to go to school to learn machine learning, but I'd be willing to pay money (maybe quite a lot of money) for a book that made it enjoyable.

For kernel-based methods in particular, Shawe-Taylor and Cristianini's "Kernel Methods for Pattern Analysis" (http://www.kernel-methods.net) is the textbook to use.

How about "Learn machine learning the hard way?"

I don't think that would be a good fit for teaching these things. Zed Shaw's series is primarily meant to teach concrete skills through rote memorization.

Machine learning isn't a "tool" you use to instantly "solve" a concrete problem you're having. Instead, you have to build up a large body of intuition about which tools you can use to work with your data. It can't be memorized.

What is it exactly that you'd like to know?

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact