
Why use SVM - cpleppert
http://www.yaksis.com/posts/why-use-svm.html
======
achompas
First: great post! Don't let my comments deter you--I'm only sharing what I
think is some constructive criticism.

I think this should be titled "why use kernels," as the gain here comes from
using an RBF kernel (and not just a SVM). Kernel-based L2/Tikhonov regression
(or kernel ridge regression) could perform just as well (although it might
require more memory to train).

Other thoughts:

* your training set is 66% of the entire set, not 80%

* you use "degree=0.5" when creating the sklearn.svm.SVC, but since you don't specify the kernel type it defaults to RBF, which doesn't accept a degree arg. See [0] for more.

* you should motivate the kernel transformation more thoroughly; by mapping the data into a higher-dimension space, you hope to find a separating boundary that isn't present in the natural space. The mapping is performed such that we only need the inner product of each vector, which lets us map to ANY space (even those of infinite dimension!!) as long as we can compute inner products.

(There are some properties of the higher-dimension spaces, such as the fact
they're Hilbert spaces, that make the above possible, but it's late and I
can't remember details off the cuff).

[0] [http://scikit-
learn.org/dev/modules/generated/sklearn.svm.SV...](http://scikit-
learn.org/dev/modules/generated/sklearn.svm.SVC.html)

------
dlwh
So, this is a good post and everyone should be aware of these. But kernels
("non-linear" svms) are not a panacea. If you use kernels, then your model
will need to remember some non-trivial number of the data points (the "support
vectors" that define the decision surface). SVMs are nice in that they need a
relatively small number of points compared to the size of the dataset.
Moreover, classification is now linear in the number of points you kept
around, and training is potentially quadratic. If you use a straight-forward
linear classifier, you only need a single weights vector.

Second, SVMs---especially with a non-linear kernel---are much trickier to tune
than logistic. SVMs are very sensitive to choice of hyper-parameters (user
specified tuning weights), which means repeatedly retraining. And if you use
too powerful of a kernel, you will overfit your data very easily.

For these reasons, my advice is to start with logistic, and then if you're not
satisfied, switch to SVMs.

(As an aside, it's totally possible to kernelize logistic regression. Without
suitable regularization, it'll be even worse with regards to the number of
datapoints you need though.)

------
tansey
Nice writeup, but comparing non-linear SVMs to decision trees and logistic
regression is a bit disingenuous. Try comparing the performance to random
forests and neural networks-- the differences will be much less stark. Also,
in real-world data mining tasks, I think most people find that random forests
generalize much better.

~~~
sgt101
Really! That's not my experience or my understanding of what should happen...
SVM's can be optimized to choose a minimal number of support vectors to define
the decision surface in the training set, and this means that they suffer much
less from structural risk than something like random decision forests or
Adaboost - so I think that they are less prone to overfit.

Boosters and random decision forests are easy to learn in parallel, so suited
to classifier discovery on data sets generated by a theory with a long tail.

Have I gone crazy?

~~~
bravura
You are incorrect. The ability to control for structural risk in random
forests and boosting is comparable to that of SVMs.

(For the audience, "structural risk" = "model complexity". Structural risk can
cause overfitting. Hence, Occam's razor.)

You control for structural risk in random forests through hyperparameters like
the maximum depth.

You control for structural risk in Adaboost through early stopping. In
extensions of Adaboost, there might also be a learning rate.

In practice, I find it of comparable difficulty to control for overfitting in
SVMs and random forests.

You are also incorrect that boosting can be easily parallelized. Each model
update causes the weight of each example to be updated, and this weight vector
must be shared. Hence, it is not trivial to parallelize boosting.

~~~
achompas
I should know this, but how do you minimize model complexity with neural nets?
Fewer layers?

~~~
bravura
The number of layers is one of the factors controlling model complexity.

Interesting, the number of _units_ in the hidden layers isn't as important as
the size of the _weights_ of the hidden units. This is a famous result from
the 1990's.

Hence, a principled way of controlling overfitting in NNs is: Pick a large
enough number of weights that you don't _underfit_ , and apply
l2-regularization to the hidden unit weights. This is superior to fiddling
with the number of hidden units in an unregularized net.

A related result is that you can control model complexity by imposing sparsity
on the activations of the hidden units.

~~~
im3w1l
Thanks, very interesting! How do I know if I have enough regularization?

~~~
rm999
Both underfitting and overfitting cause poor generalization performance. You
can use cross validation to search for the parameters that give the lowest
validation error.

------
megaframe
Just adding this, for learning purposes, libsvm works just fine, but for most
real problems I've run into I've had to hand code my own SVM libs. All open
source solutions I've run into use libsvm or some derivation there of which
does not scale well. It single cpu usage on an algo that can easily thread out
(or even better run on distributed clusters using Hadoop). It also blows
through absurd amounts of memory when the number of support vectors and or the
vector features are large.

The libsvm train function and some of the associated tools grid.py do a fairly
mediocre job at refining the coef (as someone else mentioned) when doing
nonlinear SVMs. It getts really tricky to select good coef then. Also using
the eigen matrix to cut down the number of important vector features ends up
being a pretty big deal if you end up black box using SVMs on massive input
data sets.

If you work in a space where you data set isn't a 50/50 mix of classification
data, finding coefficients to maximize accuracy and cut incorrect
classification, becomes a mess.

~~~
jderick
Maybe you can contribute some of your ideas to libsvm?

~~~
megaframe
Working on it :-) to be posted on github. I'm about done into a re-write in
perl (I just like that lang, don't just judge me :-P) that threads it out to
all cores and cuts the memory usage in half for data large sets.

------
W0lf
A kernel is basically a function which transforms data from one dimension into
another. The mentioned kernel trick is the computational reduction from O(d^2)
to O(d) through the fact that one doesn't need to calculate the dot products
of all the extended basis vectors but only a kernel matrix (so called Gram
matrix) which is made from the dot products of the original vectors which is
only linear in cost. It seems almost everybody talking about SVMs and kernels
is getting this wrong.

~~~
wookietrader
Actually, you get it wrong.

A Kernel function k(x, y) is a function that calculates phi(x)^T phi(y) for
some phi. That means, it calculcates the _dot product in a higher dimensional
space of two data points_ ; it does not do the transformation.

That implies that Kernels do not have to work on the gram matrix. Kernels can
be sth completely different, e.g. Fisher Kernels.

(What I wrote is based on Bishop's book and others.)

------
kmike84
This article may give a false impression that more complex models are better
than less complex, and the only issue with complex models is that they require
more computations. This is misleading and there should really be emphasis on
bias-variance tradeoff and underfitting/overfitting.

------
davidtgoldblatt
This is only tangentially related to the article, but if you'd like to see a
nice interactive visualization similar to the ones it includes,
<http://cs.stanford.edu/~karpathy/svmjs/demo/> is a great one. There are
similar visualizations for neural networks and random forests linked there.

------
t3kcit
I'm a bit disappointed that such an unreflective post makes it to hacker news.
Many of the scikit-learn staff have excellent blogs, I never saw any of those
featured.

Btw, the sklearn docs have an example that I think explains different
classifiers way better and also hints at a very important point: non-linear
decision boundaries are mostly important for low-dimensional dense data and
not helpful for text data for example.

~~~
riffraff
> Many of the scikit-learn staff have excellent blogs, I never saw any of
> those featured.

maybe you should submit more of their posts :)

~~~
achompas
I've seen them on here. They're excellent technical posts that probably don't
interest many of the main-page viewers here.

Here's Jake Vanderplas on SVD algorithms for sparse matrices:

[http://jakevdp.github.com/blog/2012/12/19/sparse-svds-in-
pyt...](http://jakevdp.github.com/blog/2012/12/19/sparse-svds-in-python/)

Excellent content.

~~~
ogrisel
Other blogs by sklearners:

<http://peekaboo-vision.blogspot.fr/>

<http://www.mblondel.org/journal/category/machine-learning/>

<http://gael-varoquaux.info/blog/?cat=20>

<http://fseoane.net/blog/>

<http://blog.vene.ro/>

------
adulau
Nice introduction to SVM but, on my past experience with SVM and especially
for large data-set, SVM algorithms are often quite intensive on memory
consumption. Not sure if all implementations of SVM behave regarding memory
consumption. I don't know if someone did some works on comparing the
implementation and especially the appropriate kernel functions to match
minimal memory requirements.

------
winter_blue
Looking at this article, I feel quite lost. I don't have the necessary
background in machine learning.

Could someone point me to some reference materials (books, articles, videos,
Coursera, etc) that will get me up to speed on the stuff I need to know in
order to understand this article?

~~~
ajdecon
Andrew Ng, who teaches the CS 229 Machine Learning course at Stanford, has his
lecture notes online: <http://cs229.stanford.edu/materials.html> . I have
found these useful in the past.

He also teaches the Coursera machine learning course:
<https://www.coursera.org/course/ml>

------
mr_luc
I'd just like to say, as a self-taught hacker, that I would pay $500 for a
"_why's poignant guide to machine learning", and I'm sure I'm not alone.

~~~
lightcatcher
What exactly do you want? Do you want a resource that "clicks" with you that
you can use to learn about machine learning, or do you want a witty piece of
art like "_why's poignant guide to ruby"?

If its the former you are looking for, I have several recommendations. There
are a plethora of online resources to learn about machine learning from. In
video form, my favorite resource is Yaser Abu-Mostafa's (Caltech prof) video
lectures, available here: <http://work.caltech.edu/teaching.html>

You can also actually enroll in the next version of CS 156 (Abu-Mostafa's
class) and do the class online with problem sets very similar to that of
Caltech students actually enrolled in the class. (It starts January 8th).

Coursera/Udacity contain several classes that involve machine learning; Andrew
Ng's class is a good place to start.

If you (like me) prefer text to video, you can buy Abu-Mostafa's book (I hear
it follows the course very closely, and the book costs way less than $500 :D
), read Andrew Ng's online lecture notes, or read one of several freely
available ML books (such as "Elements of Statistical Learning" or "Bayesian
Reasoning and Machine Learning", but I've found these to take more
mathematical maturity that the other resources recommended).

Note: One reason that I've highly recommended Abu-Mostafa's resources (besides
having taken the class at Caltech and loving it) is that his class is more
focused on the theory of machine learning (ie hammering in over-arching
principles like avoiding overfitting by regularizing) rather than just
covering as many algorithms as possible. Also, I believe Mitchell of CMU has
good online resources for his machine learning course.

~~~
mr_luc
Definitely, something that 'clicks'.

_why's book for Ruby, pg/norvig's books for Lisp ... they're so enjoyable to
read that you don't need to exert yourself for the subject to 'click', which
is pretty impressive considering what you're learning.

I don't want to go to school to learn machine learning, but I'd be willing to
pay money (maybe quite a lot of money) for a book that made it enjoyable.

