

Machine Learning: Neural Network vs Support Vector Machine - wslh
http://stackoverflow.com/questions/11632516/machine-learning-neural-network-vs-support-vector-machine

======
bravura
SVMs are good if you want high accuracy without much fiddling and don't have
many training examples. It is pretty simple to get off-the-shelf results from
SVMs. However, SVM training is quadratic in the number of examples, and you
have to get really hacky to train >10K examples.

Neural networks are good if you have many training examples, and don't mind
doing hyperparameter tuning. I have trained neural networks over 1B examples
on a single core. (Took a month.) However, you have to tune your learning rate
and regularization, and there don't yet exist good packages to do this
automatically.

It is also much simpler with neural networks to learn over custom data, e.g.
mixing supervised and unsupervised learning (labeled and unlabeled examples),
transfer learning, etc., because you can change your evaluation criterion and
minimize it.

If you want to do deep learning, we have a much better understanding of how to
do training using neural networks, particularly because we can train on such
large datasets.

~~~
levesque
Just wanted to add that hyperparameter tuning is vital for SVMs as well.

------
michaelochurch
SVMs require a linearly separable dataset. That would make them seem quite
useless, but often that's not a huge problem, and sometimes when it is you can
add basis functions as you wish, like on this data:

    
    
        1  1
        2  1
        3 -1
        4 -1
        5  1
        6  1
        7 -1
    

This data set becomes linearly separable with the basis expansion {X^2, X^3}
because there's a cubic function that matches the data perfectly for signs
(e.g. f(x) > 0 on all the x's mapped to 1, and f(x) < 0 on all the -1's). This
doubles the size of your data matrix (4 columns, for 1, X, X^2, and X^3} in
comparison to the more typical basis.

The problem is that if your data are noisy, you can end up needing a lot of
basis expansions to get separation, and then you have the typical problems of
complexity.

As far as I understand it, SVMs are good when you know there aren't
misclassifications in the data and believe the separating surfaces (which will
separate the data perfectly) will be simple because you're not going to need
to add a lot of basis expansions. It seems to expect a certain orderliness in
the data.

Neural nets are extremely flexible on account of their large number of degrees
of freedom, but rarely achieve "perfect" classification and it's often not
desirable that they do (overfitting). What they seem to be strong at is
finding a very good (but not perfect) answer, and unlike SVMs which don't work
well in a noisy world (can't achieve separation) they seem to handle it well.
Neural nets, in my experience, will learn something in a lot of different
types of environments. The major negative of neural nets is that they take a
very long time to converge, and if you use stochastic gradient descent (which
becomes necessary on a large data set) they will actually never converge.
Also, neural net problems often require extensive cross-validation.

~~~
psb217
Your description of SVMs is a bit misleading. It matches with what are
typically called "hard margin" SVMs, which do require linearly separable data.
However, people talking about SVMs are typically talking about "soft margin"
SVMs, which don't require linearly separable data. Soft margin SVMs are the
kind implemented in practically any off-the-shelf machine learning library you
might pick up online.

A concise description of the "mode of action" for soft margin SVMs would be:
project the training data into an alternate space and then perform
L2-regularized regression with hinge loss in that space. The trick is that, by
using (valid Mercer) kernels, the solution to the regularized regression in
the alternate space can be represented by a weighted sum of kernel functions
evaluated at points in the training set (i.e. the support vectors). Thus, the
solution can be learned without having to explicitly represent the points in
the alternate space, which permits the use of very high, or even "infinite"
dimensional spaces for the alternate representation.

In the context of the alternative space, the use of strong L2 regularization
(formally effected by an L2 constraint on the implicitly learned parameters)
dramatically reduces the risk of overfitting. Additionally, the combination of
L2 regularization and hinge loss leads to a convex optimization problem which,
to some people, constitutes one of the key advantages of SVMs.

I tried to avoid too much jargon, though the term "L2-regularized regression
with hinge loss" may merit further expansion (if you're interested).

~~~
michaelochurch
Thanks. That's really cool. I would be interested in getting some pointers
regarding the soft-margin SVM.

I've used neural nets a fair bit, but I've never built an SVM, although I'll
probably get to them soon in my ML study (I'm using Bishop's book and
Hastie's, as well as the online course videos).

~~~
Evbn
Read Andrew Ng's lecture notes (really more book-level quality) for the
version of cs291 he wrote _before_ he created the simplified/less-mathematical
corsera version. They are floating around online.

Also, Elements of Statistical Learning is available online for free (previous
edition, maybe?) , which covers a lot more standard/traditional statistical
curve/surface fitting topics as well, all with high mathematical rigor.

~~~
psb217
Elements of Statistical Learning is Hastie's book. Between it and Bishop's
book (i.e. Pattern Recognition and Machine Learning), I prefer Hastie's for
clarity of exposition. In particular, I find that ESL better conveys the sort
of intuitive understanding of _why_ a method works that facilitates practical
applications and extensions to novel contexts. Though, there are topics for
which the increased equations/explanations ratio of Bishop's book is useful.

I've TAed my university's graduate ML course for the past couple of years, so
I've read most chapters of these books in some detail and have hands-on
experience using them to help people who are looking closely at these topics
for the first time. Interestingly, SVMs are actually a good example of when
I'd suggest both books.

