
An Idiot’s guide to Support vector machines (2003) [pdf] - bladecatcher
http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf
======
quantombone
Back in around 2008, SVMs were all the rage in computer vision. We would use
hand designed visual features and then a linear SVM on top. That was how
object detectors were built (remember DPM?)

Funny how SVMs are just max-margin loss functions and we just took for granted
that you needed domain expertise to craft features like HOG/SIFT by hand.

By 2018, we use ConvNets to learn BOTH the features and the classifier. In
fact, it’s hard to separate where the features end and the classifier begins
(in a modern CNN).

~~~
a-dub
So the pitch is that you don't have to do feature engineering... but then
instead it seems people do network structure engineering with featurish things
like convolutions.

The performance is still better in most cases but I often have to wonder, are
people just doing feature engineering once removed and is the better
performance just the result of having WAY more parameters in the model?

~~~
a-dub
I guess one upshot to the SVM approach is that there's math for quantifying
how well a given model will generalize, subject to some assumptions.

Is there anything like that in the ANN world?

~~~
computerex
In short, no. Not for practically large models used in common tasks like image
classification or speech to text.

------
abhgh
If you need closer to a ELI5 version I recommend this - [1].

Disclaimer: written by me.

[1] [https://blog.statsbot.co/support-vector-machines-
tutorial-c1...](https://blog.statsbot.co/support-vector-machines-
tutorial-c1618e635e93)

~~~
jameslk
Great article! I did not understand the part with kernels before, so I
appreciate the simplified explanation. The interactive demo on
[https://www.csie.ntu.edu.tw/~cjlin/libsvm/](https://www.csie.ntu.edu.tw/~cjlin/libsvm/)
is really cool.

~~~
abhgh
Thanks! Yes its a pretty good demo, it should be more popular IMO.

------
cultus
I notice this doesn't mention hinge loss, which is by far the simpler way of
arriving at the SVM. Hinge loss is just max(0, 1- t*y), where y is the output
of the linear model and t = +-1 is the label. Thus, it takes the common-sense
approach of not penalizing losses that are far enough away from the decision
boundary, and penalizing linearly after that.

An SVM is literally just a linear model with hinge loss instead of log loss
(logistic regression) or squared loss (ordinary linear regression) in primal
form. For apparently historical reasons, it is usually derived from the "hard-
margin" SVM in dual form, motivating with trying to maximize the margin. This
is complicated and not very intuitive.

This also causes people to conflate the kernel trick and the dual form, while
in fact they have nothing to do with each other. You can use the kernel trick
in primal svm just fine.

Stochastic gradient descent can also be used for primal methods, while it
doesn't work in the dual. That makes it much faster for large problems than
the dual.

~~~
quantombone
The hinge-loss and the primal form of the SVM objective is really easy to
understand. Every ML 101 class would jump into the dual formulation, talk
about kernels, RKHS, and all the fancy stuff.

Once you realize that a linear SVM isn’t very different from logistic
regression, it starts to all make sense (at least it did for me).

Key insight of the hinge-loss: once something is classified correctly beyond
the margin, it incurs a loss of zero.

Now, Something fun to think about. Draw the hinge loss. Now draw the ReLU
(which is found all over the place in CNNs). Now thing about L1-regularization
(which was used to induce sparsity in compressed sensing). They are more
similar in form than you would think.

~~~
a-dub
Wasn't the reason why everyone cared about SVMs was specifically because of
the nonlinear kernel stuff that would push performance a smidge and produce
new bests on existing benchmark datasets?

The QP-dual formulation never seemed like something that could scale, and
linear SVMs never seemed all that much better than just lasso/elasticnet
regression. (hmmmmm :) )

~~~
cultus
The nonlinear kernel SVM works at least as well in primal, using just the
representer theorem. Since it is unconstrained, all you need to do is create a
kernel matrix and solve your system with your favorite convex optimizer like
Newton's method (which also can work in lower dimensions).

Second-order methods like Newton's method converge better to the exact
solution that SGD, although they don't reach a "pretty good" solution as fast
usually. Coordinate descent methods in the dual also get very close to the
exact solution, but Newton's method and friends are usually faster. With
(quasi-) Newton methods in the primal, everything just comes down to solving
linear systems, which is a much more well-studied problem.

I've even experimented successfully with kernelized models with millions of
examples in low dimensions using the Fast Gauss Transform. That's impossible
in the dual.

You can also generate low-rank kernel approximations [0] using the Nystroem
method or the Fastfood transform that can then be used in a linear SVM. For
example, if I had a problem with n=10^6, I can make a low-rank approximation
of the kernel matrix (say d=1000) and feed that into a fast SGD optimizer.

This often works really well, and is usually pretty close to the exact kernel
solution if the problem is of lower intrinsic dimensionality, which is usually
true if the dual SVM is sparse in its basis vectors. This largely negates the
sparsity advantage of the primal SVM. If a kernel approximation isn't good,
then the dual SVM wouldn't be meaningfully sparse anyway, so there is still no
advantage of dual. Best just solve the kernelized system in the primal, and
use a second-order optimizer if needed.

[0] [https://scikit-
learn.org/stable/modules/kernel_approximation...](https://scikit-
learn.org/stable/modules/kernel_approximation.html)

------
cultus
There's been some work on variational Bayesian formulations of SVMs in the
last few years. These can give actual uncertainty estimates and do automatic
hyperparameter tuning. This one in particular is very cool:

[https://arxiv.org/pdf/1707.05532.pdf](https://arxiv.org/pdf/1707.05532.pdf)

------
usgroup
Clearly idiots are not what they use to be in my day ...

~~~
commandlinefan
I had trouble following it myself, and then it struck me - that must be
because I’m not an idiot!

------
rusbus
It's interesting how quickly support vector machines went from the hot new
thing to classify images to an afterthought after deep learning started having
great results.

~~~
rdtsc
Noticed that too. It feels it was just a few years and all of the sudden
everything is "deep" now.

The same thing happened with data storage. As soon as big data appeared
everyone stopped doing just data and started doing "big data". Now the term is
kind of a joke even.

I predict in a few years "deep learning" term will become mostly used in an
ironic sense as well.

~~~
username223
> I predict in a few years "deep learning" term will become mostly used in an
> ironic sense as well.

I may be a bit behind the times, but I'm also mystified by "deep learning's"
popularity. Both giant neural nets and kernel methods have overfitting
problems: torture a billion-parameter model long enough, and it will tell you
what you want to hear.

SVMs address this by finding a large margin for error, which will hopefully
improve generalization. DNNs (I think) do this by throwing more ("big") data
at the problem and hoping that the training set covers all possible inputs.
Work on adversarial learning suggests that DNNs go completely off the rails
when presented with anything slightly unexpected.

~~~
soraki_soladead
My other comment addresses some of this but you're overstating things a bit.
Throwing more data at the model is one solution. Its just not the only, or
even best, approach. Properly measured performance on good holdouts and the
application of regularization avoids the worst of overfitting. This is
standard practice is most of machine learning, not just deep learning.

Deep learning gets a lot of hype because for many applications they perform
better and scale better without a lot of tricks and extensions which are now
possible with SVMs. You can even use a large margin loss with deep models to
get some of the benefits of SVMs.

Adversarial examples are way overblown. First, SVMs are not immune to them
either. Second, very few applications are threatened by things like
adversarial examples.

------
simonw
Bullet point on page 2: "Optimal hyperplane for linearly separable patterns"

I think the author may be working from a very different definition of the word
"idiot".

~~~
simonw
Seconding the recommendation for [https://blog.statsbot.co/support-vector-
machines-tutorial-c1...](https://blog.statsbot.co/support-vector-machines-
tutorial-c1618e635e93) \- after reading that, "Optimal hyperplane for linearly
separable patterns" actually made sense to me.

------
iamwil
I have a question!

In the pdf, it said that the optimization problem in SVMs have a nice property
in that it was quadratic, which means that there's a nice global minimum to go
towards, and not lots of local minimum like in NN. That means, it seems SVMs
won't get stuck at a suboptimal solution.

Is that not a problem in DNNs now? Or is it that it's such high dimensionality
that local minima don't stop the optimizer, because there's always another way
around the local minimum?

~~~
olooney
People who know more about deep learning than I do tend to argue that there is
empirical evidence that non-convexity is a non-issue because the performance
of local minima will be close to the performance of the global minimum, given
a sufficient number of nodes[1][2]. One such quote:

    
    
        I once ran a small neural net 100 times on simple three-dimensional data re- 
        selecting the initial weights to be small and random on each run. I found 32 
        distinct minima, each of which gave a different picture, and having about equal 
        test set error.
    
        – Leo Breiman
    

[1]:
[https://projecteuclid.org/download/pdf_1/euclid.ss/100921372...](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726)

[2]:
[https://stats.stackexchange.com/questions/203288/understandi...](https://stats.stackexchange.com/questions/203288/understanding-
almost-all-local-minimum-have-very-similar-function-value-to-the)

Note also that "optimal" in each case means "relative to other models in the
same family." But a large neural net and an SVM do not exist in the same
hypothesis space. As an analogy, recall that Ordinary Least Squares is the
BLUE (Best Linear Unbiased Estimator.) But that's only relative to other
unbiased (which in particular means no regularization: not
ridge/lasso/elasticnet) linear models over the same feature set. Just because
OLS provably gives the best performance in this family doesn't mean it is
going to do well compared to models outside that family. In the same way, just
because SVM offers a convergence guarantee doesn't mean its performance will
always be the same or better than an NN.

The real problem with SVMs is that when you use a kernel (which is the whole
point, linear SVMs are just logistic regression with hinge loss) then you
introduce one landmark (feature) for every data point. So if your original
data set was a manageable 1e6x1e3 then the SVM will view this data set as
1e6x1e6. It doesn't actually have to store that in memory thanks to the kernel
trick[3] but training time still scales as O(N^2) where N is the number of
observations (rows) in the original data. In practice, even with a highly
optimized library like LIBSVM[4] you're not going to get good performance past
N=1e6. (I've personally had the most success with SVMs on small datasets of
only a few thousand rows.) While NN can very easily accommodate much later
training sets with training time only growing as O(N), more or less. You can
always sample a manageable training set from a larger data set but if your
only using 1% of your data to train a model that's a problem.

Deep NN is also a far more modular approach: if you know your data has a 2D
structure for example, you can add some CNN layers and get translational
invariance. Deep NN comes with a large toolbox for specializing the model to
the data, while the options for tuning an SVM are much more limited: you can
change out the kernel function (not that anything is likely to beat RBF) and
play around with the regularization parameter, and that's it. Once you've
tuned those hyper-parameters with a small grid search there's not much else
left to try with SVMs. If they work, they work, otherwise you have to change
approaches.

[3]:
[https://en.wikipedia.org/wiki/Kernel_method](https://en.wikipedia.org/wiki/Kernel_method)
[4]:
[https://www.csie.ntu.edu.tw/~cjlin/libsvm/](https://www.csie.ntu.edu.tw/~cjlin/libsvm/)

~~~
abhgh
I think you're exactly right about the modularity aspect of DL; in fact I made
a similar comment on this page, albeit speaking in terms of basis functions.

I have a minor nitpick regarding this point you make: _not that anything is
likely to beat RBF_. Depending on the data, specialized kernels can help
immensely. An easy example is sequence classification where something like a
string kernel might work really well. Or image classification, where histogram
based kernels might prove superior.

Note that sometimes you might want to measure how good a kernel is for a
problem not by its prediction accuracy alone but also by the number of support
vectors it needs - if the final model retains ~100% of the training data as
support vectors, it is not a great model in _some (subjective) sense_ since it
is memorizing a lot. Depending on the data, you might "beat" the RBF kernel on
this aspect too.

Regarding the training time, there are some interesting tricks I've come
across (but not tried them out yet) -[1], [2].

[1] Ensemble SVMs
[http://www.jmlr.org/papers/volume15/claesen14a/claesen14a.pd...](http://www.jmlr.org/papers/volume15/claesen14a/claesen14a.pdf)

[2] SVMPath - algorithm to fit the entire path of SVM solutions for every
value of the cost parameter, with essentially the same computational cost as
fitting one SVM model.
[http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf](http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf)

------
strikelaserclaw
implementing a svm was my senior project in college. Brings back nightmares.

------
mistrial9
compare to Tzotsos 2006 "A SUPPORT VECTOR MACHINE APPROACH FOR OBJECT BASED
IMAGE ANALYSIS"

