
An Introduction to Support Vector Machines - feconroses
https://monkeylearn.com/blog/introduction-to-support-vector-machines-svm/
======
currymj
Since neural nets are winning at the moment, it's easy to see SVMs as an
underdog, being ignored due to deep learning hype and PR. This is kind of
true, but it's worth noting that 10-15 years ago we had the exact opposite
situation. Neural nets were a once promising technique that had stagnated/hit
their limits, while SVMs were the new state of the art.

People were coming up with dozens of unnecessary variations on them, everybody
in the world was trying to shoehorn the word "kernel" into their paper titles,
using some kind of kernel method was a surefire way to get published.

I wish machine learning research didn't respond so strongly to trends and
hype, and I also wish the economics of academic research didn't force people
into cliques fighting over scarce resources.

I'm still wondering what, if anything, is going to supplant deep learning.
It's probably an existing technique that will suddenly become much more usable
due to some small improvement.

~~~
rm999
This is true, but only in the academic research world. SVMs had relatively
little success on practical problems and in industry, so they never built up
the kind of standing that neural networks did. Even in 2003-2005 - arguably
the peak time for SVMs - neural networks were much better known to almost
everyone (industry practitioners, researchers, and laypeople) than SVMs.

What frustrates me is that people who are starting out in machine learning
often never learn that linear/logistic regression dominates the practical
applications of ML. I've spoken to people who know the in-and-outs of various
deep network architectures who don't even know how to start with building a
baseline logistic regression model.

~~~
tnecniv
What prevented SVMs from catching on in industry?

~~~
rcar
They tend to create difficult to interpret models that don't perform as well
as other "black box" modeling methods (GBMs, neural nets, etc.)

~~~
closed
Was this true, or perceived as true in 2003? My understanding was that people
did not see them as performing worse than NN back then.

~~~
rm999
Definitely true. I worked for a company that was generating millions of
dollars a year from neural networks in the mid 90s (edit: to be clear I didn't
work there in the 90s, I joined years after their initial buildouts). The
Unreasonable Effectiveness™ of neural networks has been true for a long time.
When I worked there I tried switching out some models with SVMs and they were
less accurate and took 1-2 orders of magnitude more time to train.

~~~
closed
Really useful to hear, thanks! I know psychology was gaining a lot of headway
with NN models in the 90s, but had little sense for what was going on in
industry.

------
deepnotderp
I'd just like to note that instead of creating additional animosity between
SVMs and deep nets, you could use both together. SVMs with hinge loss can be
Yet-another-layer (tm) in your deep net, to be used when it provides better
performance.

~~~
strebler
That's a great point. Fundamentally, if you look at something like a CNN, what
it's really doing is producing a feature descriptor based on the input image.
One can easily use that feature descriptor in a classic SVM, alongside (or
instead of) SoftMax.

~~~
deepnotderp
Yup, in fact, the universal feature extraction is what allows imagenet
pretraining to work well on lung cancer images.

One nitpick though, ConvNets can absolutely be used to do "thinking" and more
than just feature extraction. For example, fully convolutional networks can be
extremely competitive with FC-layer based nets.

------
idrism
For anyone interested in SVMs (and other introductory Machine Learning
concepts), Udacity's intro course is really good:
[https://www.udacity.com/course/intro-to-machine-learning--
ud...](https://www.udacity.com/course/intro-to-machine-learning--ud120)

~~~
yamaneko
This class from MIT taught by Patrick Winston is also a great resource:
[https://www.youtube.com/watch?v=_PwhiWxHK8o](https://www.youtube.com/watch?v=_PwhiWxHK8o)

At the end of the class, he also gives some historical perspectives, like how
Vapnik came up with SVMs.

------
Asdfbla
I remember that only a few years ago, in a computational statistics class I
took the lecturer mentioned how SVMs (and Random Forests) have largely
replaced neural networks. How things can change so quickly...

I always liked SVMs for the elegance of the kernel trick, but I guess choosing
the right kernel functions and parameters for them wasn't that much easier
than training a neural net either.

~~~
joe_the_user
(All this from my rough, amateur understanding), SVMs are more or less
equivalent to linear regression in a "feature space" and also equivalent to
shallow neural network (~2-3). This means their size more or less increases
with the amount of data they are attempting to approximate. And this means
they don't do well scaling to truly huge data sets.

Deep nets pulled ahead of SVMs at the point people figured out how to train
them on truly huge data sets using GPUs, gradient descent (and an ever
increasing arsenal of further tricks - all the schemes together are
mindboggling to read about).

This was basically because the deepness of a deep neural net means that it's
size isn't as prone to increase with the size of data.

I don't really know why SVMs haven't been able to scale to a multi-layer
approach though I know people have tried (someone has tried just about
everything these days).

Part of the situation is leveraging simple code with GPUs still may be the
most effective approach.

~~~
genericpseudo
Close but not quite. The difference between (soft) SVM and a kernel linear
classifier is choice of loss function; SVM minimizes hinge loss, linear
regression minimizes squared loss.

(Choice of different loss functions will also give you Elastic Net, LASSO,
logistic regression. From an engineering point of view I tend to think of the
entire class as being different flavors of "stochastic gradient descent", in
the spirit of Vowpal Wabbit etc.)

------
chestervonwinch
With SVM, you often must perform a rather larger grid search over kernels and
kernel parameters. It seems like no matter the model, we can't avoid the
hyperparameter problem -- although boosting and bagging meta-methods come
close.

It would be nice if we could quantify the complexity of a dataset and match
this to a model with similar complexity. I imagine that it's hard (or
impossible) to decouple these two complexity quantifiers, however.

------
stared
For SVMs I really like this intro:
[https://generalabstractnonsense.com/2017/03/A-quick-look-
at-...](https://generalabstractnonsense.com/2017/03/A-quick-look-at-Support-
Vector-Machines/) (with hand drawings!)

------
aaron-lebo
For a recent practical example of their usefulness:

 _This paper presents the Militarized Interstate Dispute (MID) 4.0 research
design for updating the database from 2002-2010. By using global search
parameters and fifteen international news sources, we collected a set of over
1.74 million documents from LexisNexis. Care was taken to create an all-
inclusive set of search parameters as well as a sufficient and unbiased list
of news sources. We classify these documents with two types of support vector
machines (SVMs). Using inductive SVMs and a single training set, we remove
90.2% of documents from our initial set. Then, using year-specific training
sets and transductive SVMs, we further reduce the number of human-coded
stories by an additional 21.6%. The resulting classifications contain anywhere
from 10,215 to 19,834 documents per year._

[http://steventlandis.weebly.com/uploads/1/2/1/4/12144932/dor...](http://steventlandis.weebly.com/uploads/1/2/1/4/12144932/dorazio_et_al_2012_here.pdf)

~~~
ice109
5 years isn't a recent example. my impression is that deep nets ate everyone's
lunch (including svm).

~~~
dmreedy
Deep nets ate everyone's hype. The lunch is still there. SVMs have many
advantages over ANNs that recommend themselves to practical applications
still.

~~~
rwallace
What advantages do SVMs have at this stage?

~~~
dmreedy
training speed, training data requirements, and runtime resource requirements
are the big ones. And on domains that are not raw image processing, SVMs are
still often quite competitive when it comes to accuracy.

------
chrischen
Can someone explain this part:

    
    
      Imagine the new space we want:
      z = x² + y²
    
      Figure out what the dot product in that space looks like:
      a · b = xa · xb  +  ya · yb  +  za · zb
      a · b = xa · xb  +  ya · yb +  (xa² + ya²) · (xb² + yb²)

~~~
tgeery
I have the same problem. Where did a & b come from? Which two vectors are we
taking the dot product of? And how is this less expensive?

~~~
Longwelwind
In the decision function of an SVM, you compute the scalar products of the
support vectors (points that are on the margin of your hyperplane, or more
precisely, the points that constrain your hyperplane) and your new sample
point:

    
    
      x· sv
    

The "z" the article defines is a new component that will be taken into account
in the scalar product. A more mathematical way of seeing that is that you
define a function phi that takes an original sample of your dataset, and
transform it into a new vector. In our case, we simply add a new dimension
(x3) based on the two original dimensions (x1, x2) that we add as a third
component in our vector:

    
    
      phi(x) = [x1, x2, x1² + x2²]
    

The scalar product we will have to compute in our decision function can then
be expressed as (this is the a and b in the article, i.e. the sample and the
support vector in our new space):

    
    
      phi(x)· phi(sv)
    

The SVM doesn't need phi(x) or phi(sv), but the scalar product of those two
numbers. The kernel trick is to find a function k that satisfies

    
    
      k(x, sv) = phi(x)· phi(sv)
    

and that satisfies the Mercer's condition (I'll let Google explain what it
is).

Your SVM will compute this (simpler) k function, instead of the full scalar
product. There are multiple "common" kernel functions used (Wikipedia has
examples of them[1]), and choosing one is a parameter of your model (ideally,
you would then setup a testing protocol to find the best one).

[1] [https://en.wikipedia.org/wiki/Positive-
definite_kernel#Examp...](https://en.wikipedia.org/wiki/Positive-
definite_kernel#Examples_of_p.d._kernels)

~~~
tgeery
Thank you. This was an amazing explanation. I am new to SVM's but did not make
the connection that margin points (observations along the margin of the
hyperplane) become your support vectors. This makes a lot more sense.

And if I am following correctly, it would make sense that the final step would
then be:

We would maximize the dot product of a new observation with the support
vectors to determine its classification (red or blue)

~~~
Longwelwind
During the learning phase of the SVM, you try to find an hyperplane that
maximizes the margin.

The decision function of an SVM can be written as:

    
    
      f(x) = sign(sum alpha_sv y_sv k(x, sv))
    

Where sum represents the sum over all support vectors "sv", "y_sv" represents
the class of the sample (red=1, blue=-1, for example), "alpha_sv" is the
result of the optimization in the learning phase during the learning phase (it
is equal to zero for a point that is not a support vector, and is positive
otherwise).

The decision function is a sum over all support vectors balanced by the "k"
function (that can thus be seen a similarity function between 2 points in your
kernel), the y_i will make the term positive or negative depending on the
class of the support vector. You take the sign of this sum (1 -> red, -1 ->
blue, in our example), and it gives you the predicted class of your sample.

