
An Idiot’s guide to Support vector machines (2003) [pdf] - headalgorithm
http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf
======
nathell
I wish people stopped using disparaging terms, especially in titles.

I skimmed through the presentation and initially felt overwhelmed by the
amount of heavy linear algebra going on in there. Now, that's not a criticism
of the document itself – there's probably no way around that – but if I were
to give up because of failing to understand the content, the corollary would
be "I'm not even an idiot", and that really doesn't help anyone.

If the title were "A guide to SVMs", it would be just as descriptive, while
avoiding calling people names.

~~~
therealcamino
I had the same reaction, and then I remembered that series of books "The
Complete Idiot's Guide To..." that was popular at the time. It's probably a
reference to that, and intended as a joke given the background needed to
understand the presentation.

[https://apnews.com/dd0d4d10c6ce8f698cb325a72ee82b97](https://apnews.com/dd0d4d10c6ce8f698cb325a72ee82b97)

------
zmmmmm
I've been through SVM explanations before and I always get to the part where
they start talking about kernels and the logic seems recursive. ie: if your
data isn't linearly separable, let's choose a kernel that matches the shape of
the data, and like magic it's now separable. But guess what, the reason I'm
doing machine learning is I don't know the shape of my data. If I knew that,
and if there was actually a simple transformation, I would just transform it
in the first place and use a linear model. So it seems like a bit of a bait
and switch. Is there some part of this that can automatically select a
transformation in a similar manner to tuning hyper parameters in a neural net?

Would love it if anybody would explain this to me :-)

~~~
blackbear_
The RBF kernel nakes any dataset linearly separable, as long as the bandwidth
is small enough (but then you might be overfitting). And certainly you can
automatically select the kernel, it is just a hyperparameter after all: try
different kernels and are what works best. You can even create a composite
kernel as a weighted sum of several simpler kernels, and find the best weights
while fitting the model.

~~~
ethelward
> The RBF kernel nakes any dataset linearly separable, as long as the
> bandwidth is small enough

That's very interesting, would you happen to have the paper proving that
somewhere?

~~~
blackbear_
> would you happen to have the paper proving that somewhere?

Actually no I don't, but here's the intuition. Consider what happens in the
limit when the bandwidth goes to zero: the kernel collapses to a delta
function, i.e. K(x_i, x_j)=1 when i=j and 0 otherwise. The kernel matrix
approaches the identity. The optimal coefficients solving the quadratic
program approach zero. The SVM predicts zero almost everywhere except in a
smaller and smaller surrounding of the training points, where the prediction
equals the label of that point.

------
dijksterhuis
I've always used [https://www.svm-tutorial.com/](https://www.svm-
tutorial.com/) as my go-to suggestion for reading up on SVMs as it starts easy
and progressively adds the complexity as needed.

> This tutorial series is intended to give you all the necessary tools to
> really understand the math behind SVM. It starts softly and then get more
> complicated. But my goal here is to keep everybody on board, _especially
> people who do not have a strong mathematical background_.

Emphasis mine.

~~~
rubyfan
i appreciate the practical usage focus here with software tutorials

------
cageface
Are support vector machines still used much or have they been supplanted by
deep learning methods?

~~~
fnbr
They’re really useful (and arguably state of the art) in situations with small
amounts of data. Deep learning is really hard to productionize, so classical
techniques like SVMs and random forests are widely used in production. Deep
learning is too, but not as much as you’d think.

~~~
Ozzie_osman
I graduated in 2006 (undergrad CS degree), and at the time, we were told SVMs
were a lot more practical than something like neural nets. Neural nets were
framed as "we will teach you this thing because it's fun to code back-prop and
it kind of works in a way we think your brain does too, but no one really uses
them in real-life, except for classifying digits".

Funny how times have changed.

~~~
frequentnapper
Even in 2014 (grad cs degree), my data mining professor said SVM were
advantageous over neural nets due to local maxima problem, so we never learned
neural nets in class.

~~~
hikarudo
That's a bit weird. Sure it's an advantage if you can optimize the object
function more easily. But the end goal is generalization, i.e. performance on
new data. The objective function is only a proxy for that.

------
frumiousirc
This gives a very helpful geometrical description which finally let SVMs make
sense to me. The weights are a vector normal to a family of planes and the
optimization finds the two parallel planes that most separate two categories
of data.

Solving the optimization is performed in terms of the inner product of data
vectors. This inner product can be replaced by a function of the inner product
(the kernel) in order to transform the data which may otherwise overlap into a
space where a separating plane may be found.

------
ejanus
Looking for simple implementation of primal and dual algorithms.

------
ajflores1604
This video on svm has helped me grok it more than any other resource to date.
Actually the whole channel is full of great breakdowns

[https://youtu.be/efR1C6CvhmE](https://youtu.be/efR1C6CvhmE)

------
dang
See also from 2018:
[https://news.ycombinator.com/item?id=18794545](https://news.ycombinator.com/item?id=18794545)

