Hacker News new | past | comments | ask | show | jobs | submit login
An Idiot’s guide to Support vector machines (2003) [pdf] (web.mit.edu)
145 points by headalgorithm 39 days ago | hide | past | favorite | 31 comments



I wish people stopped using disparaging terms, especially in titles.

I skimmed through the presentation and initially felt overwhelmed by the amount of heavy linear algebra going on in there. Now, that's not a criticism of the document itself – there's probably no way around that – but if I were to give up because of failing to understand the content, the corollary would be "I'm not even an idiot", and that really doesn't help anyone.

If the title were "A guide to SVMs", it would be just as descriptive, while avoiding calling people names.


I had the same reaction, and then I remembered that series of books "The Complete Idiot's Guide To..." that was popular at the time. It's probably a reference to that, and intended as a joke given the background needed to understand the presentation.

https://apnews.com/dd0d4d10c6ce8f698cb325a72ee82b97


I've never liked the "idiot's guide" or "for dummies" thing, either.

But, FWIW, keep in mind that this is a slide deck taken out of context. In context, it was probably being presented to an audience where a certain level of background knowledge could be more safely assumed. And, in context, it was probably being presented by a real live human. That means that there was a whole lot of more detailed explanation that is not available if you're just reading the deck. And also a human face to make any tongue-in-cheek bits (I'm guessing the title was meant to be at least a little bit irreverent) more apparent.


OTOH, these kind of titles actually attract me to the articles in the areas I have very little clue about. They give me the impression that authors have introduced the subject matter in the most accessible ELI5 style.


I've been through SVM explanations before and I always get to the part where they start talking about kernels and the logic seems recursive. ie: if your data isn't linearly separable, let's choose a kernel that matches the shape of the data, and like magic it's now separable. But guess what, the reason I'm doing machine learning is I don't know the shape of my data. If I knew that, and if there was actually a simple transformation, I would just transform it in the first place and use a linear model. So it seems like a bit of a bait and switch. Is there some part of this that can automatically select a transformation in a similar manner to tuning hyper parameters in a neural net?

Would love it if anybody would explain this to me :-)


The RBF kernel nakes any dataset linearly separable, as long as the bandwidth is small enough (but then you might be overfitting). And certainly you can automatically select the kernel, it is just a hyperparameter after all: try different kernels and are what works best. You can even create a composite kernel as a weighted sum of several simpler kernels, and find the best weights while fitting the model.


> The RBF kernel nakes any dataset linearly separable, as long as the bandwidth is small enough

That's very interesting, would you happen to have the paper proving that somewhere?


> would you happen to have the paper proving that somewhere?

Actually no I don't, but here's the intuition. Consider what happens in the limit when the bandwidth goes to zero: the kernel collapses to a delta function, i.e. K(x_i, x_j)=1 when i=j and 0 otherwise. The kernel matrix approaches the identity. The optimal coefficients solving the quadratic program approach zero. The SVM predicts zero almost everywhere except in a smaller and smaller surrounding of the training points, where the prediction equals the label of that point.


RBF kernel represents vectors of infinite dimension and that makes the dataset linearly separable. Increasing dimensions improves separability and going to infinity makes it always true. Representation is not infinite but is implicitly infinite.

For example when you use polynomial kernels you do not add extra dimensions to input vectors explicitly but you could use a linear kernel and add quadratic terms to the input vectors and still get the same separability.


From someone who used some SVMs in computer vision, before CNNs took over:

* If you don't know the "shape" of your data, you can at least try the "kitchen sink" approach; just try out a bunch of different tranformations and/or kernels, possibly in combination with each other, and see what works. This was a relatively popular approach in practice, but usually done with more rigor than I'm making it sound like. And there was even a popular method called "random kitchen sinks"! [1]

* You might not have a good idea of what a good transformation might be, but maybe you have some idea of how to measure similarity between data points? This could be a kernel, which is in some sense equivalent for SVMs.

* There is actually a fair amount of literature on learning transformations and kernels, or at least optimizing the parameters of kernels somehow. What you can do also depends on what kinds of labels (supervision) you have, if any.

* And finally, yeah, if you're at the point of trying to learn kernels or transformations, you're pretty close to what neural networks are already doing, and fairly successfully at the moment. At least in computer vision, one of the main reasons CNNs have largely replaced SVMs is that they can learn task-specific features given the right architecture/training method/hyperparameters and enough data. (What is the right architecture/training method/hyperparameters? Well... back to the "kitchen sink" method ;))

[1] (https://people.eecs.berkeley.edu/~brecht/papers/08.rah.rec.n...)


I'll take a stab at it (sorry if I'm not getting to the heart of your question and wrote about things you're already familiar with). I think there are two key levels of understanding about the kernel trick (the observation that often you only ever use a "kernel function" k(x,x') which roughly measures similarity between samples, rather than all the features of the samples themselves).

> "... if there was actually a simple transformation, I would just transform it in the first place and use a linear model"

The first level of usefulness of the kernel trick is that it allows us to bypass this. Even when we know the transformation, bypassing the explicit transformation can be vastly more computationally efficient.

Say our data comes as x = (x_1, ..., x_n) but we suspect that better features would be the the second degree terms: T(x) = (x_i x_j)_{1 <= i <= j <= n}. So T maps R^n to R^{n^2}. So our data matrix could go from 10^3 wide to 10^6 wide (terrible!). And then we want to compute the inner products of two samples mapped into this higher dimensional space R^{n^2}, which will be an O(n^2) operation.

Alternatively, if we focus in on the fact that we really only need the inner product of the transformed samples (not actually the transformed samples themselves), we see that what we want is:

Sum_{1<=i<=j<=n} (x_i x_j) (x'_i x'_j) = [Sum_{1<= i <= n} x_i x'_i ]^2 = <x,x'>^2

where <x,x'> is the normal inner product in R^n. So we can define k(x,x') = <x,x'>^2, and computing k(x,x') like this is only an O(n) operation (whereas not using the kernel trick and going the explicit transformation route leads to O(n^2) operations for computing inner products in the higher dimensional space).

So we've seen that, even when the transformation to be applied is simple and known, avoiding it with the kernel trick can vastly improve the speed and memory usage of the model.

The second level of understanding the kernel trick is observing that a kernel is simply measuring similarity between two samples in some way. We can conjure kernel functions that create a notion of similarity that we want to try out (or suspect would be good for our data), without ever having to think about what kind of transformation of the data would lead to an inner product in a higher dimension space that leads to that similarity.

Let's make one right now. Say we want two samples x and x' to be similar if they are close (in R^n) and not similar if they are not close, but we really want to exaggerate this. We may imagine there's some threshold (that if two samples are 1 unit away from each other, that's quite similar, but being 3 units away isn't 1/3rd as similar but far far less similar) we really want to "peak" similarity in a tight radius. Then we could use k(x,x') = exp(- |x-x'|^2), since this only has a value near 1 if x and x' are quite close and drops off rapidly to 0 as x and x' get further apart. How rapidly should the similarity drop off as they get further apart? That's probably a parameter we may want to experiment with, let's go with k(x,x') = exp(- gamma * |x-x'|^2) instead. We've just invented Radial Basis Function (RBF) kernels (or Gaussian kernels) ! Do we have any idea what explicit transformation we would do to our data to get an inner product in a higher dimensional space that leads to this same function k(x,x')? Nope. Regardless, do we have a notion of similarity that may be very useful for our data? Yup.

So the kernel trick transforms the harder problem of thinking up a transformation to a higher dimensional space where the data can be easily separated, into the easier problem of thinking up good notions of similarity between samples. But you're right - you still need to have some type of understanding of your data to intuit what a good kernel function will be for your problem. That's part of the art (unfortunately, less of a science) of being good at training SVMs. If you have no idea at all, most people will go with a Gaussian kernel and just see how that goes. Knowing all the common kernels and when to use which is basically the SVM equivalent of hyperparameter tuning in NNs - the model doesn't learn itself which ones are good, despite that there are some common-wisdom good defaults, and you can squeeze out some extra performance by knowing how to select the good ones from experience (or brute force searching all options). I need some practice in being more concise, but hopefully some of this helps.


Thanks so much for taking the time to write such a detailed answer. And yes, your explanations help a lot!


I've always used https://www.svm-tutorial.com/ as my go-to suggestion for reading up on SVMs as it starts easy and progressively adds the complexity as needed.

> This tutorial series is intended to give you all the necessary tools to really understand the math behind SVM. It starts softly and then get more complicated. But my goal here is to keep everybody on board, especially people who do not have a strong mathematical background.

Emphasis mine.


i appreciate the practical usage focus here with software tutorials


Are support vector machines still used much or have they been supplanted by deep learning methods?


They’re really useful (and arguably state of the art) in situations with small amounts of data. Deep learning is really hard to productionize, so classical techniques like SVMs and random forests are widely used in production. Deep learning is too, but not as much as you’d think.


I graduated in 2006 (undergrad CS degree), and at the time, we were told SVMs were a lot more practical than something like neural nets. Neural nets were framed as "we will teach you this thing because it's fun to code back-prop and it kind of works in a way we think your brain does too, but no one really uses them in real-life, except for classifying digits".

Funny how times have changed.


Funny how that happens. In 2006 I was taken a grad-level Intro to ML course. After the first (and only) lecture on neural nets, I asked the prof on recommendations for learning more - since I had mostly a cogsci background, I was pretty interested because of the "neural" part. The prof essentially said the same as yours (I don't blame him -- it was a common sentiment of the time). And of course, these days he's doing deep learning!


Even in 2014 (grad cs degree), my data mining professor said SVM were advantageous over neural nets due to local maxima problem, so we never learned neural nets in class.


That's a bit weird. Sure it's an advantage if you can optimize the object function more easily. But the end goal is generalization, i.e. performance on new data. The objective function is only a proxy for that.


I still use them. I had to make a few binary and multi class classification models in NLP and while Transformers models gave nice results, it was impossible for me to use them long term.

Why? Because data were confidential, I could not use the Cloud to get a nice GPU, and I needed almost one model per enterprise. Training DistilBert for 3 epochs was like 12 hours and it was a domain where data/concept drift happen often which means re-training was mandatory on weekly basis, at least.

After some feature engineering, I was able to get almost the same performance (<1% difference) with a model that was able to train in less than 5 minutes and could be use in production easily.

I love the fact that SVM, at least in scikit-learn, have a built-in early stopping mechanism. Really handy.


For large data yes. For small data sets feature engineering is often more important than the machine learning model used. (Kaggle contests are frequently are won by Random Forrests + Feature Engineering)

Another important thing to consider is that a lot of the actual work in setting up a machine learning pipeline has nothing to do with machine learning (making sure that data is clean, exceptions are handled, etc). On a small data set an SVM can achieve good performance out of the box. Starting with an SVM can be a really helpful in validating the basic approach and understand the nature of the problem before you sink in a bunch to time/money collecting data and training a complicated model.


I used at work as part of a NLP system to extract valid relations between named entities on context of politics. Since I didn't have labeled data before, I annotated some hundreds by myself, and because was not much label data SVM outperforms other models (compared with ANN, Gaussian Process, KNN, Naive Bayes, and others that I forgot). The best kernel for this approach was a Linear one.

The system has a compound set of machine learning models and parsers to finally extract from the government official public news (http://in.gov.br) documents with the following structured info:

- who was/will hired and fired (PERSON entity)

- which job role it will/did have. (JOB entity)

- when will happens¹ (DATE entity)

Each entity is extracted individually using a custom trained NER and each sentence is passed to the Relation Extraction system, which is built using SVM. Features are concatenated word vectors² compound by the slices of the text in the form (entity1, entity2, before, between, after).

The system is being alive for almost two years. It produces great results. Just did need retrain the entity recognizer twice in all that time (built using spacy which uses a averaged perceptron).

The SVM part (Relation Extraction) was not retrained since the first day deployed and it still works gracefully :D

¹this info is on the text as natural lang, sometimes is different from the post date ²gensim.Word2Vec custom model trained on this corpus.


I would like to under how you solved your feature engineering issues?


In general I used a custom word vector model with a fixed dimension of n=100. As you should now, this feature transformer only maps word to vectors, but I have a more complex structure than just words to work on. The model works sentence wise and preparsing of text is made, let's say I have the following sentence:

"Hire Bill Gates as Front-End Developer at 05/05/2020"

Suppose the NER extracts the following entities from it:

- Bill Gates (PERSON)

- Front-End Developer (JOB)

- 05/05/2020 (DATE)

In my domain problem, I need the relation PERSON-JOB-DATE, which can be decomposed in two binary relations: PERSON-JOB and JOB-DATE. Each binary relation it's a model by itself with the following class outcomes: invalid, hiring, firing. If two binary relations has the same job and outcome classes, I build the triple PERSON-JOB-DATE.

At feature engineering level, which is you asked for, I build a tuple of local attention from entities perspective based on slices of the text: (entity1, entity2, before, between, after). Each part it's transformed into a vector by using average word vector and finally each part it's concatenated in a final vector that will be used by SVM. Since word vector dimension I choose in my experiments was n=100, my model will have n=500 features.

An example about the slices for PERSON-JOB it will be:

("Bill Gates", "Front-End Developer", "Hire", "as", "at 05/05/2020")

1. Each element of the tuple is tokenized

2. With tokens available, use word vector model to transform each one.

3. For each element of the tuple take the average of the vectors.

4. Concatenate each averaged vector into a final vector.

Using that structure the model is biased strongly by how the sentence is written with the words around of the entities. For my domain where the sentences are very regular it worked well. I am not sure if will work in more general domain, like social media.

I hope this clarify your question in some level.


Thanks so much


Yeah they're still used. When approaching a new problem I reach for linear methods first if it seems like they could work. DNNs are amazing, but can be very heavy handed and use significantly more resources, which isn't desirable unless needed or one is working with a complex problem.

I've even used DNNs when I couldn't make SVMs work, but someone more experienced came along and showed me a kernel method that for the trick with an SVM.

In short, it's a great tool if it works.


This gives a very helpful geometrical description which finally let SVMs make sense to me. The weights are a vector normal to a family of planes and the optimization finds the two parallel planes that most separate two categories of data.

Solving the optimization is performed in terms of the inner product of data vectors. This inner product can be replaced by a function of the inner product (the kernel) in order to transform the data which may otherwise overlap into a space where a separating plane may be found.


Looking for simple implementation of primal and dual algorithms.


This video on svm has helped me grok it more than any other resource to date. Actually the whole channel is full of great breakdowns

https://youtu.be/efR1C6CvhmE





Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: