Hacker News new | past | comments | ask | show | jobs | submit login
An Introduction to Support Vector Machines (monkeylearn.com)
292 points by feconroses on June 22, 2017 | hide | past | web | favorite | 60 comments

Since neural nets are winning at the moment, it's easy to see SVMs as an underdog, being ignored due to deep learning hype and PR. This is kind of true, but it's worth noting that 10-15 years ago we had the exact opposite situation. Neural nets were a once promising technique that had stagnated/hit their limits, while SVMs were the new state of the art.

People were coming up with dozens of unnecessary variations on them, everybody in the world was trying to shoehorn the word "kernel" into their paper titles, using some kind of kernel method was a surefire way to get published.

I wish machine learning research didn't respond so strongly to trends and hype, and I also wish the economics of academic research didn't force people into cliques fighting over scarce resources.

I'm still wondering what, if anything, is going to supplant deep learning. It's probably an existing technique that will suddenly become much more usable due to some small improvement.

This is true, but only in the academic research world. SVMs had relatively little success on practical problems and in industry, so they never built up the kind of standing that neural networks did. Even in 2003-2005 - arguably the peak time for SVMs - neural networks were much better known to almost everyone (industry practitioners, researchers, and laypeople) than SVMs.

What frustrates me is that people who are starting out in machine learning often never learn that linear/logistic regression dominates the practical applications of ML. I've spoken to people who know the in-and-outs of various deep network architectures who don't even know how to start with building a baseline logistic regression model.

What prevented SVMs from catching on in industry?

the tldr answer: SVMs are theoretically powerful but either impractical or pointless.

The longer answer:

SVMs comes with two theoretical benefits: 1. A guaranteed optimal solution. You get this with simpler techniques like logistic regression. 2. The ability to use non-linear kernels, which can offer much more power than logistic regression (on par with neural networks).

So, at first, it seemed like SVMs were the best of all worlds, and a lot of people got excited. In practice, though, non-linear kernels slowed training down to the point of being impractical. Linear kernels were fast enough, but removed the second benefit, so most people would prefer to use established linear techniques like logistic regression.

Have been in the industry for quite a while, and I think one of the often-overlooked non-technical reasons is they are hard to get an intuitive feel of, compared to ANNs. This might seem irrelevant, but think about it - someone who doesn't have a degree/rigorous training in ML or a ML heavy program (and this population is big in the industry, people who are looking to get into ML, from say analytics or software dev) will try out things he/she can identify with. ANNs are positioned exactly right for this - they're sufficiently sophisticated, you can quickly get a high-level idea, and there is the attractive comparison to how our mind works. So that's what people try out early amongst advanced algos. The industry doesn't give you a lot of time to explore, so once you have invested time to pick up ANNs, you tend to hold on to the knowledge.

I've also conducted multiple training sessions/discussions with small groups on ML, and it supports what I've said. SVMs are hard to explain to a general crowd; with ANNs I've enough visual cues to get them started. Sure, they mightn't get the math right away, but they understand enough to be comfortable using a library.

As an aside, a comment here mentions Logistic/linear regression dominates the industry. I think it is for a similar reason. They're simple to understand and try out. That doesn't make them good models, in my experience, on a bunch of real-world problems.

Now if you ask me about the technical cons of SVMs, I'd say - scalability of non-linear kernels and the fact that I've to cherry-pick kernels. Linear and RBF kernels work well on most problems, but even then, for a bunch of problems where RBF seems to work well, the number of support vectors stored by the model can be massive. If I weren't to be pedantic about it and excuse the fact that the kernel seems to be "memorizing" more than "learning", this is still a beast to run in real time. nu-SVMs address this issue to an extent, but then we are back to picking the right kernel for the task. This is one thing I love about ANNs - the kernel (or what essentially is the kernel) is learned.

Also, hyperparameter optimization is a bit painful. Zoubin Ghahramani of Cambridge is a champion for Gaussian process models which are as flexible as SVM/SVC, but have more disciplined approach: via well-chosen priors (Bayesian structure) and single-step hyperparameter optimization.

1.SVM is not interpretable just as DL.

2.Hard to parallel if you're using kernels other than linear one.

3.So-so performance.

They tend to create difficult to interpret models that don't perform as well as other "black box" modeling methods (GBMs, neural nets, etc.)

This is not really true. Aside from ensemble models they tend to perform pretty much at par or better. Here's an extensive comparison by Rich Caruana [1]

[1] https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.ic...

Was this true, or perceived as true in 2003? My understanding was that people did not see them as performing worse than NN back then.

Definitely true. I worked for a company that was generating millions of dollars a year from neural networks in the mid 90s (edit: to be clear I didn't work there in the 90s, I joined years after their initial buildouts). The Unreasonable Effectiveness™ of neural networks has been true for a long time. When I worked there I tried switching out some models with SVMs and they were less accurate and took 1-2 orders of magnitude more time to train.

Really useful to hear, thanks! I know psychology was gaining a lot of headway with NN models in the 90s, but had little sense for what was going on in industry.

Weren't SVMs used for the NetFlix recommendation engine? (i.e. the NetFlix prize?)

I think it was Gradient Decision Boosting Trees to combine different models.

No, it was an ensemble of Collaborative filtering using Matrix Factorization (using SVD) and a RBM.

err.. I think it is an evolution. Deep architectures allow for a more efficient function approximation from fewer examples than shallow architectures do.

Deep (neural) networks are really just a generalization of machine learning (on graphs). The key is that we learn the similarity function to discriminate from examples instead of specifying it a priori. You can also build classifiers other ways: linear/logistic regression based on feature vectors or by providing some similarity metric (SVM). But in these cases you are providing the discriminator function.

For example, in SVMs you have to provide a similarity measure as the "kernel" maps your feature vector into a higher dimensional space where the examples can be separated.

In deep neural networks we don't really know (or care to some extent) what the optimal feature vectors are or what the correct similarity metric is. We only care that at the end we've encoded it correctly (e.g. having chosen enough parameters/layers etc...) after training. Again the NN is some general function that applies a soft-max relationship from the inputs to outputs for each layer.

Yann LeCun has a great paper (2007) explaining this:


Hell, you can (efficiently) solve a lot of problems with KNN that people are throwing dedicated hardware + Tensorflow. But that's not cool. I've learned that doing the "cool stuff" in the face of practicality is rampant because it's part of the self-fulfilling cycle of hiring people who do the cool stuff and people doing the cool stuff to get hired.

There's a whole world beyond neural networks and it seems like it's mostly all on the backburner now. Which I understand, the resurgence in NN algorithms, approaches and hardware in the last 10 years has been exciting but it does feel like tunnel vision sometimes.

Fashion statements come and go in academia. There are academics doing good research in unfashionable areas and doing just fine. Plus, it's hard to see where the latest trends will go. Who thought Software Defined Networks would be a thing in the 1980s?

> I wish machine learning research didn't respond so strongly to trends and hype,

It's really because nobody actually understands what's going on inside a ML algorithm. When you give it a ginormous dataset, what data is it really using to make its determination of

[0.0000999192346 , .91128756789 , 0 , .62819364 , 32.8172]

Because what I do for ML is do a supervised fit, then use a next to test and confirm fitness, then unleash it on untrained data and check and see. But I have no real understanding of what those numbers actually represent. I mean, does .91128756789 represent the curve around the nose, or is it skin color, or is it a facial encoding of 3d shape?

> I'm still wondering what, if anything, is going to supplant deep learning.

I think it'll be a slow climb to actual understanding. Right now, we have object identifiers in NN. They work, after TB's of images and PFLOPS of cpu/gpu time. It's only brute force with 'magic black boxes' - and that provides results but no understanding. The next steps are actually deciphering what the understanding is, or making straight-up algorithms that can differentiate between things.

Shouldn't it be possible to backpropogate those categorical outputs all the way back to the inputs/features (NOT weights) after a forward pass, to localize the sensitivity of them with respect to the actual pixels for a prediction? I imagine that would have to give at least some insight.

Beyond that, the convolution/max pool repeated steps could be understood to be applying something akin to a multi-level wavelet decomposition, which is pretty well understood. It's how classical matched filtering, Haar cascading, and a wide variety of proceeding image classification methods operated at their first steps too.

CNNs/Deep learning really doesn't seem like a black box at all when examined in sequence. But to me at least, randomized ensemble methods (random forest, etc.) are actually a bit more mysterious to me in their performance out of the box, with little tuning.

I'm in no way a researcher or even an enthusiast of machine learning, but I'm pretty sure that I came across a paper posted on HN a few days ago that did exactly what you and the parent poster are describing, figuring out what pixels contributed most to some machine learning algorithm. I'll try and see if I can find it.

Edit: yep, found it.

SmoothGrad: removing noise by adding noise, https://arxiv.org/abs/1706.03825

Web page with explanations and examples


I couldn't find the HN thread, but there was no discussion as far as I remember.

Bagging and bootstrap ensemble methods aren't really that confusing. Just think of it as stochastic gradient descent on a much larger hypothetical data set.

The effect is same one that occurs when you get a group of people together to estimate the number of jelly beans in a jar. All the estimators are biased, but if that bias is drawn from a zero mean distribution, deviation of the average bias goes down as the number of estimators increases.

I think you might be on to something, but the big problem here is that the Input is hundreds of GB or TB's . It's hard to understand what a feature is, or even why it's selected.

I can certainly observe what's being selected once the state machine is generated, but I have no clue how it was constructed to make the features. Do determine that, I have to watch the state of the machine as it "grows" to the final result.

>People were coming up with dozens of unnecessary variations on them, everybody in the world was trying to shoehorn the word "kernel" into their paper titles, using some kind of kernel method was a surefire way to get published.

I took Andrej Karpthy's tutorial on Nueral net and learned that an SVM is basically one neuron.


Would you say machine learning teens to overfit?

Ugh too late to edit, but that was supposed to be tends.

I'd just like to note that instead of creating additional animosity between SVMs and deep nets, you could use both together. SVMs with hinge loss can be Yet-another-layer (tm) in your deep net, to be used when it provides better performance.

That's a great point. Fundamentally, if you look at something like a CNN, what it's really doing is producing a feature descriptor based on the input image. One can easily use that feature descriptor in a classic SVM, alongside (or instead of) SoftMax.

Yup, in fact, the universal feature extraction is what allows imagenet pretraining to work well on lung cancer images.

One nitpick though, ConvNets can absolutely be used to do "thinking" and more than just feature extraction. For example, fully convolutional networks can be extremely competitive with FC-layer based nets.

Could you explain in a bit more detail how you would integrate an SVM layer into a DNN? The kernel matrix depends on all samples, while at training time you would only have access to those in the minibatch.

The simplest is to pop it on the top. Run you DNN to reduce your input down to a nicer cleaner smaller dimensional output, then plop an SVM on top for classification.

Seems like in that case you would train both models separately on different cost functions. By phrasing it as a layer I was expecting both the SVM and the DNN could be trained simultaneously.

Unless things have changed, one of the key benefits of DNNs was that you trained them layer by layer.

You also want to be able to train the DNN on your unlabelled data and the SVM on your much smaller labelled set.

For anyone interested in SVMs (and other introductory Machine Learning concepts), Udacity's intro course is really good: https://www.udacity.com/course/intro-to-machine-learning--ud...

This class from MIT taught by Patrick Winston is also a great resource: https://www.youtube.com/watch?v=_PwhiWxHK8o

At the end of the class, he also gives some historical perspectives, like how Vapnik came up with SVMs.

I remember that only a few years ago, in a computational statistics class I took the lecturer mentioned how SVMs (and Random Forests) have largely replaced neural networks. How things can change so quickly...

I always liked SVMs for the elegance of the kernel trick, but I guess choosing the right kernel functions and parameters for them wasn't that much easier than training a neural net either.

(All this from my rough, amateur understanding), SVMs are more or less equivalent to linear regression in a "feature space" and also equivalent to shallow neural network (~2-3). This means their size more or less increases with the amount of data they are attempting to approximate. And this means they don't do well scaling to truly huge data sets.

Deep nets pulled ahead of SVMs at the point people figured out how to train them on truly huge data sets using GPUs, gradient descent (and an ever increasing arsenal of further tricks - all the schemes together are mindboggling to read about).

This was basically because the deepness of a deep neural net means that it's size isn't as prone to increase with the size of data.

I don't really know why SVMs haven't been able to scale to a multi-layer approach though I know people have tried (someone has tried just about everything these days).

Part of the situation is leveraging simple code with GPUs still may be the most effective approach.

Close but not quite. The difference between (soft) SVM and a kernel linear classifier is choice of loss function; SVM minimizes hinge loss, linear regression minimizes squared loss.

(Choice of different loss functions will also give you Elastic Net, LASSO, logistic regression. From an engineering point of view I tend to think of the entire class as being different flavors of "stochastic gradient descent", in the spirit of Vowpal Wabbit etc.)

If you like SVMs, you should check out gaussian processes (GP). They work with covariance kernels similar to SVM, but the result is fully Bayesian. With most modern GP packages you can even set priors on your kernel and mean functions, then use either optimization or markov chain monte carlo to select optimal values.

The only downside to GPs is that they are O(N^3) in time, so not applicable to big data. There are stochastic GPs that approximate using batch learning, but they're not as polished.

Interesting, thanks for the tip! I will check it out.

With SVM, you often must perform a rather larger grid search over kernels and kernel parameters. It seems like no matter the model, we can't avoid the hyperparameter problem -- although boosting and bagging meta-methods come close.

It would be nice if we could quantify the complexity of a dataset and match this to a model with similar complexity. I imagine that it's hard (or impossible) to decouple these two complexity quantifiers, however.

For SVMs I really like this intro: https://generalabstractnonsense.com/2017/03/A-quick-look-at-... (with hand drawings!)

For a recent practical example of their usefulness:

This paper presents the Militarized Interstate Dispute (MID) 4.0 research design for updating the database from 2002-2010. By using global search parameters and fifteen international news sources, we collected a set of over 1.74 million documents from LexisNexis. Care was taken to create an all-inclusive set of search parameters as well as a sufficient and unbiased list of news sources. We classify these documents with two types of support vector machines (SVMs). Using inductive SVMs and a single training set, we remove 90.2% of documents from our initial set. Then, using year-specific training sets and transductive SVMs, we further reduce the number of human-coded stories by an additional 21.6%. The resulting classifications contain anywhere from 10,215 to 19,834 documents per year.


5 years isn't a recent example. my impression is that deep nets ate everyone's lunch (including svm).

On the research frontier, this is pretty much true. But, for what it's worth, I'm spearheading some machine learning efforts at my current company, and most of my initial production models have not been deep networks but rather classical approaches like boosted trees or SVMs. Actually, gradient boosted trees in particular are one of the most powerful general-purpose models out there, and there are some really fantastic distributed implementations available now that routinely win Kaggle competitions (xgboost, Spark MLLib's version).

I will note though that the problems I'm tackling do not involve any image processing or recognition. Convolutional networks really have completely dominated that area both in research and practice in the last few years.

Deep nets ate everyone's hype. The lunch is still there. SVMs have many advantages over ANNs that recommend themselves to practical applications still.

What advantages do SVMs have at this stage?

training speed, training data requirements, and runtime resource requirements are the big ones. And on domains that are not raw image processing, SVMs are still often quite competitive when it comes to accuracy.

Not at all, deep nets are difficult to train and they need lot's of processing before they learn same kind of classifying features than ie. SVM has. So yes you can simulate SVM with deep networks but usually it's not very good solution.

SVM can be also used as part of the neural network such as in classifying layer

They really aren't appropriate for the same set of problems, even if there is some overlap.

There is a lot of buzz around deep learning right now, but the SNR isn't great.

Can someone explain this part:

  Imagine the new space we want:
  z = x² + y²

  Figure out what the dot product in that space looks like:
  a · b = xa · xb  +  ya · yb  +  za · zb
  a · b = xa · xb  +  ya · yb +  (xa² + ya²) · (xb² + yb²)

I don't know if you are more familiar with matrix notation

  phi(x) = [x1, x2, x1² + x2²]
  phi(x)· phi(y) = [x1, x2, x1² + x2²]*[y1, y2, y1² + y2²]
  phi(x)· phi(y) = x1 y1 + x2 y2 + (x1² + x2²)(y1² + y²)

Take the 2nd line, and drop in the definition of "z" from the first line. You get the 3rd line as a result.

I have the same problem. Where did a & b come from? Which two vectors are we taking the dot product of? And how is this less expensive?

In the decision function of an SVM, you compute the scalar products of the support vectors (points that are on the margin of your hyperplane, or more precisely, the points that constrain your hyperplane) and your new sample point:

  x· sv
The "z" the article defines is a new component that will be taken into account in the scalar product. A more mathematical way of seeing that is that you define a function phi that takes an original sample of your dataset, and transform it into a new vector. In our case, we simply add a new dimension (x3) based on the two original dimensions (x1, x2) that we add as a third component in our vector:

  phi(x) = [x1, x2, x1² + x2²]
The scalar product we will have to compute in our decision function can then be expressed as (this is the a and b in the article, i.e. the sample and the support vector in our new space):

  phi(x)· phi(sv)
The SVM doesn't need phi(x) or phi(sv), but the scalar product of those two numbers. The kernel trick is to find a function k that satisfies

  k(x, sv) = phi(x)· phi(sv)
and that satisfies the Mercer's condition (I'll let Google explain what it is).

Your SVM will compute this (simpler) k function, instead of the full scalar product. There are multiple "common" kernel functions used (Wikipedia has examples of them[1]), and choosing one is a parameter of your model (ideally, you would then setup a testing protocol to find the best one).

[1] https://en.wikipedia.org/wiki/Positive-definite_kernel#Examp...

Thank you. This was an amazing explanation. I am new to SVM's but did not make the connection that margin points (observations along the margin of the hyperplane) become your support vectors. This makes a lot more sense.

And if I am following correctly, it would make sense that the final step would then be:

We would maximize the dot product of a new observation with the support vectors to determine its classification (red or blue)

During the learning phase of the SVM, you try to find an hyperplane that maximizes the margin.

The decision function of an SVM can be written as:

  f(x) = sign(sum alpha_sv y_sv k(x, sv))
Where sum represents the sum over all support vectors "sv", "y_sv" represents the class of the sample (red=1, blue=-1, for example), "alpha_sv" is the result of the optimization in the learning phase during the learning phase (it is equal to zero for a point that is not a support vector, and is positive otherwise).

The decision function is a sum over all support vectors balanced by the "k" function (that can thus be seen a similarity function between 2 points in your kernel), the y_i will make the term positive or negative depending on the class of the support vector. You take the sign of this sum (1 -> red, -1 -> blue, in our example), and it gives you the predicted class of your sample.

Thanks for bringing some saner notation in here. I feel like blog posts and journal articles that abuse notation like this one just make people more allergic to math.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact