People were coming up with dozens of unnecessary variations on them, everybody in the world was trying to shoehorn the word "kernel" into their paper titles, using some kind of kernel method was a surefire way to get published.
I wish machine learning research didn't respond so strongly to trends and hype, and I also wish the economics of academic research didn't force people into cliques fighting over scarce resources.
I'm still wondering what, if anything, is going to supplant deep learning. It's probably an existing technique that will suddenly become much more usable due to some small improvement.
What frustrates me is that people who are starting out in machine learning often never learn that linear/logistic regression dominates the practical applications of ML. I've spoken to people who know the in-and-outs of various deep network architectures who don't even know how to start with building a baseline logistic regression model.
The longer answer:
SVMs comes with two theoretical benefits:
1. A guaranteed optimal solution. You get this with simpler techniques like logistic regression.
2. The ability to use non-linear kernels, which can offer much more power than logistic regression (on par with neural networks).
So, at first, it seemed like SVMs were the best of all worlds, and a lot of people got excited. In practice, though, non-linear kernels slowed training down to the point of being impractical. Linear kernels were fast enough, but removed the second benefit, so most people would prefer to use established linear techniques like logistic regression.
I've also conducted multiple training sessions/discussions with small groups on ML, and it supports what I've said. SVMs are hard to explain to a general crowd; with ANNs I've enough visual cues to get them started. Sure, they mightn't get the math right away, but they understand enough to be comfortable using a library.
As an aside, a comment here mentions Logistic/linear regression dominates the industry. I think it is for a similar reason. They're simple to understand and try out. That doesn't make them good models, in my experience, on a bunch of real-world problems.
Now if you ask me about the technical cons of SVMs, I'd say - scalability of non-linear kernels and the fact that I've to cherry-pick kernels. Linear and RBF kernels work well on most problems, but even then, for a bunch of problems where RBF seems to work well, the number of support vectors stored by the model can be massive. If I weren't to be pedantic about it and excuse the fact that the kernel seems to be "memorizing" more than "learning", this is still a beast to run in real time. nu-SVMs address this issue to an extent, but then we are back to picking the right kernel for the task. This is one thing I love about ANNs - the kernel (or what essentially is the kernel) is learned.
2.Hard to parallel if you're using kernels other than linear one.
Deep (neural) networks are really just a generalization of machine learning (on graphs). The key is that we learn the similarity function to discriminate from examples instead of specifying it a priori. You can also build classifiers other ways: linear/logistic regression based on feature vectors or by providing some similarity metric (SVM). But in these cases you are providing the discriminator function.
For example, in SVMs you have to provide a similarity measure as the "kernel" maps your feature vector into a higher dimensional space where the examples can be separated.
In deep neural networks we don't really know (or care to some extent) what the optimal feature vectors are or what the correct similarity metric is. We only care that at the end we've encoded it correctly (e.g. having chosen enough parameters/layers etc...) after training. Again the NN is some general function that applies a soft-max relationship from the inputs to outputs for each layer.
Yann LeCun has a great paper (2007) explaining this:
There's a whole world beyond neural networks and it seems like it's mostly all on the backburner now. Which I understand, the resurgence in NN algorithms, approaches and hardware in the last 10 years has been exciting but it does feel like tunnel vision sometimes.
It's really because nobody actually understands what's going on inside a ML algorithm. When you give it a ginormous dataset, what data is it really using to make its determination of
[0.0000999192346 , .91128756789 , 0 , .62819364 , 32.8172]
Because what I do for ML is do a supervised fit, then use a next to test and confirm fitness, then unleash it on untrained data and check and see. But I have no real understanding of what those numbers actually represent. I mean, does .91128756789 represent the curve around the nose, or is it skin color, or is it a facial encoding of 3d shape?
> I'm still wondering what, if anything, is going to supplant deep learning.
I think it'll be a slow climb to actual understanding. Right now, we have object identifiers in NN. They work, after TB's of images and PFLOPS of cpu/gpu time. It's only brute force with 'magic black boxes' - and that provides results but no understanding. The next steps are actually deciphering what the understanding is, or making straight-up algorithms that can differentiate between things.
Beyond that, the convolution/max pool repeated steps could be understood to be applying something akin to a multi-level wavelet decomposition, which is pretty well understood. It's how classical matched filtering, Haar cascading, and a wide variety of proceeding image classification methods operated at their first steps too.
CNNs/Deep learning really doesn't seem like a black box at all when examined in sequence. But to me at least, randomized ensemble methods (random forest, etc.) are actually a bit more mysterious to me in their performance out of the box, with little tuning.
Edit: yep, found it.
SmoothGrad: removing noise by adding noise, https://arxiv.org/abs/1706.03825
Web page with explanations and examples
I couldn't find the HN thread, but there was no discussion as far as I remember.
The effect is same one that occurs when you get a group of people together to estimate the number of jelly beans in a jar. All the estimators are biased, but if that bias is drawn from a zero mean distribution, deviation of the average bias goes down as the number of estimators increases.
I can certainly observe what's being selected once the state machine is generated, but I have no clue how it was constructed to make the features. Do determine that, I have to watch the state of the machine as it "grows" to the final result.
I took Andrej Karpthy's tutorial on Nueral net and learned
that an SVM is basically one neuron.
One nitpick though, ConvNets can absolutely be used to do "thinking" and more than just feature extraction. For example, fully convolutional networks can be extremely competitive with FC-layer based nets.
You also want to be able to train the DNN on your unlabelled data and the SVM on your much smaller labelled set.
At the end of the class, he also gives some historical perspectives, like how Vapnik came up with SVMs.
I always liked SVMs for the elegance of the kernel trick, but I guess choosing the right kernel functions and parameters for them wasn't that much easier than training a neural net either.
Deep nets pulled ahead of SVMs at the point people figured out how to train them on truly huge data sets using GPUs, gradient descent (and an ever increasing arsenal of further tricks - all the schemes together are mindboggling to read about).
This was basically because the deepness of a deep neural net means that it's size isn't as prone to increase with the size of data.
I don't really know why SVMs haven't been able to scale to a multi-layer approach though I know people have tried (someone has tried just about everything these days).
Part of the situation is leveraging simple code with GPUs still may be the most effective approach.
(Choice of different loss functions will also give you Elastic Net, LASSO, logistic regression. From an engineering point of view I tend to think of the entire class as being different flavors of "stochastic gradient descent", in the spirit of Vowpal Wabbit etc.)
The only downside to GPs is that they are O(N^3) in time, so not applicable to big data. There are stochastic GPs that approximate using batch learning, but they're not as polished.
It would be nice if we could quantify the complexity of a dataset and match this to a model with similar complexity. I imagine that it's hard (or impossible) to decouple these two complexity quantifiers, however.
This paper presents the Militarized Interstate Dispute (MID) 4.0 research design for
updating the database from 2002-2010. By using global search parameters and fifteen
international news sources, we collected a set of over 1.74 million documents from LexisNexis.
Care was taken to create an all-inclusive set of search parameters as well as a
sufficient and unbiased list of news sources. We classify these documents with two types
of support vector machines (SVMs). Using inductive SVMs and a single training set,
we remove 90.2% of documents from our initial set. Then, using year-specific training
sets and transductive SVMs, we further reduce the number of human-coded stories by an
additional 21.6%. The resulting classifications contain anywhere from 10,215 to 19,834
documents per year.
I will note though that the problems I'm tackling do not involve any image processing or recognition. Convolutional networks really have completely dominated that area both in research and practice in the last few years.
SVM can be also used as part of the neural network such as in classifying layer
There is a lot of buzz around deep learning right now, but the SNR isn't great.
Imagine the new space we want:
z = x² + y²
Figure out what the dot product in that space looks like:
a · b = xa · xb + ya · yb + za · zb
a · b = xa · xb + ya · yb + (xa² + ya²) · (xb² + yb²)
phi(x) = [x1, x2, x1² + x2²]
phi(x)· phi(y) = [x1, x2, x1² + x2²]*[y1, y2, y1² + y2²]
phi(x)· phi(y) = x1 y1 + x2 y2 + (x1² + x2²)(y1² + y²)
phi(x) = [x1, x2, x1² + x2²]
k(x, sv) = phi(x)· phi(sv)
Your SVM will compute this (simpler) k function, instead of the full scalar product. There are multiple "common" kernel functions used (Wikipedia has examples of them), and choosing one is a parameter of your model (ideally, you would then setup a testing protocol to find the best one).
And if I am following correctly, it would make sense that the final step would then be:
We would maximize the dot product of a new observation with the support vectors to determine its classification (red or blue)
The decision function of an SVM can be written as:
f(x) = sign(sum alpha_sv y_sv k(x, sv))
The decision function is a sum over all support vectors balanced by the "k" function (that can thus be seen a similarity function between 2 points in your kernel), the y_i will make the term positive or negative depending on the class of the support vector. You take the sign of this sum (1 -> red, -1 -> blue, in our example), and it gives you the predicted class of your sample.