Funny how SVMs are just max-margin loss functions and we just took for granted that you needed domain expertise to craft features like HOG/SIFT by hand.
By 2018, we use ConvNets to learn BOTH the features and the classifier. In fact, it’s hard to separate where the features end and the classifier begins (in a modern CNN).
The performance is still better in most cases but I often have to wonder, are people just doing feature engineering once removed and is the better performance just the result of having WAY more parameters in the model?
Is there anything like that in the ANN world?
Not really, or sort of, depending on how you think.
A deep neural network does work - at least to some extent - because of the large number of parameters. However, it is practical because it can be trained in a reasonable amount of time.
Things like ResNets are useful because they allow us to train deeper networks.
You can create a SVM with the same number of parameters, and in theory it could be as accurate (this is basically the no free lunch theorem). But you won't be able to train it to the same accuracy.
 Of course there are practical concerns about what you do for features, since hand created features just aren't as good as neural network ones. One thing people do now is use the lower layers of a deep neural network as a feature extractor and then put a SVM on top of them as the classifier. This works quite well, and is reasonably fast to train.
I'm not sure I agree with this bit in theory. A Neural Network is a stack of basis functions; and this stack can also be seen as a bunch of basis functions. And basis functions are what kernels represent. Trivially, you could then "copy" the weights that a ANN would learn into a kernel and obtain the same accuracy.
The reason this doesn't work in practice is, in SVMs, you tend not to learn kernels from scratch but use (possibly a combination of) standard parameterized kernels - . The learning step in the SVM adapts this standard kernel to your dataset as much as the parameters allow, but this would be sub-optimal compared to learning a kernel (or the corresponding basis functions) from scratch that's built just for your data. With a well trained ANN the latter is what you get.
 there has been a fair amount of work on learning kernels too, but its not as mainstream as using standard kernels.
I'm not sure I agree with this bit in theory.
No one really agrees with it in theory - I'm not aware of a good theoretical explanation as to why some deep networks are easier to train. And yet there is a growing body of real, generalized practical hints which work pretty reliably.
This is pretty exciting! There is undiscovered ideas here. But it is unsatisfactory from the theoretical sense at the moment.
See slide 7:
Shallow feed-forward networks are "universal function approximators"  when the number of hidden neurons is finite but unbounded. Of course, the width of that layer grows exponentially in the depth of the deep network that you might wish to approximate .
The statement that "[i]t's just that deep neural networks are a lot more practical to train" (emphasis mine) sounds somewhat reductive; it's not only that depth is a nice trick or hack for training speed, but that depth makes the success of deep networks in the past decade at all possible. We live in a world with bounded computing resources and bounded training data. You cannot subsume all deep networks into shallow networks, and shallow networks into SVMs in the real world. So I am pretty sure of how important that distinction is.
And what's more, depth extracts a hierarchy of interpret-able features at multiple scales, and a decision surface embedded within that feature space, rather than a brittle decision surface in an extremely high dimensional space with little semantic meaning. One of these approaches generalizes better than the other to unseen data.
This is important. Flippantly said,discarding learnability and speed of convergence, you can get the power of any neural network by the following algorithm:
1. Randomly generates a sufficiently wide bit pattern
2. Interprets it as a program and run it on the test set
3. discard results until the desired accuracy is reached
I ask because when we're training human to understand things, there are a variety of benefits to separate feature-understanding from the classifiers. In particular, you get gains in flexibility, extendability, and debuggability.
I get why people are happy to take the ConvNet gains and run with them for now. But have you seen any interesting work to get the benefits of separation in the new paradigm? (Or, alternately, is there a reason why those concerns are outmoded?)
This was because we didn't understand how to train a deep model end-to-end until later. When we learned how to make that end-to-end training work it tended to perform better because the learned features were task specific.
You can still learn general features in a bunch of ways, in addition to the older method using autoencoders. For one example, multiple supervised heads with auxiliary losses can learn more generalize features.
Otherwise, why we would we need a gazillion architectures for different problems (or even the same exact problem)?
Disclaimer: written by me.
An SVM is literally just a linear model with hinge loss instead of log loss (logistic regression) or squared loss (ordinary linear regression) in primal form. For apparently historical reasons, it is usually derived from the "hard-margin" SVM in dual form, motivating with trying to maximize the margin. This is complicated and not very intuitive.
This also causes people to conflate the kernel trick and the dual form, while in fact they have nothing to do with each other. You can use the kernel trick in primal svm just fine.
Stochastic gradient descent can also be used for primal methods, while it doesn't work in the dual. That makes it much faster for large problems than the dual.
Once you realize that a linear SVM isn’t very different from logistic regression, it starts to all make sense (at least it did for me).
Key insight of the hinge-loss: once something is classified correctly beyond the margin, it incurs a loss of zero.
Now, Something fun to think about. Draw the hinge loss. Now draw the ReLU (which is found all over the place in CNNs). Now thing about L1-regularization (which was used to induce sparsity in compressed sensing). They are more similar in form than you would think.
Some people have had good luck with hinge or multi-hinge loss for neural networks instead of the almost universal log loss, since of course the hinge loss can be used in things other than linear models. It doesn't care how you get the y output.
The QP-dual formulation never seemed like something that could scale, and linear SVMs never seemed all that much better than just lasso/elasticnet regression. (hmmmmm :) )
Second-order methods like Newton's method converge better to the exact solution that SGD, although they don't reach a "pretty good" solution as fast usually. Coordinate descent methods in the dual also get very close to the exact solution, but Newton's method and friends are usually faster. With (quasi-) Newton methods in the primal, everything just comes down to solving linear systems, which is a much more well-studied problem.
I've even experimented successfully with kernelized models with millions of examples in low dimensions using the Fast Gauss Transform. That's impossible in the dual.
You can also generate low-rank kernel approximations  using the Nystroem method or the Fastfood transform that can then be used in a linear SVM. For example, if I had a problem with n=10^6, I can make a low-rank approximation of the kernel matrix (say d=1000) and feed that into a fast SGD optimizer.
This often works really well, and is usually pretty close to the exact kernel solution if the problem is of lower intrinsic dimensionality, which is usually true if the dual SVM is sparse in its basis vectors. This largely negates the sparsity advantage of the primal SVM. If a kernel approximation isn't good, then the dual SVM wouldn't be meaningfully sparse anyway, so there is still no advantage of dual. Best just solve the kernelized system in the primal, and use a second-order optimizer if needed.
The same thing happened with data storage. As soon as big data appeared everyone stopped doing just data and started doing "big data". Now the term is kind of a joke even.
I predict in a few years "deep learning" term will become mostly used in an ironic sense as well.
I may be a bit behind the times, but I'm also mystified by "deep learning's" popularity. Both giant neural nets and kernel methods have overfitting problems: torture a billion-parameter model long enough, and it will tell you what you want to hear.
SVMs address this by finding a large margin for error, which will hopefully improve generalization. DNNs (I think) do this by throwing more ("big") data at the problem and hoping that the training set covers all possible inputs. Work on adversarial learning suggests that DNNs go completely off the rails when presented with anything slightly unexpected.
Deep learning gets a lot of hype because for many applications they perform better and scale better without a lot of tricks and extensions which are now possible with SVMs. You can even use a large margin loss with deep models to get some of the benefits of SVMs.
Adversarial examples are way overblown. First, SVMs are not immune to them either. Second, very few applications are threatened by things like adversarial examples.
I don't think your scenario is likely to occur unless something else starts outperforming deep learning (in the broadest sense) _and_ there's an approachable alternative to solve the same problems at least as well.
SVMs have been in active use since the early nineties and have been formulated much before that
I think the author may be working from a very different definition of the word "idiot".
In the pdf, it said that the optimization problem in SVMs have a nice property in that it was quadratic, which means that there's a nice global minimum to go towards, and not lots of local minimum like in NN. That means, it seems SVMs won't get stuck at a suboptimal solution.
Is that not a problem in DNNs now? Or is it that it's such high dimensionality that local minima don't stop the optimizer, because there's always another way around the local minimum?
With deep learning, the problem is non-convex, so the results are locally optimal but not necessarily globally optimal. We get around this with different techniques such as randomized initial weights, but with few exceptions, non-convex problems are NP-complete. Therefore there is no polynomial-time algorithm for finding the global optimum in deep learning unless P==NP.
I occasionally hear people claim deep learning finds the globally optimal solution, but that’s just not correct (or they mean they are doing an exhaustive search e.g. branch and bound, which has exponential worst case run time and isn’t practical).
That said, locally optimal solutions are often “good enough” for practical problems.
With DNNs, people learn everything (features + decision boundary), and this problem is not convex. Surprisingly DNNs work quite well in practice even though we were taught to be afraid of non-convex problems in grad school around 2005.
If back in early 2000s, we stopped worrying about theoretical issues and explored more approaches like ConvNets, we might have had the deep learning revolution 10 years earlier.
I once ran a small neural net 100 times on simple three-dimensional data re-
selecting the initial weights to be small and random on each run. I found 32
distinct minima, each of which gave a different picture, and having about equal
test set error.
– Leo Breiman
Note also that "optimal" in each case means "relative to other models in the same family." But a large neural net and an SVM do not exist in the same hypothesis space. As an analogy, recall that Ordinary Least Squares is the BLUE (Best Linear Unbiased Estimator.) But that's only relative to other unbiased (which in particular means no regularization: not ridge/lasso/elasticnet) linear models over the same feature set. Just because OLS provably gives the best performance in this family doesn't mean it is going to do well compared to models outside that family. In the same way, just because SVM offers a convergence guarantee doesn't mean its performance will always be the same or better than an NN.
The real problem with SVMs is that when you use a kernel (which is the whole point, linear SVMs are just logistic regression with hinge loss) then you introduce one landmark (feature) for every data point. So if your original data set was a manageable 1e6x1e3 then the SVM will view this data set as 1e6x1e6. It doesn't actually have to store that in memory thanks to the kernel trick but training time still scales as O(N^2) where N is the number of observations (rows) in the original data. In practice, even with a highly optimized library like LIBSVM you're not going to get good performance past N=1e6. (I've personally had the most success with SVMs on small datasets of only a few thousand rows.) While NN can very easily accommodate much later training sets with training time only growing as O(N), more or less. You can always sample a manageable training set from a larger data set but if your only using 1% of your data to train a model that's a problem.
Deep NN is also a far more modular approach: if you know your data has a 2D structure for example, you can add some CNN layers and get translational invariance. Deep NN comes with a large toolbox for specializing the model to the data, while the options for tuning an SVM are much more limited: you can change out the kernel function (not that anything is likely to beat RBF) and play around with the regularization parameter, and that's it. Once you've tuned those hyper-parameters with a small grid search there's not much else left to try with SVMs. If they work, they work, otherwise you have to change approaches.
I have a minor nitpick regarding this point you make: not that anything is likely to beat RBF. Depending on the data, specialized kernels can help immensely. An easy example is sequence classification where something like a string kernel might work really well. Or image classification, where histogram based kernels might prove superior.
Note that sometimes you might want to measure how good a kernel is for a problem not by its prediction accuracy alone but also by the number of support vectors it needs - if the final model retains ~100% of the training data as support vectors, it is not a great model in some (subjective) sense since it is memorizing a lot. Depending on the data, you might "beat" the RBF kernel on this aspect too.
Regarding the training time, there are some interesting tricks I've come across (but not tried them out yet) -, .
 Ensemble SVMs http://www.jmlr.org/papers/volume15/claesen14a/claesen14a.pd...
 SVMPath - algorithm to fit the entire path of SVM solutions for every value of the cost parameter, with
essentially the same computational cost as fitting one SVM model. http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf