I think this should be titled "why use kernels," as the gain here comes from using an RBF kernel (and not just a SVM). Kernel-based L2/Tikhonov regression (or kernel ridge regression) could perform just as well (although it might require more memory to train).
* your training set is 66% of the entire set, not 80%
* you use "degree=0.5" when creating the sklearn.svm.SVC, but since you don't specify the kernel type it defaults to RBF, which doesn't accept a degree arg. See  for more.
* you should motivate the kernel transformation more thoroughly; by mapping the data into a higher-dimension space, you hope to find a separating boundary that isn't present in the natural space. The mapping is performed such that we only need the inner product of each vector, which lets us map to ANY space (even those of infinite dimension!!) as long as we can compute inner products.
(There are some properties of the higher-dimension spaces, such as the fact they're Hilbert spaces, that make the above possible, but it's late and I can't remember details off the cuff).
Second, SVMs---especially with a non-linear kernel---are much trickier to tune than logistic. SVMs are very sensitive to choice of hyper-parameters (user specified tuning weights), which means repeatedly retraining. And if you use too powerful of a kernel, you will overfit your data very easily.
For these reasons, my advice is to start with logistic, and then if you're not satisfied, switch to SVMs.
(As an aside, it's totally possible to kernelize logistic regression. Without suitable regularization, it'll be even worse with regards to the number of datapoints you need though.)
Boosters and random decision forests are easy to learn in parallel, so suited to classifier discovery on data sets generated by a theory with a long tail.
Have I gone crazy?
(For the audience, "structural risk" = "model complexity". Structural risk can cause overfitting. Hence, Occam's razor.)
You control for structural risk in random forests through hyperparameters like the maximum depth.
You control for structural risk in Adaboost through early stopping. In extensions of Adaboost, there might also be a learning rate.
In practice, I find it of comparable difficulty to control for overfitting in SVMs and random forests.
You are also incorrect that boosting can be easily parallelized. Each model update causes the weight of each example to be updated, and this weight vector must be shared. Hence, it is not trivial to parallelize boosting.
Roughly speaking, dropout training provides a strong regularizing effect through a sort of model averaging that is conceptually related to the well-known bagging approach from which random forests derive their power and flexibility. Dropout training has already produced state-of-the art results on several time-worn standard benchmarks and helped Hinton's group win a recent Kaggle competition (for an overview of their approach, see: http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it...).
I've played around with this a bit over the last few weeks, and have a Matlab implementation publicly available from my Github at: https://github.com/Philip-Bachman/NN-Dropout.
Interesting, the number of units in the hidden layers isn't as important as the size of the weights of the hidden units. This is a famous result from the 1990's.
Hence, a principled way of controlling overfitting in NNs is: Pick a large enough number of weights that you don't underfit, and apply l2-regularization to the hidden unit weights. This is superior to fiddling with the number of hidden units in an unregularized net.
A related result is that you can control model complexity by imposing sparsity on the activations of the hidden units.
Is L2 better than L1 in this regard? My experience is that L1 significantly outperforms L2 whenever overfitting (rather than noise / bad measurements) is the problem you are addressing.
The libsvm train function and some of the associated tools grid.py do a fairly mediocre job at refining the coef (as someone else mentioned) when doing nonlinear SVMs. It getts really tricky to select good coef then. Also using the eigen matrix to cut down the number of important vector features ends up being a pretty big deal if you end up black box using SVMs on massive input data sets.
If you work in a space where you data set isn't a 50/50 mix of classification data, finding coefficients to maximize accuracy and cut incorrect classification, becomes a mess.
A Kernel function k(x, y) is a function that calculates phi(x)^T phi(y) for some phi. That means, it calculcates the dot product in a higher dimensional space of two data points; it does not do the transformation.
That implies that Kernels do not have to work on the gram matrix. Kernels can be sth completely different, e.g. Fisher Kernels.
(What I wrote is based on Bishop's book and others.)
Btw, the sklearn docs have an example that I think explains different classifiers way better and also hints at a very important point: non-linear decision boundaries are mostly important for low-dimensional dense data and not helpful for text data for example.
maybe you should submit more of their posts :)
Here's Jake Vanderplas on SVD algorithms for sparse matrices:
Could someone point me to some reference materials (books, articles, videos, Coursera, etc) that will get me up to speed on the stuff I need to know in order to understand this article?
He also teaches the Coursera machine learning course: https://www.coursera.org/course/ml
If its the former you are looking for, I have several recommendations. There are a plethora of online resources to learn about machine learning from. In video form, my favorite resource is Yaser Abu-Mostafa's (Caltech prof) video lectures, available here: http://work.caltech.edu/teaching.html
You can also actually enroll in the next version of CS 156 (Abu-Mostafa's class) and do the class online with problem sets very similar to that of Caltech students actually enrolled in the class. (It starts January 8th).
Coursera/Udacity contain several classes that involve machine learning; Andrew Ng's class is a good place to start.
If you (like me) prefer text to video, you can buy Abu-Mostafa's book (I hear it follows the course very closely, and the book costs way less than $500 :D ), read Andrew Ng's online lecture notes, or read one of several freely available ML books (such as "Elements of Statistical Learning" or "Bayesian Reasoning and Machine Learning", but I've found these to take more mathematical maturity that the other resources recommended).
Note: One reason that I've highly recommended Abu-Mostafa's resources (besides having taken the class at Caltech and loving it) is that his class is more focused on the theory of machine learning (ie hammering in over-arching principles like avoiding overfitting by regularizing) rather than just covering as many algorithms as possible. Also, I believe Mitchell of CMU has good online resources for his machine learning course.
_why's book for Ruby, pg/norvig's books for Lisp ... they're so enjoyable to read that you don't need to exert yourself for the subject to 'click', which is pretty impressive considering what you're learning.
I don't want to go to school to learn machine learning, but I'd be willing to pay money (maybe quite a lot of money) for a book that made it enjoyable.
Machine learning isn't a "tool" you use to instantly "solve" a concrete problem you're having. Instead, you have to build up a large body of intuition about which tools you can use to work with your data. It can't be memorized.