

Which machine learning classifiers are fast enough for medium-sized data? - ogrisel
http://blog.explainmydata.com/2012/06/ntrain-24853-ntest-25147-ncorrupt.html

======
ogrisel
As noted on the scikit-learn mailing list the poor results of the liblinear
can be cause by a low convergence tolerance but also by the fact that the
internal memory layout used by liblinear is not optimized for dense input data
arrays as is the case for the SGDClassifier of scikit-learn.

~~~
iskander
Any idea what the preferred algorithm would be for linear SVMs with dense
data? (other than SGD, of course)

~~~
ogrisel
SGD with averaging [1] :)

More seriously, SGD is pretty hard to beat for fitting linear models (SVM,
logistic regression and other l1 penalized models with various loss functions)
when the number of samples getting large.

[1] <http://leon.bottou.org/projects/sgd>

------
benhamner
Interesting comparison, but only using generated mixtures of gaussians for
training data severely limits any conclusions that can be drawn from this.
Naturally the method with the same assumptions as the generating process had
the best performance.

It is important to note that both the performance of the machine learning
algorithm (in terms of the error metric) and its runtime are very dependent on
the source data in most cases.

~~~
robrenaud
I was impressed that random forests did so well for the irrelevant feature
detection, given that they know nothing about gaussians. Though IIRC, you've
used them to win Kaggle competitions, so maybe you already know their power.

------
fuzzmeister
Has anyone seen a similar comparison for medium-sized document classification
tasks? I'd imagine LibLINEAR would perform far better for document
classification than it does in these results.

~~~
ogrisel
Have a look at the RCV1 benchmarks on this page:
<http://leon.bottou.org/projects/sgd>

SGD is still slightly faster but liblinear is behaving good enough in that
case.

~~~
ogrisel
One unmentioned caveat of SGD is how to configure the learning rate schedule.
scikit-learn is using Bottou's tricks that seem to work reasonably well in
practice but it might even be better to implement the online estimate of
optimal learning rate schedule from this NIPS 2012 pre-print:
<http://arxiv.org/abs/1206.1106> (No More Pesky Learning Rates).

------
ahuibers
If anyone wants to do some very interesting contract work in machine learning
(SVMs) for a YC company, please mail me. [edit: email address now in my
profile oops]

~~~
jberryman
I'm interested, but your email's not in your profile. I'm at
brandon.m.simmons@gmail.com

------
theatraine
I wonder how a simple pre-processing operation would affect the accuracies? In
my experience, standardization (z-scoring) can really improve SVM accuracy.

~~~
ogrisel
I think that in this case the data was generated in a uniform manner (centered
and with isotropic variance). So it might not greatly impact the performance.
~50% training error is about the best you can get in a completely non-linearly
separable dataset.

