Update on scikit-learn: recent developments for machine learning in Python

cf · on May 9, 2012

Since it isn't likely going to be said otherwise, I really think this is the best library for doing Machine Learning out there right now.

This assessment isn't based on breadth of algorithms supported, since R beats it here. It has nothing to do with documentation, even though it has the best. Scikits.learn is fantastic because it has consistent interfaces. The creators of this library have thought very hard about what interfaces classifiers should have. This greatly reduces the learning curve and makes it cake to compare classifiers.

The clean interfaces make it easier to perform cross-validation and leads to less surprises. The largest problem with most machine learning code out there is while it works, it never gets this kind of software engineering attention.

amueller · on May 9, 2012

Thanks for this great feedback :)

What algorithms do you think scikit-learn is still missing compared to R?

cf · on May 9, 2012

Largely, what I need is integration with JAGS http://mcmc-jags.sourceforge.net/ which is a niche that pymc fills. I am working on a PR to get in the other stuff I need.

I bring up R more as a point that there is more to a library than supporting lots of algorithms. R wins by that metric http://www.cran.r-project.org/web/packages/available_package...

mblondel_ml · on May 9, 2012

The comparison doesn't really make sense: R is a language (and environment), scikit-learn is a library.

amueller · on May 9, 2012

I just want us to win on all metrics ;) at least if there is benefit for the users in it.

natekupp · on May 9, 2012

It'd be really great if you guys had an implementation of Friedman's Multivariate Adaptive Regression Splines[1]. For me, this is the only R package left for which I don't have a Python alternative.

[1] http://www.salford-systems.com/doc/MARS.pdf

ogrisel · on May 9, 2012

I did not know about MARS, do you have any practical benchmarking dataset / sample tasks where MARS is state of the art? What kind of regression problem do you use it for yourself? What is the typical dimension of the problems it can address efficiently (n_samples and n_features)?

GaelVaroquaux · on May 9, 2012

I'd actually be very interested in such a model. It's a major amount of work, and in an area in which I am not expert, so I know that I will not be working on it any time soon, but I agree that it would be a great addition to the scikit.