
A foundation for scikit-learn at Inria - fermigier
http://gael-varoquaux.info/programming/a-foundation-for-scikit-learn-at-inria.html
======
hetspookjee
Scikit-learn is a very nicely written library and I can use plenty of
superlatives to describe the wonderous API of scikit-learn.

One thing I can't recommend enough is to extend their Transfomers base class
in such a way that you implement their fit and transform methods. A simple
example can be viewed here:
[https://gitlab.com/timelord/sklearn_transformers](https://gitlab.com/timelord/sklearn_transformers)

which allows you to put your transformers into the scikit-learn Pipelines and
GridSearchCV (and more). The way scikit-learn leverages multiple cores is by
using joblib and Dask extends this implementation to effortlessly scale the
scikit-learn pipelines onto a cluster of servers.
[https://distributed.readthedocs.io/en/latest/joblib.html](https://distributed.readthedocs.io/en/latest/joblib.html)

By writing your own data transformations in the transformer format you can, by
extension, leverage this g great ecosystem.

I think it's a great time to be a data scientist / engineer now.

------
zeec123
Unfortunately scikit-learn is a mess without an alternative.

There is so much wrong with the api design of sklearn (how can one think
"predict_proba" is a good function name?). I can understand this, since most
of it was probably written by PhD students without the time and expertise to
come up with a proper api; many of them without a CS background. Compare this
to e.g. the API of google/guava.

For example
[https://www.reddit.com/r/statistics/comments/8de54s/is_r_bet...](https://www.reddit.com/r/statistics/comments/8de54s/is_r_better_than_python_at_anything_i_started/dxmnaef/)

    
    
       Case in point, sklearn doesn't have a bootstrap crossvalidator despite the bootstrap being one of the most
       important statistical tools of the last two decades. In fact, they used to, but it was removed. 
       Weird right?
       ...
       > We don't remove the sklearn.cross_validation.Bootstrap class because few people are using it, 
       > but because too many people are using something that is non-standard (I made it up) and very very 
       > likely not what they expect if they just read its name. 
       > At best it is causing confusion when our users read the docstring and/or its source code. 
       > At worse it causes silent modeling errors in our users code base.
       ...
       Oh man, I thought of another great example. I bet you had no idea that 
       sklearn.linear_model.LogisticRegression is L2 penalized by default. 
       "But if that's the case, why didn't they make this explicit by calling it RidgeClassifier instead?" 
       Maybe because sklearn has a Ridge object already, but it exclusively performs regression? 
       Who knows (also... why L2 instead of L1? Yeesh). Anyway, if you want to just do unpenalized 
       logistic regression, you have to set the C argument to an arbitrarily high value, 
       which can cause problems. Is this discussed in the documentation? 
       Nope, not at all. Just on stackoverflow and github. 
       Is this opaque and unnecessarily convoluted for such a basic and crucial technique? Yup.
    

Or the following:
[https://www.reddit.com/r/haskell/comments/7brsuu/machine_lea...](https://www.reddit.com/r/haskell/comments/7brsuu/machine_learning_in_functional_programming/dppck1x/)

~~~
bob_bob_bob
It's true that scikit-learn was started and originally written mostly by PhD
students (most were in fact CS PhDs), and the API they designed is amazing! A
lot of the python ML ecosystem has adopted it and uses it - fit, predict,
transform. I don't think any language has something comparable.

4 years ago they removed a misleading class - and even at the time the
documentation was clear about what it was doing. I'm not sure how this reveals
some huge flaw about scikit-learn. At best it shows that the contributors can
realize their mistakes and solve them, without even needing people to point it
out? That's great!

Also pointing to a bad implementation 4 years ago, for a project which has
since then had way more funding for engineering time, and who's use has
exploded, seems a bit misleading.

~~~
zeec123
See the second link I posted. Even the most basic 3 functionalities are bad
designed. If X is your input space and Y your output space then fit should
(after each call) return a function X->Y and not modify some internal state.

Have you ever tried looked at the pipeline cross validation, where you have to
pass a dict of parameters to the function with underscore prefixes for each
stage in the pipeline? Do this and you never call the API design amazing
again.

There are examples for other bad design choices as well.

You are right, there is no alternative at the moment. Maybe julia lang will do
better job, we will see.

------
xvilka
I thought INRIA uses OCaml everywhere and would choose Owl[1] (OCaml library
for numeric scientific computing and machine learning) as a project for this
kind of foundation.

[1] [https://github.com/owlbarn/owl](https://github.com/owlbarn/owl)

~~~
pyrale
Inria is a public institution dedicated to research. There are many labs and
people with separate goals. They are no more dedicated to ocaml than MIT is
dedicated to emacs.

~~~
globberz
An even better comparison would be with, say, the NSF. I am sure that this or
that technology has been developed by NSF-funded researchers, but it would be
absurd to assume that NSF-funded researchers in MIT use and promote the same
things as NSF-funded researchers in Caltech because they're both affiliated
with the NSF.

------
11235813213455
I'd love this same kind of library in nodejs

