
Scikit-Learn’s Defaults Are Wrong - dsr12
https://ryxcommar.com/2019/08/30/scikit-learns-defaults-are-wrong
======
missosoup
If you're not using regularisation for logistic regression, you're most likely
doing it wrong. If you don't know to normalise your data prior to that, then
you need to go back to the basics.

The default makes sense.

It's also clearly stated as the very first parameter in the constructor which
defaults to L2. The docs also state in BOLD: "Note that regularization is
applied by default."

This post is just a case of pebcac.

~~~
Macuyiko
I agree. The only thing I believe the article gets right however is that the
naming doesn't expose this default behavior, which might cause problems for
people coming from other software.

Still, any good data scientist I know actually knows to read the documentation
carefully and has learnt regularisation is the default.

~~~
paulgb
One thing that concerns me about the "you should read the documentation"
argument is that even if the person _writing_ the code did that for every
function they used, anyone _reading_ the code is likely to make (reasonable)
assumptions. So bugs like accidentally regularizing, for example, could slip
past code review.

~~~
ad404b8a372f2b9
It's a fair point for standard software engineering libraries, that the
defaults should be obvious, but I don't think that it's possible to hold
scientific libraries to the same standards. They require expert knowledge, and
the assumptions you make about the models that are being used should be very
carefully checked. There is no obvious defaults in that case because we're not
writing software, we're building scientific models.

For example, look at the neural network classifier model in scikit-learn:
[https://scikit-
learn.org/stable/modules/generated/sklearn.ne...](https://scikit-
learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)
None of these defaults can be assumed. If you showed me a function call with
no optional arguments I couldn't tell you, nor anybody else that hasn't read
the documentation, what activation is being used, what optimizer, learning
rate, etc... In fact when coming across these kind of models during code
reviews the nature of these parameters is the first thing I ask about.

~~~
zenexer
Then there shouldn’t be any defaults, right? All of these issues could easily
be avoided by forcing users to provide explicit parameters. It only makes
sense to offer defaults if they’re intuitive.

------
phonebucket
Sure, I can see that L2 regularization being the norm is a point to be argued
about. I am in the camp that believes there should be no regularization by
default.

But I am confused by the outrage this has caused.

Scikit-Learn's documentation is great. I do not believe that many competent
scientists and devs are making this mistake. I prefer to believe that they
always knew about the defaults, disagreed with them, overrode them, and went
on their merry way.

I also think these sorts of devs/scientists are less likely to chime into this
debate, because they don't see what the fuss is about.

Edit: genuinely no pun intended about 'L2 regularization being the norm'.
Fancy that.

~~~
6gvONxR4sf7o
Eh, when I first encountered it I thought it was a really dumb default. It
doesn't cause trouble, except maybe when I was first coming over from stats
libraries, where every logistic regression function means what ever textbook
tells you it is. But it's no real problem, so whatever. I'm not going to file
an issue or submit a PR. It's dumb, but not a big deal.

Saying it's stupid a stupid default right now isn't outrage, it's just
bikeshedding that would otherwise be silly if it weren't specifically the
topic of discussion.

------
oli5679
Statsmodels has a nice implementation of logistic regression more suited for
statistical analysis.

It has no default regularisation, nice labelled coefficients table, and model
evaluation metrics.

[https://www.statsmodels.org/dev/generated/statsmodels.discre...](https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html)

This reflects the difference in intended uses. I view Sklearn as a collection
of tools to help forecast a target, whilst statsmodels is collection of tools
to examine a model you have of the data-generating process.

If you want to do the former, sensible defaults are helpful - for example
random forest has many tune-able parameters, and I am glad maintainers haven't
shrugged and left all unspecified, because I know I can quickly fit a
reasonable first-cut model. For latter, you are better off using statsmodels.
In a way this is part of the culture difference between statistics and machine
learning.

[http://brenocon.com/blog/2008/12/statistics-vs-machine-
learn...](http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-
fight/)

One common pain point with sklearn is that it uses numpy arrays as
representation of data series, whilst a lot of the ecosystem create data
cleaning and prepare their data as pandas dataframes.

------
jononor
The article makes some valid points. But it is full of unnecessary snark. "If
you’re a smartypants, or someone whose brain was ruined by machine learning,
..." \- come on. This kind of tone is not helping to communicate, or to
motivate those that volunteers their time to making scikit-learn.

~~~
bart_spoon
I think the tone is in response to similar hostility from the other crowd.
This isn't a new issue, the first time I read debate about the default
regularization settings in scikit was a couple years ago. Back then the
initial responses of many, including the scikit developers, could be very
condescending and strong at times. Its only been in the last year or so that
the tide seems to have shifted against the current defaults, and the devs have
begun to acquiesce in the form of more explicit documentation, etc.

------
reepoc
I agree with the author that it would be better to have unregularized logistic
regression as the default. This has tripped me up in the past, but luckily
sklearn docs are good so I could figure out what’s going on.

Random forest defaults are clearly wrong. n_trees are set to 10 which just
doesn’t work as well as a reasonable default like 100.

I just checked the docs and it looks like they are updating it in a future
version.

I think this wouldn’t be a problem, but there is so much hype around data
science, boot camp grads with only a high level understanding of ML, etc. it
probably makes sense to make the library as idiot proof as possible with the
default Params.

Either way I don’t care much, it’s still an awesome lib for free and nuances
like this don’t affect me

------
wei_jok
If it works so well for the (who knows how many) people using the defaults,
who is to say that they are “wrong”?

~~~
darawk
I think he makes a pretty good case for why they're wrong in the article. He
also specifically addresses that question:

> I don’t know if it’s true that a plurality of people doing logistic
> regressions are using L2 regularization and lambda = 1, but the point is
> that it doesn’t matter. Unregularized logistic regression is the most
> obvious interpretation of a bare bones logistic regression, so it should be
> the default

> Why is this a problem? Because one might expect that the most basic version
> of a function should broadly work for most cases. Except that’s not actually
> what happens for LogisticRegression. Scikit-learn requires you to either
> preprocess your data or specify options that let you work with data that has
> not been preprocessed in this specific way.

Most people running an unparameterized logistic regression would expect that
regression to work just fine on un-normalized data. However, a regularized LR
is not going to work at all unless your data has been normalized. This is
deeply counter-intuitive and has probably tripped thousands of people up who
have no idea it even happened.

------
dxbydt
I’m quite dumbfounded by the comments on this thread. If I make a python
library where I take three points as input and spit out the area of the
triangle you get when you join those points with straight lines, I have to
ensure that said area is correct. Even a high school kid knows that. To argue
that I will always assume that the three points come from a right triangle, so
I will compute three distances and return half the product of the two smaller
ones, that’s just...what’s the polite thing to say in HN circles these
days...retarded on my part. It means I’m completely braindead. It’s means I’m
dense. Obtuse. It basically means I need more coffee.

I don’t get to argue that its a right triangle library, so I will only
entertain points from a right triangle. Because once you put the library out
there, anybody might and will use it. Who do you think uses Excel ? The vast
majority are Secretaries, housewives, florists, dog walkers...not just the
quants sitting in a bank. If some middle school kid doing his homework calls
my library with three collinear points, I must throw an exception. Or atleast
spit out an area of zero. Imagine how you’d feel as a parent or a teacher if
your kid got a big fat zero because I assume two of the three lines are
perpendicular when it’s trivial to verify that they aren’t. As the library
author, I must do the right thing, simply because, hey it’s a nice planet & so
let’s be nice to fellow citizens and library callers & so forth. Ethics, etc.

Let me ask you this. It’s fucking 2019. You have a+bx on your rhs. Why the
fuck are you setting g(p) = log(p/(1-p)) as your link ? You say oh it has a
nice interpretation, it’s log odds...ok so are you in epi ? You care about
lives and deaths, cures & placebos that sort of thing ? Nope, you are doing ML
for some faang. Ok then why logit ? You can pick literally any CDF. Every
continuous cdf is smooth sigmoid. They are all cadlag. The finance community
generally picks the gaussian cdf, so that’s called probit and all their papers
are full of probit regression. Whereas the healthcare guys don’t have the
analytical chops, so they just settle on logistic regression. But it’s 2019!
You can stick any cdf in there !!! You know, glm used to be a graduate
statistics subject. Nowadays sophomores are happily using glm with a custom
link function. 16-17 year olds. They craft a nice looking smooth cdf, get it’s
closed form, use that in the glm link function. Everything fits so nicely, you
get tight CIs, good inference. When these kids graduate, if they know you are
still debating about probit vs logit with a default l2 penalty, they are going
to laugh in your face & kick you to the curb.

You guys are elite programmers. glms must be your favorite playpen. Craft some
custom link functions and step up to the plate. logistic with default l2?!
jesus.

------
gosub
TLDR: scikit-learn's 'LogisticRegression' function violates the principle of
least astonishment by automagically using L2 regularization.

------
6gvONxR4sf7o
Scikit-learn has separate functions for sklearn.linear_model.LinearRegression
and sklearn.linear_model.Ridge. LinearRegression isn't regularized. From that,
I'd assume LogisticRegression isn't either.

The sane paths ways to do it seem to be to make linear regression match
logistic, or to make logistic match linear.

It's not bad to have options, but the defaults should match what most users
expect. It's an ML library, but people coming to ML from everywhere that has
learned stats (most quantitative fields) are aligned around a different
meaning of "logistic regression." Is it so crazy to default to that? When you
tune the model, you're going to change the default anyways.

------
fooser
This article is wrong. Regularization is always preferred for logistic
regression and should be the default for any real use case. Suppose your data
is perfectly linearly separated: when logistic regression sees that it is
assigning 96% probability to a given data point, and it could assign 99%
(without misclassifying any other points), it will. Then it will try to assign
99.5%, and so on, until you have infinitely large coefficients (in high
dimensions this is very possible). Statsmodels’s defaults assume the least,
yes, but statsmodels also won’t automatically do things like adding an
intercept to OLS, because it is mostly a backend for other libraries
interfaces.

EDIT: oops, funklute said this an hour ago and apparently got downvoted. He is
correct

~~~
mattkrause
“Always“ is a strong word.

You’re both right that it often makes sense for prediction, but logistic
regression can be used for inference too, and it’s much more debatable there.

------
treydey
What's the difference between the 'l1_ratio' and the 'C' parameters in the
package?

------
RocketSyntax
show results on test dataset to prove points

------
xiaodai
Remember reading that thread the developer was saying something like "this is
an ML library, not a statistics library". The subtext he is trying to convey
is. I do ML! I am a higher-being. If you want statistically accurate function
names then I don't care about you. Because you are likely a statistician, a
lower being than me! I am a ML practitioner, a HIGHER-being. I don't take shit
from lower-beings like you". What an arsehole.

~~~
_Wintermute
I read it as the opposite, lot's of people looking down at the sklearn devs
and poking fun, thinking they've made some mistake because they don't
understand the basics of statistics. When really the developers explained
their decision that the default parameters made the most sense for sklearn's
use-case.

~~~
xiaodai
The article already explained there is way in the world the L2 hyperparameter
is set to 1. The only way to get the correct parameter is to use CV.

