Hacker News new | past | comments | ask | show | jobs | submit login
Scikit-Learn’s Defaults Are Wrong (ryxcommar.com)
78 points by dsr12 on Nov 1, 2019 | hide | past | favorite | 38 comments

If you're not using regularisation for logistic regression, you're most likely doing it wrong. If you don't know to normalise your data prior to that, then you need to go back to the basics.

The default makes sense.

It's also clearly stated as the very first parameter in the constructor which defaults to L2. The docs also state in BOLD: "Note that regularization is applied by default."

This post is just a case of pebcac.

> If you're not using regularisation for logistic regression, you're most likely doing it wrong

Why's that? Logistic regressions are perfectly reasonable to run for exploratory purposes. There are tons of situations where regularization is not what you want.

> If you don't know to normalise your data prior to that

It's not a question of not knowing. In an unregularized regression, you don't need to standardize your data beforehand. People who knows this but don't realize regularization is being applied by default can easily make this mistake.

Logistic regression without regularization blows up if your data is linearly separable. This is an issue you don't really encounter with linear regression.

EDIT: Apparently this is getting downvoted. To explain a bit more clearly: if you know nothing about the data, then your default should be to run with regularization, since you won't get a useful result if the data turns out to be linearly separable. In the case of linear regression, it's often sensible to start with non-regularized fitting, since that won't mess up your coefficients. But logistic regression is quite simply a different beast altogether, and this is covered in any intro on logistic regression you care to consult.

I think you're taking a perspective which is too focused on ML, ignoring the widespread statistical use of logistic regression in lower dimensional settings with reasonable a-priori covariate structures.

> Logistic regression without regularization blows up if your data is linearly separable. This is an issue you don't really encounter with linear regression.

Finding that your data is linearly separable is usually quite important - I do see students mess this up though, to be fair.

If you want frequentist inference, non-penalized logistic regression is the way to go.

(disclaimer, I'm predominantly Bayesian, so with proper priors the point is actually moot for me.)

> I think you're taking a perspective which is too focused on ML

scikit-learn is an ML library. It's explicitly not a stats library. The assumptions and defaults are correct for ML use cases. GP comment is entirely correct.

If you want to do widespread statistics, use StatsModels which is designed for that.

>scikit-learn is an ML library. It's explicitly not a stats library.

Yeah but logistic regression is from stats. If you're going to implement ClassicalThingFromNearbyField in an ML library, people are right to complain when the definition doesn't match. Not surprising anyone, there's a lot of overlap between stats and ML.

If I implemented a "SimpleFraction" model in sklearn, it would be useful to use "[# this class] / ([# all classes] + epsilon)" using epsilon to regularize, but that would be a crazy name. Or at least it would be crazy if I used a default other than epsilon=0.

I am a stats guy, and sklearn did confuse the hell out of me when I tried to use it as such, but I have to agree with others here. The issue is not how regression works, the issue is that sklearn does not do stats at all!

The use case you think about is not what sklearn does. You cant even use it, because it returns no model metrics you need for a specified model structure.

No one uses sklearn for stats, and logistic regression here is really not stats, it is a not a ML estimated conditional expectation, it is simply a consequence of a ML loss function. Ane in that sense, regularization is useful.

People expecting to get a stats model out of sklearn will fail immediately, since it does not even return errors, covariances or statistics.

The post I was replying to was attempting to make a much more general point - I understand the default as applied to Scikit Learn.

If you know nothing about your data, you shouldn’t be fitting either kind of model; you should be learning about it instead.

Are you seriously arguing that R's implementation and statsmodels's implementations are doing it wrong? What if I had a model called "OrdinaryLeastSquares" in sklearn and had it regularize by default. After all, it's sklearn, so people probably wanted a penalized least squares model. That would be bad, because it's not what that means in most other libraries and textbooks where users have encountered the term.

Came here to say this. I can think of a lot of use cases where the feature data numbers will be dramatically different scales and the regularization helps to prevent that “noise” from dramatically affecting your model.

I agree. The only thing I believe the article gets right however is that the naming doesn't expose this default behavior, which might cause problems for people coming from other software.

Still, any good data scientist I know actually knows to read the documentation carefully and has learnt regularisation is the default.

One thing that concerns me about the "you should read the documentation" argument is that even if the person writing the code did that for every function they used, anyone reading the code is likely to make (reasonable) assumptions. So bugs like accidentally regularizing, for example, could slip past code review.

It's a fair point for standard software engineering libraries, that the defaults should be obvious, but I don't think that it's possible to hold scientific libraries to the same standards. They require expert knowledge, and the assumptions you make about the models that are being used should be very carefully checked. There is no obvious defaults in that case because we're not writing software, we're building scientific models.

For example, look at the neural network classifier model in scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.ne... None of these defaults can be assumed. If you showed me a function call with no optional arguments I couldn't tell you, nor anybody else that hasn't read the documentation, what activation is being used, what optimizer, learning rate, etc... In fact when coming across these kind of models during code reviews the nature of these parameters is the first thing I ask about.

Then there shouldn’t be any defaults, right? All of these issues could easily be avoided by forcing users to provide explicit parameters. It only makes sense to offer defaults if they’re intuitive.

Agree. It's naive to assume your use case is the basic one when using these libraries or that you know the underlying implementation and all of the parameters, since in ML the implementations vary enough to affect the outcome and will have different controls.

To the point of exploratory analysis in some of the parents ,I prefer statsmodels for that purpose. It's not quite up to where the similarly purposed tools are in other languages, but for most of my work where I care about interpretation, it hits the right spot between usability and providing the standard statistical outputs.

The call signature does expose it though. Am I a minority in terms of using jupyter notebook and an IDE that shows call signatures and docstrings?

The docs only state that because of this pushback. If I remember correctly, that is a relatively recent addition (within the last year I believe) that was added due to the communities dislike for both the defaults and the previous documentation.

Completely agree. If you can't read documentation's software, you shouldn't do data science/ML in any cases.

Sure, I can see that L2 regularization being the norm is a point to be argued about. I am in the camp that believes there should be no regularization by default.

But I am confused by the outrage this has caused.

Scikit-Learn's documentation is great. I do not believe that many competent scientists and devs are making this mistake. I prefer to believe that they always knew about the defaults, disagreed with them, overrode them, and went on their merry way.

I also think these sorts of devs/scientists are less likely to chime into this debate, because they don't see what the fuss is about.

Edit: genuinely no pun intended about 'L2 regularization being the norm'. Fancy that.

Eh, when I first encountered it I thought it was a really dumb default. It doesn't cause trouble, except maybe when I was first coming over from stats libraries, where every logistic regression function means what ever textbook tells you it is. But it's no real problem, so whatever. I'm not going to file an issue or submit a PR. It's dumb, but not a big deal.

Saying it's stupid a stupid default right now isn't outrage, it's just bikeshedding that would otherwise be silly if it weren't specifically the topic of discussion.

Statsmodels has a nice implementation of logistic regression more suited for statistical analysis.

It has no default regularisation, nice labelled coefficients table, and model evaluation metrics.


This reflects the difference in intended uses. I view Sklearn as a collection of tools to help forecast a target, whilst statsmodels is collection of tools to examine a model you have of the data-generating process.

If you want to do the former, sensible defaults are helpful - for example random forest has many tune-able parameters, and I am glad maintainers haven't shrugged and left all unspecified, because I know I can quickly fit a reasonable first-cut model. For latter, you are better off using statsmodels. In a way this is part of the culture difference between statistics and machine learning.


One common pain point with sklearn is that it uses numpy arrays as representation of data series, whilst a lot of the ecosystem create data cleaning and prepare their data as pandas dataframes.

The article makes some valid points. But it is full of unnecessary snark. "If you’re a smartypants, or someone whose brain was ruined by machine learning, ..." - come on. This kind of tone is not helping to communicate, or to motivate those that volunteers their time to making scikit-learn.

I think the tone is in response to similar hostility from the other crowd. This isn't a new issue, the first time I read debate about the default regularization settings in scikit was a couple years ago. Back then the initial responses of many, including the scikit developers, could be very condescending and strong at times. Its only been in the last year or so that the tide seems to have shifted against the current defaults, and the devs have begun to acquiesce in the form of more explicit documentation, etc.

I agree with the author that it would be better to have unregularized logistic regression as the default. This has tripped me up in the past, but luckily sklearn docs are good so I could figure out what’s going on.

Random forest defaults are clearly wrong. n_trees are set to 10 which just doesn’t work as well as a reasonable default like 100.

I just checked the docs and it looks like they are updating it in a future version.

I think this wouldn’t be a problem, but there is so much hype around data science, boot camp grads with only a high level understanding of ML, etc. it probably makes sense to make the library as idiot proof as possible with the default Params.

Either way I don’t care much, it’s still an awesome lib for free and nuances like this don’t affect me

If it works so well for the (who knows how many) people using the defaults, who is to say that they are “wrong”?

I think he makes a pretty good case for why they're wrong in the article. He also specifically addresses that question:

> I don’t know if it’s true that a plurality of people doing logistic regressions are using L2 regularization and lambda = 1, but the point is that it doesn’t matter. Unregularized logistic regression is the most obvious interpretation of a bare bones logistic regression, so it should be the default

> Why is this a problem? Because one might expect that the most basic version of a function should broadly work for most cases. Except that’s not actually what happens for LogisticRegression. Scikit-learn requires you to either preprocess your data or specify options that let you work with data that has not been preprocessed in this specific way.

Most people running an unparameterized logistic regression would expect that regression to work just fine on un-normalized data. However, a regularized LR is not going to work at all unless your data has been normalized. This is deeply counter-intuitive and has probably tripped thousands of people up who have no idea it even happened.

I’m quite dumbfounded by the comments on this thread. If I make a python library where I take three points as input and spit out the area of the triangle you get when you join those points with straight lines, I have to ensure that said area is correct. Even a high school kid knows that. To argue that I will always assume that the three points come from a right triangle, so I will compute three distances and return half the product of the two smaller ones, that’s just...what’s the polite thing to say in HN circles these days...retarded on my part. It means I’m completely braindead. It’s means I’m dense. Obtuse. It basically means I need more coffee.

I don’t get to argue that its a right triangle library, so I will only entertain points from a right triangle. Because once you put the library out there, anybody might and will use it. Who do you think uses Excel ? The vast majority are Secretaries, housewives, florists, dog walkers...not just the quants sitting in a bank. If some middle school kid doing his homework calls my library with three collinear points, I must throw an exception. Or atleast spit out an area of zero. Imagine how you’d feel as a parent or a teacher if your kid got a big fat zero because I assume two of the three lines are perpendicular when it’s trivial to verify that they aren’t. As the library author, I must do the right thing, simply because, hey it’s a nice planet & so let’s be nice to fellow citizens and library callers & so forth. Ethics, etc.

Let me ask you this. It’s fucking 2019. You have a+bx on your rhs. Why the fuck are you setting g(p) = log(p/(1-p)) as your link ? You say oh it has a nice interpretation, it’s log odds...ok so are you in epi ? You care about lives and deaths, cures & placebos that sort of thing ? Nope, you are doing ML for some faang. Ok then why logit ? You can pick literally any CDF. Every continuous cdf is smooth sigmoid. They are all cadlag. The finance community generally picks the gaussian cdf, so that’s called probit and all their papers are full of probit regression. Whereas the healthcare guys don’t have the analytical chops, so they just settle on logistic regression. But it’s 2019! You can stick any cdf in there !!! You know, glm used to be a graduate statistics subject. Nowadays sophomores are happily using glm with a custom link function. 16-17 year olds. They craft a nice looking smooth cdf, get it’s closed form, use that in the glm link function. Everything fits so nicely, you get tight CIs, good inference. When these kids graduate, if they know you are still debating about probit vs logit with a default l2 penalty, they are going to laugh in your face & kick you to the curb.

You guys are elite programmers. glms must be your favorite playpen. Craft some custom link functions and step up to the plate. logistic with default l2?! jesus.

TLDR: scikit-learn's 'LogisticRegression' function violates the principle of least astonishment by automagically using L2 regularization.

Scikit-learn has separate functions for sklearn.linear_model.LinearRegression and sklearn.linear_model.Ridge. LinearRegression isn't regularized. From that, I'd assume LogisticRegression isn't either.

The sane paths ways to do it seem to be to make linear regression match logistic, or to make logistic match linear.

It's not bad to have options, but the defaults should match what most users expect. It's an ML library, but people coming to ML from everywhere that has learned stats (most quantitative fields) are aligned around a different meaning of "logistic regression." Is it so crazy to default to that? When you tune the model, you're going to change the default anyways.

This article is wrong. Regularization is always preferred for logistic regression and should be the default for any real use case. Suppose your data is perfectly linearly separated: when logistic regression sees that it is assigning 96% probability to a given data point, and it could assign 99% (without misclassifying any other points), it will. Then it will try to assign 99.5%, and so on, until you have infinitely large coefficients (in high dimensions this is very possible). Statsmodels’s defaults assume the least, yes, but statsmodels also won’t automatically do things like adding an intercept to OLS, because it is mostly a backend for other libraries interfaces.

EDIT: oops, funklute said this an hour ago and apparently got downvoted. He is correct

“Always“ is a strong word.

You’re both right that it often makes sense for prediction, but logistic regression can be used for inference too, and it’s much more debatable there.

What's the difference between the 'l1_ratio' and the 'C' parameters in the package?

show results on test dataset to prove points

Remember reading that thread the developer was saying something like "this is an ML library, not a statistics library". The subtext he is trying to convey is. I do ML! I am a higher-being. If you want statistically accurate function names then I don't care about you. Because you are likely a statistician, a lower being than me! I am a ML practitioner, a HIGHER-being. I don't take shit from lower-beings like you". What an arsehole.

Haha, I read it the exact opposite way. This is ML, not statistics - we’re not publishing, aren’t obsessed with rigor, we just want to scrape some sites and spit out a house price predictor in a few hours :)

I read it as the opposite, lot's of people looking down at the sklearn devs and poking fun, thinking they've made some mistake because they don't understand the basics of statistics. When really the developers explained their decision that the default parameters made the most sense for sklearn's use-case.

The article already explained there is way in the world the L2 hyperparameter is set to 1. The only way to get the correct parameter is to use CV.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact