The default makes sense.
It's also clearly stated as the very first parameter in the constructor which defaults to L2. The docs also state in BOLD: "Note that regularization is applied by default."
This post is just a case of pebcac.
Why's that? Logistic regressions are perfectly reasonable to run for exploratory purposes. There are tons of situations where regularization is not what you want.
> If you don't know to normalise your data prior to that
It's not a question of not knowing. In an unregularized regression, you don't need to standardize your data beforehand. People who knows this but don't realize regularization is being applied by default can easily make this mistake.
EDIT: Apparently this is getting downvoted. To explain a bit more clearly: if you know nothing about the data, then your default should be to run with regularization, since you won't get a useful result if the data turns out to be linearly separable. In the case of linear regression, it's often sensible to start with non-regularized fitting, since that won't mess up your coefficients. But logistic regression is quite simply a different beast altogether, and this is covered in any intro on logistic regression you care to consult.
> Logistic regression without regularization blows up if your data is linearly separable. This is an issue you don't really encounter with linear regression.
Finding that your data is linearly separable is usually quite important - I do see students mess this up though, to be fair.
If you want frequentist inference, non-penalized logistic regression is the way to go.
(disclaimer, I'm predominantly Bayesian, so with proper priors the point is actually moot for me.)
scikit-learn is an ML library. It's explicitly not a stats library. The assumptions and defaults are correct for ML use cases. GP comment is entirely correct.
If you want to do widespread statistics, use StatsModels which is designed for that.
Yeah but logistic regression is from stats. If you're going to implement ClassicalThingFromNearbyField in an ML library, people are right to complain when the definition doesn't match. Not surprising anyone, there's a lot of overlap between stats and ML.
If I implemented a "SimpleFraction" model in sklearn, it would be useful to use "[# this class] / ([# all classes] + epsilon)" using epsilon to regularize, but that would be a crazy name. Or at least it would be crazy if I used a default other than epsilon=0.
The use case you think about is not what sklearn does. You cant even use it, because it returns no model metrics you need for a specified model structure.
No one uses sklearn for stats, and logistic regression here is really not stats, it is a not a ML estimated conditional expectation, it is simply a consequence of a ML loss function.
Ane in that sense, regularization is useful.
People expecting to get a stats model out of sklearn will fail immediately, since it does not even return errors, covariances or statistics.
Still, any good data scientist I know actually knows to read the documentation carefully and has learnt regularisation is the default.
For example, look at the neural network classifier model in scikit-learn:
None of these defaults can be assumed. If you showed me a function call with no optional arguments I couldn't tell you, nor anybody else that hasn't read the documentation, what activation is being used, what optimizer, learning rate, etc...
In fact when coming across these kind of models during code reviews the nature of these parameters is the first thing I ask about.
To the point of exploratory analysis in some of the parents ,I prefer statsmodels for that purpose. It's not quite up to where the similarly purposed tools are in other languages, but for most of my work where I care about interpretation, it hits the right spot between usability and providing the standard statistical outputs.
But I am confused by the outrage this has caused.
Scikit-Learn's documentation is great. I do not believe that many competent scientists and devs are making this mistake. I prefer to believe that they always knew about the defaults, disagreed with them, overrode them, and went on their merry way.
I also think these sorts of devs/scientists are less likely to chime into this debate, because they don't see what the fuss is about.
Edit: genuinely no pun intended about 'L2 regularization being the norm'. Fancy that.
Saying it's stupid a stupid default right now isn't outrage, it's just bikeshedding that would otherwise be silly if it weren't specifically the topic of discussion.
It has no default regularisation, nice labelled coefficients table, and model evaluation metrics.
This reflects the difference in intended uses. I view Sklearn as a collection of tools to help forecast a target, whilst statsmodels is collection of tools to examine a model you have of the data-generating process.
If you want to do the former, sensible defaults are helpful - for example random forest has many tune-able parameters, and I am glad maintainers haven't shrugged and left all unspecified, because I know I can quickly fit a reasonable first-cut model. For latter, you are better off using statsmodels. In a way this is part of the culture difference between statistics and machine learning.
One common pain point with sklearn is that it uses numpy arrays as representation of data series, whilst a lot of the ecosystem create data cleaning and prepare their data as pandas dataframes.
Random forest defaults are clearly wrong. n_trees are set to 10 which just doesn’t work as well as a reasonable default like 100.
I just checked the docs and it looks like they are updating it in a future version.
I think this wouldn’t be a problem, but there is so much hype around data science, boot camp grads with only a high level understanding of ML, etc. it probably makes sense to make the library as idiot proof as possible with the default Params.
Either way I don’t care much, it’s still an awesome lib for free and nuances like this don’t affect me
> I don’t know if it’s true that a plurality of people doing logistic regressions are using L2 regularization and lambda = 1, but the point is that it doesn’t matter. Unregularized logistic regression is the most obvious interpretation of a bare bones logistic regression, so it should be the default
> Why is this a problem? Because one might expect that the most basic version of a function should broadly work for most cases. Except that’s not actually what happens for LogisticRegression. Scikit-learn requires you to either preprocess your data or specify options that let you work with data that has not been preprocessed in this specific way.
Most people running an unparameterized logistic regression would expect that regression to work just fine on un-normalized data. However, a regularized LR is not going to work at all unless your data has been normalized. This is deeply counter-intuitive and has probably tripped thousands of people up who have no idea it even happened.
I don’t get to argue that its a right triangle library, so I will only entertain points from a right triangle. Because once you put the library out there, anybody might and will use it. Who do you think uses Excel ? The vast majority are Secretaries, housewives, florists, dog walkers...not just the quants sitting in a bank. If some middle school kid doing his homework calls my library with three collinear points, I must throw an exception. Or atleast spit out an area of zero. Imagine how you’d feel as a parent or a teacher if your kid got a big fat zero because I assume two of the three lines are perpendicular when it’s trivial to verify that they aren’t. As the library author, I must do the right thing, simply because, hey it’s a nice planet & so let’s be nice to fellow citizens and library callers & so forth. Ethics, etc.
Let me ask you this. It’s fucking 2019. You have a+bx on your rhs. Why the fuck are you setting g(p) = log(p/(1-p)) as your link ? You say oh it has a nice interpretation, it’s log odds...ok so are you in epi ? You care about lives and deaths, cures & placebos that sort of thing ? Nope, you are doing ML for some faang. Ok then why logit ? You can pick literally any CDF. Every continuous cdf is smooth sigmoid. They are all cadlag. The finance community generally picks the gaussian cdf, so that’s called probit and all their papers are full of probit regression. Whereas the healthcare guys don’t have the analytical chops, so they just settle on logistic regression. But it’s 2019! You can stick any cdf in there !!! You know, glm used to be a graduate statistics subject. Nowadays sophomores are happily using glm with a custom link function. 16-17 year olds. They craft a nice looking smooth cdf, get it’s closed form, use that in the glm link function. Everything fits so nicely, you get tight CIs, good inference. When these kids graduate, if they know you are still debating about probit vs logit with a default l2 penalty, they are going to laugh in your face & kick you to the curb.
You guys are elite programmers. glms must be your favorite playpen. Craft some custom link functions and step up to the plate. logistic with default l2?! jesus.
The sane paths ways to do it seem to be to make linear regression match logistic, or to make logistic match linear.
It's not bad to have options, but the defaults should match what most users expect. It's an ML library, but people coming to ML from everywhere that has learned stats (most quantitative fields) are aligned around a different meaning of "logistic regression." Is it so crazy to default to that? When you tune the model, you're going to change the default anyways.
EDIT: oops, funklute said this an hour ago and apparently got downvoted. He is correct
You’re both right that it often makes sense for prediction, but logistic regression can be used for inference too, and it’s much more debatable there.