
Logistic regression from scratch - pmuens
https://philippmuens.com/logistic-regression-from-scratch/
======
curiousgal
Very interesting seeing people in the comments debate what is a very basic
thing taught in any stats/econometrics class.

The idea behind binary regression ( Y is 0 or 1) is that you use a latent
variable Y* = beta X + epsilon.

X is the matrix of indenpent variables, beta is the vector of coefficients and
epsilon is an error term that sums the rest of what X can't explain.

Y thus becomes 1 if Y* >0 and 0 otherwise.

Seeing how Y is binary, we can model it using a Bernoulli distribution with a
success probability P(Y=1) = P(Y* >0) = 1 - P(Y* <=0) = 1 - P(epsilon <= -beta
X) = 1 - CDF_epsilon(-beta X)

Technically you can use any function that maps R to [0, 1] as a CDF. If its
density is symmetrical then you can directly write the above probability as
CDF(beta X). The two usual choices are either the normal CDF which gives the
Probit model or the Logistic function (Sigmoid) which gives the Logit model.
With the CDF known you can calculate the likelihood and use it to estimate the
coefficients.

People prefer the Logit model because the coefficients of the model are
interpretable in terms of log-odds and all and the fucntion has some nice
numerical properties.

That's all there is to it really.

~~~
reactspa
I've always found it a little simplistic that the default cut-off, in most
statistical software, for whether something should be 0 or 1, is 0.5.

(i.e. > 0.5 equals 1, and < 0.5 equals 0).

This seems to be a "rarely-questioned assumption".

Is there a reason why this is considered reasonable? And is there a name for
the cut-off (i.e., if I were to want to change the cut-off, what keyword
should I search for inside the software's manual?)?

~~~
curiousgal
From a stats perspective the cutoff is included in the coefficients. If you
use a design matrix (add a column of 1s to your variables) you get in a non
matrix notation (beta_0 _1 + beta_1_ X_1 +...) So the threshold can be
considered beta_0.

In the software, you can get classification models to output class
probabilities instead of class labels. You can then use whatever threshold you
like for to transform those probabilities to labels.

You may see it refered to as "discrimination threshold". Varying that
threshold is how ROC curves are constructed.

~~~
cuchoi
The threshold would be beta_0 on every case or only when you have subtracted
the mean from your data?

~~~
vhhn
You don't want to demean your dependent (response) binary variable. So you
almost always want to keep beta0 to control for any imbalance in your
dependent var.

~~~
cuchoi
I meant demeaning the independent variables. My understanding is that the
beta_0 will have the meaning the curiousgal attach it only if you demean your
independent variables.

~~~
vhhn
I see. But I think after demeaning X, beta0 will just have a special
meaning... log odds of the average case. Nothing more.

------
IdiocyInAction
My understanding of Logistic Regression is that it's linear regression on the
log-odds, which are then converted to probabilities with the sigmoid/softmax
function. This formulation allows one to do direct linear regression on the
probabilities, without the unpleasant side effects of just using a linear
model as-is. A mathematical justification for doing this is given by the
generalized linear model formulation.

~~~
nerdponx
It's better to think of linear regression and logistic regression as special
cases of the Generalized Linear Model (GLM).

In that framework, they are literally the same model with different "settings"
\- Gaussian vs Bernoulli distribution.

~~~
markkvdb
I have to disagree with you. While assuming Gaussian disturbance terms results
in a linear regression, the linear regression framework is more general. It
makes no assumptions about the distribution of the disturbance terms. Instead,
it merely restricts the variance to be constant over all values of the
response variable.

~~~
nerdponx
Both things can be true.

Linear regression is extra-special because it's a special case of several
different frameworks and model classes.

I should have written that it's better (in my opinion) to think of logistic
regression in the context of GLMs, at least while you're learning.

Edit: yes logistic regression is a special case of regression with a different
loss function. But it's not nearly "as special" as linear regression.

~~~
zwaps
As above, I would strongly agree with you. Both linear and logistic regression
can be special cases of frameworks that are more general and far less
parametric than GLM. But they also have very intuitive or hands-on
explanations, especially logistic regression, which GLM doesn't have.

------
thomasahle
Logistic regression can learn some quite amazing things. I trained a linear
function to play chess:
[https://github.com/thomasahle/fastchess](https://github.com/thomasahle/fastchess)
and it manages to predict the next moves of top engine games with 27%
accuracy.

A benefit of logistic regression is that the resulting model really fast.
Furthermore, it's linear, so you can do incremental updates to your
prediction. If you have `n` classes and `b` input features change, you can
recompute in `bn` time, rather than doing a full matrix multiplication, which
can be a huge time saver.

~~~
mhh__
Isn't 27% worse than flipping a coin?

~~~
thomasahle
A typical chess position has 20-40 legal moves. The complete space of moves
for the model to predict from has about 1800 moves.

For comparison Leela Zero gets around 60% accuracy on predicting its own next
move.

With this sort of accuracy you can reduce the search part of the algorithm to
an effective branching factor of 2-4 rather than 40, nearly for free, which is
a pretty big win.

~~~
nuclearnice1
I don’t understand the comment about Leela. Why isn’t own move prediction
deterministic?

~~~
thomasahle
Because Leela (like fastchess mentioned above) has two parts: A neural network
predicting good moves, and a tree search exploring the moves suggested and
evaluating the resulting positions (with a second net).

If the prediction (policy) net had a 100% accuracy, you wouldn't need the tree
search part at all.

~~~
nuclearnice1
Got it. Thanks for clarifying. Let me restate.

Part one of Leela ranks several chess moves. Part two picks among those.

60% of the time part 2 chooses the #1 ranked move.

~~~
thomasahle
That works :-)

One addition: The second part can be run for an arbitrary amount of time,
gradually improving the quality of the returned move.

The 60% figure comes from the training games, which are played very quickly,
and so don't have a lot of time for refining, thus increasing the prediction
accuracy.

In real games, tcec-chess.com/ this "self accuracy" would probably be a bit
lower.

------
oli5679
Weight of evidence binning can be helpful feature engineering strategy for
logistic regression.

Often this is a good 'first cut' model for a binary classifier on tabular
data. If feature interactions don't have a major impact on your target then
this can actually be a tough benchmark to beat.

[https://github.com/oli5679/WeightOfEvidenceDemo](https://github.com/oli5679/WeightOfEvidenceDemo)

[https://www.listendata.com/2015/03/weight-of-evidence-woe-
an...](https://www.listendata.com/2015/03/weight-of-evidence-woe-and-
information.html?m=1)

------
colinmhayes
How do we differentiate between econometrics and machine learning? Logistic
regression seems like it fits into econometrics better than machine learning
to me. There's no regularization. I guess there's gradient descent which can
be seen as more machine learning. In the end it's semantics of course, still
an interesting distinction.

~~~
eVoLInTHRo
Econometrics is the application of statistical techniques on economics-related
problems, typically to understand relationships between economic phenomena
(e.g. income) and things that might be associated with it (e.g. education).

Machine learning is typically defined as a way to enable computers to learn
from data to accomplish tasks, without explicitly telling them how.

Both fields can use logistic regression, regularization, and gradient descent
to accomplish their goals, so in that sense there's no distinction.

But IMO there is a difference in their primary intention: econometrics
typically focuses on inference about relationships, machine learning typically
focuses on predictive accuracy. That's not to say that econometrics doesn't
consider predictive accuracy, or that machine learning doesn't consider
inference, but it's usually not their primary concern.

~~~
colinmhayes
So you're going with the only difference being who's building the model.
Interesting take, can't say I disagree much. Although I would say that
regularization in econometric models is a bit rare because it distorts the
coefficients which as you pointed out is the primary goal of econometrics.

~~~
mr_toad
Econometric models tend to be hand-fit and focus more on
explanation/hypothesis testing than prediction, so automated variable
selection is less common (and sometimes frowned upon).

------
clircle
Ah, the old "regression from scratch" post that is mandatory for all blogs

~~~
melling
It looks like he’s working through a lot of algorithms:

[https://github.com/pmuens/lab](https://github.com/pmuens/lab)

