
Logistic Regression from Bayes’ Theorem - Homunculiheaded
https://www.countbayesie.com/blog/2019/6/12/logistic-regression-from-bayes-theorem
======
imbusy111
If you fit a linear model for the coffee making problem and one of the
parameters is temperature and the coefficient for the temperature in the
linear model is positive, does that mean if you keep increasing the
temperature without limit, the probability of making a good cup of coffee
increases as well without limit?

In reality the temperature is required to be a certain exact value within a
range.

~~~
blackbear_
Yes, you are correct. In other words, the relationship between temperature and
quality is not linear, so directly using the temperature in a linear model
gives wrong results. (To be pedantic, the probability will approach 1 as the
temperature goes to infinity)

~~~
imbusy111
The example problem seems to be completely inappropriate then for the
article/model.

~~~
frankc
I think it's a little more complicated then that. A variable might not be
linear in general but may be approximately linear within a certain range of
values. You might fit the model on values only within that linear range and
thus get a good fit. The model may be very useful inside the range of fitted
values but garbage at extrapolation. As long as you understand the limitations
it can still be a useful model.

~~~
jfarmer
While you’re right, the original post is meant to be pedagogical. Someone who
doesn’t understand the fundamentals of model selection might learn the wrong
lesson(s).

You kinda have to expect a student to use the examples you give.

There’s really no upside to using that example.

~~~
zwaps
If I were the author, I would drop things like temperature unless I am willing
to discuss nonlinear transformations thereof.

Instead, why not simplify it down to D=Freshness of Coffee Beans ?

------
debbiedowner
It would be nice to hear about the optimization method with convergence
guarantees etc. Introducing the model is nice, but you need to show quality
and easiness of fit. You can maybe do this before since you rely on the idea
of learning the parameters somehow to motivate the model.

You can relate to NNs for free since it is a linear layer with sigmoid
activation.

You can stress it is linear in that your decision boundary is linear.

I don't like how capitalized letters are not random variables but are
observations.

You can give some examples of what conditional PDFs P(H=1 | D ) look like and
what you can model. In your case if the ideal temp for coffee is 190F and +/\-
10 or more and the coffee is bad then you hope that (temp - 190)^2 is a
feature input.

Congrats on the book deal!

------
jules
Very nice! How about this, for more than 2 classes:

Let p_k be the probability of being in class k. We assume log p_k = f_k(x) +
C(x) where x is the feature vector and C(x) is normalisation to make the
probabilities sum to 1.

Equivalently, p_k is proportional to exp(f_k(x)), so p_k = exp(f_k(x)) / sum_j
exp(f_j(x)).

Because of the normalisation we may assume without loss of generality that
f_0(x) = 0. Then if we have 2 classes and f_1(x) is linear, we get logistic
regression.

~~~
mateo411
It's called multinomial logistic regression if there are more than 2 classes.
It works exactly how you described.

------
doomrobo
This was a really neat exposition! I have a few questions:

1\. Is D a binary random variable? If so, what exactly does it mean to say
beta*D + beta_0 is an approximation for log odds? Doesn't this formula only
take on 2 possible values?

2\. Could you provide intuition for why a linear function of D would be a good
approximation for the log odds mentioned?

~~~
zwaps
D can be binary, in which case the model predicts two alternative outcomes for
the log-odds. The linear part is shifted by beta, whenever D=1.

As for your second point, there is no prior reason in this case why a linear
function of D would be a good approximation. Indeed in the current case, we
would probably prefer to at least write beta1 _D+beta2_ D^2+beta0 which is
still linear in a transformation of D.

That being said, however, there is a notion on why a linear function may be a
good approach. If you are interested in the direction of change around the
averages values of the variables involved, then a linear function gives you
such a "linear approximation" of the slope. This of course quickly breaks down
if the function is not really linear, and in particular, it breaks down if you
are interested in predicting an observation that is not "average". But often,
one may be interested in such very qualitative statements as: on average, the
coffee is improved by fresher beans - yes or no? In that case, such a linear
model may give an answer.

Note that the above is absolutely not formally correct.

Finally, logistic regression can also be motivated differently.

\- It arises from minimizing certain entropy losses in Machine Learning

\- One may assume that the binary variable we observe is really just based on
a "latent" variable (here something like coffee quality), which is determined
by such a linear model

\- finally, in economics and reinforcement learning, we assume agents make one
decision (here whether the coffee is good or bad) by judging the inputs plus
some random "error" or "taste" parameter which is has extrem value
distribution. Since only the differences between these utilities matter (cf.
odd ratios), and the actual values are meaningless, the logistic regression
also arises.

------
blackbear_
NB: this post uses D for the input x and H for the output y. This confused me
quite a bit since usually in ML we use D for the data (pairs of x and y) and H
for the model (in most cases the parameters, the betas in this example).

------
PopularBoard
I'm a little confused, how much technical this approach is? I can't understand
the meaning of P(D) for example. Does it make sense in strict mathematics?

~~~
vidarh
P(D) is just the probability of D. This is common when talking about
probabilities.

~~~
PopularBoard
How can we talk about the probability of the data (D)?

~~~
vidarh
D in this case refers to a specific set of variables that goes into brewing
coffee. P(D) then refers to the probability of a given set of values for that
vector of variables given all the possible values.

Don't take it too literally - P(..) here is not some well defined function,
it's effectively just part of the name, as a convention for naming
probabilities. I find it confusing too.

As the article points out, that set is for all intents and purposes infinite
in this case, but this doesn't matter, as you can sidestep it by comparing to
complementary hypotheses (which makes P(D) cancel out). This is all covered in
the article.

The only maths worth reading up on to understand this article is a basic
introduction to Bayes theorem - the wikipedia page is quite decent.

------
s_Hogg
I realise this is pedantry but it's definitely "Bayes' theorem" not "Baye's
theorem" dammit.

Sorry about that.

~~~
BlackFly
Well, if you want to be a pedant, it should arguably be Bayes's theorem:
[https://www.bartleby.com/141/strunk.html](https://www.bartleby.com/141/strunk.html).

That "Drop the trailing s to form possessives" exception needs to die.

~~~
WhompingWindows
Why does it need to die? What is the rationale behind that?

~~~
cwilkes
What’s the rationale for keeping it? English should try to be at least a
little bit easier parse for a newbie.

~~~
darkpuma
One man trying to dictate the direction of a language used by a billion
people... that's just not going to happen.

Not that I don't sympathize. I think that "it's" should be the possessive form
of 'it', not a contraction of 'it is'. Maybe if we band together we can force-
feed both our changes at once to the rest of the world?

~~~
cameronbrown
Languages by definition are fluid. If a certain way of spelling a word or
grammatical structure spreads to become popular (possibly, yes, from one
person), then it becomes the new normal.

~~~
darkpuma
English is fluid like the tides are fluid. The chance that somebody can
dictate the direction of that tide is very slim.

