If you fit a linear model for the coffee making problem and one of the parameters is temperature and the coefficient for the temperature in the linear model is positive, does that mean if you keep increasing the temperature without limit, the probability of making a good cup of coffee increases as well without limit?
In reality the temperature is required to be a certain exact value within a range.
But this is just a criticism of extrapolation beyond support, not of any particular class of model. The extrapolation would be equally nonsensical with a linear probability model or any sigmoid transformation, because in reality the problem is only defined within a support (or else is a hinge problem, where everything outside the support has a fixed value). This doesn't make the model useless, it makes it useless for particular out of support extrapolations. This is why inference is both a quantitative and qualitative problem.
Yes, you are correct. In other words, the relationship between temperature and quality is not linear, so directly using the temperature in a linear model gives wrong results. (To be pedantic, the probability will approach 1 as the temperature goes to infinity)
Well, in the real world you can only boil water so the highest temperature of (uncontained) water at nominal pressures is ~100C no matter how long you wait or how much heat you apply. Even in an espresso maker the max pressure sets the max temp. So it might be that the maximum temperature is the right answer (not infinity) for a good cup.
I think it's a little more complicated then that. A variable might not be linear in general but may be approximately linear within a certain range of values. You might fit the model on values only within that linear range and thus get a good fit. The model may be very useful inside the range of fitted values but garbage at extrapolation. As long as you understand the limitations it can still be a useful model.
While you’re right, the original post is meant to be pedagogical. Someone who doesn’t understand the fundamentals of model selection might learn the wrong lesson(s).
You kinda have to expect a student to use the examples you give.
It would be nice to hear about the optimization method with convergence guarantees etc. Introducing the model is nice, but you need to show quality and easiness of fit. You can maybe do this before since you rely on the idea of learning the parameters somehow to motivate the model.
You can relate to NNs for free since it is a linear layer with sigmoid activation.
You can stress it is linear in that your decision boundary is linear.
I don't like how capitalized letters are not random variables but are observations.
You can give some examples of what conditional PDFs P(H=1 | D ) look like and what you can model. In your case if the ideal temp for coffee is 190F and +/- 10 or more and the coffee is bad then you hope that (temp - 190)^2 is a feature input.
Very nice! How about this, for more than 2 classes:
Let p_k be the probability of being in class k. We assume log p_k = f_k(x) + C(x) where x is the feature vector and C(x) is normalisation to make the probabilities sum to 1.
Equivalently, p_k is proportional to exp(f_k(x)), so p_k = exp(f_k(x)) / sum_j exp(f_j(x)).
Because of the normalisation we may assume without loss of generality that f_0(x) = 0. Then if we have 2 classes and f_1(x) is linear, we get logistic regression.
This was a really neat exposition! I have a few questions:
1. Is D a binary random variable? If so, what exactly does it mean to say beta*D + beta_0 is an approximation for log odds? Doesn't this formula only take on 2 possible values?
2. Could you provide intuition for why a linear function of D would be a good approximation for the log odds mentioned?
D can be binary, in which case the model predicts two alternative outcomes for the log-odds. The linear part is shifted by beta, whenever D=1.
As for your second point, there is no prior reason in this case why a linear function of D would be a good approximation. Indeed in the current case, we would probably prefer to at least write
beta1D+beta2D^2+beta0
which is still linear in a transformation of D.
That being said, however, there is a notion on why a linear function may be a good approach.
If you are interested in the direction of change around the averages values of the variables involved, then a linear function gives you such a "linear approximation" of the slope. This of course quickly breaks down if the function is not really linear, and in particular, it breaks down if you are interested in predicting an observation that is not "average".
But often, one may be interested in such very qualitative statements as: on average, the coffee is improved by fresher beans - yes or no?
In that case, such a linear model may give an answer.
Note that the above is absolutely not formally correct.
Finally, logistic regression can also be motivated differently.
- It arises from minimizing certain entropy losses in Machine Learning
- One may assume that the binary variable we observe is really just based on a "latent" variable (here something like coffee quality), which is determined by such a linear model
- finally, in economics and reinforcement learning, we assume agents make one decision (here whether the coffee is good or bad) by judging the inputs plus some random "error" or "taste" parameter which is has extrem value distribution. Since only the differences between these utilities matter (cf. odd ratios), and the actual values are meaningless, the logistic regression also arises.
D is a vector of input data. If it's more than a single number, then beta D needs to be interpreted as a dot product.
There's no guarantee that in any specific case a linear function of D will be a good approximation to the log odds. (In the present instance, where D is the temperature, it won't be -- there'll be a narrow range of good temperatures and the further away you get from that range, the worse the coffee is likely to be.)
But a linear approximation is at least simple and log odds (unlike e.g. probability) at least can take any value from -oo to +oo. Sometimes you get lucky.
NB: this post uses D for the input x and H for the output y. This confused me quite a bit since usually in ML we use D for the data (pairs of x and y) and H for the model (in most cases the parameters, the betas in this example).
I'm a little confused, how much technical this approach is? I can't understand the meaning of P(D) for example. Does it make sense in strict mathematics?
The approach is formally correct. You always have to make sure these values actually exist, but otherwise it goes through.
The example is probably not the very best, but P(D) may make more sense if you think of the following:
If D equals the amount of coffee I put in the grinder, then D has a certain random component. Sometimes I put in more, sometimes less - even though I aim at a specific level. This is why it is important to have a concept of P(D) in Bayes' equation.
The one case where I inadvertently put in a lot of coffee should not be used for "strong evidence" - is the idea here.
D in this case refers to a specific set of variables that goes into brewing coffee. P(D) then refers to the probability of a given set of values for that vector of variables given all the possible values.
Don't take it too literally - P(..) here is not some well defined function, it's effectively just part of the name, as a convention for naming probabilities. I find it confusing too.
As the article points out, that set is for all intents and purposes infinite in this case, but this doesn't matter, as you can sidestep it by comparing to complementary hypotheses (which makes P(D) cancel out). This is all covered in the article.
The only maths worth reading up on to understand this article is a basic introduction to Bayes theorem - the wikipedia page is quite decent.
If you read the article, you will find out that you are only supposed to use the exception on historical names. Most people use it incorrectly and drop the trailing s any time the word ends in an s. Arguably for Bayes, he is historical enough that it could be appropriate, but that makes the rule even worse: at what point in time does a personality become historical enough that you must drop the s? I personally consider "historical" to mean: about the time of Archimedes, when a lot of people's names seemed to end in s.
In my opinion it is just better to keep the s as that matches how people will pronounce it and regularization of the language is better. Plus the misuse of the rule makes it even worse.
One man trying to dictate the direction of a language used by a billion people... that's just not going to happen.
Not that I don't sympathize. I think that "it's" should be the possessive form of 'it', not a contraction of 'it is'. Maybe if we band together we can force-feed both our changes at once to the rest of the world?
Languages by definition are fluid. If a certain way of spelling a word or grammatical structure spreads to become popular (possibly, yes, from one person), then it becomes the new normal.
Because it reflects a misunderstanding. People say "Boris's bike" not "Boris' bike". They typically omit the possessive s when they mistake the existing s for a possessive. Like in Bayes.
Well, to be even more pedantic, there is a rule for leaving out the final s. You leave it out in possessives built from names of famous historical personalities and mythical creatures: Socrates' dialogues, Zeus' wife, Pegasus' wings, Jesus' teachings. In every other case you have to add the s: Bayes's Theorem, Jones's wife.
At least that's how I've learned it as a non-native speaker.
In reality the temperature is required to be a certain exact value within a range.