Asking why the logistic/sigmoid function is used in logistic regression is like asking why the quadratic x^2 function is used in quadratic regression or why The Beatles are the band whose recording is on the White Album.
A regression fits discrete data to a curve. A logistic regression fits discrete data to the logistic curve. Asking where these functions come from is interesting, but that's another question entirely.
I had always looked at the logistic function as having an easy source, the ODE:
y' = y(1-y)
As for how that relates to probability (p' = p[1-p]) I leave as an exercise to the reader.
>Note: the log of the odds function is often called "the logistic" function.
I was taught that this (the inverse of the logistic function) is the logit function.
My brain is admittedly a little fuzzier than it used to be these days due to medical issues, so if anyone can clarify for me, I'd appreciate it.
The reason I upvoted this post was with the hope that comments like this would be made. I'm still solidifying my grasp of ML/math topics so when these posts come up I always hope there are commenters with more experience/education in these topics to further enlighten me.
I think you're asking too much of upvotes. Since "upvote" and "save" are the same thing on HN, a lot of people upvote anything that meets the bar of "based on the title, it's plausible this is something I'd like to read eventually". That's really about all you can expect of upvotes.
In my opinion, the best justification for the logit is that it is a easily-optimizable member of the family of surrogate losses:
The papers cited in the article pretty cleanly explain why the logit and other surrogate losses are the 'natural choice' for classification. This also explains why other non-generative models like the SVM perform similarly well to the logit in practice.
Here's another motivation for why logits are practically useful: Bayes' Theorem!
You've probably seen Bayes' theorem in the depressing form:
Pr(C | E) = Pr(E | C) Pr(C) / Pr(E).
If you've gotten a little further with it, you may have seen that (if you work out a lot of mathematics) you can tack on a | O to all of these probabilities, where the O stands for old-evidence about the world, so that you can make this expression recursive:
Pr(C | E O) = Pr(E | C O) Pr(C | O) / Pr(E | O).
And if you've gotten a bit further you may have seen that in practice we often will use the fact that Pr(E | O) = Pr(E | C O) Pr(C | O) + Pr(E | !C O) Pr(!C | O), where ! is of course set-complement or the prefix "not-". This allows us to write:
Pr(E | C O) Pr(C | O)
Pr(C | E O) = ----------------------------------
Pr(E | C O) Pr(C | O) + Pr(E | !C O) [1 − Pr(C | O)]
But we can actually rewrite this last expression using the odds ratio,
Odds(A) = Pr(A) / [1 - Pr(A)]
Pr(E | C O)
Odds(C | E O) = --------------- * Odds(C | O)
Pr(E | !C O)
On the assumption that a bunch of pieces of evidence are independent, each factor decouples from the others and just becomes some factor W_i. The logit expression looks simply like:
log(Odds(C | E1 E2 ... En)) = log(Odds(C)) + log(W1) + log(W2) + ... + log(Wn).
The choice of the use of the sigmoid function does not stem from applying MLE to the model: instead it is an priori and arbitrary modeling decision that has been done even before starting to think about the estimation of the parameters of the model.
The intuitive justification with respect to the choice of this link function given in this blog post seems quite standard to me. This text-book chapter gives a similar intuition: http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf (section 12.2).
Also in practice we don't use pure MLE but l2-penalized (or l1-penalized) MLE to fit a logistic regression model as otherwise it might be very sensitive to noisy data if the number of features very large (compared to the number of training samples).
Thanks for pointing out the relation to the Bernoulli distribution.
The "logistic regression" has been invented / reinvented independently with different derivations 10 times or so.
Which is to say that the starting point is that of someone figuring out what is going on behind the scenes of a working software package rather than building up from first principles.
Curious if there is a link to a better explanation from a similar starting point.
Maybe a better analogy might be cryptography where practical implications of entropy pool design are relatively esoteric despite the vastly less accessible mathematical nature of reliable one way encryption algorithms.
It would be interesting to try step functions run through low-pass filters with different cutoffs to see which work best for logistic regression. Anything that produces a monotonically increasing value ought to converge. The only question is how fast.
cough Rectified Linear Units cough
I actually started down a similar path to self-learn AI, but ended up going the conventional Masters degree route. I even setup a skeleton of a blog to document the journey.
The home page has some courses you may want to add to your resource list. I've taken quite a few courses on Udacity and have enjoyed almost all of them.
Good luck on your adventure!
Usual recs are Strang or Axler Linear Algebra, Ross or Tsitsiklis for Probability, Spivak Calculus, Boyd Convex Optimization, and other folks will recommend real analysis, diff eq's, topology, etc.
This interpretation is also nice as it shows the natural relationship between logistic and probit regression. Probit regression arises when the unobserved utility is modeled with a Gaussian distribution.
More detail: http://fisher.osu.edu/~schroeder.9/AMIS900/ch5.pdf
There are other non-linearities which people use to map onto (0, 1), for example probit regression uses the Normal CDF. In fact you can use the CDF of any distribution supported on the whole real line, and the sigmoid is an example of this -- it's the CDF of a standard logistic distribution .
There's a nice interpretation for this using an extra latent variable: for probit regression, you take your linear predictor, add a standard normal noise term, and the response is determined by the sign of the result. For logistic regression, same thing except make it a standard logistic instead.
This then extends nicely to ordinal regression too.
It's certainly not the only option though, and not always the best fit.
I suppose the logistic having heavier tails than the normal is probably the main consideration in motivating one or the other as the better model for a given situation.
Logistic being is heavier-tailed, is potentially more robust to outliers. Which in terms of binary data, means that it might be a better choice in cases where an unexpected outcome is possible even in the most clear-cut cases. Probit regression with its heavier normal tails, might be a better fit in cases where the response is expected to be pretty much deterministic in clear-cut cases, and where quite severe inferences can be drawn from unexpected outcomes in those cases. Sound fair?
Probit might lead to more efficient inferences in cases where the mechanism is known to become deterministic relatively quickly as the linear predictor gets big.
You could go further in either direction too (more or less robust) by using other link functions.
Use linear regression when 1 unit increase in x increases y by a.
Use logistic regression when 1 unit increases x increases log odds by a.
Use Poisson regression when 1 unit increase in x increases y by y×e^a (see also negative binomial regression)
If you want to do bayesian updating of your model upon seeing new data, and you want to express this as a linear function, a sum of log-odds values is the way to do it.
The best way to understand this is to learn about "naive bayes" first. If you start with naive bayes, but replace the values that are based on observed frequencies with values that you can just optimize for your objective, you get logistic regression.
The need to publish this stuff is debatable. I don't think promoting it -- without clear caveats -- is particularly helpful. Self-submitting it to HN borders on naive arrogance.
Frankly you don't know what you don't know and the amount of cultural damage you can do in hopes of finding the one person who'll risk downvotes to say "nope, that's sorta wrong" doesn't seem like a great use of everyone's time. I'd prefer we reserve that for bigger, broader, riskier, innovative concepts. Not e^-x.
To me, the risk to Hacker News is that comments inconsistent with HN's ethos of civility, dissuade people from the vulnerability that comes when one posts original work. And I'd prefer people put a bit more effort into writing thoughtful and reflective comments.
Coincidentally I've driven Feynman's van. In fact it's the quality of Feynman's highly accessible writing that has me so disappointed in half-assed articles such as this one.
A young, knowledge-hungry, enthusiastic crowd such as HN deserves BETTER input, not junk like this. Surely we can retain the positivity while culling the crap.
If tpacek downvotes a crypto post it probably shouldn't sit on the homepage all day. But that's not the structure here.
Better would be a post that explains what's wrong with the post, or a better submission.
I, for example, haven't driven Feynman's van and my math is quite rusty, but I sure enjoyed the post and the following discussion.
Last week's idiotic neural net post involved data compression. It's now a top Google result atop research papers from the people who actually figured it out 30 years ago. I'm eager to see next week's thriller... see ya in the comments. If I get in early enough I'll fix it with love; if I get in late again I may or may not take the time to question why we keep doing things this way. And if I do a half-assed job with either, I'll certainly appreciate people telling me so. Might not work for everyone, but seems to open doors for me--including driver's side on the van.
Even assuming you're 100% right, hauling in barrels of garbage and dumping them in the commons is not a good reaction.