Hacker News new | comments | ask | show | jobs | submit login
The Sigmoid Function in Logistic Regression (karlrosaen.com)
97 points by krosaen on May 17, 2016 | hide | past | web | favorite | 57 comments

Perhaps I was educated wrong, or perhaps I've forgotten almost everything from my math degrees, but I can't wrap my head around this blog post, let alone how it was upvoted so highly.

Asking why the logistic/sigmoid function is used in logistic regression is like asking why the quadratic x^2 function is used in quadratic regression or why The Beatles are the band whose recording is on the White Album.

A regression fits discrete data to a curve. A logistic regression fits discrete data to the logistic curve. Asking where these functions come from is interesting, but that's another question entirely.

I had always looked at the logistic function as having an easy source, the ODE:

y' = y(1-y)

As for how that relates to probability (p' = p[1-p]) I leave as an exercise to the reader.

>Note: the log of the odds function is often called "the logistic" function.

I was taught that this (the inverse of the logistic function) is the logit function.

My brain is admittedly a little fuzzier than it used to be these days due to medical issues, so if anyone can clarify for me, I'd appreciate it.

"let alone how it was upvoted so highly.

The reason I upvoted this post was with the hope that comments like this would be made. I'm still solidifying my grasp of ML/math topics so when these posts come up I always hope there are commenters with more experience/education in these topics to further enlighten me.

Yikes. Unfortunately this is why (IMO) I see a lot of mediocre ML articles upvoted very highly on HN that have either very little discussion, or generally poor quality discussion, especially around hot topics like neural networks. The discussion I've seen here is honestly the exception to the norm (probably due to the statistical nature of the topic). Personally this is one of the few areas where Reddit does better.

Unfortunately this is why (IMO) I see a lot of mediocre ML articles upvoted very highly on HN that have either very little discussion, or generally poor quality discussion, especially around hot topics like neural networks

I think you're asking too much of upvotes. Since "upvote" and "save" are the same thing on HN, a lot of people upvote anything that meets the bar of "based on the title, it's plausible this is something I'd like to read eventually". That's really about all you can expect of upvotes.

I realize that's what many people do, but I don't agree with it because it's an implicit signal of interest/worthiness, due to affecting rankings. It gives people on HN the wrong perception of what is popular/worth learning and risks leading autodidacts astray.

I actually agree and I would love to see HN adopt a separate "save" flag aside from "upvote" (like Reddit does), but... it is what is, for the moment anyway.

Any suggestions which subreddits I should be looking at?

/r/machinelearning is pretty high quality, very technical. You'll see a lot of well-known researchers posting there sometimes (whereas here you only see karpathy every once in a blue moon). I'm only a random guy who did ml research as an undergrad.

There are good reasons other than `it models the data well' to use a logit. Of course, the perspective that much like least-squares, logit regression does best when your data is generated from a logistic GLM is true, but not the only story.

In my opinion, the best justification for the logit is that it is a easily-optimizable member of the family of surrogate losses: http://fa.bianp.net/blog/2014/surrogate-loss-functions-in-ma...

The papers cited in the article pretty cleanly explain why the logit and other surrogate losses are the 'natural choice' for classification. This also explains why other non-generative models like the SVM perform similarly well to the logit in practice.

Right, log(p/(1-p)) is the logit function; its inverse is the logistic function.

Here's another motivation for why logits are practically useful: Bayes' Theorem!

You've probably seen Bayes' theorem in the depressing form:

    Pr(C | E) = Pr(E | C) Pr(C) / Pr(E).
Here E is evidence, C is a claim, and Pr() is the probability function mapping sets of possibilities to their probability. It says that the probability of a claim, given some evidence, turns out to be the probability of the evidence given the claim, times the prior probability of the claim, divided by the prior probability of the evidence.

If you've gotten a little further with it, you may have seen that (if you work out a lot of mathematics) you can tack on a | O to all of these probabilities, where the O stands for old-evidence about the world, so that you can make this expression recursive:

        Pr(C | E O) = Pr(E | C O) Pr(C | O) / Pr(E | O).
(In case it's not totally clear, when I juxtapose sets I intend to mean their set-intersection, which is usually analogized to a sort of "set product". So `E O` above means "both E and O": given both the old evidence O and the new evidence E.)

And if you've gotten a bit further you may have seen that in practice we often will use the fact that Pr(E | O) = Pr(E | C O) Pr(C | O) + Pr(E | !C O) Pr(!C | O), where ! is of course set-complement or the prefix "not-". This allows us to write:

                        Pr(E | C O) Pr(C | O)
    Pr(C | E O) = ----------------------------------
                  Pr(E | C O) Pr(C | O) + Pr(E | !C O) [1 − Pr(C | O)]
Same theorem, but a lot more complicated because it's a lot more real-world!

But we can actually rewrite this last expression using the odds ratio,

    Odds(A) = Pr(A) / [1 - Pr(A)]
in a much simpler form as:

                      Pr(E | C O)
    Odds(C | E O) = --------------- * Odds(C | O)
                      Pr(E | !C O)
The recursive formula therefore looks like a chain of multiplications of factors due to newer and newer evidence.

On the assumption that a bunch of pieces of evidence are independent, each factor decouples from the others and just becomes some factor W_i. The logit expression looks simply like:

    log(Odds(C | E1 E2 ... En)) = log(Odds(C)) + log(W1) + log(W2) + ... + log(Wn).
The logistic regression is therefore assuming a linear approximation: each continuous factor F_i that you're studying corresponds to some independent piece of evidence E_i which contributes some factor W_i = log(Pr(E_i | C) / Pr(E_i | !C)), and we're assuming that this logarithm-of-a-conditional-probability-ratio can then be roughly approximated as a_i + b_i F_i. We collect together log(Pr(C)) + Sum_i a_i together as some leading coefficient A and then we get our fit as N+1 variables (A, {b_i}).

I'm sorry, that seems like the most backwards explanation of logistic regression I've ever seen. It uses the sigmoid function because it's logistic regression. The statistical model is a Bernoulli distribution. Write down the likelihood, maximize it, and you get the sigmoid.

The statistical model is a Bernoulli distribution where the expected value p is parameterized by the logistic sigmoid function applied to a linear model of the input variables.

The choice of the use of the sigmoid function does not stem from applying MLE to the model: instead it is an priori and arbitrary modeling decision that has been done even before starting to think about the estimation of the parameters of the model.

The intuitive justification with respect to the choice of this link function given in this blog post seems quite standard to me. This text-book chapter gives a similar intuition: http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf (section 12.2).

Also in practice we don't use pure MLE but l2-penalized (or l1-penalized) MLE to fit a logistic regression model as otherwise it might be very sensitive to noisy data if the number of features very large (compared to the number of training samples).

Yep, you're right. I was playing too fast and loose there. The Bernoulli likelihood definitely suggests the odds as a natural space, but it's not required.

Thanks for pointing out the similarity in intuitive justification in Shalizi's text, I have that book on my to-read list already, fun to see I wasn't far off.

Yes, it uses a sigmoid function because it's logistic regression and therefore you are using the inverse of the logistic function, the sigmoid, as the notebook explains. But I think it's worth running through that and exploring why it's useful to use a logistic function in the first place (maps linear combo to (-1, 1) range). It was useful to me anyways, perhaps obvious to others.

Thanks for pointing out the relation to the Bernoulli distribution.

The logistic function maps to (0,1). The hyperbolic tangent tanh(x)=2*logistic(2x)-1 maps to (-1,1).

If you look at the history of statistics, you'll find plenty of practically motivated ad-hoc problem solving like "hmm, how do we get R^n to map onto [0,1]" Log odds were used long before logistic regression and GLMs came along, and it certainly doesn't sound backwards to me to try and explain how people came to use a technique that initially looks very arbitrary (why not model probabilities directly?)

I think this is a good example of the problems with autodidactism. For all its flaws, structured education is what makes common knowledge common. When you study on your own, you don't know what you don't know, and there's no one there to point out your obvious-with-the-right-knowledge lapses.

To me it's an example of what Hacker News offers people interested in learning [the human, not machine type]. This very article shows how it is a place where there are no long term consequences equivalent to grades: just a wee bit of standard internet asshattery and those complaints about the youth of today that have been in the Western cannon since Socrates was drinking Hemlock.

You can actually do it in either direction, going from the link function and reasoning back to the distribution it implies, or starting at the distribution and seeing what maximizes the likelihood.

The "logistic regression" has been invented / reinvented independently with different derivations 10 times or so.

The reverse nature might be due to approaching the problem from practical application of machine learning technology rather than the higher level abstracts of than from mathematics/statistics.

Which is to say that the starting point is that of someone figuring out what is going on behind the scenes of a working software package rather than building up from first principles.

Curious if there is a link to a better explanation from a similar starting point.

The real explanation comes from the theory of generalized linear models: https://en.wikipedia.org/wiki/Generalized_linear_model

My apologies for not being clear, I was wondering if there was something closer toward the Randall Monroe [edit] "thing explainer" end of the spectrum. The reason I was wondering is because machine learning is becoming something that people incorporate into a project as a library. The analogy I would draw is to something like TCP/IP where there are abstractions over routing and congestion control and where routing and congestion control are abstractions over the mathematics of scheduling and graph theory.

Maybe a better analogy might be cryptography where practical implications of entropy pool design are relatively esoteric despite the vastly less accessible mathematical nature of reliable one way encryption algorithms.

Randall Monroe?

The training set for "randall" sigmoid in my brain becomes apparent.

Sigmoid functions also come up in neural nets. But that's because you need something that's differentiable. If you just used a step function, backpropagation wouldn't work, because small changes in the tuning parameters wouldn't do anything. You can think of a sigmoid as a step function pushed through a low-pass filter.

It would be interesting to try step functions run through low-pass filters with different cutoffs to see which work best for logistic regression. Anything that produces a monotonically increasing value ought to converge. The only question is how fast.

I was going to chime in about neural nets too, since I thought that was what the article was going to mention instead of just defining it for logistic regression. It is used because it is continuous and differentiable as you say, and it keeps the bounds neat between 0 and 1, although most modern deep neural nets use an ReLU, or rectifier function of some sort. The softplus is an ReLU, that has the nice property of its derivative being the sigmoid function.

To expand on what gcp is saying, that's not true. Rectified linear units do better than sigmoids but are not differentiable (but are sub-differentiable), and ReLUs have the problem where small changes don't do anything (in the negative regime).

>backpropagation wouldn't work, because small changes in the tuning parameters wouldn't do anything

cough Rectified Linear Units cough

Your learning sabbatical seems very well thought out and I'm glad your documenting your journey publicly so others can learn with you.

I actually started down a similar path to self-learn AI, but ended up going the conventional Masters degree route. I even setup a skeleton of a blog to document the journey.

The home page has some courses you may want to add to your resource list. I've taken quite a few courses on Udacity and have enjoyed almost all of them.


Good luck on your adventure!

Thanks! That graphic at http://cole-maclean.github.io/ is pretty neat, I take it that was your WIP on your self-study before you opted for the masters program? Good luck to you as well.

There's lots of preparatory ML reading lists, e.g. metacademy.org roadmap, Goodfellow et al. DL book, Shalev-Shwartz/Ben-David ML text (both books freely available content), review materials on Stanford course syllabuses by Socher, Ng and Karpathy.

Usual recs are Strang or Axler Linear Algebra, Ross or Tsitsiklis for Probability, Spivak Calculus, Boyd Convex Optimization, and other folks will recommend real analysis, diff eq's, topology, etc.

There is also the interesting interpretation of logistic regression as a latent choice/utility model, where unobserved variation in utility is modeled using the logistic distribution. In this interpretation, the sigmoid function arises as the CDF of the logistic distribution.

This interpretation is also nice as it shows the natural relationship between logistic and probit regression. Probit regression arises when the unobserved utility is modeled with a Gaussian distribution.

In this interpretation we expect decision-makers to be utility-maximizers - for example, as whether I buy or not Thing X; so, we expect Gumbel (an extreme value distribution) errors. But we're looking for Utility(1) minus Utility(0) (buying vs. not buying) and the difference between two Gumbels is a logistic.

More detail: http://fisher.osu.edu/~schroeder.9/AMIS900/ch5.pdf

Ah, right you are! Thanks for correcting my mistake :)

Their answer is pretty much 'because it's based on the log-odds', which to me is still only very mild motivation.

There are other non-linearities which people use to map onto (0, 1), for example probit regression uses the Normal CDF. In fact you can use the CDF of any distribution supported on the whole real line, and the sigmoid is an example of this -- it's the CDF of a standard logistic distribution [1].

There's a nice interpretation for this using an extra latent variable: for probit regression, you take your linear predictor, add a standard normal noise term, and the response is determined by the sign of the result. For logistic regression, same thing except make it a standard logistic instead.

This then extends nicely to ordinal regression too.

[0] https://en.wikipedia.org/wiki/Probit_model [1] https://en.wikipedia.org/wiki/Logistic_distribution

There are other nice properties. For example, because the logit link is canonical for the binomial GLM, inference about unknown parameters using it is based on sufficient statistics.

It's certainly not the only option though, and not always the best fit.

Ah yep, I forgot it's the canonical link. That's more of a small computational convenience though, right, at least when fitting a straightforward GLM -- it should be very cheap to fit regardless.

I suppose the logistic having heavier tails than the normal is probably the main consideration in motivating one or the other as the better model for a given situation.

Logistic being is heavier-tailed, is potentially more robust to outliers. Which in terms of binary data, means that it might be a better choice in cases where an unexpected outcome is possible even in the most clear-cut cases. Probit regression with its heavier normal tails, might be a better fit in cases where the response is expected to be pretty much deterministic in clear-cut cases, and where quite severe inferences can be drawn from unexpected outcomes in those cases. Sound fair?

Is there a natural justification for the logistic distribution though?

See the other replies above, but: the logistic has heavier tails than the normal, so might do better in cases where we need robustness, where unexpected outcomes remain possible even in cases where the linear predictor is relatively big, and we want to avoid drawing extreme inferences from them.

Probit might lead to more efficient inferences in cases where the mechanism is known to become deterministic relatively quickly as the linear predictor gets big.

You could go further in either direction too (more or less robust) by using other link functions.

y = a0 + Σ(ai×xi)

Use linear regression when 1 unit increase in x increases y by a.

Use logistic regression when 1 unit increases x increases log odds by a.

Use Poisson regression when 1 unit increase in x increases y by y×e^a (see also negative binomial regression)

Important that these are on average.


Binomial and Poisson regression can both be performed with non-canonical link functions, which sometimes provide a better fit or more interpretable results (see: canonical link for gamma regression).

There are much better motivations for log-odds and the logistic function than just ensuring the predicted values are bounded probabilities.

If you want to do bayesian updating of your model upon seeing new data, and you want to express this as a linear function, a sum of log-odds values is the way to do it.

The best way to understand this is to learn about "naive bayes" first. If you start with naive bayes, but replace the values that are based on observed frequencies with values that you can just optimize for your objective, you get logistic regression.

What regression do you use when modeling total views collected by a YouTube video over time?

Probably some sort of growth model.

Agreed. This was painful to read until I clicked "About" and realized the author was on some sort of... self-exploration mission.

The need to publish this stuff is debatable. I don't think promoting it -- without clear caveats -- is particularly helpful. Self-submitting it to HN borders on naive arrogance.

Frankly you don't know what you don't know and the amount of cultural damage you can do in hopes of finding the one person who'll risk downvotes to say "nope, that's sorta wrong" doesn't seem like a great use of everyone's time. I'd prefer we reserve that for bigger, broader, riskier, innovative concepts. Not e^-x.

We detached this subthread from https://news.ycombinator.com/item?id=11713289 and marked it off-topic.

For me, one of the great things about Hacker News is that every person on the site is a new comer to some of the topics presented. Perhaps more importantly, for any technical topic there is a high probability that there is a user for whom the article is so basic as to be technically uninformative...I mean there is a reasonable probability that one or more of the people who use HN actually rode in Feynman's van at some point.

To me, the risk to Hacker News is that comments inconsistent with HN's ethos of civility, dissuade people from the vulnerability that comes when one posts original work. And I'd prefer people put a bit more effort into writing thoughtful and reflective comments.

Agreed. However I draw the line at obvious amateurs submitting their own articles. I fail to see why we need to celebrate every dope making a lounge chair out of packing crates.

Coincidentally I've driven Feynman's van. In fact it's the quality of Feynman's highly accessible writing that has me so disappointed in half-assed articles such as this one.

A young, knowledge-hungry, enthusiastic crowd such as HN deserves BETTER input, not junk like this. Surely we can retain the positivity while culling the crap.

In my opinion, comments lacking civility and passing relevance to the discussion at hand trump technical error as a bane of Hacker News. Charitably, I realize that some people might be new comers to the relevant skills just as others are to technical knowledge.

My observation is that community moderation seems to be working for one of those cases but not the other. Therefore an increase of meta-discussion on goofy posts might be expected.

If tpacek downvotes a crypto post it probably shouldn't sit on the homepage all day. But that's not the structure here.

Meta discussion about the quality of a post is not great, but rude dismissive meta discussion about how shit a post is definitely suboptimal.

Better would be a post that explains what's wrong with the post, or a better submission.

So instead of calling someone's work "half-assed", how about cutting your elitist crap and putting it in larger context with good explanation, like other commeters did? Then you would (1) help others to learn something new and (2) spare OP's feelings. Both outcomes are good and highly desirable. You achieved the opposite.

I, for example, haven't driven Feynman's van and my math is quite rusty, but I sure enjoyed the post and the following discussion.

If the point of Hacker News is to be a sewing circle where we slowly reveal first principles of introductory textbooks, after someone makes the top of the homepage with a fundamentally wacky and incorrect analysis -- just so you can enjoy the comments -- then everything you propose makes perfect sense. Same process also works great in kindergarten classrooms. But at some point you have to stop rewarding people for drawing a yellow sun in crayon with a happy face on it or they never grow up.

Last week's idiotic neural net post involved data compression. It's now a top Google result atop research papers from the people who actually figured it out 30 years ago. I'm eager to see next week's thriller... see ya in the comments. If I get in early enough I'll fix it with love; if I get in late again I may or may not take the time to question why we keep doing things this way. And if I do a half-assed job with either, I'll certainly appreciate people telling me so. Might not work for everyone, but seems to open doors for me--including driver's side on the van.

Please stop making the thread worse. Other people already pointed out problems with the article; there was no need to pile on by calling names—which the HN guidelines ask you not to do—let alone take the thread so tediously off topic.

Even assuming you're 100% right, hauling in barrels of garbage and dumping them in the commons is not a good reaction.

If there are better articles, how about submitting some and hopefully they'll be upvoted more.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact