
The Sigmoid Function in Logistic Regression - krosaen
http://karlrosaen.com/ml/notebooks/logistic-regression-why-sigmoid/
======
pflats
Perhaps I was educated wrong, or perhaps I've forgotten almost everything from
my math degrees, but I can't wrap my head around this blog post, let alone how
it was upvoted so highly.

Asking why the logistic/sigmoid function is used in logistic regression is
like asking why the quadratic x^2 function is used in quadratic regression or
why The Beatles are the band whose recording is on the White Album.

A regression fits discrete data to a curve. A logistic regression fits
discrete data to the logistic curve. Asking where these functions come from is
interesting, but that's another question entirely.

I had always looked at the logistic function as having an easy source, the
ODE:

y' = y(1-y)

As for how that relates to probability (p' = p[1-p]) I leave as an exercise to
the reader.

>Note: the log of the odds function is often called "the logistic" function.

I was taught that this (the inverse of the logistic function) is the logit
function.

My brain is admittedly a little fuzzier than it used to be these days due to
medical issues, so if anyone can clarify for me, I'd appreciate it.

~~~
TDL
"let alone how it was upvoted so highly.

The reason I upvoted this post was with the hope that comments like this would
be made. I'm still solidifying my grasp of ML/math topics so when these posts
come up I always hope there are commenters with more experience/education in
these topics to further enlighten me.

~~~
argonaut
Yikes. Unfortunately this is why (IMO) I see a lot of mediocre ML articles
upvoted very highly on HN that have either very little discussion, or
generally poor quality discussion, especially around hot topics like neural
networks. The discussion I've seen here is honestly the exception to the norm
(probably due to the statistical nature of the topic). Personally this is one
of the few areas where Reddit does better.

~~~
mindcrime
_Unfortunately this is why (IMO) I see a lot of mediocre ML articles upvoted
very highly on HN that have either very little discussion, or generally poor
quality discussion, especially around hot topics like neural networks_

I think you're asking too much of upvotes. Since "upvote" and "save" are the
same thing on HN, a lot of people upvote anything that meets the bar of "based
on the title, it's plausible this is something I'd like to read eventually".
That's really about all you can expect of upvotes.

~~~
argonaut
I realize that's what many people do, but I don't agree with it because it's
an implicit signal of interest/worthiness, due to affecting rankings. It gives
people on HN the wrong perception of what is popular/worth learning and risks
leading autodidacts astray.

~~~
mindcrime
I actually agree and I would love to see HN adopt a separate "save" flag aside
from "upvote" (like Reddit does), but... it is what is, for the moment anyway.

------
oddthink
I'm sorry, that seems like the most backwards explanation of logistic
regression I've ever seen. It uses the sigmoid function because it's _logistic
regression_. The statistical model is a Bernoulli distribution. Write down the
likelihood, maximize it, and you get the sigmoid.

~~~
ogrisel
The statistical model is a Bernoulli distribution where the expected value p
is parameterized by the logistic sigmoid function applied to a linear model of
the input variables.

The choice of the use of the sigmoid function does not stem from applying MLE
to the model: instead it is an priori and arbitrary modeling decision that has
been done even before starting to think about the estimation of the parameters
of the model.

The intuitive justification with respect to the choice of this link function
given in this blog post seems quite standard to me. This text-book chapter
gives a similar intuition:
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf](http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf)
(section 12.2).

Also in practice we don't use pure MLE but l2-penalized (or l1-penalized) MLE
to fit a logistic regression model as otherwise it might be very sensitive to
noisy data if the number of features very large (compared to the number of
training samples).

~~~
oddthink
Yep, you're right. I was playing too fast and loose there. The Bernoulli
likelihood definitely suggests the odds as a natural space, but it's not
required.

------
Animats
Sigmoid functions also come up in neural nets. But that's because you need
something that's differentiable. If you just used a step function,
backpropagation wouldn't work, because small changes in the tuning parameters
wouldn't do anything. You can think of a sigmoid as a step function pushed
through a low-pass filter.

It would be interesting to try step functions run through low-pass filters
with different cutoffs to see which work best for logistic regression.
Anything that produces a monotonically increasing value ought to converge. The
only question is how fast.

~~~
eggy
I was going to chime in about neural nets too, since I thought that was what
the article was going to mention instead of just defining it for logistic
regression. It is used because it is continuous and differentiable as you say,
and it keeps the bounds neat between 0 and 1, although most modern deep neural
nets use an ReLU, or rectifier function of some sort. The softplus is an ReLU,
that has the nice property of its derivative being the sigmoid function.

------
ponderingHplus
Your learning sabbatical seems very well thought out and I'm glad your
documenting your journey publicly so others can learn with you.

I actually started down a similar path to self-learn AI, but ended up going
the conventional Masters degree route. I even setup a skeleton of a blog to
document the journey.

The home page has some courses you may want to add to your resource list. I've
taken quite a few courses on Udacity and have enjoyed almost all of them.

[http://cole-maclean.github.io/About/](http://cole-maclean.github.io/About/)

Good luck on your adventure!

~~~
krosaen
Thanks! That graphic at [http://cole-maclean.github.io/](http://cole-
maclean.github.io/) is pretty neat, I take it that was your WIP on your self-
study before you opted for the masters program? Good luck to you as well.

~~~
gtani
There's lots of preparatory ML reading lists, e.g. metacademy.org roadmap,
Goodfellow et al. DL book, Shalev-Shwartz/Ben-David ML text (both books freely
available content), review materials on Stanford course syllabuses by Socher,
Ng and Karpathy.

Usual recs are Strang or Axler Linear Algebra, Ross or Tsitsiklis for
Probability, Spivak Calculus, Boyd Convex Optimization, and other folks will
recommend real analysis, diff eq's, topology, etc.

------
MidsizeBlowfish
There is also the interesting interpretation of logistic regression as a
latent choice/utility model, where unobserved variation in utility is modeled
using the logistic distribution. In this interpretation, the sigmoid function
arises as the CDF of the logistic distribution.

This interpretation is also nice as it shows the natural relationship between
logistic and probit regression. Probit regression arises when the unobserved
utility is modeled with a Gaussian distribution.

~~~
thanatropism
In this interpretation we expect decision-makers to be utility-maximizers -
for example, as whether I buy or not Thing X; so, we expect Gumbel (an extreme
value distribution) errors. But we're looking for Utility(1) minus Utility(0)
(buying vs. not buying) and the difference between two Gumbels is a logistic.

More detail:
[http://fisher.osu.edu/~schroeder.9/AMIS900/ch5.pdf](http://fisher.osu.edu/~schroeder.9/AMIS900/ch5.pdf)

~~~
MidsizeBlowfish
Ah, right you are! Thanks for correcting my mistake :)

------
mjw
Their answer is pretty much 'because it's based on the log-odds', which to me
is still only very mild motivation.

There are other non-linearities which people use to map onto (0, 1), for
example probit regression uses the Normal CDF. In fact you can use the CDF of
any distribution supported on the whole real line, and the sigmoid is an
example of this -- it's the CDF of a standard logistic distribution [1].

There's a nice interpretation for this using an extra latent variable: for
probit regression, you take your linear predictor, add a standard normal noise
term, and the response is determined by the sign of the result. For logistic
regression, same thing except make it a standard logistic instead.

This then extends nicely to ordinal regression too.

[0]
[https://en.wikipedia.org/wiki/Probit_model](https://en.wikipedia.org/wiki/Probit_model)
[1]
[https://en.wikipedia.org/wiki/Logistic_distribution](https://en.wikipedia.org/wiki/Logistic_distribution)

~~~
gbrown
There are other nice properties. For example, because the logit link is
canonical for the binomial GLM, inference about unknown parameters using it is
based on sufficient statistics.

It's certainly not the only option though, and not always the best fit.

~~~
mjw
Ah yep, I forgot it's the canonical link. That's more of a small computational
convenience though, right, at least when fitting a straightforward GLM -- it
should be very cheap to fit regardless.

I suppose the logistic having heavier tails than the normal is probably the
main consideration in motivating one or the other as the better model for a
given situation.

Logistic being is heavier-tailed, is potentially more robust to outliers.
Which in terms of binary data, means that it might be a better choice in cases
where an unexpected outcome is possible even in the most clear-cut cases.
Probit regression with its heavier normal tails, might be a better fit in
cases where the response is expected to be pretty much deterministic in clear-
cut cases, and where quite severe inferences can be drawn from unexpected
outcomes in those cases. Sound fair?

------
nabla9
y = a0 + Σ(ai×xi)

Use linear regression when 1 unit increase in x increases y by a.

Use logistic regression when 1 unit increases x increases log odds by a.

Use Poisson regression when 1 unit increase in x increases y by y×e^a (see
also negative binomial regression)

~~~
nimish
Important that these are on average.

[https://en.wikipedia.org/wiki/Generalized_linear_model](https://en.wikipedia.org/wiki/Generalized_linear_model)

------
moultano
There are much better motivations for log-odds and the logistic function than
just ensuring the predicted values are bounded probabilities.

If you want to do bayesian updating of your model upon seeing new data, and
you want to express this as a linear function, a sum of log-odds values is the
way to do it.

The best way to understand this is to learn about "naive bayes" first. If you
start with naive bayes, but replace the values that are based on observed
frequencies with values that you can just optimize for your objective, you get
logistic regression.

------
hammock
What regression do you use when modeling total views collected by a YouTube
video over time?

~~~
gbrown
Probably some sort of growth model.

------
tacos
Agreed. This was painful to read until I clicked "About" and realized the
author was on some sort of... self-exploration mission.

The need to publish this stuff is debatable. I don't think promoting it --
without clear caveats -- is particularly helpful. Self-submitting it to HN
borders on naive arrogance.

Frankly you don't know what you don't know and the amount of cultural damage
you can do in hopes of finding the one person who'll risk downvotes to say
"nope, that's sorta wrong" doesn't seem like a great use of everyone's time.
I'd prefer we reserve that for bigger, broader, riskier, innovative concepts.
Not e^-x.

~~~
brudgers
For me, one of the great things about Hacker News is that every person on the
site is a new comer to some of the topics presented. Perhaps more importantly,
for any technical topic there is a high probability that there is a user for
whom the article is so basic as to be technically uninformative...I mean there
is a reasonable probability that one or more of the people who use HN actually
rode in Feynman's van at some point.

To me, the risk to Hacker News is that comments inconsistent with HN's ethos
of civility, dissuade people from the vulnerability that comes when one posts
original work. And I'd prefer people put a bit more effort into writing
thoughtful and reflective comments.

~~~
tacos
Agreed. However I draw the line at obvious amateurs submitting their own
articles. I fail to see why we need to celebrate every dope making a lounge
chair out of packing crates.

Coincidentally I've _driven_ Feynman's van. In fact it's the quality of
Feynman's highly accessible writing that has me so disappointed in half-assed
articles such as this one.

A young, knowledge-hungry, enthusiastic crowd such as HN deserves BETTER
input, not junk like this. Surely we can retain the positivity while culling
the crap.

~~~
brudgers
In my opinion, comments lacking civility and passing relevance to the
discussion at hand trump technical error as a bane of Hacker News. Charitably,
I realize that some people might be new comers to the relevant skills just as
others are to technical knowledge.

~~~
tacos
My observation is that community moderation seems to be working for one of
those cases but not the other. Therefore an increase of meta-discussion on
goofy posts might be expected.

If tpacek downvotes a crypto post it probably shouldn't sit on the homepage
all day. But that's not the structure here.

~~~
DanBC
Meta discussion about the quality of a post is not great, but rude dismissive
meta discussion about how shit a post is definitely suboptimal.

Better would be a post that explains what's wrong with the post, or a better
submission.

