
Visualizing the Math Behind Logistic Regression and Newton's Method - seanharr11
http://thelaziestprogrammer.com/sharrington/math-of-machine-learning/solving-logreg-newtons-method
======
davedx
What is the difference between Newton's Method and Gradient Descent?

Edit: Found an answer: [https://www.quora.com/In-optimization-why-is-Newtons-
method-...](https://www.quora.com/In-optimization-why-is-Newtons-method-much-
faster-than-gradient-descent)

~~~
rm999
The calculus answer is that gradient descent approximates the shape of your
optimization space with the first order taylor expansion at your currently
optimized point (first derivative information), and Newton's method
approximates it with the second order taylor expansion (first and second
derivative information).

The geometric answer is that gradient descent approximates the shape of your
currently optimized point by a plane that is tangent to the current point,
whereas Newton's approximates it with a quadratic "bowl" using the local
curvature. In gradient descent you have to take short steps down the path,
because you know your approximation cannot be true (you will not always be
able to lower your error by moving in the same direction, if you assume there
is some minimum error). In Newton's method you can assume that your bowl
approximation is correct and go straight to the bottom of that bowl. If your
space really does look like a bowl, you will get to the bottom super quickly.

The ELI5 answer: you are a ship captain looking for the lowest point in a part
of the ocean. You cannot afford to send a submersible down, so your approach
is to send an anchor down to check the depth and take a panoramic photo a few
feet around the anchor. After analyzing the results of each anchor drop you
can move your ship anywhere and try again. In gradient descent you look for
the lowest point on the edge of the camera's vision and move in that
direction. You can move as far as you'd like, but you will not go too far
because you know the drop will not go on forever. In Newton's method you will
look at the curvature in the pano and extrapolate it out assuming it is a
"bowl" shape. For example maybe your anchor is on an underwater J-shaped cliff
getting less steep as it goes north, so you map it out as a U shape that
starts going up again if you go a mile north. So, you move your ship a mile
north and try again there. In gradient descent you would have moved maybe 100
feet north then checked again. Newton's method can intuitively get you to the
low point in very few anchor readings, but crunching the numbers of the bowl
can be very time-consuming (this becomes relevant in higher dimensions). In
this analogy, the long and lat coordinates in the ocean are your two
parameters to be optimized, and the height of the ocean floor is the error in
your optimization. You want to find the parameters that give you the lowest
error.

~~~
haskellandchill
Great explanation, and the ship example could make for a fun math game! I'm in
NYC also and interested in creating spaces for mathematics, will be at the
Museum of Mathematics hackathon next week. Drop me a line if you're interested
in chatting, sandy.vanderbleek@gmail.com

------
JadeNB
Since the author is reading, a few small typos, followed by one slightly more
substantial comment: 'simgoid' should be 'sigmoid' (S-shaped); `x y = log(x) +
log(y)` should be `log(x y) = log(x) + log(y)`;'guarentee' should be
'guarantee'; 'recipricol' should be 'reciprocal'.

I would like to see some mention of the fact that the division by the gradient
is a meaningless, purely formal _motivation_ for the correct step (inverting
the Hessian) that follows.

~~~
seanharr11
Thanks for the feedback, I co-worker just jabbed me with regarding the log
property mistake also...

As to the motivation for the correct step: can you point me to a resource that
explains this? Not sure I follow...

~~~
JadeNB
> As to the motivation for the correct step: can you point me to a resource
> that explains this? Not sure I follow...

You write an equation involving division by the gradient. This is an illegal
operation (one cannot divide by a vector), and your final recipe doesn't do
it. As far as I can tell, you are writing down the incorrect, illegally-
vector-inverting formula as motivation for the correct formula involving the
(inverse of the) Hessian. All I am suggesting is that you say explicitly
something like "Of course, this formula as written is not literally correct;
one cannot actually divide by a vector. The correct procedure is explained
below."

(Incidentally, speaking of inverses, another poster
([https://news.ycombinator.com/item?id=14881265](https://news.ycombinator.com/item?id=14881265))
has mentioned that it may be a bit confusing to speak of the inverse of a
matrix rather than the reciprocal, since (as I interpret that other poster's
point) the reciprocal of a matrix _is_ just its inverse. I might prefer to say
something like "We write $H_{\ell(\theta)}^{-1}\nabla\ell(\theta)$ rather than
$\frac{\nabla\ell(\theta)}{H_\ell(\theta)}$ to emphasise that we are inverting
a matrix, not a scalar, so that the order of multiplication matters.")

~~~
seanharr11
Ahh I got it. Understood, definitely worth clarifying, will update. Thanks.

------
theoh
Dialing up the complexity a bit from Newton's method, it would be interesting
to know whether there are now better explanations of the conjugate gradient
method online than this classic (or at least high-profile) intro:
[https://www.cs.cmu.edu/~quake-papers/painless-conjugate-
grad...](https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf)

------
mcphage
For the graphs of the home price / bathroom data set, what does the vertical
axis represent? I don't see it labeled or discussed anywhere.

~~~
seanharr11
I just took the index of the data point (i.e. 1, 250) as the y-axis with the
intention of "Stretching out" the data set along the y-axis. Otherwise the
data would otherwise be illegibly compressed on a 1-D number line.

In retrospect, I could have better represented the data with 2 overlying
histograms, but this (somewhat) captures the intent of showing that "more
expensive houses tend to have more than 2 bathrooms".

~~~
joehan
I agree with you that 2 histograms would be better. Check out the grouped
barchart on this link

[http://nvd3.org/examples/multiBar.html](http://nvd3.org/examples/multiBar.html)

------
craigching
This is a really nice introduction to logistic regression, well done! My one
quibble with the OP is the jump into Newton's method. Maybe a derivation to
explain the five steps would help. Thanks!

~~~
seanharr11
Thank you. I err'ed on the side of caution to (try) to reduce information
overload, but I also didn't want to assume too much prior knowledge. Delicate
balance...

Maybe a linked post that dives deeper into the 5 steps of Newton's Method?
Would that be more approachable?

~~~
craigching
Yeah, that would work. Honestly, when I studied logistic regression, we used
gradient descent, so seeing it with Newton's method was the draw for me. I can
go find the resources myself of course, but since I was reading the article I
thought it would be convenient there :)

------
bigtoine123
I search everywhere about the difference between Newton's Method and Gradient
Descent? but I couldn't find something that useful. Can u suggest any website/
article where I can learn the difference?

~~~
seanharr11
The comment right below has 2 links:

Gradient Descent explored similarly to this post:
[http://thelaziestprogrammer.com/sharrington/math-of-
machine-...](http://thelaziestprogrammer.com/sharrington/math-of-machine-
learning/the-gradient-a-visual-descent)

The difference between the two: [https://www.quora.com/In-optimization-why-is-
Newtons-method-...](https://www.quora.com/In-optimization-why-is-Newtons-
method-much-faster-than-gradient-descent)

------
xU1ppskunDmy6oz
The author has far too little mathematical understanding to be teaching
anybody (that’s my impression at least). If you don’t understand Newton’s
method before reading this, you won’t understand it afterwards. “A method for
finding the roots of a polynomial”. Why polynomials? Does it work, always? Is
it fast? Why would following the tangent repeatedly be a good idea? “We take
the inverse instead of the reciprocal because it’s a matrix”. Not impressed.

~~~
seanharr11
I am a programmer trying to learn math, so are my intended audience members.

That said, I should include facts related to convergence, and maybe even speed
compared to SGD.

As to the reciprocal -> inverse generalization, do you have any resources you
could point we towards to better understand this?

Additionally, a concrete answer to "Why would following the tangent repeatedly
be a good idea?" has been hard to come by for me. I am able to visualize this,
but if you have resources that explain this well please share.

~~~
xU1ppskunDmy6oz
In general, it’s not a good idea. And in general, Newton’s method won’t
converge.

Newton’s method boils down to replacing your function by a first-order
approximation. For a differentiable function, in a small neighbourhood(!),
that’s a good approximation (by definition), though, and the zero of the model
function will be very close to the zero of the original function (if it lies
in that neighbourhood).

PS: i did not expect the poster and author to be the same person, otherwise I
would’ve phrased my criticism differently. A SHOW HN would have helped.

PPS: basically the whole reciprocal/inverse confusion only arises because you
start the multidimensional case from your iteration formula. If you back to
its derivation, and start again from there, you can avoid that.

~~~
craigching
> In general, it’s not a good idea. And in general, Newton’s method won’t
> converge.

Right, but this blog post isn't about the general case of using Newton's
method to find roots, it's about using Newton's method for solving logistic
regression for which it is perfectly suited, though there are better methods
as well, of course.

