
The Matrix Calculus You Need for Deep Learning - prostoalex
http://explained.ai/matrix-calculus/index.html
======
hypersoar
Does anyone know of good resources for studying machine learning or data
science given a strong mathematical background? I'm transitioning careers from
pure math research into industry. I know very little about machine learning,
but I know the crap out of linear algebra and real analysis (and other, less
relevant fields). It'd be great to have some sources that leverage that
without assuming much prior CS knowledge.

~~~
thanatropism
There are many new methods cropping up that most people in the data science
hype train will be full-on unable to access, including methods on manifolds
(even kernel methods on manifolds) and algebraic-topological methods
(persistent homology) with enough maths to give Kagglers the screaming
meemies.

I'm using some of those for $(redacted, the plan is to make money). Don't
follow the crowd.

~~~
mjw
When I started out in ML I was really keen to learn about the most 'mathsy'
approaches out there.

I think with hindsight, it's great to have a broad spectrum of methods
available to you, but if you focus too much on methods at the hard-math end of
the spectrum just for the sake of an intellectual challenge, you can end up
fixated on an exotic solution looking for a problem while the rest of the
field moves on, rather than doing useful engineering people care about.

Maybe you find a niche where something exotic really helps, maybe you don't --
maybe for research this is a risk worth taking. But just something to keep in
mind.

IMO: breadth is good. Mathematical maturity helps. If one sticks around one
finds uses for interesting maths eventually, but not worth trying to force it.

Another avenue for people who want to use some hardcore math: try and use it
to find some good theory around why things which work well, work well. Not an
easy task either by any means.

------
luk32
If someone likes more lecture style explanation I can recommend 3blue1brown's
material on YouTube. He explained in a pretty good an accessible way imho.

I didn't learn artificial neural network stuff from there. I knew those
concepts but I didn't know the matrix formalism applied to it. So this was
really nice to understand why GPUs are good for this. Math-wise it was really
nice watch.

~~~
tw1010
It's amazing what a rich-get-richer effect products and content that really
manage to solve problems in a high quality way get in comments sections around
the web. (E.g. 3B1B.)

~~~
Twisol
I'm not really sure why you're being downvoted. You're right that 3B1B really
does manage to solve a problem in a high-quality way, and it's amazing how
much of an effect that really has on people, especially considering the
relatively niche topics that 3B1B goes into. (You'd think that his "Essence
of" series would be more popular, but the one-off problem analyses have
ridiculously more views in general.)

In some ways, it _is_ a "rich-get-richer" effect. But creators like 3B1B
expend a lot of time and resources to do what they do, and the word of mouth
he gets is an acknowledgement that the work he does is worth the money and
views we provide.

------
roel_v
So I have a question somewhat related to this that I never knew where/who to
ask (well actually I asked a few mathematicians at a university I work with
whose answers I couldn't understand - their answers were almost as
impenetrable as the Wikipedia page, and some engineering scientists who I
thought would be more into 'applied math' but they didn't know. So I'm hoping
some data science people reading this would better understand where I'm coming
from, and be able to explain at my primitive math skills level).

So, some time ago I contracted out writing some code for fitting a logistic
regression onto a given set of observations. There were some specific
requirements, but I think I should have been able to piece something myself
together using mainstream LA libraries; some of them even hint at 'you could
fit a logistic regression using these functions' but no complete examples. But
I didn't understand well enough so I contracted it out.

The woman who ended up writing the code used a 'Hessian' matrix to do so (she
actually wrote two functions doing the same thing, one used this Hessian
approach - I think the idea was that it would be faster but there wasn't a lot
of time and it never got tweaked enough to make a difference).

So my question - is there a layman's explanation for what a Hessian matrix is,
and how it applies to fitting a logistic model? Also (with an eye to the
future of my project), does it have applications for non-linear regressions?

Alternatively, are there any books where this is covered? I have most standard
stats/applied stats/operations research books, as well as a few like the no
bullshit guide to linear algebra, but none cover this specific issue - or even
how to fit a logistic regression at all, on a practical level (so not just
'conceptually you do xyz, implementation is left as an exercise for the
reader').

~~~
TJSomething
Those Wikipedia pages are kind of awful for pedagogy, but they have the right
equations, so I won't cover those.

Say we have a curve that corresponds to how good of a fit your model is. We
want to try to find the maximum on that curve. However, calculating every
point of the curve is too expensive, so we want to minimize the number of
points we have to check. So, we start with a guess as to the highest point on
the curve and take the first and second derivatives of the curve at that
point. This gives us enough to fit a parabola that approximates the curve in
the neighborhood of our initial guess point. Then, it's pretty easy to solve
for the highest point on the parabola. That's our new guess. Repeat that a few
times until the guesses stop shifting much. If the curve is nicely shaped
(i.e. smooth everywhere, only has one maximum), the guesses will converge on
the highest point.

This is a often faster than a similar method, gradient ascent, which relies
upon only taking the first derivative. This would yield a line in the vicinity
of our guess, and then we just move our guess a little bit, such that it goes
up the line. This is pretty slow, since it can't just go straight to a guess
of the top, and if you go too fast, then it'll blow right past the maximum.

The Hessian matrix is the higher dimensional equivalent to the second
derivative there and the gradient is the equivalent for the first derivative.
For example, if we have a two dimensional surface in 3D, then those matrices
will be 2x2 and capture the curvature of the 3D paraboloid in the vicinity of
the guess. As you go up in dimensionality, they're called quadric
hypersurfaces.

When you're fitting a logistic regression, your hypersurface is the logarithm
of the likelihood that the data you have fits the logistic curve with
parameters at that point. The logarithm makes the hypersurface better behaved
and makes the calculus easier. You just need the gradient and the Hessian,
evaluate those at your initial guess, fit a quadric hypersurface to the guess
there, pop up to the top of that hypersurface, repeat a few times, and you've
got your model.

~~~
ghaff
Unfortunately, mathematics is one of the areas where Wikipedia is pretty awful
in general. The articles seem mostly written for people who pretty much
already understand the topic in question. Of course, you always have to assume
_some_ knowledge base but the stereotypical jargon-filled Wilipedia approach
is particularly off-putting in this area.

~~~
Myrmornis
I understand what you’re trying to say, but Wikipedia is a fantastic resource
for mathematics. “Pretty awful” is not a correct choice of words. But yes,
much of it is written at beyond-undergrad-math level. And undergrad math is
already advanced! And no I’m not someone with a math PhD talking down! I’m
struggling through teaching myself undergrad math.

~~~
pvg
The only way "pretty awful" is incorrect is that it is too polite and
reserved. Reams upon reams of pages are written completely at odds with
Wikipedia's own style guidelines and common-sense expectations of what one
might find in an encyclopedia. Unlike some famously dense mathematical texts,
wikipedia maths pages don't even come with any of the benefits of brevity or
focus. It's like a giant joke competition of who can describe every trivial
thing in the most abstract and abstruse way except it got out of hand and the
participants forgot it was supposed to be a joke. Mathworld and similar sites
will help you much more with undergrad maths.

~~~
Myrmornis
We’re saying much the same thing, it’s just that I find your and GP’s use of
the absolute “pretty awful” to be hyperbolic and something of a loss of
perspective. Remember what we have here: a free, actively maintained,
accurate, comprehensive and advanced corpus of expository writing on
mathematics. Adjectives that are missing there are “intuitition-rich”, and
“helpful for undergraduates”. I do understand if you are disappointed with it.
As noted, I (undergrad level) don’t approach it with an expectation that it
will be my favorite reading on a topic.

~~~
pvg
_We’re saying much the same thing_

No.

 _something of a loss of perspective._

You'd have to provide some alternative perspective or argument that goes
beyond 'pretty awful sounds kinda mean'.

 _Remember what we have here: a free, actively maintained, accurate,
comprehensive and advanced corpus of expository writing on mathematics_

We already have a few of those. As I mentioned, mathworld is far better at
this and it's been around longer than Wikipedia.

------
Koncopd
Absolutely best guide (for me) on these things is
[http://www.psi.toronto.edu/~andrew/papers/matrix_calculus_fo...](http://www.psi.toronto.edu/~andrew/papers/matrix_calculus_for_learning.pdf)
Also "Matrix algebra" by Magnus and Abadir.

------
jonnydubowsky
This thread is a glowing reminder of how effective and friendly the HN
community can be, when good will is extended in both the question and the
answer. Thank you to all who contributed as the comments have provided me (and
many others I'm sure) with a clear and concise map of the matrix calculus full
of helpful resources. My 4th of July beach reading is now complete.

------
cimmanom
The explanatory style here is excellent. They explain clearly in 3 paragraphs
what I was unable to get a good handle on in a semester of college math (due
to awful teaching styles and textbooks), namely why partial derivatives are
calculated the way they are.

------
bag531
This is really useful. Matrix calculus was one of the hardest parts of my
machine learning course at university simply because it had never been covered
by any of the many math classes I'd taken.

------
andrepd
Too extensive and not focusing enough on the core concepts. Teaching
mathematics is about making the underlying structure as clear as possible.
It's a very hard thing to do effectively.

------
qiqitori
This book had a really gentle explanation of the required calculus:
[https://www.amazon.com/Make-Your-Own-Neural-Network-
ebook/dp...](https://www.amazon.com/Make-Your-Own-Neural-Network-
ebook/dp/B01EER4Z4G)

It also builds some really simple Python code to create a simple (non-"deep",
i.e. with just one hidden layer) neural network capable of recognizing human-
drawn digits with good accuracy.

------
sdan
Truely amazing work. Hoped for some nice drawings but Latex did it!

------
calebh
There's a tool now that can do matrix calculus:
[http://www.matrixcalculus.org/](http://www.matrixcalculus.org/)

------
stablemap
Previous discussion:

[https://news.ycombinator.com/item?id=16267178](https://news.ycombinator.com/item?id=16267178)

~~~
dang
Hmm that definitely qualifies this one as a dupe, but the discussion is
unusually substantive so I guess we'll leave it up.

------
dm7
an accessible and concise refreshment of material

