Matrix Calculus for Deep Learning

jph00 · on Jan 30, 2018

Jeremy here. Here to answer any questions or comments that you have.

But more importantly - I need to mention that Terence Parr did nearly all the work on this. He shared my passion for making something that anyone could read on any device to such an extent that he ended up creating a new tool for generating fast, mobile-friendly math-heavy texts: https://github.com/parrt/bookish . (We tried Katex, Mathjax, and pretty much everything else but nothing rendered everything properly).

I've never found anything that introduces the necessary matrix calculus for deep learning clearly, correctly, and accessibly - so I'm happy that this now exists.

parrt · on Jan 30, 2018

Terence here. Jeremy's role was critical in terms of direction and content for the article. Who better than he to describe the math needs for deep learning. :)

MarkMMullin · on Jan 31, 2018

Glad to see you in this space Terrance! Been a long time since the traveling parser revival and beer tasting festival!

archgoon · on Jan 30, 2018

Question; when I learned Vector Calculus back in college, I used Marsden & Tromba as our text book where they equate the derivative of a function from R^n->R^m with the Jacobian. Is matrix calculus the same thing, just a slightly different notation?

jacobolus · on Jan 30, 2018

Typographic advice: the body text has very long lines in a desktop browser, which makes it a bit slow and tiring to read. I’d say the ideal is somewhere between 1/2 and 2/3 this length. I’d recommend keeping the same width on screen but bumping the font size up by 30%.

As an extra minor nit, italicizing functions like sin, etc. is also somewhat unconventional in mathematical typesetting.

parrt · on Jan 30, 2018

I agree that the font should be bigger. I need to learn more CSS in order to switch between font sizes per platform. The font of the text is easy but all of the images were generated from latex using a specific font size. I need to scale the in-line equation images as the font size bumps up.

dbetteridge · on Jan 31, 2018

the magic incantation here is probably media queries

@media (max-width: 768px) {

    p {

     font-size: 1rem;
    }
}

parrt · on Jan 31, 2018

We have to also adjust the image sizes for the in-line equations. That’s what I need to figure out :)

posterboy · on Jan 30, 2018

it's a variable name for a polymorphic function, taylor series, euler formular, etc. depending on context. This is a significant difference to singular types, but being a variable name to an abstract concept is in principle no different to typesetting x. This goes neatly with "everything is an objectreference" and might be more of a programmer's perspective.

jacobolus · on Jan 30, 2018

In mathematical notation it gets confusing because several italic letters in a row are otherwise interpreted as separate variables.

jph00 · on Jan 30, 2018

Thanks for the advice!

jtmcmc · on Jan 30, 2018

This is great! my graduate advisor a long time ago made a really great matrix calculus study sheet for me that was absolutely invaluable in learning ML ( i mean really this is great for not just DL but all sorts of reasoning in ML)

parrt · on Jan 30, 2018

We originally had that generic ML target in mind but figured a DL bent would make it a wee bit more interesting.

Dawny33 · on Jan 31, 2018

Wondering if you still have that with you? Would be super useful for folks here.

jules · on Jan 30, 2018

What do you think about the index notation physicists use for tensor calculus?

jph00 · on Jan 30, 2018

I don't find it very accessible, myself - but I'm not a physicist, so the materials using or about that notation aren't aimed at me.

The only tensor notation I've been happy with is that used in J (http://www.jsoftware.com), which is simple, flexible, and concise.

There's also some nice-enough modern notation used in this excellent review: http://www.cs.cmu.edu/~christos/courses/826-resources/PAPERS...

jules · on Jan 30, 2018

The notation is very simple: you write tensor expressions with indices, and repeated indices are implicitly summed over. For example, matrix multiply:

   A_ij = B_ik C_kj

Differentiating this with respect to variable l:

   ∂_l A_ij = ∂_l (B_ik C_kj) = (∂_l B_ik) C_kj + B_ik (∂_l C_kj)

By writing out indices you can just use the rules for scalar derivatives.

jph00 · on Jan 31, 2018

Oh in that case I'm just confused - that's what I know as Einstein notation. Modern physics papers seemed to use much more complex notation, but I probably just misunderstood.

If we're talking about Einstein notation, then I'm a fan - `np.einsum()` is often a great way to create fast tensor computations with minimal code.

mlevental · on Jan 30, 2018

index notation isn't useful for calculation it's useful for algebra i.e. if you want closed form solution to tensor equations. my distinct impression is that in ML no one cares about that because eventually everything gets a numerical treatment.

TheArcane · on Jan 31, 2018

There are discrepancies between the HTML and pdf version.

Eg. In the first line in 'References'

HTML: "When looking for resources on the web, search for “matrix calculus” not “vector calculus.” "

Latex/pdf: "When looking for resources on the web, search for “elements” not “elements” "

parrt · on Jan 31, 2018

Whoops. thanks. translator error. I'll fix it.

smith-kyle · on Jan 30, 2018

Hi Jeremy, thanks for creating so many great learning resources. Myself and a fellow ML enthusiast are starting a deep learning meetup here in Phoenix next month. Do you have any advice for creating a welcoming environment for people to learn/teach? Thanks!

jph00 · on Jan 30, 2018

Probably better asking that at http://forums.fast.ai , since a lot of participants there have set up local meetups, and some of them have gotten pretty big. I haven't created a meetup myself - my focus has been on the online course and community, frankly.

edhu2017 · on Jan 30, 2018

In the article, you say the gradient of 2x + y^8 is [1, 7y^8]. Shouldn't it be [2, 7y^8]?

danbruc · on Jan 30, 2018

Now you are just replacing one error with another, you mean [2, 8y^7]. ;)

parrt · on Jan 30, 2018

Oops. yeah. thanks

2paisay · on Jan 31, 2018

ERRATA: In the start of Matrix Calculus section, shouldn't the first value in gradient of "g" be 2 instead of 1 i.e., [2,8y^7]

Mangalor · on Jan 30, 2018

This is really great, thanks.

marrowgari · on Jan 30, 2018

If you're looking at this with the intention of getting started in Deep Learning and feeling overwhelmed by the math then Andrew Ng offers a great course on Coursera that goes over all of the formulas needed to calculate the forward propagation, loss computation, backward propagation, and gradient descent. Highly recommend it for anyone interested in breaking into the field of machine learning.

ransom1538 · on Jan 30, 2018

It is also all free on youtube: https://www.youtube.com/watch?v=UzxYlbK2c7E

wuliwong · on Jan 30, 2018

They also have it all on Stanford's site with some other information and course materials.

https://see.stanford.edu/Course/CS229

marrowgari · on Feb 1, 2018

This is great. But I find the advantage of Coursera is that it incorporates quizzes and programming homework into the lectures reinforcing the learning. Also, the material is updated and more relevant to today's ML problems then his 2008 lectures on youtube.

ink404 · on Jan 30, 2018

https://www.coursera.org/learn/machine-learning

this one?

marrowgari · on Feb 1, 2018

this one... https://www.coursera.org/learn/neural-networks-deep-learning...

singularity2001 · on Jan 30, 2018

if you want to get started with machine learning you should just start with Keras and do some theory later.

smrtinsert · on Jan 30, 2018

Thanks for this. Was taking Andrew Ng's course but the way he glosses over the calculus and then expects the student to understand the implications at end of lecture was a turn off so I dropped it. I hated the feeling I wasn't learning, just memorizing solutions.

jph00 · on Jan 30, 2018

You might prefer the approach at http://course.fast.ai - all the concepts are taught with code, instead of math, and understanding is developed by running experiments.

joshgel · on Jan 31, 2018

I did both and found fast.ai so much easier to understand for someone without a background in math, like me.

vervez · on Jan 30, 2018

I've taken 3/5 of his Deep Learning MOOC and thought the calculus, at least its implementation, was very well explained. Were you dissatisfied with the lack of depth or lack of explanation? Just curious.

calebh · on Jan 30, 2018

Fortunately there is a website now capable of doing matrix calculus! http://www.matrixcalculus.org

Mathematica doesn't seem to be able to do matrix calculus, which surprised me quite a bit.

BucketSort · on Jan 30, 2018

Indeed, most tools surprisingly lack this ability. I was shocked when I needed to break my calculations down to a piece-wise form when doing matrix calc with SymPy.

improbable22 · on Jan 30, 2018

Mathematica can absolutely do all of this! D[ matrix ,x] ... or things like this:

f[x_,y_] := x^2 + Sin[y]

vars={x,y};

Table[ D[f[x,y],var1, var2], {var1,vars}, {var2,vars}] // MatrixForm

SoerenL · on Jan 30, 2018

But how do you compute the derivative of x'Ax in Mathematica (x being a vector and A being a matrix)? What you have pointed out is only scalar derivatives, if I am not mistaken here.

improbable22 · on Jan 30, 2018

Like this, perhaps?

A = {{1,2},{3,4}}

vec = {x^2, x^3}

D[vec.A.vec, x]

Or perhaps like this, again the table of derivatives:

xvec = {x1,x2}

Table[ D[xvec.A.xvec,x] ,{x,xvec}]

(all untested... one typo caught...)

calebh · on Jan 30, 2018

Mathematica can definitely compute the derivatives if you fix the size of the matrix. This isn't very useful if you're trying to compute the derivative of an expression with arbitrary sized matrices.

improbable22 · on Jan 31, 2018

You can do general matrices too, what do you have in mind?

aa[x_] = {{1, 2}, {3, 4}} x

bb[x_] = {x^2, x^3}

D[ a[x].b[x] , x, x] (* for any suitable tensors *)

% /. {a -> aa, b -> bb}

claar · on Jan 30, 2018

http://www.wolframalpha.com/input/?i=D%5B%7Bx%5E2,+x%5E3%7D....

parrt · on Jan 31, 2018

Added Link to Wolfram Alpha...

adyavanapalli · on Jan 30, 2018

Note: This website presumes denominator layout, which is different from what is used in the guide.

calebh · on Jan 30, 2018

Does the layout matter as long as you're consistent? Do the deep learning libraries presume that you're using a certain layout?

jph00 · on Jan 31, 2018

The deep learning libraries largely hide all the calculus - it's all automated.

parrt · on Jan 30, 2018

Wow! Great little calculator. Thanks for pointing us at it.

calebh · on Jan 30, 2018

You should add a link in the resources section

parrt · on Jan 31, 2018

Done. Added a link.

kaffeinecoma · on Jan 30, 2018

Wow, this is really a great resource. I wish it had been available a few years back when I took the free online version of CS231n. The hardest part (for me, anyway) was the long-forgotten Calculus needed for backprop. Especially as applied to matrices. I struggled at the time to find accessible explanations of many of the matrix operations, and you seem to have it all laid out here. Thank you.

zwieback · on Jan 30, 2018

Thanks so much for this. I have no interest in deep learning (at the moment) but I was working through some papers about the Lucas Kanade tracker and this paper explains some of the underlying math in just the right amount of detail. The authors usually show the beginning and end point and just say something like "using the chain rule" we arrive at ... It took me a while to understand what they were saying and this paper helps a lot.

The math is super easy but keeping all the notation s and conventions in my head is hard, I've never seen it laid out this nicely before. Thanks!

parrt · on Jan 30, 2018

Hiya. That's funny because it's exactly what caused us to write this article. Jeremy and I were working on an automatic differentiation tool and couldn't find any description of the appropriate matrix calculus that explained the steps. Everything is just providing the solution without the intervening steps. We decided to write it down so we never have to figure out the notation again. haha

adyavanapalli · on Jan 30, 2018

Matrix calculus is a bit screwy when you realize that there are two possible notations to represent matrix derivatives (numerator vs. denominator layout; numerator layout is used in this guide). Plus, the notation is not very "speaking" for doing calculations unless you commit to memory some basic results.. which is why, as a physicist, I would recommend working in tensor calculus notation during calculations, and translating back to matrix notation for writing the results.

parrt · on Jan 30, 2018

I was also surprised when I saw that there was no standard notation for Jacobian matrices. We use the numerator notation in the article, but point out that there are papers that use the denominator notation. I think I remember from engineering school that we used numerator notation so we stuck with that.

improbable22 · on Jan 30, 2018

This is what index notation is good for, and I encourage everyone to learn it. Jacobians are dx_a/dx_b, two indices, and clearly b belongs to the derivative. Whether it's rows or columns is an implementation detail of how you're storing these numbers.

Index notation also seems natural for programming: an element A[i,j] or a slice Z[3,4,:] are precisely this.

mewert · on Jan 30, 2018

I agree. If your Matrix calculus involves more complicated use of derivative operators you can't treat it like linear algebra anymore. Better to break it down into something like tensor notation first and back to matrices at the end. https://en.wikipedia.org/wiki/Del. Specifically I was always confused by the material derivative of a vector field when presented as either Matrix or vector calculus. If you represent it in tensor notation ( or explicitly break it out as operations on basis vectors ) it works out nicely.

hinkley · on Jan 31, 2018

So are we turning machine learning into a Euclidean distance calculation [with] many dimensions, with different weights for each dimension?

That’s... not that sexy. But at least it makes sense to anyone with an undergrad degree in CS or math, which is something neural networks never accomplished.

letlambda · on Jan 31, 2018

>For functions of a single parameter, /partial operator/ is equivalent to /full derivative operator/ (for sufficiently smooth functions).

Are you pulling my leg here, or do I need to scrap my understanding of calculus?

If they exist, how could they not be equivalent?

lprd · on Jan 30, 2018

In school, I didn't make it much past basic calculus/algebra. As a self-taught programmer (my highest level of education is a high-school diploma), I seriously wish I could go back and put more effort into math. I love looking at these types of topics, but I have absolutely no clue what I'm looking at.

If anyone can recommend any books, courses, or any other material that starts from high-school level math, and gradually increases in complexity, I would love to look at it.

Cheers :)

KSS42 · on Jan 30, 2018

I would start by watching the Linear Algebra and Calculus videos by 3Blue1Brown. This will give you an intuitive understanding of both.

https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2x...

https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53...

lprd · on Jan 30, 2018

Thank you!

j2kun · on Jan 30, 2018

I am writing a book that I hope can serve such a purpose. Would you be interested in taking a look? If so, shoot me an email at mathintersectprogramming@gmail.com

Fair warning, I have shown it to a programmer who claimed some level of "math phobia," and they said the first chapter was too difficult. I rewrote that chapter since, and I think it is better, but I could use some feedback :)

jph00 · on Jan 30, 2018

Well... this paper is really designed to be accessible with just high school math, if you take your time (a few weeks or months) and follow the references. Any time it relies on some concept, it includes a link to learning more about that concept, and also has a link to a forum where you can ask questions if you get stuck. There's also a table of all notation used.

If you give it a go and find you're not successful, I'd be interested to hear where is the first point where you got stuck and couldn't get unstuck, since that would suggest a need for us to improve our paper!

lprd · on Jan 30, 2018

Hi Jeremy,

I greatly appreciate your response! I will take a long look at this paper again and attempt to digest it.

Thanks again!

tmbsundar · on Jan 31, 2018

Try "Methods of Mathematics Applied to Calculus, Probability, and Statistics" by Richard Hamming

sydl · on Jan 30, 2018

Thanks for this great contribution.

I would like to be able to read the math in DL papers. (sorry I'm asking for something that it's too broad)

1) How much does this document cover the notations in those papers. 2) When I read a paper and if I am not sure what the math means, does that mean that I did not grok the subject yet, or the math presented in that paper goes beyond the math given in this Matrix Calculus document (assuming I studied well this document).

pacbard · on Jan 30, 2018

I would say try to get as much of the intuition behind the math as you can. Knowing what an equation means rather than fully understanding the Greek notation is what it matters for practice. If you want to substantially contribute to the theoretical CS literature, you will need to have a good handle of the notation (for obvious reasons).

Note: I come from math and Econ, so the split between practitioner and theorists might be different for CS/ML.

blt · on Jan 30, 2018

While matrix derivatives are important, there is also a lot of other math in DL papers. In particular, a lot of the probability side concerns expectations, KL divergences, entropy, etc., which are all defined in terms of integrals or sums. You need undergraduate-level probability background.

jph00 · on Jan 31, 2018

The first 5 chapters of the Goodfellow deep learning book are a great resource for understanding the probability, linear algebra, optimization, and information theory you need to digest deep learning papers.

tw1010 · on Jan 30, 2018

Like, did we ever get this much math on a regular basis back a few years ago on HN? It's exciting to see how math is seeping into engineering culture.

dbetteridge · on Jan 31, 2018

Fantastic content, thankyou.

If I could make one request it would be a bit of margin/padding on the left of the body text. Would make it more readable on mobile.

bahram_banisadr · on Jan 30, 2018

Beautifully straightforward - going to be referring to this for quick refreshers

cjhanks · on Jan 30, 2018

Bookmarked, this is a very nice reference I will need to read in greater depth.

gigatexal · on Jan 31, 2018

Reading this shows me just how little I know about advanced mathematics.

peetle · on Jan 30, 2018

So great! Thanks Terence and Jeremy!

poster123 · on Jan 31, 2018

The book "Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd Edition" by Magnus and Neudecker, is available at http://www.janmagnus.nl/misc/mdc2007-3rdedition (468 pages).