Hacker News new | past | comments | ask | show | jobs | submit login
Matrix Calculus for Deep Learning (usfca.edu)
573 points by jph00 on Jan 30, 2018 | hide | past | web | favorite | 81 comments



Jeremy here. Here to answer any questions or comments that you have.

But more importantly - I need to mention that Terence Parr did nearly all the work on this. He shared my passion for making something that anyone could read on any device to such an extent that he ended up creating a new tool for generating fast, mobile-friendly math-heavy texts: https://github.com/parrt/bookish . (We tried Katex, Mathjax, and pretty much everything else but nothing rendered everything properly).

I've never found anything that introduces the necessary matrix calculus for deep learning clearly, correctly, and accessibly - so I'm happy that this now exists.


Terence here. Jeremy's role was critical in terms of direction and content for the article. Who better than he to describe the math needs for deep learning. :)


Glad to see you in this space Terrance! Been a long time since the traveling parser revival and beer tasting festival!


Question; when I learned Vector Calculus back in college, I used Marsden & Tromba as our text book where they equate the derivative of a function from R^n->R^m with the Jacobian. Is matrix calculus the same thing, just a slightly different notation?


Typographic advice: the body text has very long lines in a desktop browser, which makes it a bit slow and tiring to read. I’d say the ideal is somewhere between 1/2 and 2/3 this length. I’d recommend keeping the same width on screen but bumping the font size up by 30%.

As an extra minor nit, italicizing functions like sin, etc. is also somewhat unconventional in mathematical typesetting.


I agree that the font should be bigger. I need to learn more CSS in order to switch between font sizes per platform. The font of the text is easy but all of the images were generated from latex using a specific font size. I need to scale the in-line equation images as the font size bumps up.


the magic incantation here is probably media queries

@media (max-width: 768px) {

    p {

     font-size: 1rem;
    }
}


We have to also adjust the image sizes for the in-line equations. That’s what I need to figure out :)


it's a variable name for a polymorphic function, taylor series, euler formular, etc. depending on context. This is a significant difference to singular types, but being a variable name to an abstract concept is in principle no different to typesetting x. This goes neatly with "everything is an objectreference" and might be more of a programmer's perspective.


In mathematical notation it gets confusing because several italic letters in a row are otherwise interpreted as separate variables.


Thanks for the advice!


This is great! my graduate advisor a long time ago made a really great matrix calculus study sheet for me that was absolutely invaluable in learning ML ( i mean really this is great for not just DL but all sorts of reasoning in ML)


We originally had that generic ML target in mind but figured a DL bent would make it a wee bit more interesting.


Wondering if you still have that with you? Would be super useful for folks here.


What do you think about the index notation physicists use for tensor calculus?


I don't find it very accessible, myself - but I'm not a physicist, so the materials using or about that notation aren't aimed at me.

The only tensor notation I've been happy with is that used in J (http://www.jsoftware.com), which is simple, flexible, and concise.

There's also some nice-enough modern notation used in this excellent review: http://www.cs.cmu.edu/~christos/courses/826-resources/PAPERS...


The notation is very simple: you write tensor expressions with indices, and repeated indices are implicitly summed over. For example, matrix multiply:

   A_ij = B_ik C_kj
Differentiating this with respect to variable l:

   ∂_l A_ij = ∂_l (B_ik C_kj) = (∂_l B_ik) C_kj + B_ik (∂_l C_kj)
By writing out indices you can just use the rules for scalar derivatives.


Oh in that case I'm just confused - that's what I know as Einstein notation. Modern physics papers seemed to use much more complex notation, but I probably just misunderstood.

If we're talking about Einstein notation, then I'm a fan - `np.einsum()` is often a great way to create fast tensor computations with minimal code.


index notation isn't useful for calculation it's useful for algebra i.e. if you want closed form solution to tensor equations. my distinct impression is that in ML no one cares about that because eventually everything gets a numerical treatment.


There are discrepancies between the HTML and pdf version.

Eg. In the first line in 'References'

HTML: "When looking for resources on the web, search for “matrix calculus” not “vector calculus.” "

Latex/pdf: "When looking for resources on the web, search for “elements” not “elements” "


Whoops. thanks. translator error. I'll fix it.


Hi Jeremy, thanks for creating so many great learning resources. Myself and a fellow ML enthusiast are starting a deep learning meetup here in Phoenix next month. Do you have any advice for creating a welcoming environment for people to learn/teach? Thanks!


Probably better asking that at http://forums.fast.ai , since a lot of participants there have set up local meetups, and some of them have gotten pretty big. I haven't created a meetup myself - my focus has been on the online course and community, frankly.


In the article, you say the gradient of 2x + y^8 is [1, 7y^8]. Shouldn't it be [2, 7y^8]?


Now you are just replacing one error with another, you mean [2, 8y^7]. ;)


Oops. yeah. thanks


ERRATA: In the start of Matrix Calculus section, shouldn't the first value in gradient of "g" be 2 instead of 1 i.e., [2,8y^7]


This is really great, thanks.


If you're looking at this with the intention of getting started in Deep Learning and feeling overwhelmed by the math then Andrew Ng offers a great course on Coursera that goes over all of the formulas needed to calculate the forward propagation, loss computation, backward propagation, and gradient descent. Highly recommend it for anyone interested in breaking into the field of machine learning.


It is also all free on youtube: https://www.youtube.com/watch?v=UzxYlbK2c7E


They also have it all on Stanford's site with some other information and course materials.

https://see.stanford.edu/Course/CS229


This is great. But I find the advantage of Coursera is that it incorporates quizzes and programming homework into the lectures reinforcing the learning. Also, the material is updated and more relevant to today's ML problems then his 2008 lectures on youtube.




if you want to get started with machine learning you should just start with Keras and do some theory later.


Thanks for this. Was taking Andrew Ng's course but the way he glosses over the calculus and then expects the student to understand the implications at end of lecture was a turn off so I dropped it. I hated the feeling I wasn't learning, just memorizing solutions.


You might prefer the approach at http://course.fast.ai - all the concepts are taught with code, instead of math, and understanding is developed by running experiments.


I did both and found fast.ai so much easier to understand for someone without a background in math, like me.


I've taken 3/5 of his Deep Learning MOOC and thought the calculus, at least its implementation, was very well explained. Were you dissatisfied with the lack of depth or lack of explanation? Just curious.


Fortunately there is a website now capable of doing matrix calculus! http://www.matrixcalculus.org

Mathematica doesn't seem to be able to do matrix calculus, which surprised me quite a bit.


Indeed, most tools surprisingly lack this ability. I was shocked when I needed to break my calculations down to a piece-wise form when doing matrix calc with SymPy.


Mathematica can absolutely do all of this! D[ matrix ,x] ... or things like this:

f[x_,y_] := x^2 + Sin[y]

vars={x,y};

Table[ D[f[x,y],var1, var2], {var1,vars}, {var2,vars}] // MatrixForm


But how do you compute the derivative of x'Ax in Mathematica (x being a vector and A being a matrix)? What you have pointed out is only scalar derivatives, if I am not mistaken here.


Like this, perhaps?

A = {{1,2},{3,4}}

vec = {x^2, x^3}

D[vec.A.vec, x]

Or perhaps like this, again the table of derivatives:

xvec = {x1,x2}

Table[ D[xvec.A.xvec,x] ,{x,xvec}]

(all untested... one typo caught...)


Mathematica can definitely compute the derivatives if you fix the size of the matrix. This isn't very useful if you're trying to compute the derivative of an expression with arbitrary sized matrices.


You can do general matrices too, what do you have in mind?

aa[x_] = {{1, 2}, {3, 4}} x

bb[x_] = {x^2, x^3}

D[ a[x].b[x] , x, x] (* for any suitable tensors *)

% /. {a -> aa, b -> bb}



Added Link to Wolfram Alpha...


Note: This website presumes denominator layout, which is different from what is used in the guide.


Does the layout matter as long as you're consistent? Do the deep learning libraries presume that you're using a certain layout?


The deep learning libraries largely hide all the calculus - it's all automated.


Wow! Great little calculator. Thanks for pointing us at it.


You should add a link in the resources section


Done. Added a link.


Wow, this is really a great resource. I wish it had been available a few years back when I took the free online version of CS231n. The hardest part (for me, anyway) was the long-forgotten Calculus needed for backprop. Especially as applied to matrices. I struggled at the time to find accessible explanations of many of the matrix operations, and you seem to have it all laid out here. Thank you.


Thanks so much for this. I have no interest in deep learning (at the moment) but I was working through some papers about the Lucas Kanade tracker and this paper explains some of the underlying math in just the right amount of detail. The authors usually show the beginning and end point and just say something like "using the chain rule" we arrive at ... It took me a while to understand what they were saying and this paper helps a lot.

The math is super easy but keeping all the notation s and conventions in my head is hard, I've never seen it laid out this nicely before. Thanks!


Hiya. That's funny because it's exactly what caused us to write this article. Jeremy and I were working on an automatic differentiation tool and couldn't find any description of the appropriate matrix calculus that explained the steps. Everything is just providing the solution without the intervening steps. We decided to write it down so we never have to figure out the notation again. haha


Matrix calculus is a bit screwy when you realize that there are two possible notations to represent matrix derivatives (numerator vs. denominator layout; numerator layout is used in this guide). Plus, the notation is not very "speaking" for doing calculations unless you commit to memory some basic results.. which is why, as a physicist, I would recommend working in tensor calculus notation during calculations, and translating back to matrix notation for writing the results.


I was also surprised when I saw that there was no standard notation for Jacobian matrices. We use the numerator notation in the article, but point out that there are papers that use the denominator notation. I think I remember from engineering school that we used numerator notation so we stuck with that.


This is what index notation is good for, and I encourage everyone to learn it. Jacobians are dx_a/dx_b, two indices, and clearly b belongs to the derivative. Whether it's rows or columns is an implementation detail of how you're storing these numbers.

Index notation also seems natural for programming: an element A[i,j] or a slice Z[3,4,:] are precisely this.


I agree. If your Matrix calculus involves more complicated use of derivative operators you can't treat it like linear algebra anymore. Better to break it down into something like tensor notation first and back to matrices at the end. https://en.wikipedia.org/wiki/Del. Specifically I was always confused by the material derivative of a vector field when presented as either Matrix or vector calculus. If you represent it in tensor notation ( or explicitly break it out as operations on basis vectors ) it works out nicely.


The book "Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd Edition" by Magnus and Neudecker, is available at http://www.janmagnus.nl/misc/mdc2007-3rdedition (468 pages).


So are we turning machine learning into a Euclidean distance calculation [with] many dimensions, with different weights for each dimension?

That’s... not that sexy. But at least it makes sense to anyone with an undergrad degree in CS or math, which is something neural networks never accomplished.


>For functions of a single parameter, /partial operator/ is equivalent to /full derivative operator/ (for sufficiently smooth functions).

Are you pulling my leg here, or do I need to scrap my understanding of calculus?

If they exist, how could they not be equivalent?


In school, I didn't make it much past basic calculus/algebra. As a self-taught programmer (my highest level of education is a high-school diploma), I seriously wish I could go back and put more effort into math. I love looking at these types of topics, but I have absolutely no clue what I'm looking at.

If anyone can recommend any books, courses, or any other material that starts from high-school level math, and gradually increases in complexity, I would love to look at it.

Cheers :)


I would start by watching the Linear Algebra and Calculus videos by 3Blue1Brown. This will give you an intuitive understanding of both.

https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2x...

https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53...


Thank you!


I am writing a book that I hope can serve such a purpose. Would you be interested in taking a look? If so, shoot me an email at mathintersectprogramming@gmail.com

Fair warning, I have shown it to a programmer who claimed some level of "math phobia," and they said the first chapter was too difficult. I rewrote that chapter since, and I think it is better, but I could use some feedback :)


Well... this paper is really designed to be accessible with just high school math, if you take your time (a few weeks or months) and follow the references. Any time it relies on some concept, it includes a link to learning more about that concept, and also has a link to a forum where you can ask questions if you get stuck. There's also a table of all notation used.

If you give it a go and find you're not successful, I'd be interested to hear where is the first point where you got stuck and couldn't get unstuck, since that would suggest a need for us to improve our paper!


Hi Jeremy,

I greatly appreciate your response! I will take a long look at this paper again and attempt to digest it.

Thanks again!


Try "Methods of Mathematics Applied to Calculus, Probability, and Statistics" by Richard Hamming


Thanks for this great contribution.

I would like to be able to read the math in DL papers. (sorry I'm asking for something that it's too broad)

1) How much does this document cover the notations in those papers. 2) When I read a paper and if I am not sure what the math means, does that mean that I did not grok the subject yet, or the math presented in that paper goes beyond the math given in this Matrix Calculus document (assuming I studied well this document).


I would say try to get as much of the intuition behind the math as you can. Knowing what an equation means rather than fully understanding the Greek notation is what it matters for practice. If you want to substantially contribute to the theoretical CS literature, you will need to have a good handle of the notation (for obvious reasons).

Note: I come from math and Econ, so the split between practitioner and theorists might be different for CS/ML.


While matrix derivatives are important, there is also a lot of other math in DL papers. In particular, a lot of the probability side concerns expectations, KL divergences, entropy, etc., which are all defined in terms of integrals or sums. You need undergraduate-level probability background.


The first 5 chapters of the Goodfellow deep learning book are a great resource for understanding the probability, linear algebra, optimization, and information theory you need to digest deep learning papers.


Like, did we ever get this much math on a regular basis back a few years ago on HN? It's exciting to see how math is seeping into engineering culture.


Fantastic content, thankyou.

If I could make one request it would be a bit of margin/padding on the left of the body text. Would make it more readable on mobile.


Bookmarked, this is a very nice reference I will need to read in greater depth.


Reading this shows me just how little I know about advanced mathematics.


So great! Thanks Terence and Jeremy!


Beautifully straightforward - going to be referring to this for quick refreshers




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: