 The matrix calculus you need for deep learning (2018) 168 points by Anon84 on April 2, 2021 | hide | past | favorite | 40 comments If you know single variable calculus and basic linear algebra, then you can often muddle through by just remembering that the derivative of,(1) f: R^n -> R is a vector(2) f: R -> R^n is a vector(3) f: R^m -> R^n is an n x m matrix (the Jacobian)(*) f(x)=x^tAx is f'(x)=(A + A^t)x (this is an example of (1))(**) and that the derivative (gradient) of (1) gives you (3) with m=n, and in this case the derivative of (3) will be symmetric (the Hessian).then just do your best to apply the single variable rules of differentiation (product rule, chain rule, etc.), and then mess around a bit with the result until all the dimensions match up and such that your result matches the appropriate case (1)-(3).For any more complicated functions you encounter such as the determinant, you can just look up its derivative when needed. I think it conceptually helps a lot to distinguish between row and column vectors, even if they are coalesce in any production code you might write.(0a) An (nxm) matrix A represents a linear transformation f(x)=Ax from R^m -> R^n(0b) A linear transformation f(x) = Ax is its own derivative, f'(x) = Ax(3) The derivative of a function f : R^m -> R^n is a linear transformation with the same "type" R^m -> R^n as the original function. This means it is an (nxm) matrix.The other two rules can be derived from these rules:(1) The derivative of a function f : R^m -> R is a (1xm) matrix, aka a row vector.(2) The derivative of a function f : R^n -> R is an (nx1) matrix, aka a column vector.This perspective is helpful because, aided by the chain rule, we can easily derive all the other matrix algebra rules on the fly. Distinguishing row / column vectors helps catch conceptual type errors in places where it's inappropriate to add vectors and covectors together. In (2) I think you meant to write "f : R -> R^n" You're right, thanks! > (0b) A linear transformation f(x) = Ax is its own derivative, f'(x) = Axf'(x) is just A , not Ax,f'(x) != f(x) (I vouched for this comment since it exposes a common misconception people have about derivatives, and I'd rather reply than have it buried)It's unfortunate that the connection between matrices and linear transformations, and thereby between derivatives and linear transformations, is not adequately emphasized in the math courses that most people take.The derivative (or differential, or total derivative) of a function f : R^m -> R^n at a point x ∈ R^m is defined, formally, to be a linear transformation A : R^n -> R^m such that`````` || f(x+h) - (f(x) + A(h)) || lim ------------------------ = 0 h->0 ||h|| `````` Where ||.|| indicates a vector norm that measures the magnitude of its argument. Unpacking a bit, notice that> h is a vector in R^m> f(x+h) is the exact value of f at x+h> f(x)+A(h) is a linear approximation to f(x+h), using x as the base point and A as the linear approximatorSo this just a formal way of saying that the dervative A should be a good linear approximation to f.Every linear transformation has a corresponding matrix, and every matrix corresponds to a linear transformation (with respect to a choice of basis). So, when we say "the derivative of f at x is the matrix A", we really mean that "the derivative of f at x is the linear transformation represented by the matrix A".EDIT: In response to the parent poster, who has dug their heels in by posting a now-dead rant in reply to this post, I encourage you to refer to a copy of Rudin, "Principles of Mathematical Analysis", for a rigorous treatment of derivatives. See also: representation theory and group actions. I think that one of the reasons people look for a good exposition is so that they do not _have_ to muddle through.I know I look for a simple example or explanation that makes complicated cases clear.This 'explanation' might be correct, but I found the details that would help someone else follow it are missing. Especially someone with a computer science background who considers a m x 1 matrix different from a 1 x m matrix.I see a sister comment has explained this, but I want to point out that there are plenty of details... (like sister's statement 0b which looks to me more like an exponential function than a linear function) that's why one goes to a carefully written source.This paper is a carefully written explanation of the basics, ending with many applications of the chain rule. The "matrix product rule" is not mentioned, which is good in my opinion. I admit it's a blue collar approach, but it's served me well when my goal is optimization or some such and I just need to get on with it. I suggest this reference  for anyone needing to look something up. I struggling in DL when we started having to do tensor multiplications on the homework. Good with 2D matrices but the higher dimensional math gave me a lot of trouble when it should have been straight forward.Any tips there? If you squint a bit tensor multiplications are still basically just matrix multiplications. At each coordinate in the output you'll place the sum of a bunch of element-wise products. Here are a few examples, some of which you've seen before and some of which you might not have:- Vector inner product (dot product) -- n-element vectors V and W are combined to create the single element (V_1 x W_1 + V_2 x W_2 + ...). Let's reframe this as a type of tensor contraction (multiplication). You have an (n,) tensor and another (n,) tensor (both "1D"), and you're getting rid of the first dimension in each of them, so the product is an (,) tensor ("0D"), and each element is computed by summing up n element-wise multiplications.- Vector outer product -- we have the same vectors V and W as before, but the result is a "2D" (n,n) matrix M where M_a,b = V_a x W_b. Reframing this as a type of tensor contraction, you aren't getting rid of, summing over, or contracting any dimensions, so the product includes every dimension from both inputs -- it's (n,n). Each element is computed by summing up 1 element-wise multiplication.- Matrix product -- Consider A an (i,j) matrix and B a (j,k) matrix. Standard matrix multiplication yields a result M (i,k). As a tensor multiplication, you're contracting (summing over) the "j" entries, so each output element is the sum of j element-wise multiplications. In particular, M_a,b = (A_a,1 x B_1,b + A_a,2 x B_2,b + ...).- Kronecker product (block matrix product) -- Consider the same matrices A and B as before. We aren't going to contract over any indices, so the result is a "4D" object M (i,j,j,k). To compute M_a,b,c,d you find all elements from A and B with those indices fixed, pair and multiply them, and sum them together. In particular, M_a,b,c,d = A_a,b x B_c,d.- Something higher dimensional -- Consider an input tensor A with shape (i,j,k,m,n) and another B with shape (m,n,w,y,z). As I'm sure you've noticed, even for objects as simple as vectors there are several choices of tensor multiplication (contraction) available. As with any other operation, the problem you're describing will govern which one you use (just like how you'd normally have a very clear reason for choosing an inner vs outer product in a vector problem). It's still interesting to consider what our options are though. The general rule is that sizes have to line up on both sides, so for this problem we could contract over {}, {m}, {n}, or {m,n}. The outputs for each of those products would have shapes, respectively, of (i,j,k,m,n,m,n,w,y,z) [everything], (i,j,k,n,n,w,y,z) [no m], (i,j,k,m,m,w,y,z) [no n], or (i,j,k,w,y,z) [no m or n]. I'm going to skip over the first three because they're enough like things you've seen before (kronecker product for the first, matrix product for the other two) that I don't think they're worth the comment space, but the last one is novel in some sense. Our methodology for multiplication is still straightforward though. Suppose the result is called M and we want to compute its value at some coordinate M_a,b,c,d,e,f. We'll pair up elements that look like A_a,b,c,?,? with elements that look like B_?,?,d,e,f (note that the inner, contracted coordinates need to match), multiply them together, and add them up. In particular, we get something like (A_a,b,c,1,1 x B_1,1,d,e,f + A_a,b,c,1,2 x B_1,2,d,e,f + A_a,b,c,2,1 x B_2,1,d,e,f + A_a,b,c,2,2 x B_2,2,d,e,f + ...). There are m x n element-wise products being added up to achieve _each_ coordinate of the output. If curious, past threads:Matrix calculus for deep learning part 2 - https://news.ycombinator.com/item?id=23358761 - May 2020 (6 comments)Matrix Calculus for Deep Learning - https://news.ycombinator.com/item?id=21661545 - Nov 2019 (47 comments)The Matrix Calculus You Need for Deep Learning - https://news.ycombinator.com/item?id=17422770 - June 2018 (77 comments)Matrix Calculus for Deep Learning - https://news.ycombinator.com/item?id=16267178 - Jan 2018 (81 comments)Others? Do people do it this way? Isn't using index notation + Einstein summation convention way easier and more powerful? You only need to remember two rules:1. dx_i/dx_k = [i==j] where [i==j] is 0 if i != j and 1 if i == j2. (AB)_ij = A_ik B_kjYou don't even need the second rule if the function you want to differentiate is in index notation in the first place.The example of the other comment:f(x) = x^T A x = x_i A_ij x_jdf/dx_k = d/dx_k (x_i A_ij x_j) = [i==k] A_ij x_j + x_i A_ij [j==k] = A_kj x_j + x_i A_ik.You can also skip the intermediate [i==k] step and immediately set the indices of other factors in a term equal. For example, differentiating with respect to A is immediate, even though the method from the article can't even handle it directly:df/A_kl = x_k x_l I think you got the indices "k" and "j" mixed up in rule 1. Yep thanks! I cannot edit any more.. You do not need any matrix calculus for deep learning and micrograd (https://github.com/karpathy/micrograd), which implements backpropagation for neural nets in 100 lines of code, is a proof. Everything else is just vectorization. Yes, you'll probably need more information-theory and statistics than linear algebra. The focus seems to be in the wrong place. The idea is great but the many notational typos are going to confuse the hell out of people who don't already know this math. (e.g. References to vector variables are often not bolded even as they are introducing bold as the notation for vectors, reusing x excessively as a variable in multiple contexts in the same expression, etc.)I went to graduate school for physics so I know all of this math cold but reading this paper caused me much confusion until I figured out they had a ton of typos in it.Caveat reader.Upon closer inspection, I realized they are not so much typos as poor UI implementation of the rendering. The visual difference between bold and italic when this page renders in either Chrome or Firefox is so subtle, it's very hard to see. Many of the equations and in-line variable references are rendered with svg so one cannot even inspect the page source to determine the intended font weight. Regardless, it's very confusing to read because of that. Not exactly what you want when trying to explain vector calculus. :)P.S. I'll try to submit some comments to the authors to see if they can get this cleared up before it causes too much confusion. > We assume no math knowledge beyond what you learned in calculus 1Anyone got recommendations for self-studying and testing Calculus 1? Calculus Made Easy - 1910. It is even entertaining."What One Fool Can Do, Another Can. (Ancient Simian Proverb.)" They are probably terrible for self-study, but I want to mention them anyway. I am a huge fan of David Bressoud’s Calculus (and analysis!) books. His first one doesn’t have exercises and his second one has a lot of physics.They are all heavy on written narrative and interlaced with history. I find all four of them absolutely fascinating.https://www.amazon.com/Second-Year-Calculus-Undergraduate-Ma...Analysis:https://www.amazon.com/Approach-Analysis-Mathematical-Associ... Better explained articles that explain the “what” part, such as the meaning of “e”, from the first principles.I have a reasonable handle on Calculus-1 but those articles really helped me connect the dots. After reading them I realized that all I knew was “how” without having a clue about “what” 1) go into knowing that the terminology is the hardest part to learn. The arithmetic is fairly easy, but the word salads can get daunting.2) khan academy, coursera, MIT open courseware, whatever books your library has. Piece together multiple sources, b/c you don't have a professor to ask questions to, you'll need all the different sources and how each source explains the same idea differently for everything to 'click'. I refreshed on linear algebra and multi variable calculus through Khan Academy before I took a theory-based machine learning class. For me, it was a really good experience and I can’t overstate my appreciation for Sal Khan’s teaching style. I found the book "Calculus: a complete course" very nice. It's very cheap and extremely unpretentious while also giving you a subtle taste of a bunch of interesting applications (Deep learning... bleh, give me electromagnetism!) Nothing beats taking a good university class on it - https://ocw.mit.edu/courses/mathematics/18-01-single-variabl... To begin, try to get a really high level explanation for differential and integral calculus. Look at videos, illustrations, etc.Then, once you have a mental image of what you will be doing, get into the math from the ground up. When I was like 14 years old one one my math teachers explained to us what calculus was using zero equations. He made everything look really easy.At college, it was the complete opposite. My professor started with limits and convergence, didn't even bother to explain why we were studying the subject.Kudos for people who actually explain things. My high school physics teacher explained calculus to me by plotting a velocity curve, then saying the tangential line at a point (derivative) is acceleration and the area from 0 to that point under the curve (integral) was the distance travelled. Made all of calculus very easy for me to grok going forward. I missed the day where the word tangent must have been explained, then failed basic trigonometry and felt like an idiot for the rest of high school. I never got a high level explanation for what calculus was because the business model of my school involved having a passing rate below 30%. If the passing rate was too high, teachers got fired no matter what the reason was. I found outlier.org to be fantastic. This helped me a lot in understanding the math behind gradient descent pretty well. The only problem I faced was when I tried to derive the equations for a vectorised implementation of a neural network. Certain derivatives, like the derivatives of matrix multiplication, are not obvious from the from the math presented in the paper. Had to dig through a bit to bridge that gap between non vectorised and vectorised implementations.By vectorised implementation of gradient descent, I mean one where gradient descent looks at more that one training samplenat a time when updating weights Side note: Terence Parr, one of the authors of this paper, also created AntLR. is there an online course for maths? like undergrad maths? I’ve seen “The Infinite Napkin” passed around in math circles. I can’t vouch for it and I haven’t read much of it: Er, +1000Man, if I had forty hours instead of forty minutes, I bet I could actually have explained this all.This book is my attempt at those forty hours.This project has evolved to more than just forty hours. https://tutorial.math.lamar.edu/ are pretty good, although they're mostly focused on the applied part. Paul's notes are great! Carried my butt through Calc 3 and PDEs, especially when the math for these classes get a little more formulaic (that is there is a pretty small set of reasonable questions you could possibly ask). His stuff is significantly more well done than any text book I had assigned for college math. MIT OCW has lectures for most undergrad CS/Engineering Math. Search: