 A Programmer’s Intuition for Matrix Multiplication 299 points by sebg on Oct 22, 2020 | hide | past | favorite | 90 comments This comes up occasionally. I don't find it to be particularly helpful, even though I do think betterexplained has been a strong source of gaining intuition into various subjects.Recognizing though that method of understanding something is pretty personal, I will say that what really helped me refresh was the 3Blue1Brown Essence of Linear Algebra series on Youtube. If you're trying to better grasp the subject, do yourself a favor and check out those videos. 3Blue1Brown is really great for learning/refreshing math. He has great visualizations.I especially enjoyed this video (not exactly a math tutorial more of a cool explanation): https://www.youtube.com/watch?v=OkmNXy7er84 I can wholeheartedly recommend his channel.In particular his recent "Lockdown Math" series taught me a lot.I knew and used trigonometric functions and exponentials before but they never really "clicked".Grant is a tremendous educator. True, but I don't agree with what he said here @ 8:00https://youtu.be/ZxYOEwM6Wbk?t=481Is this really the case? I checked Wikipedia and MathWorld and nobody makes a distinction between e^x and exp(x).Even if it's a "white lie" for didactic reasons, I don't buy it, it will be much more confusing down the line for students.Math is about finding structures governed by some rules and then generalizing them. To me the e^x defined as repeated multiplication conceptually is the same thing as the exponential function, just over broader domain. Why? You can interpolate between integer-valued X's using geometric mean. e^3 = sqrt(e^2 * e^4). What stops you from interpolating this recursively to achieve in the limit the exponential function over real numbers? I think the way he said it is a bit weird (and definitely nobody would think of e^x and exp(x) ss being two different things). But I would agree that it's probably best to define e^x through its power series, especially because it directly generalises to complex arguments. (Another good definition is as the function f whose derivative equals itself, and which satisfies f(0)=1, but that's not very constructive.)Of course, you can also go another route: you can define the number e (e.g. as the limit of (1+1/n)^n), then you can straightforwardly define natural number powers of e by repeated multiplication, then integer and rational powers through reciprocals and roots (you have to prove n-th roots exist, but that's doable), and then you can define real number exponents via limits (maybe this is what you mean by "interpolating recursively"). Now, when we come to complex numbers, you can use Euler's formula as a definition instead of a theorem, and define exp(a+bi):=e^a * (cos b + i sin b). Of course, at the end of this whole exercise you can prove that this definition is exactly equal to the power series definition.Another benefit of the power series definition is that it also generalises e.g. to the matrix exponential (exp(A), when A is a matrix). Everyone's seen/knows them already I haven't. I'm glad notsuoh posted the link. notsuoh on Oct 22, 2020 [–] Perhaps not the people who have not seen the OP’s link? adenadel on Oct 22, 2020 [–] Well, even if "everyone" knows it, at some point you have to come across it for the first time. Everyone's seen that XKCD already.  I'm thinking now we go here https://xkcd.com/1270/ I've always found the bipartite graph conceptualization of matrix multiplication the most intuitive, and especially so if you're familiar with neural networks: https://www.math3ma.com/blog/matrices-probability-graphs That is genuinely a really fun way to look at it, thank you for linking this! Especially this fits nicely with Markov matrices where you have N input nodes and N output nodes and the sum of all of the probabilities coming out of one of the nodes needs to equal 1.What I might find a little more difficult to teach to people through this lens is the phenomenon of eigenvectors, but I suppose that's to be expected—nothing will work well for all purposes. >What I might find a little more difficult to teach to people through this lens is the phenomenon of eigenvectorsWhy? Eigenvectors are simply inputs to the network where the output keeps its shape, that is, at most it gets rescaled, as if you had applied a uniform gain to the components, but otherwise it will be the same as the input. So for example let me ask for your knee-jerk opinions based on this idea:- does a matrix have the same left-eigenvectors as its right-eigenvectors?- what is the relationship between the left-eigenvalues and right-eigenvalues?- is there always an eigenvector? when is there a complete set? how do you generalize your notion of eigenvectors so that matrices always have a complete set of them?I am not sure any of these are intuitive here. If A is a square matrix, then a left eigenvector v is a vector such that vA = \lambda_v v for some \lambda_v. Likewise, if u is a right eigenvector of A, Au = \lambda_u u.Notice that u and v cannot be equal, because they are not the same shape. However, if v is a right eigenvector with eigenvalue \lambda, then v^T is a left eigenvector with eigenvalue \lambda, as well. More or less what this means is that we tend to just ignore left eigenvalues and only use the right eigenvalues, because the math is exactly the same up to a transpose operation.A matrix does not necessarily have nontrivial eigenvectors. Think about the 0 matrix here. But, if A is nonzero, then it must have at least one nonzero eigenvalue, hence one nontrivial eigenvector. This is because a nonzero matrix must have at least one nonzero row and column; however, if you construct the other rows (columns) to be multiples of the first column, you end up with a matrix with only one nontrivial eigenvalue. This also illustrates the conditions necessary for an nxn matrix A to have n linearly independent eigenvectors: the rows of A must be linearly independent.All of this is covered in a decent undergrad linear algebra course. I would suggest either finding a video course, or getting a good book and working through it, if you want to understand these things better. > However, if v is a right eigenvector with eigenvalue \lambda, then v^T is a left eigenvector with eigenvalue \lambda, as well.Consider the matrix M = [ 0.50 0.50 ] [ 0.25 0.75 ] Clearly [1; 1] is a right eigenvector with eigenvalue 1. But [1 1] M = [0.75 1.25].> A matrix does not necessarily have nontrivial eigenvectors. Think about the 0 matrix here.The zero matrix has _all_ the nontrivial vectors as eigenvectors.> But, if A is nonzero, then it must have at least one nonzero eigenvalue, hence one nontrivial eigenvector.Consider N = [ 0 1 ] [ 0 0 ] This has no non-zero eigenvalues, but it is not the zero matrix. It does have a nontrivial eigenvector, [1; 0], because every square complex matrix has an eigenvector. (This is in contradiction to another statement you said, that a matrix does not have nontrivial eigenvectors—that is technically true but only in a limited sense that your space might not be ℂ^n or something that can be quickly generalized to it like ℝ^n. But like we’re in a computing forum and I might find that pedantic.)I don't know why you consider the eigenvalue zero to somehow not count as an eigenvalue. Very suspicious.> This also illustrates the conditions necessary for an nxn matrix A to have n linearly independent eigenvectors: the rows of A must be linearly independent.Consider P = [ 1 1 ] [ 0 1 ] The rows are linearly independent, but this does not have a second eigenvector.> All of this is covered in a decent undergrad linear algebra course. I would suggest either finding a video course, or getting a good book and working through it, if you want to understand these things better.sigh ... See this is why I sometimes feel bad about commenting here. Like, this comment thread was supposed to be a celebration of this different way of thinking about linear algebra, and then I have to deal with this stuff. Like I totally don’t think you meant to come across as condescending, but given that I have taught crash courses for struggling friends on linear algebra concepts they missed in their undergrad to get through our graduate work in physics, you know, it kind of does come across that way. Yeah. In my experience, specialized subreddits (like /r/math or /r/haskell) have less such silliness than HN.For the first matrix, maybe a simpler example is M = [1 1; 0 0]. Then [1; 0] is a right eigenvector with eigenvalue 1, but [1 0] M = [1 1]. Tainnor on Oct 23, 2020 [–] > I don't know why you consider the eigenvalue zero to somehow not count as an eigenvalue. Very suspicious.Zero can of course be a proper eigenvalue. Probably they confused it with the fact that the zero vector typically doesn't count as an eigenvector. philip-b on Oct 23, 2020 [–] >But, if A is nonzero, then it must have at least one nonzero eigenvalue, hence one nontrivial eigenvector.How about this matrix? [[0, -1], [[1, 0]] >However, if v is a right eigenvector with eigenvalue \lambda, then v^T is a left eigenvector with eigenvalue \lambda, as well.A snippet of code producing a counterexample: import numpy as np import scipy.linalg as spla A = np.random.randn(3, 3) right_eigenvector = spla.eig(A)[:,0] right_eigenvalue = ((A @ right_eigenvector) / right_eigenvector) potential_left_eigenvector = right_eigenvector[np.newaxis, :] # if all components of the following are not the same, then it's not # a left eigenvector print((potential_left_eigenvector @ A) / potential_left_eigenvector) # prints [[-1.1327836 -0.j -0.14850693-0.j -1.84397691+0.j]] Your first matrix doesn't count, but the theorem is bogus: see my counterexample above.The correct theorem is that every complex square matrix has an eigenvector. If we interpret your matrix as a complex matrix, then it does have two eigenvectors, namely [1; ±i]. Hence why I’d say it kind of “doesn’t count.” The way to see it is to realise that the eigenvalues are the roots of the characteristic polynomial (which is not hard to prove). But by the fundamental theorem of algebra, every complex polynomial has a root (it does actually have n roots, where nxn is the dimension of the matrix, but those roots can all be the same). This also shows why the theorem is, in general, false if you restrict yourself to real numbers.(I also think that the counterexample exhibited "counts" in the way that it's intuitively very clear why it can't have a real eigenvector. The R^2 plane can be easily visualised, as opposed to C^2, and the matrix exhibited acts as a rotation of R^2, so of course there can't be an eigenvector. As you said, once you allow complex numbers, that's not true anymore, but at that point the visual intuition breaks down somewhat.) Oh. Oh my. I have studied linear algebra theory some. I have used it a ton. I have used eigenvectors to solve problems. I have never grokked them.This description changed that. Thank you! ssivark on Oct 22, 2020 [–] An eigenvector would correspond to the "stationary distribution" of the Markov transition matrix represented by the graph. (think "page rank") Was just coming in to point to this. While the geometric intuition in the post being discussed is quite visual, thinking of matrix elements as directed arrows generalizes easily to high-dimensional vector spaces, and also makes it very easy to understand the behavior of sparse transformation matrices, and motivates a nice correspondence between linear algebra and graph algorithms (ref. graphblas) This seems more accurate than the original post, which seems quite tied to this code/data analogy.OTOH, it is just vectors in a many-dimensional space. We're neigh-supernatural machines for describing vectors, that's how we throw things better than any other animal, I don't really know why we need analogies here. Why do we want x'x? It maps to a distance... Thanks for linking to this. I'd prefer stating that matrices actually don't mean anything in isolation. They are simply representations of linear maps. What happens to the elements of a matrix before the elements are formed is irrelevant to the matrix itself...And to be clear, I am not a purist. Purist test: aren't vectors just Nx1 matrices?Joking aside, matrices are linear maps in the context of multiplication. They can absolutely have meanings assigned outside the usual "when multiplied by a vector" type of calculations. For example the "adjacency matrix" representation of a graph is very often never multiplied by a vector even in academic algorithm descriptions. YMMV if you call it a (lookup) table in that context but the term "matrix" seems fairly well established there. > They can absolutely have meanings assigned outside the usual "when multiplied by a vector" type of calculations. For example the "adjacency matrix" representation of a graph is very often never multiplied by a vector even in academic algorithm descriptions.Not disputing your basic point, but I'm not sure "very often never" is entirely accurate. In the GraphBLAS API vector/matrix and matrix/vector multiplication is key for almost all algorithms that depend on breadth first search, this is because matrix/vector multiplication is a single step in a BFS which is the core operation of the GraphBLAS. The vector is typically used to hold the current frontier (and mask). See the "Shortest Path Lengths" example here:https://github.com/michelp/pygraphblas/blob/master/pygraphbl...(disclosure, I am the pygraphblas author).> Purist test: aren't vectors just Nx1 matrices?I like it. :) As an interesting point, the GraphBLAS defines both Matrix and Vector types, but internally in the SuiteSparse implementation, they are all just unified into one type and yes, vectors are Nx1 or 1xN. Wow thanks for that example. I don't think I've seen that trick before; now I want do do more with graphs again. I implemented flood fill for Adobe as an adapter on top of BFS (from the BGL), which was, itself, an adapter on top of BLAS (written in FORTRAN). It was somewhat faster than the handcoded assembly BFS they were using at the time. The compile times were killer, though. > the "adjacency matrix" representation of a graph is very often never multiplied by a vector even in academic algorithm descriptions.Not true. There is a whole field (graph signal processing) that studies graphs in terms of their description in matrix form (adjacency, connectivity and more). For example, you can find connected components by looking at the eigenvectors of a certain matrix. pmiller2 on Oct 22, 2020 [–] > For example the "adjacency matrix" representation of a graph is very often never multiplied by a vector even in academic algorithm descriptions.Unless you're doing spectral graph theory, which looks at properties of the eigenvalues and eigenvectors of the graph's adjacency matrix. Even then, you don't usually multiply the adjacency matrix by a vector explicitly. Today I learned! I'm fairly familiar with "regular" spectral analysis techniques, ie FFTs and DCTs, with some CWT waveletty goodness thrown in here and there but I never heard about spectral graph analysis. That's what you get for staying in EE land too long I guess. Fourier & Friends are excellent analytical tools for the nature of patterns, regardless of where they occur! In EE-land we think too much about real and complex numbers, where Fourier analysis is more concrete. But spectral graph theory makes a lot of sense when you think of Fourier analysis as a way of decomposing patterns into their constituent parts, it just looks a bit funny when those patterns aren't numbers. Indeed. By using tools of linear algebra on the adjacency matrices of a graph, you can tell a lot about the structure of the graph.For instance, if A is the adjacency matrix of a graph G, with n vertices (we say 'vertices' in the math department almost exclusively -- rarely do we say 'nodes'), and k edges, denoting the trace of A by Tr(A), we can find the number of triangles in A by computing Tr(A^3)/6. We can also tell if G is connected by computing A^n. OTOH, the spectrum doesn't tell you nearly everything you'd want to know about a graph. There are many examples of non-isomorphic graphs that are nonetheless cospectral. For instance, one of my professors from grad school wrote a landmark paper in spectral graph theory in 1973, showing that almost all trees are cospectral. ---: Of course, you'd probably never actually do this, because it's horribly inefficient, but here's how it goes.First, you prove easily by induction that A^c_{i,j} is the number of paths of length c starting at vertex i and ending at vertex j. With that in mind, we see that if there is a row of all nonzero numbers in any matrix A^c for c <= n, then G is connected. With a little more work, you can even extract all the connected components of G, by looking at submatrices of A^c.: Unsurprisingly, the paper is titled "Almost all trees are cospectral". See: https://www.semanticscholar.org/paper/Almost-all-trees-are-c... dnautics on Oct 22, 2020 [–] > Purist testThe short answer is. Yes, but if your PL treats it that way you will wind up in trouble .The long answer is: https://www.youtube.com/watch?v=C2RO34b_oPM (but seriously take the time to watch this, it's fantastic). Separately C- and often C-derived languages/frameworks like numpy also treat them as 1xN matrices, but that's a nearly tabs-vs-spaces level religious war, with actual consequences. I think a lot of people got confused when translating andrew ng's early octave-based ML coursera course (Nx1) to python (1xN), and also in the back and forth between math ML papers which use Nx1 and python. > numpy also treat them [vectors] as 1xN matricesI may be misunderstanding something, but in numpy at least a vector, a 1×N array ("row vector"), and a N×1 array ("column vector") behave differently. vec = numpy.array([1,2,3]) assert vec.shape == (3,) assert vec.ndim == 1 row = np.array([[1,2,3]]) assert row.shape == (1,3) assert row.ndim == 2 col = np.array([, , ]) assert col.shape == (3,1) assert col.ndim == 2 And a vector, unlike a 1×N or N×1 array, can be multiplied to either side of any matrix with compatible cardinality: assert all(vec @ numpy.eye(3) == numpy.array([1, 2, 3])) assert all(numpy.eye(3) @ vec == numpy.array([1, 2, 3])) Matlab / Octave, in contrast, forces the user to pick a 1×N or N×1 matrix to represent a vector. Its arrays are always ≥ 2D. So you're probably still right that people get confused translating this to Python. Just wanted to clarify the numpy side of things.The JuliaCon talk you linked does a great job of explaining what different fields mean by "vector"; thanks for linking it. I just tried this, I stand corrected, but I remember this (the last clause) not working in numpy. It's possible my memory is flawed: >>> import numpy as np >>> vec = np.array([1, 2, 3]) >>> mtx = np.eye(3) >>> np.dot(vec, mtx) array([1., 2., 3.]) >>> np.dot(mtx, vec) array([1., 2., 3.]) It's possible though that I was remembering not that it doesn't execute, but rather that if you do mess up the multiplication order and do it in the wrong direction with anything besides trivial diagonal matrices, you won't get a type error, you will get an incorrect but valid result, which is arguably worse than a type error, and way more frustrating, especially if start by testing eye, which is the easiest thing to do. I remember a problem similar to the one you describe when using a dot product both to substitute for a fast loop and for its normal mathematical purpose, in the same operation: points = np.array([[0, 0, 0], [1, 0, 0], [1, 1, 0], [0, 1, 0]]) M = np.array([[0.866, 0.5, 0], [-0.5, 0.866, 0], [0, 0, 1]]) # rotation points @ M # works M @ points # raises ValueError, also a different transform Regarding messing up the multiplication order, numpy by itself doesn't have the information to raise a relevant type error, sadly. As you probably know, based on your other comments, the important thing is whether the adjacent basis vectors (dimensions) match, not how they are written. That is, given M = M_ij (a_i ⊗ b_j), a vector v_i a_i must be multiplied on the left, a vector v_i b_i must be multiplied on the right, and a vector v_i c_i (with basis vector c) shouldn't be used at all. Numpy just does array math; it doesn't keep track of dimensions. Hopefully the code has accurate comments, or you're using a language with a decent type checker.However, I think you can track dimensional compatibility using xarray: import xarray as xr import numpy as np points = xr.DataArray([[0, 0, 0], [1, 0, 0], [1, 1, 0], [0, 1, 0]], dims=("idx", "a")) M = xr.DataArray([[0.154, 0.988, 0], [-0.988, 0.154, 0], [0, 0, 1]], dims=("a", "b")) # rotation points.dot(M) # works M.dot(points) # also works # M's a-dim is matched to point's a-dim, so both # results have same values regardless of # multiplication order, just transposed. assert np.all(points.dot(M).T == points.dot(M)) I only just tried xarray in response to your post, so not sure if there are pitfalls in practice, or if this is the best way to use it. Well I was doing graphics transformations on affine vectors, so they are all 4x4 and 4x1 vectors... You can't "know you are in the wrong order by checking dimensions" in that case. Besides needing to change the order of multiplication (y = xM vs y = Mx) and use the right transposition of the matrix, do any other actual consequences come from the choice of "row vector" vs "column vector"? Mathematically speaking, no. They are isomorphic duals. But ergonomics are important to prevent errors. I personally think Matrix-in-front is nice because it evokes the idea that the matrix is a function acting on the vector.In the real world, When you start doing matrix-vector, if you aren't careful with row-major vs column-major storage you can introduce performance regressions. Let's say the architecture can do 256 bit memory move into 4xf64 mmx registers, you have to think carefully about what is going where before proceeding. Some architectures might be able to do an arbitrary stride so that those f64s can be nonadjacent in a simultaneous move. I suspect not handling this correctly is part of what holds back really good fp8, fp16 performance, and I suspect for example the TPUs struggle with this.Of course optimizing for matrix-vector can be different for optimizing for matrix-matrix, since the column vs row memory access preference for the front matrix is exactly the opposite of the back matrix. scubbo on Oct 22, 2020 [–] > aren't vectors just Nx1 matrices?It's been a while since I formally studied maths, but - can you describe a situation in which this is _not_ true? (modulo transposition) In an N dimensional vector space, yes, you can always choose a basis and write vectors down that way, but it turns out to often be a bad way to work. For example, picture a surface. At any point on the surface, there is a plane of vectors tangent to that surface. It's unnatural to describe the vectors there in terms of a 2x1 matrix, since that requires the choice of a basis of the tangent space. And if you try to choose a basis of the tangent space, you'd probably want it to vary continuously from one point on the surface to the next, but that's not possible to do on many surfaces, such as a sphere, thanks to the Hairy Ball Theorem. crdrost on Oct 22, 2020 [–] So, I would say that there is a canonical isomorphism between them but that they are not quite the same.So, to take a step back, I'd say that a covector is a linear function from a vector input to a scalar output (those terms themselves being somewhat fungible, the space of "vectors" needs to probably form a module over the "scalars" which must themselves be a ring—the issue is that I would as a physicist include scalar fields but you can have two nonzero fields multiply to give a zero field, so they do not obey the field axioms).Then I would say that an [m n]-tensor is technically a multilinear function from m covectors and n vectors to a scalar. This means that a [0 1]-tensor is really exactly a covector, but a [1 0]-tensor is not really exactly a vector. One can manifestly be constructed from any vector v (take any covector c and produce the scalar c(v)) and we generally have some criteria that I forget which ensures that this is not just an injection but also a bijection implying some inverse, so that these [1 0] "co-covectors" are indeed isomorphic to the vectors. And then of course this becomes much easier in a metric space where we identify a canonical bijection between vectors and covectors called the "metric", this gives a much more direct way by which co-covectors are isomorphic to vectors.Hopefully that all makes sense. The situations in which this is not true are non-metric spaces where linear maps from (linear maps from vectors to scalars) to scalars are just a subset of the "apply to this vector" instances, but other linear maps are maybe also included.Then there is a more dramatic statement which I think requires the space to be Hausdorff-paracompact or something, which is the idea that of course we have the outer product which naturally combines an [a b]-tensor and a [c d] tensor into an [a+c, b+d]-tensor ("take these vectors and covectors, give them to the first tensor, take those vectors and covectors, give them to the second tensor, then take the two scalar outputs and multiply them together), and we claim that actually an [m n]-tensor can always be realized as a finite sum of outer products of m [1 0]-tensors with n [0 1]-tensors, which allows you to do arbitrary index contractions in a geometric way. Well if you think of matrices as linear maps then saying a vector is just an Nx1 matrix is like saying that a vector is just a linear map from a vector space of dimension 1 to a vector space of dimension N. It isn’t really a very useful thing to say Sure it's useful, if not earth-shattering - another way of saying it is that a vector maps scalars to vectors on the same line. joppy on Oct 23, 2020 [–] It’s more of a perspective and opinion thing. For example I would not say that an integer is its twos-complement representation, but I am very happy to hold onto that representation as a kind of key for the “real thing”.Similarly, in a vector space there can be abstract vectors, which exist no matter of what basis you choose to write them down. (Choosing a basis = choosing a way to encode them as a 1xN or Nx1 matrix).Often in the theory, general theorems can be proven very efficiently using these abstract vectors, but most of the time when one wants to actually do things to those vectors, picking a convenient basis and working with N coordinates is the best choice. bonoboTP on Oct 22, 2020 [–] You can do many linear algebra operations on the adjacency matrix. It's a proper matrix, in the full sense, not just a table of numbers. I think if in any context one uses the usual matrix algebra, it is the same linear map, just in a more abstract setting.If you tell me storing a lookup table is matrix, I think that is sort of out of context. But then again, I could be labelled a purist here, and I'm trying really hard to not fall into the trap. tobmlt on Oct 22, 2020 [–] Along with what others have mentioned, another application of adjacency matrix multiplication is in doing discrete calculus on meshes.Pushing one forms to two forms via exterior differentiation, for example. The chief operation is a matrix multiplication with said adjacency matrix. Joker_vD on Oct 22, 2020 [–] Yes, they pretty much are unless you're in the business of linear codes—then they're actually 1xN matrices. A pretty annoying convention for someone coming in from the general algebra: takes a time to stop mentally transposing everything. Vectors are Nx1 matrices only once you choose an ordered basis! alentist on Oct 23, 2020 [–] The nth power of an adjacency matrix yields the number of length-n walks from any vertex to any other vertex. andi999 on Oct 22, 2020 [–] And covectors 1xN matrices? Agreed. I think that the distinction between the objects and their possible representations should be made clear from the beginning, simply because it will make things easier to understand.IMO, just a couple of chapters from an introductory pure mathematics book is actually the most "to the point" place to start even for people who are not huge fans of math. Just the precise definition of a vector space over a field, a linear operator, a vector basis, the derivation of matrix elements, the usual related examples and some "gymnastics" on paper can make stuff "click" and be a great start for any intended application. When I took a course in linear algebra, we learned about linear transformations and vector spaces in the abstract first. Then when we got to matrices it was viewed as a way to represent linear transformations. Also it led to a natural derivation for matrix multiplication. If anyone is curious I've uploaded the half-page derivation found in Chapter 3 of Linear Algebra Done Right here. > However, sometimes the matrix being operated on is not a linear operation, but a set of vectors or data points.In Jeremy Kun's book  he argues that in fact data matrices _can_ be viewed as linear transformations. See e.g. section 10.9 on the SVD> That is, we’re saying the input is R3 and the basis vectors are people:...> By representing the ratings this way, we’re imposing the hypothesis that the _process_ of rating movies is linear in nature. ... I wholehearted endorse 3B1B, The linear algebra series is excellent, the rest of his work is too! In some applications, e.g. modelling something defined by PDE, you might need to solve a very large linear system of equations Ax=b, where the vector b has 100,000,000 elements, and A is a 100,000,000 by 100,000,000 element matrix. So for a naive dense matrix representation of A using 1 byte per entry you'd need around 9 petabytes of storage. In practise, for many interesting systems of equations that encode "local" properties, A will be a sparse matrix where the overwhelming majority of matrix entries are zero. There are various sparse matrix formats designed to avoid storing zeroes: the simplest is COO where you store triples of the form (i, j, A_ij) for all nonzero entries A_ij of A. More efficient formats are CSR or CSC ( compressed sparse row & column respectively) that use even less memory by avoiding storing repeated indices where there are runs of nonzero elements.To solve one of these huge linear systems Ax = b , we hardly ever want  to go to the expense of computing and storing an explicit matrix representation of the inverse A^{-1} . Instead, we merely need to have some function f that is able to map an input vector b to some output vector f(b)=x that happens to satisfy Ax=b .If the matrix A has sufficiently nice properties  we can use an efficient iterative method such as the conjugate gradient method to map the input vector b to an output vector x by solving Ax=b. To evaluate the conjugate gradient method, we don't need to necessarily have an explicit dense or sparse representation of the matrix A, we merely need a function that encodes how A acts on a vector by matrix multiplication. E.g. some function g that maps an input vector x to the output vector g(x) = Ax.So if we've got some appropriately defined conjugate gradient method algorithm cg(g, b), we can define the function f(b) := cg(g, b)Calling f(b) will return x that satisfies g(x) = b, i.e. f is the inverse of g. So now we're expressing linear algebra without having any explicit matrices anywhere: but our functions g and f are of course both linear operators that encode the action of matrix multiplication by A and A^{-1} respectively. Why not? It's a lot more effort to compute A^{-1} which could be used to evaluate A^{-1}b for any given b, but often we only want to know A^{-1}b for a specific b. Also, if the matrix A encodes some local property of a physical system , then A^{-1} will encode a global property -- so it will need to be a dense matrix. Sufficiently nice properties: if it's symmetric positive definite, which arises naturally in a bunch of situations. See eg https://en.m.wikipedia.org/wiki/Conjugate_gradient_method For the more machine learning-minded folks, this happens almost always in doing inference with Exact Gaussian Processes (GP), where because of the non-parametric nature of a GP model, the covariance matrix grows with the size of data points. The inference routine is cubic in the number of data points. Hence, sparsity in the posited covariance matrix is _extremely_ important for fast inference. And what happens if indeed the huge matrix elements are all non-zero? Like, let's say, satellite scan data for a given country when you want to spy on their underground systems (think North Korea facilities)? Wouldn't storing that data as COO would actually triple the amount of memory? Presumable the designer(s) of such a system will be able to know in advance whether the resulting matrix will be sparse or not and choose their encoding appropriately. FWIW, for a lot of practical applications the raw sensor data would be non-sparse but it would be transformed/filtered almost immediately into a more space-efficient representation. For example in the case you mentioned (unless you think the entire country is completely underdug by tunnels) most of the raw data can be deleted immediately with no loss of signal as there is no underground system underneath that particular spot of land. That's my point. In order to analyze you need to preserve entire data. One system is the image acquisition. Another is the analysis, on more advanced systems. The advanced system needs all data. Dismissing it at entry point defeats its purpose. No matter how you flip it, there are practical applications where huge matrices that have non-zero elements exists.Here is another example - cryptography analysis. Usually password length is small, usually below 1k characters. But if you increase the password length then simple old ciphers, like Vigenere, become harder to crack. Bump the key length to over 1 million bytes and suddenly your century old methods, like index of coincidence and Kasiski examination become a problem of having huge matrices to feed to analyzer. 1 million key length suddenly generates a matrix with 256M x 256M for analyzer. And all elements in that matrix are non zero. I was expecting this to be more about writing fast matrix multiplies. Surprisingly, it ends up being a lot more than just a triple loop and there's a reason that everyone uses existing libraries like OpenBLAS and MKL . Matrices are a fun way to stretch the mind. They are easy to grasp yet very dissimilar to numbers. Sometimes they can be used to represent numbers and we can use our matrix intuition to answer questions about regular numbers.You can define functions on matrices just like you define them on numbers. Multiplication is a function. The determinant is a function. You can even take derivatives of these functions. The derivative of the determinant function is written in terms of the trace of the matrix. Yet another matrix function. Who would have guessed that the derivative and trace are related in this way? There's really a lot of depth here. Don't forget the matrix exponential which might certainly blow the mind of a lot of people here. Matrices as exponent - who would have thought that possible! det(e^A) = e^(tr(A))Pure magic. Matrix vector multiply takes an n string and maps it to a k string via a matrix automata. Matrix matrix multiply maps p strings instead of one. We should be parsing on a GPU. If we're talking about 3D rendering then a matrix multiplication with a vector are just projections: A 3x3 matrix that transforms a vector is nothing else than that vector being projected onto the 3 axis which are inside the 3x3 matrix. This is very easy to see visually.A matrix times matrix multiplication (eg 3x3 times 3x3) is just projecting the axis of one matrix onto the other: expressing the coordinate system in terms of a different coordinate system.The 4x3 (or 4x4) matrix is needed for translation and 3d projection for a 3d triangle onto a 2d plane (screen pixels).For translation: it's just a hack since if you write it out it's a nice and a fast hack to simulate "movement".Then for projection you just hack the numbers in the matrix in such a way that you shrink things further apart from the origin.TLDR: for 3d graphics matrices are used because you can hack them to represent any kind of a transformation. The reason they are used is because graphics cards are fast at performing dot products. Plus matrix multiplication doesn't require trigonometry or division so it doesn't require the slower transcendental GPU instructions (sin, cos,...).For other uses their interpretations is different.I always hated the dull mathematic definition of "a matrix represents a linear map" because what the matrix actually represents is completely context dependent. The projection viewpoint for a rotation matrix using dot products is totally valid, but I think the column viewpoint is more intuitive.A 3x3 matrix can be viewed as something that acts on the standard unit basis vectors. The action on the standard unit basis vectors can be read out by looking at the columns. (A matrix times the ith standard basis vector is just the ith column.) Since the transform is linear, all the other points in space are carried along with the action as well. This is the "linear map" viewpoint of a rotation matrix.I show an example here https://biro.ai/we-learned-the-wrong-way-to-matrix-multiply/ where this viewpoint lets you instantly see what a rotation matrix does by looking at the columns. > I always hated the dull mathematic definition of "a matrix represents a linear map" because what the matrix actually represents is completely context dependent.The projection and translation matrices you mentioned represent restricted sets of linear maps in some vector space, but the meaning of the restriction is also captured in other mathematical structures. See affine spaces and projective spaces. Sure - remember those noisy dot matrix printers?.. Does anyone mind suggesting a solid intro to matrices in the context of what they are and how the affine transformations work? I know of these concepts but I think it's time to learn. OMG Thank you! I'm going through tons of linear algebra for a ML algorithm I need to understand and matrix multiplication is ... weird to say the least. I strongly recommend 3blue1brown's intro to linear algebra for quick intuitions, followed by Gilbert Strangs MIT Open Course for (much) deeper detail. Strang is a genius, but Sanderson's graphics and intuitions can be helpful early on. thanks!! I skimmed Strangs video just now and it looks super good, what a great suggestion! Matrix multiplication is like convolution. To be sure there isn't a perspective here that I may be unaware of, is this a reference to circulant matrices, which bear the Toeplitz sparsity structure? You might find something interesting in https://cr.yp.to/lineartime/multapps-20080515.pdf Maybe convolution on a discrete basis can be represented as a matrix multiplication with a circulant matrix? Can you write up a physical intuition for the usage of matrix? A rotation in space by an angle can be written as a matrix. What is some good intuition around matrix inverses? The matrix inverse “undoes whatever the matrix did” is pretty much the most straightforward interpretation, if you’re thinking of a matrix as performing some kind of rotation/shear/scaling of space.There are some other interpretations, for example if you think of a linear system Ax=b as a kind of “checking machine” (does x satisfy Ax=b? Yes/no, try again), then the inverse of A can produce solutions, since x=(A inverse)b. An inverse function? Search: