 Why is it important for a matrix to be square? (2018) 171 points by luu 4 months ago | hide | past | favorite | 60 comments An extension of spectral decomposition on a square matrix is called the singular value decomposition where any matrix can be decomposed. The trick is to diagonalize the matrix A A^T and A^T A, in this manner we can find the left and right singular values and have an equation which is very similar to the spectral decomposition. It has a very wide range of applicibility--most of which is also highly valuable.There's also the Moore-Penrose pseudoinverse which also has plenty of applicaitons and works on non-square matrices. I would not call the SVD an "extension" of the eigen decomposition. The SVD is also defined for square matrices and is not equal to the eigen decomposition except for the case of symmetric matrices.This leads to the question: for a non-symmetric square diagonalizable matrix, what is the difference between the eigenvalues and the singular values? A good heuristic for applications is that the eigenvalues capture the asymptotic behavior of repeating the map, i.e. the behavior of A^k as k goes to infinity, while the singular values capture the transient behavior of applying the map once.For example, consider the matrix`````` 0.9 x 0 0.9. `````` As x goes to infinity, its largest singular value becomes infinity, while its eigenvalues remain smaller than one. This means that there exists a vector (0 1) whose image under A is huge, but whose image under A^k still approaches the zero vector for large k.A good book on these topics is "Spectra and Pseudospectra" by Trefethen and Embree. Good point, while I don't need this clarification others might. I shouldn't have implied that they are the same for square matrices implicitly, hence my failed attempt at using "similar".However, they're still an extension in the sense that singular values do correspond to (square of the) eigenvalues of the square matrix A A^T and A^T A. Yeah I guess "extension" is not a precisely defined word. I wanted to make sure other readers didn't interpret like "generalization", meaning that the original narrow concept is precisely a special case of the extension. Yeah, I wouldn't have used extension for a math audience, probably just "related concept", or "application of". In the software context it usually means the opposite--something specific that makes use of some existing functionality. A digression is that what multiplying A and A^T really means, in the sense that A represents a map of spaces, but A^T represents a map in the opposite direction of the dual spaces. In finite dimension or in topological vector spaces, a space is isomorphic with its dual, but not naturalliy so. It is naturally isomorphic to the double dual! as in Mac Lane's CWM. So the multiplication is doable with an apparent excess good luck. On the other hand, in , linear adjoint operators bewteen two topological vector spaces are shown equivalent, up to an scalar, to adjoint functor pairs showing a Galois connection between the posetal categories being the lattice of subspaces of the original two spaces. This story should continue. Yeah. I don't know why LA courses focus so much on eigenvalues, and don't talk much about SVD. It seems like an obvious next step once you know how to compute eigenvalues, but SVDs are always covered in passing and superficially (at least in courses I've taken).I find it fascinating that A = U∑V^T works on ANY matrix A, with U and V being made of orthogonal vectors (the left and right singular vectors of A)... it's like saying that all matrices are scalings and projections when viewed with appropriate bases for the input and output spaces (U and V).> we can find the left and right singular valuesInterestingly the (nonzero subset) of the left and right singular values happen to be the same... I don't have a useful intuition about how to explain this. Does anyone know why A A^T and A^T A have the same eigenvalues? > Yeah. I don't know why LA courses focus so much on eigenvalues, and don't talk much about SVD.It makes sense to talk about the eigenvalues of an operator on any vector space. But in order for singular values to make sense the vector space needs to have an inner product.So eigenvalues are more general because they make sense even when you haven't chosen an inner product on your vector space.From an HN perspective this kind of consideration is relevant when working on a data set in which there are some arbitrary units. You don't want your model predicting house prices to work differently depending on whether you choose to measure floor size in m^2 or ft^2. Could you elaborate on why SVD requires an inner product structure (choice of scale) while eigenvalues do not? Do you just mean that the input and output vector spaces might generically have different units, and so one would need to use some sensible "normalized" units? The SVD breaks a linear transformation into 3 parts:(1) A rotation in the source space, (2) an axis-aligned non-uniform scaling (if necessary adding/removing dimensions to match the destination space), and (3) another rotation in the destination space.Rotation is only a meaningful concept when you have a metrical structure (think of the full geometry of Euclid’s Elements including circles, perpendicularity, lengths, angles), so if you don’t have a metrical structure the SVD is likewise not really meaningful.Often spaces we deal with using linear algebra only have an affine structure (parallelism is well defined, and lengths can be compared when they are along parallel lines), but do not have a metrical structure (so there is no meaningful way to compare lengths pointed in different directions or measure angles). For example: there is no meaningful concept of the angle between the directions of 5 miles/gallon and 10 miles/gallon, and if you arbitrarily defined one, it would change when you switched from gallons to milliliters or from miles to meters.In practice people still often arbitrarily impose a metrical structure, sometimes based on some heuristic analysis of the data involved, and then use tools like the SVD based on that. There's probably more than one perspective on this. Here's one: a standard way to approach SVD is to start with a matrix A and consider A^TA and AA^T. Since the latter is symmetric (or hermitian if on a complex vector space, where transpose is replaced by adjoint), by the spectral theorem it has real eigenvalues and orthonormal eigenvectors, from which the SVD can be deduced.But the definition of A^T or A^* depends on a particular choice of basis, and is not a coordinate-invariant concept. If one had an inner product <,>, the adjoint can be defined in a coordinate-independent way using that inner product: A^* is the operator defined by = for all vectors u and v. (One can check that A^ is uniquely defined.) For a different choice of inner product, one gets a different A^.Note that the operator A^TA comes up naturally when one considers what the action of A does to the length of vectors:|u|^2 = |Au|^2 = = Geometrically, the SVD tells us how a linear transformation dilates or contracts space in different directions, e.g., think about what a linear transformation does to a sphere. But all these concepts -- length of vectors, spheres, ellipsoids (images of spheres under linear transformations) -- depend on a choice of inner product. Singular values of a transform are defined using the "operator norm" of the transform.The norm is defined in terms of the inner product.On the other hand, eigenvalues are defined with using only the transform and scalar multiplication, which does not require an inner product. I'm not sure you need to do that (define singular values via the operator norm), but one thing I don't see a solution for is getting rid of the requirement of the inner product for the definition of the adjoint. The really short answer is that (one off) the definition(s) of SVDs uses the transpose, which requires a inner product. Any two matrices AB and BA have same eigen values. Say v is an eigen vector of AB with eigen value lambda : you have ABv = lambda v. Now BA(Bv) = B(ABv) = B(lambda v) = lambda*(Bv). So, Bv is an eigen vector of BA with eigen value lambda. Trefethen and Bau's Numerical Linear Algebra (http://people.maths.ox.ac.uk/trefethen/text.html) introduces SVD up-front, before discussing how to compute anything. SVDs are the entry point to topic models, e.g. NNMF. Super interesting and useful for modeling large datasets efficiently. Gilbert Strang covers a this well in his "18.065 - Matrix Methods in Data Analysis, Signal Processing, and Machine Learning" course . I can highly recommend the lecture series to anyone interested in such topics. curious, what are some cool application of SVD? Not the OP, but I've got some links for you:1. https://en.wikipedia.org/wiki/Singular_value_decomposition#A... of which the "Low-rank matrix approximation" is the most important one (it's like looking inside the matrix, seeing its significant components, and zeroing out the remaining ones to save space). See also PCA in statistics.2. 1976 video about SVD https://www.youtube.com/watch?v=R9UoFyqJca8 that shows a visualization for an algorithm for how to compute it.3. Good two-part blog post series https://jeremykun.com/2016/04/18/singular-value-decompositio... https://jeremykun.com/2016/05/16/singular-value-decompositio... Not an application, but this my favorite explanation of SVD:http://gregorygundersen.com/blog/2018/12/10/svd/It's such an important operation that I'd say understanding it changes how you understand a lot of linear algebra. That Kun post is also good. Latent semantic indexing uses SVD to identify relationships between words in unstructured text.You can use it to search for words and find related texts even though those texts do not contain the actual words you searched for. Or you can use it to find similar texts, even though important words may differ.Not sure how relevant LSI is these days, not my field at all, but mapping words to vector spaces and using SVD like this kinda blew my mind a bit when I stumbled upon it many years ago. I’ve used to on neural spike data before. If your responses are not too “messy”, you can estimate the part of the response that was due to a stimulus instead of just noise, pretty well. Very useful for working with MIMO (multi input multi output) control systems. Off the top of my head,- Principal component analysis- Fitting a plane to a set of points- Linear least squares I work with scientific data but have to say I have a very shallow math background and also forgotten most what I learnt at school/uni. However whenever I click open the source for scikit-learn / high level code published in papers I see SVD. A lot of the scenarios in biology have been abstracted to be matrix manipulation which is fascinating and I really need to learn. All of these examples are equivalent to least squares :) Yeah, for the most part all of engineering is equivalent to least squares I'd say, there's always some non-linear optimization procedure that uses a norm^2 metric since it's so well studied and solved already. It's disappointing as a mathematician, but the rest of me is fine with it :) How is PCA the same as least squares? I've always understood it as the eigendecomposition of the covariance matrix. If you're looking for a more intuitive understanding of linear algebra, I highly recommend 3blue1brown's YouTube series, "Essence of Linear Algebra." Absolutely this. If I remember right, it's roughly 13 videos of 20-30 minutes each. By the end I had a better understanding of what linear algebra was for and (conceptually) how the math actually worked than I got from a full semester at college. Another evidence that linear algebra concepts are terribly confusing when layed out without geometric background.This had been criticised for decades. Competently and popularly presented criticism goes back to at least as far as 1957 (Artin's Geometric Algebra; see discussion of determinants somewhere near the beginning) but linear algebra is still often presented decoupled from geometry.I wonder though if there's purely algebraic approach to matrices that explains as much (or more) as geometric one. Maybe approaching algebra of matrices consistently as an example of category algebra could be illuminating. I learned abstract linear algebra first, and didn't learn geometrical interpretations until I taught myself many years later so I could write an asteroids clone in svg.I don't think it was a pedagogically a problem, except I couldn't bring myself to care about matrices when I was learning them... It was very easy to take my abstract knowledge and apply it, and for me it might have been harder the other way around.In retrospect a hybrid applied/theoretical topic (like reed Solomon encoding and recovery) might have perked me up. But I might have also been a strange case. If by “abstract” linear algebra you mean “course that starts with the definition of vector space”, then it is geometric enough in the sense Artin talks about.It is pedagogically a problem for people who ask questions like the one in topic. But it can also be a problem when one encounters bilinear form and linear operator in practice but can't distingish between the two. I can't think of a specific example but I was asked once about some problem (in electrical engineering, iirc) where the source of confusion was this; some transformations of a (square) matrix were natural while others were not.Some people feel strongly about the topic—mostly those with “pure math” inclinations. I think the answer here is simple and not as profound as people might think. Square matrices are linear operators which map an input vector into a space with the same dimensionality. We can characterize a few interesting symmetries when a matrix happens to be square.As such, we have more rules and theorems around them, making them suitable for practical usage. For example, physics deeply utilizes Hermitian matrices, which are a special subclass of square matrices. Spectral theory is incredibly important in all sorts of applications. Square matrices are also a natural description for graphs (e.g. adjacency matrices), which are fundamental objects for pretty much all CS applications across all layers of your tech stack. Square matrices are great for representing pair-wise interactions between nodes. In the context of computer science, matrices are used as a representation construct with different possible formal interpretations which frequently can be recognized depending on the supported operations. For example, some functions of NumPy assume that matrices represent elements of linear algebra while other functions treat a matrix as a collection of rows or columns. In the latter case, a square matrix is probably an exception because rows and columns have completely different semantics. The question as asked in the title is incorrect. It is not important for a matrix to be square. Matrices are defined to solve specific problems, and the specific problem is what determines the geometry, not some goal of having square matrices.However the person actually asked why it is important for matrices to be square for most of the theorems they are learning. The actual answers to this are interesting, but the presence of this question likely implies that they are being taught abstract theory first prior to building much intuition.I was taught this way as well in my advanced pure math course. It was all super abstract until I was studying for the final exam and then had this eureka moment where suddenly everything made sense (a matrix is just a numerical way of describing a linear transform! And computing eigenvectors is like factoring!). Sadly, this happened again the next term where we derived SVD - except this time it never made sense to me until a later on course where we needed to use SVD for some application. The reason the matrix A has to be square for det(A-XI_n) is a) because we cannot subtract a matrix A from another matrix B if their dimensions aren’t the same and since XI_n is a nxn matrix A has to be an nxn matrix as well and b) because the determinant is only defined for square matrices. V.I. Arnold said: "The determinant of a matrix is an (oriented) volume of the parallelepiped whose edges are its columns. If the students are told this secret (which is carefully hidden in the purified algebraic education), then the whole theory of determinants becomes a clear chapter of the theory of poly-linear forms. If determinants are defined otherwise, then any sensible person will forever hate all the determinants, Jacobians and the implicit function theorem." TLDR: the author is asking an honest question, not trying to say that non square matrices have little use in linear algebra.As linear transformations of space into itself, a very frequent operation, are described by a square matrix those matrices do show up more frequently. But map X into Y of different dimensions and you get a non-square matrix I think the key is that linear algebra textbooks do not tell students straight away that matrices are encodings of linear maps. Hence a square matrix is an encoding of a map that preserves dimensions.Axler, and a few others, say this straight away in the intro. This makes things much simpler to understand instead of developing linear algebra historically from systems of linear equations. I happened to be the other way, I learned first that matrices were linear transformations and, being thick-headed, it took me a really long time to understand that matrices could also be used as "dumb excel spreadsheets".there's a fantastic historiography of linear algebra and why we're stuck in this situation here: https://www.youtube.com/watch?v=C2RO34b_oPM A worse situation is that of statistics, where many textbooks insist on using approximations to the normal which obscure concepts and are not needed in the age of computer-based inference .Calculus is also a bit confusing as for historical reasons we use a lot of notation from Newton-Leibniz informal infinitesimal approach, but we try to teach Bolzano-Weirstrass formal analysis, ending up with a confusing potpurri. I didn't mind the issue in calculus. The first thing we did in analysis after being introduced the Bolzano weierstrass form was to prove that it was equivalent to epsilon delta proofs. > But map X into Y of different dimensions and you get a non-square matrixThis is true, but it's also something that is arguably impossible on the one hand and "unwise" on the other.You can easily map a higher-dimensional space into a lower-dimensional space, but you will irretrievably lose a lot of information when you do so.And in the other direction, you can't actually map a lower-dimensional space into a higher-dimensional space with a matrix. The image of X can never have more dimensions than X does -- the choice to represent it with Y > X dimensions is just that, a representational choice. This idea is only meaningful in terms of the semantics behind the representation. > This is true, but it's also something that is arguably impossible on the one hand and "unwise" on the other.If I understand "unwise" to refer to a linear map into a lower-dimensional space: of course that's something you'll often want to do! Suppose, for example, that most of the interesting structure of your data is close to lying on a n-dimensional subspace of R^m, with n And in the other direction, you can't actually map a lower-dimensional space into a higher-dimensional space with a matrix. The image of X can never have more dimensions than X does -- the choice to represent it with Y > X dimensions is just that, a representational choice.Any matrix involves a choice (of basis), so that complaint is moot. For sure you can have a linear map whose codomain is higher-dimensional. You are correct that the image can't be higher-dimensional, but that doesn't prevent the existence of such a map. Sometimes losing that information is very useful however, to arrive at a representation that has a specific meaning.For example, an x-ray computed tomography (CT) image volume in 3D may be projected into various 2D synthetic planar x-ray projections (digitally reconstructed radiographs).There are countless situations where projections are very useful. In fact, the 3D CT image itself is reconstructed from projections! See https://en.wikipedia.org/wiki/Tomographic_reconstruction Yes, but this is a case where you have three dimensions in the input and also three dimensions in the output. (You're assembling the 3D image from different 2D slices with different depth coordinates.)The third dimension is discretized while the other two are continuous; the reconstruction consists of smoothing out the third dimension. In the real world, all of the dimensions are discretized. The input dimensions are theta (gantry angle), n (detector channel), and z (table position). The output dimensions are x, y, z. For a fixed z, the plot of projection intensity as a function of theta and n is called a "sinogram" (which is indeed a 2D space) which gets reconstructed to form an image (also 2D). It is true that there are three dimensions of raw data and three dimensions of reconstructed image. However, due to various tricks with reconstruction models, the total number of samples does not have to be the same in the raw data and in the output. As a result you can see recon models employing nonsquare matrices. For more information, you can read about methods like iterative reconstruction, compressed sensing, and differentiated backprojection. This description is adequate for axial tomography, whereas other geometries like cone beam tomography are more complicated (see FDK method). I don't disagree, but I think my comment helps to illustrate a reason why square matrices really are more special than other matrices. It s not more important! And most of these theorems can be extended to non square matrix anyway. Projections are examples of Rn to Rm applications that are very importants. Coordinates or probability measures too. And it make little sense to try to reduce them to endomorphism. Matrix can represent applications but they are objects worth studying for what they are. The most important example of a non-square matrix is a (column) vector. An mxn matrix can be thought of as a mapping or function from one subspace Rn and another Rm. If n=m and the matrix is square, it is possible to have an inverse mapping, (inverse matrix). Well, because matrices are "weird" mathematical objects. They're a composition of vectors (which make more sense than matrices) but it seems the square matrices are where you want to end up and non-square ones are kinda like special cases.If you're solving a linear system of equations for example: you need a square matrix. If you don't either your system is underdefined or you have redundancies or contradictions into it.(And one could say algebra with numbers is a special case of a 1x1 matrix, but anyway...)So worry about square matrices, about 1xN and Nx1 matrices (which are actually vectores) and the weird shapes are weird shapes What? They're a convenient device for writing down a linear map between finite-dimensional vector spaces. Nothing more, nothing less.Your explanation is bound to confuse people far more than help, as it seems to mystify something that is very mundane. Yes they're mundane, but the "linear map between different vector spaces" case is pretty much the only thing they can do.Think of it as type casting in programming languages, you only do it when you need, but the real work is done processing elements of the same type.If that wasn't the case, then non-square matrices would have all the "cool" properties of the square ones.Yes I might be picking at straws here, and I agree it might confuse people even more, but this weirdness of matrices hasn't escaped me since I've learned about it. And then when you get to vectors it all makes sense. Hum... > Yes they're mundane, but the "linear map between different vector spaces" case is pretty much the only thing they can do.Of course it's all they "can do". They're merely a way of writing down linear maps for a given choice of bases. Yes. Then they're a "different" object than a square matrix or a vector.My point is, the sub-algebra of all square matrices supports more operations than mixing matrices of different sizes. Search: