I wish there was a Missing Semester of Linear Algebra course to help people go from "Okay I have a course or two in linear algebra, I know what span, vectors, basis and dimension mean, the formal definition of an inner product space, and I can do Gauss-Jordan elimination, determinants and eigenvalues for small matrices with paper and pencil" to "I have a 100x100 matrix of noisy data from sensors and this research paper I found tells me I can do some fantastic stuff if I compute such-and-such involving eigenvalues or inverses or whatnot. Or maybe I have a process with 1000 states where I know the probability objects move from state i to state j for each pair (i, j) and I want to find the steady state. How do I wrangle numpy into doing what I need?"
MIT has a course called The Missing Semester of Your CS Education [1]. It tells you about practical stuff that you need to know but isn't really taught in classes (shells, version control, build systems, package managers, VM's).
There needs to be something similar for linear algebra, it seems like there's a lot of folk knowledge and a big gap between what typical undergrad courses train you to do and what you encounter in actual practical problems.
(And don't get me started on all the weird linear algebra stuff they have going on in e.g. quantum physics.)
Stanford has ENGR108 [1] based on freely available book "Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares" [2] by Stephen Boyd, with video lectures [3] available. EE263 [4] is sort of a continuation of this at a more advaneced level, it originally also was developed by Boyd and also has video lectures available [5]
I don’t know your age but I expect you’re a bit older! This has changed dramatically in the past 20-30 years and has been a contentious issue for the past few in the computer science education community. Many programs, if they even offer numerical analysis coursework, provide it to supplement regular analysis coursework (rather than computer science coursework).
Do you have any more context on the contentiousness of including numerical analysis in a traditional cs curriculum? I’m curious to know what the arguments are on both sides.
I’ve also recently noticed that programs and students outside North America seem to take the content much more seriously.
I do know that this course is still offered widely across the US, but maybe not within CS, and maybe not as a mandatory course.
But, bottom line: if you're interested, it's very likely you can take it - you may have to check your school of engineering or applied math department.
check out prof. steve brunton’s youtube channel[0] — it’s light on the numpy side of things, but he’s got a lot of material going from first principles to dealing with huge matrices, compressed sensing, dynamical systems, ML, etc
Would the libraries take care of most of the algorithms mentioned in Horn's book? I was wondering if there's something in between: it goes beyond basic linear algebra, but it uses the numerical libraries to process the large matrices to solve complex problems.
It'd be nice to see some meat on the bones and a few ripping yarns about the application end of applied math techniques .. eg: forming an enhanced image from tens (or hundreds) of thousands of multichannel spectral samples using a sensitivity adjusted SVD, and then removing the most common expected background to highlight the anomalies.
It's dry stuff in Linear Algebra, somewhat more exciting when searching for nuclear weapons in a forest or gold in a desert.
Why is Newton’s Method not used more often in ML? I know Newton’s method requires the 2nd derivative while Gradient Descent family of algorithms only requires the first, but shouldn’t Newton’s Method be just as straightforward when using autodifferentiation since the process is just a computational graph graph transform using a lookup table?
With Newton's method, you'll be solving Hx=g (H = hessian of f, g = gradient of f) at each iteration. For large number of variables N, building H is of order N^2 and solving Hx = g is of order N^3 with an usual solver. N^2 and N^3 are really large for an already large N. I believe the reason is as simple as that. It isn't that it is tedious and difficult to write down the formulas or the code to compute H. It's just too costly, computationally speaking. There is also an increased memory cost (possibly having to store H).
When people do have ways to go around this problem, they do use Newton's method for large scale problems.
Adding to the answers already given (complexity related to the computation of full Hessians). First order methods, i.e., methods that only require gradients, are the methods of choice when extreme solution accuracy is not required, which is the case in many practical settings (including ML).
One of Newton's method major selling points is that once it's close to a local minimum, under the right smoothness assumptions it essentially converges faster and faster to an exact optimizer - in practice you get "one more correct significant digit each iteration" once you are close enough. It's called Newton's method region of quadratic convergence, see [0] Theorem 14.1, p.3
Adding to sibling post from phao, Optimizers like ADAM essentially do a diagonal approximation to Newton's method, which gives some, though certainly not all, of the benefit of Newton.
There is the seminal paper "Deep Learning via Hessian-free Optimization" [1] that applies Newton's method (Hessian free because of a clever trick of solving Hx=-g iteratively without explicitly constructing the Hessian, and getting an approximation without full O(n^3)) - but n^2 or n^3 is just too large in practice when the diagonal approximations work just fine.
The other obvious reason to avoid inverses is that they're only defined for square matrices, whereas LU decomposition works on general rectangular matrices (or rather PLU decomposition,same basic idea).
Invert can also destroy structure you might want to keep around, for example the LU factorization of a banded matrix will still be banded (lapack will do partial pivoting which will increase the bandwidth, but with it'll only double the number of super-diagonals in L), while the inverse is a full matrix.
E.g. Goodfellow et al did even worse than this sin in the Deep Learning book when they claimed the condition number for a square (but not necessarily normal) matrix is defined in terms of eigenvalues. This is false, but nevertheless see 4.2 in https://www.deeplearningbook.org/contents/numerical.html . When I've raised this with people in real life, I typically get some reflexive response that it should be a useful approximation, but as this blog points out, that isn't true either.
Not sure I 100% agree with (2) forming A^TA.
In many real-world use cases A^TA+choleksy is going to be considerably faster than QR on A, and come with few numerical consequences. Even in the numerically challenging cases, pivoted LDL^T on the saddle point system is still going to be faster than the suggested solution of doing QR on A. (Essentially no optimization solver I've ever seen uses QR).
Love the post. I'll take this opportunity to link to a favorite classic linear algebra paper in a similar vein: "Nineteen Dubious Ways to Compute the Exponential of a Matrix" [1]
Possibly only a partial sin since I use Moore-Penrose pseudo inverse and L2 ridge regularization. This sin is being commit as a consequence of committing next sin.
> 2. Forming the Cross-Product Matrix A^TA
Yes yes this is potential numerically bad. However in practice as long as you are careful with scaling it’s perfectly fine and enable better parallel computations.
> 7. Using Eigenvalues to Estimate Conditioning
This sin is the only way I actually know so after reading this I realized I need to read more on this.
The last of the 4, which obviously is an avoidable sin.
> 3. Evaluating Matrix Products in an Inefficient Order
I definitely have code that should he changed. It’s also on my todo list now to audit a specific routine that I suspect can be fixed. This one is just stupid it can be avoided.
Putting this out to invoke Cunningham's Law... my intuition says that while the article may be right about matrices in the real numbers, using the eigenvalues to check for closeness to singularity may be more valid on the floats, because probably what you're testing for isn't "closeness to singularity" but how close you are to having floating point failures, and that seems at least likely to be heavily correlated.
I now sit back and wait for someone to explain why this is wrong while I act like this was an entirely unanticipated result of my post.
Numerical algorithms have always been a fascination of mine and I spent quite a bit of time studying them. Linear algebra has always seemed to provide some of the most rich content (followed by differential equations, imo). To me linear algebra was so dry when done on paper but suddenly a new world opened up when I could use computers.
nit: in "5. Not Exploiting Structure in the Matrix", the author says circulant matrices can be solved in O(n log_2 n) operations, where log_2 means "base 2" log. This notation is unnecessary since different bases of logs (as long as the bases are constant and independent of n) only differ in a constant factor, the _2 is insignificant in the big O notation, so it's just O(n log n)
IMO 9 is the fault of the language rather than the programmer. Julia solves this one by having types that can represent conjugates, transposes, and adjoints of arbitrary matrices lazily.
I think they are talking about explicitly transposing it in memory here. A "transpose operation flag" is pretty a standard feature for linear algebra libraries.
If someone is writing a higher level object-oriented linear algebra environment, having the matrices carry around a "transposed" flag that they can pass to BLAS seems like a reasonable thing to do. Since, other than in Julia and Matlab, matrix stuff is usually an add-on, we can't really blame the language for this decision IMO.
A downside could be -- usually your underlying tuned library will be BLAS and LAPACK, which don't accept a 'transposed' flag for every single operation. So, from the user point of view, it could be kind of confusing -- "when I transpose a matrix and then go on to multiply, the transpose is free. But if I transpose a matrix and then go on to hit it with QR, it for some reason incurs this weird extra cost -- not where I do the operation, but later, in the QR."
and in python/numpy, transpose is a O(1) operation: it doesn't actually move any data, just changes the strides used to access data points in the underlying buffer.
... if you only use the transpose once. If instead it is going to be used multiple times, explicitly computing the transpose can be a huge performance boost.
For dense matrices, it is typically used to exploit memory locality (i.e. to be prefetch- and cache-friendly).
For sparse matrices (your point 8), the advantage can be even more pronounced, sometimes the difference between being able to exploit sparsity, or not.
> 9. Transposing a matrix (Like the inverse A^{-1}, the explicit transpose A^t is often not needed)
Most matrix libraries should make transpose, conjugate, and conjugate transpose just twiddling a bit on its internal representation--BLAS routines should have a parameter on them saying if the input matrix needs to be transposed and/or conjugated before doing an operation.
Doesn’t memory layout still matter? Assumedly matrix-vector products like Av are fastest if the rows of A are contiguous in memory, and vector-matrix products vA are fastest when the columns are contiguous in memory. So sometimes one would really want to actually transpose the matrix, rather than reinterpreting the indexing scheme.
Materializing the transpose of a matrix is one of the most common and useful operations at the start of a large-scale data processing system.
People who say "transpose is an O(1) operation because it just creates a view" aren't including the important detail of caches and access patterns impact on performance.
Per the first sin, is there any alternative for computing the view matrix from a camera matrix in 3d graphics? This is a case where inversion "feels" appropriate to me.
He is mostly talking about computational linear algebra problems of a large scale type due to large matrices: the "computational intensity" comes from having really large matrices (kxk for k = 100's, 1 000's, 10 000's, 100 000's, 1 000 000's, ...).
In computer graphics, the situation is often different. Usually, you have small matrices (kxk for k=2,3,4); a huge number of vectors; and you want to apply your matrix to all of those vectors. Very often, these matrices have very well known forms and also known well behaved inverses. There isn't really a significant computational cost in computing the inverse (you'll very often write down its formula by hand), and conditioning is usually not an issue (consider a rotation matrix for example or undoing translations with homogeneous coordinates).
> Indeed one would not solve the scalar (n = 1) system 7x = 21 by computing x = 7^-1 × 21 but rather would carry out a division x = 21 / 7
I remember learning in algebra to solve this equation exactly the way he described and said we don't use. You multiply both sides by 1 / 7 which cancels the 7 on the left side, because 1 / 7 is the inverse of 7.
Now I just implicitly divide both sides by 7, but I'm still solving the equation by using the inverse of 7...
I think the difference is that you did the inversion symbolically, not numerically. This is about numeric computation not applying algebraic manipulations.
Image manipulation, game engines, graphics, finite element analysis, fluid dynamics, signal processing, weather models, etc. It's used everywhere where you have a system of equations to solve numerically.
Most of the industries are dominated by subject matter experts who write code.
Significantly used in finance as well, risk estimation is often approximately fitting an MVN to sparse data. I’ve used ideas from this blog many times.
Linear algebra is basically the foundation of most serious computation effort. We rank supercomputers essentially by their sustained floating-point computation rate solving large linear equations--a substantial fraction of HPC workloads boil down to "math on large matrices." In the large scale, basically any simulation (whether it be weather modeling or analysis of the stress on a physical object) is solving linear equations. Graphics tends to boil down to doing lots and lots of small matrix multiplications, as does signal or image processing. Optimization problems also tend to boil down to linear algebra as well (via simplex)--so things like operations research, place & route on FPGAs, or even optimal factory ratios in a game like Factorio turn out to need to use linear algebra.
The cool thing about numerical linear algebra is, as far as numerical analysis is concerned, it's probably one of the most accessible fields to the would-be citizen scientist.
It's been a long time since I've taken linear algebra but if you can find an OCW course, or a really good book I'd start there. You'll want some rigor under your belt. Once you do that, any good numerical analysis text will usually cover the basics (you'll need them) before going into the fascinating world of algorithms as they relate to linear algebra.
Numerical Linear Algebra by Trefethen and Bau is the resource I've seen recommended most often. I've only read the first few chapters but they are good and useful.
head to ulaff.net, they have textbooks, a set of computer exercises, a code repository, youtube lectures, the whole nine yards really starting with linear algebra 101 and culminating with numerical linear algebra
their treatment of numerical linear algebra is not as comprehensive as say the one you'd find in Watkins' Fundamentals of Matrix Computations but it's free
MIT has a course called The Missing Semester of Your CS Education [1]. It tells you about practical stuff that you need to know but isn't really taught in classes (shells, version control, build systems, package managers, VM's).
There needs to be something similar for linear algebra, it seems like there's a lot of folk knowledge and a big gap between what typical undergrad courses train you to do and what you encounter in actual practical problems.
(And don't get me started on all the weird linear algebra stuff they have going on in e.g. quantum physics.)
[1] https://missing.csail.mit.edu/about/