Hacker News new | past | comments | ask | show | jobs | submit login
Seven sins of numerical linear algebra (nhigham.com)
333 points by jeffreyrogers on Oct 12, 2022 | hide | past | favorite | 72 comments



I wish there was a Missing Semester of Linear Algebra course to help people go from "Okay I have a course or two in linear algebra, I know what span, vectors, basis and dimension mean, the formal definition of an inner product space, and I can do Gauss-Jordan elimination, determinants and eigenvalues for small matrices with paper and pencil" to "I have a 100x100 matrix of noisy data from sensors and this research paper I found tells me I can do some fantastic stuff if I compute such-and-such involving eigenvalues or inverses or whatnot. Or maybe I have a process with 1000 states where I know the probability objects move from state i to state j for each pair (i, j) and I want to find the steady state. How do I wrangle numpy into doing what I need?"

MIT has a course called The Missing Semester of Your CS Education [1]. It tells you about practical stuff that you need to know but isn't really taught in classes (shells, version control, build systems, package managers, VM's).

There needs to be something similar for linear algebra, it seems like there's a lot of folk knowledge and a big gap between what typical undergrad courses train you to do and what you encounter in actual practical problems.

(And don't get me started on all the weird linear algebra stuff they have going on in e.g. quantum physics.)

[1] https://missing.csail.mit.edu/about/


Stanford has ENGR108 [1] based on freely available book "Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares" [2] by Stephen Boyd, with video lectures [3] available. EE263 [4] is sort of a continuation of this at a more advaneced level, it originally also was developed by Boyd and also has video lectures available [5]

[1] https://stanford.edu/class/engr108/

[2] https://web.stanford.edu/~boyd/vmls/

[3] https://www.youtube.com/watch?v=oR6G1MUMveE

[4] https://ee263.stanford.edu/

[5] https://www.youtube.com/playlist?list=PL06960BA52D0DB32B


Sounds like a vanilla numerical math course?


I don’t know your age but I expect you’re a bit older! This has changed dramatically in the past 20-30 years and has been a contentious issue for the past few in the computer science education community. Many programs, if they even offer numerical analysis coursework, provide it to supplement regular analysis coursework (rather than computer science coursework).


Do you have any more context on the contentiousness of including numerical analysis in a traditional cs curriculum? I’m curious to know what the arguments are on both sides.

I’ve also recently noticed that programs and students outside North America seem to take the content much more seriously.


I had a computer numerics course even for studying pure mathematics.

That was in Germany in the early 2000s.


Maybe ;-)

I do know that this course is still offered widely across the US, but maybe not within CS, and maybe not as a mandatory course.

But, bottom line: if you're interested, it's very likely you can take it - you may have to check your school of engineering or applied math department.


Sounds like EE263 at Stanford. Stephen Boyd’s lectures are on YouTube and so are most of the slides and psets.

https://ee263.stanford.edu/archive/


check out prof. steve brunton’s youtube channel[0] — it’s light on the numpy side of things, but he’s got a lot of material going from first principles to dealing with huge matrices, compressed sensing, dynamical systems, ML, etc

[0] https://www.youtube.com/c/Eigensteve


I don't exactly know what you mean, but at least most points in the blog post were covered in my undergrad numerics courses.


It's a master degree course called Matrix Analysis (e.g Horn and Johnson as text).


Would the libraries take care of most of the algorithms mentioned in Horn's book? I was wondering if there's something in between: it goes beyond basic linear algebra, but it uses the numerical libraries to process the large matrices to solve complex problems.


Is it close to 18.065 from Prof Strang?


Prof Higham has quite the list of papers on the finer points of numerical algorithms going back to his MSc in 1983:

http://www.ma.man.ac.uk/~higham/papers/bibbase.php

including a brief note on the comparing "Top 10 Algorithms in Applied Mathematics" between 2000 and 2016 that may interest some:

https://nhigham.com/2016/03/29/the-top-10-algorithms-in-appl...

It'd be nice to see some meat on the bones and a few ripping yarns about the application end of applied math techniques .. eg: forming an enhanced image from tens (or hundreds) of thousands of multichannel spectral samples using a sensitivity adjusted SVD, and then removing the most common expected background to highlight the anomalies.

It's dry stuff in Linear Algebra, somewhat more exciting when searching for nuclear weapons in a forest or gold in a desert.


He also wrote "Accuracy and Stability of Numerical Algorithms" which is part of the canon of the field.


Why is Newton’s Method not used more often in ML? I know Newton’s method requires the 2nd derivative while Gradient Descent family of algorithms only requires the first, but shouldn’t Newton’s Method be just as straightforward when using autodifferentiation since the process is just a computational graph graph transform using a lookup table?


With Newton's method, you'll be solving Hx=g (H = hessian of f, g = gradient of f) at each iteration. For large number of variables N, building H is of order N^2 and solving Hx = g is of order N^3 with an usual solver. N^2 and N^3 are really large for an already large N. I believe the reason is as simple as that. It isn't that it is tedious and difficult to write down the formulas or the code to compute H. It's just too costly, computationally speaking. There is also an increased memory cost (possibly having to store H).

When people do have ways to go around this problem, they do use Newton's method for large scale problems.


Adding to the answers already given (complexity related to the computation of full Hessians). First order methods, i.e., methods that only require gradients, are the methods of choice when extreme solution accuracy is not required, which is the case in many practical settings (including ML).

One of Newton's method major selling points is that once it's close to a local minimum, under the right smoothness assumptions it essentially converges faster and faster to an exact optimizer - in practice you get "one more correct significant digit each iteration" once you are close enough. It's called Newton's method region of quadratic convergence, see [0] Theorem 14.1, p.3

[0] https://www.stat.cmu.edu/~ryantibs/convexopt-F16/scribes/new...


Adding to sibling post from phao, Optimizers like ADAM essentially do a diagonal approximation to Newton's method, which gives some, though certainly not all, of the benefit of Newton.

There is the seminal paper "Deep Learning via Hessian-free Optimization" [1] that applies Newton's method (Hessian free because of a clever trick of solving Hx=-g iteratively without explicitly constructing the Hessian, and getting an approximation without full O(n^3)) - but n^2 or n^3 is just too large in practice when the diagonal approximations work just fine.

[1] https://www.cs.toronto.edu/~jmartens/docs/Deep_HessianFree.p...


Re: the first sin, I was told to not invert matrices but never given a satisfying reason. So I wrote a blog post about it:

http://gregorygundersen.com/blog/2020/12/09/matrix-inversion...


The other obvious reason to avoid inverses is that they're only defined for square matrices, whereas LU decomposition works on general rectangular matrices (or rather PLU decomposition,same basic idea).


Invert can also destroy structure you might want to keep around, for example the LU factorization of a banded matrix will still be banded (lapack will do partial pivoting which will increase the bandwidth, but with it'll only double the number of super-diagonals in L), while the inverse is a full matrix.


Err, the number of super-diagonals in U obviously, haha, oops.


Most people should pay extra attention to #7.

E.g. Goodfellow et al did even worse than this sin in the Deep Learning book when they claimed the condition number for a square (but not necessarily normal) matrix is defined in terms of eigenvalues. This is false, but nevertheless see 4.2 in https://www.deeplearningbook.org/contents/numerical.html . When I've raised this with people in real life, I typically get some reflexive response that it should be a useful approximation, but as this blog points out, that isn't true either.


Good catch!


Also, please check other posts in Professor Nick Higham's blog. Especially his wonderful and accessible "What is ..." series of articles [0]

[0] https://nhigham.com/index-of-what-is-articles/


Not sure I 100% agree with (2) forming A^TA. In many real-world use cases A^TA+choleksy is going to be considerably faster than QR on A, and come with few numerical consequences. Even in the numerically challenging cases, pivoted LDL^T on the saddle point system is still going to be faster than the suggested solution of doing QR on A. (Essentially no optimization solver I've ever seen uses QR).


Love the post. I'll take this opportunity to link to a favorite classic linear algebra paper in a similar vein: "Nineteen Dubious Ways to Compute the Exponential of a Matrix" [1]

[1]: https://www.math.purdue.edu/~yipn/543/matrixExp19-I.pdf


Great list thank you. Unfortunately I commit 4 of these on a regular basis.


Why, and which 4?


> 1. Inverting a Matrix

Possibly only a partial sin since I use Moore-Penrose pseudo inverse and L2 ridge regularization. This sin is being commit as a consequence of committing next sin.

> 2. Forming the Cross-Product Matrix A^TA

Yes yes this is potential numerically bad. However in practice as long as you are careful with scaling it’s perfectly fine and enable better parallel computations.

> 7. Using Eigenvalues to Estimate Conditioning

This sin is the only way I actually know so after reading this I realized I need to read more on this.

The last of the 4, which obviously is an avoidable sin.

> 3. Evaluating Matrix Products in an Inefficient Order

I definitely have code that should he changed. It’s also on my todo list now to audit a specific routine that I suspect can be fixed. This one is just stupid it can be avoided.


"7. Using Eigenvalues to Estimate Conditioning"

Putting this out to invoke Cunningham's Law... my intuition says that while the article may be right about matrices in the real numbers, using the eigenvalues to check for closeness to singularity may be more valid on the floats, because probably what you're testing for isn't "closeness to singularity" but how close you are to having floating point failures, and that seems at least likely to be heavily correlated.

I now sit back and wait for someone to explain why this is wrong while I act like this was an entirely unanticipated result of my post.


Why would this be correlated more with eigenvalues than singular values?


... same ... :-/


Numerical algorithms have always been a fascination of mine and I spent quite a bit of time studying them. Linear algebra has always seemed to provide some of the most rich content (followed by differential equations, imo). To me linear algebra was so dry when done on paper but suddenly a new world opened up when I could use computers.


I recommend Numerical Linear Algebra by Trefethen and Bau.

Thanks for this excellent link.


Amazing text indeed, this was our textbook in some graduate courses I took in numerical computation. Machine error epsilon still haunts me at night


Trefethen's video lectures are also superb: https://podcasts.ox.ac.uk/series/scientific-computing-dphil-...


The PDFs for Higham's "What is" series are available on Github: https://github.com/higham/what-is


nit: in "5. Not Exploiting Structure in the Matrix", the author says circulant matrices can be solved in O(n log_2 n) operations, where log_2 means "base 2" log. This notation is unnecessary since different bases of logs (as long as the bases are constant and independent of n) only differ in a constant factor, the _2 is insignificant in the big O notation, so it's just O(n log n)


Learned a lot of these lessons when studying statistics as a non-coder trying to implement my own techniques and other peoples research.


Nice list. I would add

8. [edit: oops that's already no 5] Not taking advantage of matrix structure (symmetric, sparse, banded, Toeplitz, ...)

9. Transposing a matrix (Like the inverse A^{-1}, the explicit transpose A^t is often not needed)


IMO 9 is the fault of the language rather than the programmer. Julia solves this one by having types that can represent conjugates, transposes, and adjoints of arbitrary matrices lazily.


I think they are talking about explicitly transposing it in memory here. A "transpose operation flag" is pretty a standard feature for linear algebra libraries.

If someone is writing a higher level object-oriented linear algebra environment, having the matrices carry around a "transposed" flag that they can pass to BLAS seems like a reasonable thing to do. Since, other than in Julia and Matlab, matrix stuff is usually an add-on, we can't really blame the language for this decision IMO.

A downside could be -- usually your underlying tuned library will be BLAS and LAPACK, which don't accept a 'transposed' flag for every single operation. So, from the user point of view, it could be kind of confusing -- "when I transpose a matrix and then go on to multiply, the transpose is free. But if I transpose a matrix and then go on to hit it with QR, it for some reason incurs this weird extra cost -- not where I do the operation, but later, in the QR."


and in python/numpy, transpose is a O(1) operation: it doesn't actually move any data, just changes the strides used to access data points in the underlying buffer.


> 9. Transposing a matrix

... if you only use the transpose once. If instead it is going to be used multiple times, explicitly computing the transpose can be a huge performance boost.

For dense matrices, it is typically used to exploit memory locality (i.e. to be prefetch- and cache-friendly).

For sparse matrices (your point 8), the advantage can be even more pronounced, sometimes the difference between being able to exploit sparsity, or not.


> 8. Not taking advantage of matrix structure (symmetric, sparse, banded, Toeplitz, ...)

This is the sin no 5 of the article: "5. Not Exploiting Structure in the Matrix"


> 9. Transposing a matrix (Like the inverse A^{-1}, the explicit transpose A^t is often not needed)

Most matrix libraries should make transpose, conjugate, and conjugate transpose just twiddling a bit on its internal representation--BLAS routines should have a parameter on them saying if the input matrix needs to be transposed and/or conjugated before doing an operation.


Doesn’t memory layout still matter? Assumedly matrix-vector products like Av are fastest if the rows of A are contiguous in memory, and vector-matrix products vA are fastest when the columns are contiguous in memory. So sometimes one would really want to actually transpose the matrix, rather than reinterpreting the indexing scheme.


Materializing the transpose of a matrix is one of the most common and useful operations at the start of a large-scale data processing system.

People who say "transpose is an O(1) operation because it just creates a view" aren't including the important detail of caches and access patterns impact on performance.


Per the first sin, is there any alternative for computing the view matrix from a camera matrix in 3d graphics? This is a case where inversion "feels" appropriate to me.


He is mostly talking about computational linear algebra problems of a large scale type due to large matrices: the "computational intensity" comes from having really large matrices (kxk for k = 100's, 1 000's, 10 000's, 100 000's, 1 000 000's, ...).

In computer graphics, the situation is often different. Usually, you have small matrices (kxk for k=2,3,4); a huge number of vectors; and you want to apply your matrix to all of those vectors. Very often, these matrices have very well known forms and also known well behaved inverses. There isn't really a significant computational cost in computing the inverse (you'll very often write down its formula by hand), and conditioning is usually not an issue (consider a rotation matrix for example or undoing translations with homogeneous coordinates).


Thank you for the clarification. I didn't realize we were talking about that kind of scale.


You may need an inverse, but you shouldn’t call .inverse() or inverse(A).

Since transformation matrices have simple structure, you can invert them much much faster.

Ex: inverse(R, u) is (R^T, -R^T * u)


> Indeed one would not solve the scalar (n = 1) system 7x = 21 by computing x = 7^-1 × 21 but rather would carry out a division x = 21 / 7

I remember learning in algebra to solve this equation exactly the way he described and said we don't use. You multiply both sides by 1 / 7 which cancels the 7 on the left side, because 1 / 7 is the inverse of 7.

Now I just implicitly divide both sides by 7, but I'm still solving the equation by using the inverse of 7...


But that is sort of the point. In theory, it's the same. But that's not how you should actually compute it.

The cheap way is to compute 21/7.

The hard way is to compute 1/7 (one floating point operation), and then multiply it by 21 (another floating point operation).

With matrices, the discrepancy in work and possibly precision between "solve Ax=b" and "compute x = A^-1 b" can be very large.


I think the difference is that you did the inversion symbolically, not numerically. This is about numeric computation not applying algebraic manipulations.


In linear algebra finding a general inverse of a matrix is harder than solving a specific equation.


ok so what is sqrt(7)x = 21?

well you just 1/sqrt(7) * sqrt(7)x = 1/sqrt(7) * 21

so x = sqrt(7)^2 * 3 / sqrt(7) = sqrt(7) * 3

but you didn't compute the inverse did you, and neither did i. you factored 21 and used the cancellation law (ax = ay => x = y)


I love this topic. Are there any industry fields or sub fields that make heavy use of this content?


Image manipulation, game engines, graphics, finite element analysis, fluid dynamics, signal processing, weather models, etc. It's used everywhere where you have a system of equations to solve numerically.

Most of the industries are dominated by subject matter experts who write code.


Significantly used in finance as well, risk estimation is often approximately fitting an MVN to sparse data. I’ve used ideas from this blog many times.


Comes up quite a bit in the AI space as well.


Linear algebra is basically the foundation of most serious computation effort. We rank supercomputers essentially by their sustained floating-point computation rate solving large linear equations--a substantial fraction of HPC workloads boil down to "math on large matrices." In the large scale, basically any simulation (whether it be weather modeling or analysis of the stress on a physical object) is solving linear equations. Graphics tends to boil down to doing lots and lots of small matrix multiplications, as does signal or image processing. Optimization problems also tend to boil down to linear algebra as well (via simplex)--so things like operations research, place & route on FPGAs, or even optimal factory ratios in a game like Factorio turn out to need to use linear algebra.


Control theory, optimization


What is a good resource to self-study numerical linear algebra?


There are many good resources! A few, depending on your inclination:

- For intuition: https://www.youtube.com/watch?v=fNk_zzaMoSs

- For rigor: https://ocw.mit.edu/courses/18-06-linear-algebra-spring-2010...

- For code: https://codingthematrix.com/

- For numerical/algorithmic details: https://people.maths.ox.ac.uk/trefethen/text.html


Rigor and Strang are not two words I'd put in the same sentence. Neil Strickland's notes https://neilstrickland.github.io/linear_maths/ (for the matrix point of view) or Jim Hefferon's book https://joshua.smcvt.edu/linearalgebra/index.html (for a vector-spacey treatment) are what comes to my mind when I think "rigor" (along with all sorts of older textbooks like Hoffman/Kunze); Jean Gallier's long set of notes https://www.cis.upenn.edu/~jean/math-deep.pdf goes even further in that direction.


The cool thing about numerical linear algebra is, as far as numerical analysis is concerned, it's probably one of the most accessible fields to the would-be citizen scientist.

It's been a long time since I've taken linear algebra but if you can find an OCW course, or a really good book I'd start there. You'll want some rigor under your belt. Once you do that, any good numerical analysis text will usually cover the basics (you'll need them) before going into the fascinating world of algorithms as they relate to linear algebra.


Numerical Linear Algebra by Trefethen and Bau is the resource I've seen recommended most often. I've only read the first few chapters but they are good and useful.


head to ulaff.net, they have textbooks, a set of computer exercises, a code repository, youtube lectures, the whole nine yards really starting with linear algebra 101 and culminating with numerical linear algebra

their treatment of numerical linear algebra is not as comprehensive as say the one you'd find in Watkins' Fundamentals of Matrix Computations but it's free


fantastic post!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: