I feel like iterative Krylov subspace methods should be around somewhere, but to be honest I'm not sure what the applications space for linear inverse problems looks like in the deep learning domain. So maybe that wouldn't quite fit (despite me finding it to be pretty cool, and these methods sort of eating the lunch of most other inverse methods).
I'd round out the course maybe with a read through of Trefethen-Bau, “Numerical Linear Algebra” for those new to the topic area.
A cool thing is that means you can do linear algebra operations with a matrix (including solving linear systems) without ever needing to actually explicitly construct said matrix. We just need a recipe to describe the action of multiplication with a matrix in order to do more advanced things with it. This is why they are so popular for sparse matrices, but their suitability can go beyond just that.
Projecting problems into Krylov spaces also tends to have a noticeable regularizing effect.
More generally: subspace methods find a solution x as a linear combination of a few column vectors that form a matrix V. So x = Vy, where y is a very small vector. Now only if you're lucky you'll find AVy = b, but in general AVy /= b. So therefore you require the residual to be perpendicular to V or something like that: V'(AVy - b) = 0. In the end you solve a very small problem (V'AV)y = V'b for y.
GMRES for instances solves AVy = b in the least squares sense: (AV)'AVy = (AV)'b.
There's a lot more to it of course.
Krylov subspaces are a (the ?) common framework to compute eigenvalues and singular values, and it works for sparse matrices as well. IOW, it is extremely useful for most problems that can be expressed as matrix factorization (e.g. recommendation using collaborative filtering, initial page rank-like algos).
The matrices themselves are a representation of a Markov chain.
I will definitely give this course a try and spread the word.
One possibility that I haven't seen: if you have a neural network graph with a small subgraph that is not 100% explicit (i.e. I have nodes with cyclic dependency that have to be solved collectively via newton-type method), the gradient across this subgraph can be solved at the converged state by solving a standard matrix-vector linear inverse problem applied to the gradients, without needed the AD engine to follow every operation through the iterates of the non-linear newton solver.
For sparsity and conditioning reasons, in my current work we do something like this for graph-based specification of engineering systems using GMRES to find the gradients across these subsystems. We're not doing deep learning, but the underlying machinery is basically the same.
However more recent theoretical and empirical results show that these kinds of approaches don't deal well with the huge saddle point problem in DL loss functions. But momentum based techniques work great.
Now, if we're just doing a stochastic steepest descent algorithm, there are not really any linear systems to solve, so this doesn't come up. If we want to use a stochastic Newton algorithm, this does. It comes up here specifically because forming the full Hessian may be too expensive in terms of computation or memory, so if we're smart and use a Krylov method that only requires Hessian-vector products, we can effectively use second-order information without taking the penalties of computation and memory. Yes, there is a loss here since we only use a piece of the spectrum of the Hessian and this is where the art and science and preconditioning come to play to insure that we use this effectively. However, we can use stochastic algorithms here. In fact, the entire field of stochastic PDE constrained optimization deals very precisely with these tools and does so in a surprisingly effective manner.
The advantage that Krylov methods have over a direct solve is that we retain the ability to use the direct solve if we want in the context of a preconditioner. However, different Krylov methods have different properties that we can use to our advantage. For example, conjugate gradient works on a minimization principle based on a quadratic form. This integrates well in trust-region methods that require us to minimize a quadratic model. GMRES minimizes the residual of the linear system, which gives us a monotonically decreasing residual at each iteration, which can be important for different algorithms. Which Krylov method to use in any particular situation depends on the properties desired and the systems involved. And, I can't stress enough, we don't lose out on direct solves. Direct solves can be fantastic preconditioners and if we really can compute an inverse exactly our Krylov iterations converges in one iteration. As such, we lose a little bit of computation and memory, but not that much.
Anyway, sorry for coming on a little bit hard, but I wanted to try and nip some amount of confusion in the bud quickly. Stochastic methods are naturally and fully integrated into inexact optimization schemes that rely on Krylov solves. They are the standard tool for many fields that use optimization. They may not be wildly used in ML, but that has little to do with their efficacy computationally.
I need to nitpick here... Big O notation is a way to describe growth rates of functions. You can count data movements (or anything else) with Big O.
So the correct way would be to translate memory access to time, but that means modelling the memory hierarchy, modelling caches in general, and finally perhaps modelling the specifically used caching strategies.
Should computer algorithms always assume they need to model caches since they are never going away? When determining the computational complexity, time and memory are treated with equivalence, but real memory doesn't and never will behave that way.
If you want a bottom up approach, go look at Ng's coursera course.
I've just started lecture 1 but I already felt some minor frustrations:
- One of the links in the first lecture is to a notebook about intro to convolutions but that notebook is just a big code dump.
- After executing the exercises, you lose the expected answer. It might be better if the answers were included as a comment in the code fragment.
- Sometimes the given answers are not actually the answer but just the computation performed as part of getting the answer. I.e. for the matrix-matrix products section in lecture 1 the suggested answer is just the resulting matrix from doing the matrix product, but according to the question in the text the answer should be the actual cheapest shop.
- Is this a USF course or a fast.ai course?
I don't know if the author is planning on improving the material, because right now it feels a bit like a beta version.
Thanks for your hard work, Rachel! Really curious what you two will get up to next.
> Jeremy and I developed this material for a numerical linear algebra course we taught in the University of San Francisco’s Masters of Analytics program, and it is the first ever numerical linear algebra course, to our knowledge, to be completely centered around practical applications and to use cutting edge algorithms and tools,
Also, sometimes algorithms can be invented and empirically shown to have good complexity properties before there are proofs.
- Computational as in 'using programming and computers'
- Computational as in 'solving a math problem' (as opposed to 'theoretical': definitions/theorems/proofs/lemmas).
My guess is that the posted link means the first kind, hence almost all of linear algebra texts are non-computational, i.e., you can become an expert in linear algebra without knowing how to program, and without knowing a single programming language.
For the second kind, most of beginner and intermediate linear algebra is computational, but not all. There's plenty of theory of linear algebra, and it has connections with representation theory, abstract algebra, as well as analysis, topology, and geometry. Study of infinite-dimensional vector spaces is purely non-computational.