Also, feel free to email me at the address in the paper if you're interested in talking about it in more detail. E.g., I've already heard from some hardware folks looking at expanding on this work.
0. The problem you're really looking at is quickly computing
a'B
for many different vectors a (of length D) which are conceptually random.
1. To do that, you approximate AB, where A is very long and thin (NxD), and B (DxM). (Note: this problem has complexity O(NDM) naively, or, if N=D=M, O(N^3), and there are smarter algorithms that reduce the exponent 3 down to 2.8 (Strassen) or even further down, though those are not practical from what I gather. Oh, and note that at minimum, you're looking at time (N+M)D, because you need to look at every element, unless you approximate).
2. You approximate AB by f(g(A),h(B)), such that the error (in "relative Frobenius norm", ie norm of the error divided by norm of the actual product) is "small" with "high probability".
3. You're given B and a "typical" A^~ (ie training data), and then have two steps:
3.1. Do some training mumbo jumbo (fast?), and
3.2. now finally you can quickly compute the approximate AB (or a'B) for "new" A (or a'), presumably fast?
Quite cool that there are new techniques (and apparently fast, the article claims 100x faster than exact, and 10x faster than existing approximations) for such a classic problem!
But, if I understand correctly, it could happen with a certain probability that the actual error is quite high? Maybe if your a' is untypical?
ETA: From what I understand, in many ML applications people use 32bit or even 16bit floats, so high accuracy is not as important as speed anyway?
Yes, basically correct. A couple notes/clarifications for other readers:
- The rows a of A are "random," but in the sense of being drawn from some distribution for which we have a training set--not in the sense of, e.g., "assume everything is gaussian" or some other idealized case.
- The training takes a few seconds to maybe a couple minutes, depending on your training set size. But it does have to be done ahead of time--you can't train after receiving the matrices and still get a speedup. Unless maybe the matrices are unrealistically large.
- The error in the approximation is provably not much larger than the error on the training set with high probability (under standard ML assumptions--finite variance, training and test distributions match), but there's no hard limit on it in general. If you know the range of values in your individual vectors, you could obtain a hard limit, although it would be much looser than what you see in practice.
- Correct that many ML applications tolerate large losses in precision. But caveat is that we won't get the right answer to a full 16 bits in each element (this would be ~99.999% accuracy), but instead more like 90-99.9%, depending on the speed-quality tradeoff chosen. So we can't claim it will be sufficiently accurate in all cases
I understand there are K prototypes (centroid) in each subspace. And there are C disjoint subspaces. But how are subspaces chosen? Do we naively split the vector space?
For softmax example, if the original 512-element vector is x = (x0, x1, ..., x511), and we want to compress the vector to 4-bytes (and K=16 using 4 bits per prototype). There should be 8 subspaces. Does that mean we break the 512-element vector into eight 64-element sub-vectors? Like (x0,...,x63), (x64,...,x127),(x128,...,x191), ..., (x448,...,x511) and each sub-vector e.g. (x0,...,x64) is compressed into 4 bit? It looks like a lot of information are compressed into to zeros and lost.
I wonder in the image filtering field (such as Sobel and Gaussian filter), how to choose a compression ratio in order to get a reasonable good image quality.
One of our graduate students is thrilled about this paper, although he tends to do that with any new CS advance that seems sensational and that we barely understand (we do bioinformatics). He said that it stood to reason that if it can mmult 100GB/s/core, then we could matrix multiply 12TB in a minute!
Could you translate into practitioner-level language what are the practical limitations of this method; specifically, what would the error rates induced by approximation be under some practical scenarios, when would it make sense and not make sense to use it, etc? There is a complex equation in the paper describing the theoretical error bounds, but I have no idea whether in some practical scenario multiplying some normally distributed variables, whether that would mean a 0.1%, 1%, 5%, 10% error.
Personally I think it only makes sense to use this kind of method in some real-time algorithm where speed is of the essence, the downstream results of the mmult are themselves used in some other approximation (like many ML applications), and emphatically not to make the process of drawing biological conclusions from painstakingly derived data a few minutes faster for the analyst.
I fear that you have made an impressive, but dangerous, tool to people who don't know what they're doing.
I think we only claim to be able to preprocess a matrix at "up to" 100GB/s/core. The overall matrix product will take longer and depend on the matrix shapes.
To simplify Section 1.1, we help when:
1) You need to perform a matrix product more quickly and can tolerate approximation error
2) You have a training set for the larger matrix
3) The smaller matrix is either a) fixed or b) skinny relative to how tall the larger matrix is.
Re: "an impressive, but dangerous, tool to people who don't know what they're doing."
I believe you are overestimating the usability of my code :). But more seriously, I suspect that people attempting to use our method in contexts it wasn't designed for will quickly discover that they either can't actually call the API the way they wanted to, or that the method is no faster for their purposes. We also characterize our method at least as thoroughly as any approximate matrix multiplication algorithm I'm aware of, and have a variety of (admittedly loose) theoretical guarantees. So I hope that at least those who thoroughly read the paper will have a clear idea of what it can do. Overall, I guess my current thinking is that 1) I'm not sure how to introduce a method any more responsibly, but 2) if I can be of help in ensuring that it gets used well, feel free to reach out.
Would the embeds from openai's CLIP be suitably "dense" for this to work? The usage requires calculating the cosine similarity between the embeds and there are many instances where a lookup table approach is already being applied e.g. github.com/rom1504/clip-retrieval.
Also, feel free to email me at the address in the paper if you're interested in talking about it in more detail. E.g., I've already heard from some hardware folks looking at expanding on this work.