Author here. Ask me anything--happy to answer questions.
Also, if you like this kind of work, you might like what I've been building for the past year: Composer [1]. It speeds up neural net training by a lot (e.g., 7x faster for ResNet-50) [2] and, in contrast to Bolt/MADDNESS, is polished, documented code you can get working in <5min.
Thank you for your efforts. I came across your paper/code and posted it here. I was looking to find a technique to cost optimise transformer based question and answering. Presently I am using CPU and getting a GPU is too costly on AWS.
Since I use high level code I don't understand the maths completely. However, I was wondering if your techniques can be beneficial on CPUs?
If I were to use this to improve transformer based architecture what should be my approach?
It should be possible to get large speedups on CPUs, but the trick will be gradually approximating each of the layers in the model (see my reply to sibling comment). It's not conceptually difficult, but will require a fair amount of C++ work to port the code to GPUs* for training; and it will probably go slower than dense ops on modern GPUs due to tensor cores not supporting our memory layout.
I think of this paper as the first in a two-part series, where the next one takes these fast ops and gets them working in full neural nets. (If anyone wants to do this project, happy to coadvise you / talk about it whenever; I won't have bandwidth to do it myself for the foreseeable future).
Are you aware of any code/projects that convert something like a fully trained resnet50 model into a maddness optimized, approximate model?
An approximate but speedier resnet inference model that runs on a CPU would be useful even if it’s not quite as fast/accurate as a GPU inference model, since currently the cost to run a GPU is typically higher than CPUs.
This master's thesis sort of does it for individual layers, but it doesn't have any fine-tuning yet so it completely wrecks the accuracy: https://github.com/joennlae/halutmatmul.
If someone worked on contributing this functionality to Composer [1] I'd be down to help out. I can't justify building it all on my own right now since we're 100% focused on training speedup, but I could definitely meet and talk through it, help code tricky parts, review PRs, etc.
Not to distract from the clearly good technical work you've done here, but why name it Bolt when there's a Fintech startup with the same name. Among other brand confusion issues (Like the Chevy Bolt).
I named it this in 2017 and was only worried about name collisions with other GitHub repos and ML algorithms. Also it's a backronym for Based On Lookup Tables + sounds at least somewhat evocative of going fast, so it was the best name for an algorithm I could come up with.
I see it noted that this could speed up machine learning inference, but any hope of this being extended to also speed up training? I imagine with 100x speedup in matmuls, albeit approximate matmuls, one could plausibly train on a CPU.
Yes. It's another research project to make this happen, but I think it would be fairly straightforward. The issue is that you can't backprop through the assignment step, so you get no gradient with respect to the input. This mandates a progressive layer freezing strategy. I don't think it would be too hard to get working though; you'd likely just need to train for longer, or start with a pretrained model and fine-tune it as you freeze + approximate the layers.
It's incredible that there's actually this much room to improve. How does this compare to GPU implementations?
Also it looks like the optimization is related to running operations on a compressed representation, for the 10x vs 100x speedup, is there a tradeoff between speed and accuracy, or is that extra degree of magnitude just from bringing SIMD into the picture?
From what I can tell, this is a machine learning based approximation to matrix multiplication by a particular matrix (which it was trained on). It trades accuracy for speed. If you need to multiply many (many!) vectors by a static matrix and you have loose enough error tolerance, this can provide up to 100x speedup.
There's definitely a tradeoff between speed and accuracy. We characterize this for various problems in the paper (https://arxiv.org/pdf/2106.10860.pdf), but tl;dr is that it speeds things up more at a given level of error when there's more redundancy in your matrices.
Back-of-the-envelope calculation suggests that this won't beat tensor cores on NVIDIA GPUs. This is basically because ~half the die is an ASIC for dense (and 2:4 sparse) matmuls, with no support for the sparsity structure we induce. If 1:16 sparsity were supported or there were a batched warp_shuffle instruction, we'd get similar speedups for GPUs as we do on CPUs.
> If you have a large collection of mostly-dense vectors and can tolerate lossy compression, Bolt can probably save you 10-200x space and compute time.
Space. It can save space.
The main limitation of fast ML models nowadays is how much parameters you can load in your GPU memory, and these are usually matrices.
200x would allow me to run GPT-3 on my old GTX 1050.
- Do some ML frameworks implement it already?
- It promises up to 200x compression, is it reasonable to expect it to allow us to run GPT-3 on smaller mainstream GPUs?
No ML frameworks implement it yet, though I'd be happy to work with people from the PyTorch/TF/JAX/CUDNN/CUTLASS/etc. teams (or volunteers) if anyone wants to make this happen.
Also, while you can get 200x compression, I do want to emphasize that there's a speed vs quality tradeoff and the results will vary by problem. We have much more careful statements in the paper about the exact problem setup, tradeoffs, etc. Also, as I've mentioned in other comments, it probably won't help too much on modern GPUs due to their acceleration of dense GEMMs but not shuffles. CPU inference is the killer app here.
THis sounds and looks impressive, but this part struck me:
"If you ... and can tolerate lossy compression"
What does this mean? I wouldn't have thought that matrix operations can be lossy. Does anybody know to what extend they are lossy and where this would be acceptable?
The matrix itself is the operation, too, it's a function from R^n to R^m, and that function is what is approximated by matrix compression.
If you are familiar with PCA or SVD, you are already close to understanding a basic form of compression. SVD breaks down an m x n matrix into an nxn rotation matrix, an nxn diagonal scaling matrix, and a n mxn loadings matrix. The ordering of the new matrices is usually with the highest amount of variability described first. So if you take the first, say, 5 of the n dimensions, and only use them, you can reconstruct an approximation of the original matrix that uses approximately 5/n of the original storage.
PCA is often used in machine learning too, and there is such a deep connection between compression that is hard to make explicit or formalize.
The implementation linked here uses Vector Quantization instead:
The furthest I have come was applying an SVD on a 2D Matrix to find the new axes of a skewed ellipsis, but I think I could follow for most of the part. Thanks.
Yeah, that's enough to visualize the idea. If you have a mx2 matrix, and you take each row as a point in 2D space, the compression means projecting all the points along the long dimension of the ellipse. If all the points are along a line, then the matrix is perfectly compressible.
In almost all practical uses of matrix multiplication, we have rounding errors. For example, in 3D it is hard to reverse exactly a rotation and get the exact initial position back.
I don't know what amount of losses we are talking about but in deep learning, several operations don't require a crazy level of compression, and it led to some lightweight float implementations (bfloat, on 16 bits, being the most common but there are also 8 bits floats for extreme cases)
If that's really a 10-100x speed increase at the cost of a bit of loss, I am sure machine learning will love it.
Neural networks generally tolerate lossy compression. So they are OK here provided that they have demonstrated that their compressed op networks still manage to reach comparable accuracy.
Unfortunately, AFAIKT, the experiments are not very comprehensive. So I wouldn't be surprised to find a good lossy compression for this use case, but I'm not sure if this is such.
This looks good. Why do the vectors have to be dense? Just because of overhead/speed gain being the lowest? Just asking if you could use it universally for all operations if I don't know the density.
If your data is represented as sparse vectors, that sparse representation is already compressed. You wouldn’t want to decompress it to a dense representation just to apply a different, less effective, lossy compression algorithm. That would be like taking a screenshot of a paragraph of text so you can post a JPEG of it to Twitter.
I guess the naive approach, if we wanted to do a quick lossy matrix multipy, would be to take the truncated SVD and use that. How does this library compare to the boring strategy, I wonder?
We found sparse, truncated PCA to be the most competitive baseline. We beat it by a lot (see the paper [1]), but the other big drawback is that trading off the rank vs sparsity was an ugly hyperparameter tuning problem. By ugly, I mean that the results were really sensitive to getting this right, it wasn't easy to set a priori, and took a while to iterate on because the sparse PCA trained much more slowly than any other practical alternative.
There are situations where PCA/SVD is the right approach though. Namely, if you need really little error, our method often can't do that, whereas throwing away dims that explain almost no variance can. Also it's just easier to implement.
SVD performs poorly for compression in terms of accuracy / compression ratio. As the Bolt paper said, for that, product quantization is the way to go. However, the accuracy trade-off is still pretty big, that's why it is mostly used for coarse similarity recalls (such as faiss or scann).
Definitely. On CPUs, you could make this 2x faster pretty easily with just another execution port for vpshufb / vtbl and a 4bit lo and hi unpack instruction.
Though the real speedup would be allowing dense matmul ASICs to operate on 16-byte tables and 4-bit indices as operands. The reason Bolt and MADDNESS end up so fast is that they produce "sparse" representations that are still contiguous, strided arrays in memory. So the kernels and access patterns are just like those of dense GEMMs (and therefore vectorize-able, etc), but with lookup-adds instead of multiply-adds.
> (In the common case that one matrix
is known ahead of time,) our method also has the in-
teresting property that it requires zero multiply-adds.
These results suggest that a mixture of hashing, aver-
aging, and byte shuffling—–the core operations of our
method—–could be a more promising building block
for machine learning than the sparsified, factorized,
and/or scalar quantized matrix products that have re-
cently been the focus of substantial research and hard-
ware investment.`
This is not at all what modern gpus are optimized for.
IMO it would be super cool and I hope someone does it. There are a lot of interesting tradeoffs around which techniques to use for which matrix sizes and under which assumptions about read vs write ratios, what you have a training set for, whether you can fuse compression intro previous ops, etc.
hm... so maybe a better place would be one of these toolkits like jax where the entire computation is known at optimization time where a blas would potentially have to do some heroic heuristics to try and fully optimize underneath the blas interface.
Also, if you like this kind of work, you might like what I've been building for the past year: Composer [1]. It speeds up neural net training by a lot (e.g., 7x faster for ResNet-50) [2] and, in contrast to Bolt/MADDNESS, is polished, documented code you can get working in <5min.
[1] https://github.com/mosaicml/composer
[2] https://www.mosaicml.com/blog/mosaic-resnet