I am surprised that they do not mention comparing against quantized matrix multiplication because their "encoding" appears to be something like a quantization step with unevenly sized buckets. And then their approximate multiplication step to me looks like multiplying a quantized input vector against a 1-bit quantized matrix.
But overall this is an extremely exciting development because it shows how one could convert a NN into an efficient hardware implementation. And due to them working only on quantized data with LUTs, one can also embed low-dimensional matrices directly into the silicon.
My prediction would be that this will develop in the way that we can soon buy $1 hardware accelerators for things like word embedding, grammar, and general language understanding. And then you need those expensive GPUs only for the last few layers of your LLM, thereby massively reducing deployment costs.
EDIT: Reading the actual paper, I saw that this work is also related to LORA because they convert high-dimensional input vectors to a quantized value based on a lower-dimensional embedding which they call "prototypes". So it's a bit like doing LORA with 1-bit quantization but instead of representing it as 8x 1bit flags you represent it as 1x 8bit integer.
Thank you for the feedback :-)
A lot of the work regarding the comparison with „simple“ approximate matrix multiplication has been done in the preceding paper: https://arxiv.org/abs/2106.10860
While I share your enthusiasm regarding the potential, we have to be careful about the limiting factors. Our main contributions on the algorithmic side are the reformulation of Maddness such that it is differentiable (autogradable), and we can use it in e2e DNN training, as decision trees are not differentiable.
We are still in the process of understanding how to optimise the training. In the next step, we want to look into transformers as, for now, we only looked into ResNets for easy comparability.
If you are a student at ETH Zurich and want to work on this -> reach out to me
Thanks for pointing that out :) When I first read the paper, I thought that 4. DIFFERENTIABLE MADDNESS was still part of the 3. BACKGROUND section.
Also, I have to admit that I don't quite understand that section, even after trying a 2nd time. The text implies that Sc would be 15x4 and Hc would be 16x15 but in the illustration it looks like 3x2 and 4x3. I guess I'll have to read Zhang [37] first because like this, I'm not sure what the selection matrix and description matrix do here. That said, (8) and following is easy to understand again. You use the softmax to create an approximately correct gradient but use the hard maximum for calculation the forward pass values.
> My prediction would be that this will develop in the way that we can soon buy $1 hardware accelerators for things like word embedding, grammar, and general language understanding. And then you need those expensive GPUs only for the last few layers of your LLM, thereby massively reducing deployment costs.
You'd still need a lot of RAM for storing these weights, wouldn't you? I mean, obviously, a $1 accelerator is a great improvement of x,000$ GPUs, but it doesn't mean we all get LLMs working on our phone just yet.
That's the beauty of their method: If you can replace a 8192x8192 matrix multiplication with a 8192x256 decision tree and then a 256x8192 look up table, your memory requirements go from 67,108,864 down to about 2,162,688 parameters. (I assumed that their decision tree for encoding is perfectly balanced and only uses log(256) parameters per row)
EDIT: And given that this work is centered around energy-efficiency and was sponsored by Huawei, I would guess that LLMs on your phone are precisely the goal here.
EDIT2: The process node that they did their calculations with appears to match Google's TPUv3 which has 0.56 TOPS/W and the paper claims 161 TOPS/W which would be a 280x improvement in energy efficiency over the AI chips in Pixel phones.
Mind blown. Sounds almost too good to be true except the human brain runs on 20W and this brings us to the same ballpark. This was hard scifi a year ago!
Can an approach like this be integrated into stuff like llama.cpp so I could have a 200B model hashed down to 7B to run on civilian hardware or even a CPU?
I'd expect that on a regular CPU, the RAM access latency will destroy any performance improvements. This work is much better suited for FPGAs or ASICs.
We have to be careful with the comparisons we make.
The TPUv3 is a training and datacenter chip and not an Edge/Inference chip. They optimise for a different tradeoff, so while the comparison looks good, it is unfair.
Ok, I get excited by seeing the numbers, but can someone please explain in a single sentence where this can be used and how big the overall impact would be?
This is a dumb question but I guess this means that you can't make something like a LoRA in software, right? Because the network is physically hardcoded?
Based on the first figure in the paper, it seems that this scheme effectively turns 8 input values into a 4-bit number, thus giving an effective 0.5-bit quantization.
Considering that current aggressive quantization for LLM transformers uses 4 bits, does such a 0.5-bit quantization produce an effective neural network?
Does the scheme stay competitive if it is changed to use 4-bit quantization instead of 0.5-bit?
This is product quantization (a vector is chopped up into sub-vectors where each sub-vector is quantized using vector quantization (VQ)), not scalar quantization (which is what you're comparing it to here).
Also most scalar quantization methods use uniform quantization (e.g., divide the range between the scalar lower bound L and scalar upper bound H into N different regions where N is usually 2^bit_width), whereas PQ (and VQ) is learned quantization via k-means on some training vector set, so they're not really directly comparable.
You'd be surprised how far that takes you. I mean I was truly astonished when I saw that a GptNeoX LLM quantized down to 1.5 bits per value at 90% sparsity was still producing acceptable predictions. But the size went from multiple GBs to less than 1 MB of (compressed) parameters.
Nothing public, sorry. I do consulting on how to convert AIs from CUDA to C++ to save money. With a good quantization, you can sometimes replace a $19k A100 with a $0.5k EPYC. And especially for apps and/or WebGL interference, you want small models.
Anyway, if you quantize to -1, 0, or +1 and then use arithmetic coding, you come out at around 1.58 bits per parameter. And then by skewing the distribution with forced sparsity, you have something like 5% x -1, 90% x 0, 5% x +1 which comes out at about 0.6 bits per parameter after arithmetic coding.
I used that on "gpt_neox.layers.*.mlp.dense_h_to_4h.weight" (HuggingFace PyTorch implementation), for example. But for other layers you need more bits. For example, I could never get gpt_neox.embed_in.weight to less than 2% -2, 8% -1, 80% 0, 8% +1, 2% +2 which comes out at around 1.1 bits per parameter [1]. And then stuff like gpt_neox.layers.0.attention.query_key_value.weight will drive up your overall bits per parameter because those are very difficult to quantize or sparsify. That 1.5 was the average over the entire model and some layers compress even better while others compress worse.
Is it possible to get good performance in computation when encoding the data this way, or is there a lot of cycles lost to packing and unpacking these bits?
It's actually much faster if you're limited by RAM bandwidth because instead of doing float x float mul, which requires 8 bytes of load and 4 bytes of store, you do an int8 x int8 mul with 2 bytes in and 1 byte out. And typically for a quantized LNN like this, you'd only do packing and unpacking before or after a matmul on the low-dimensional vectors so that you can directly use the quantized weights.
E.g. you quantize a 512-float activation to 512-int8, then matmul with 512x4096, Gelu, 4096x512 all in int8, then de-quantize to 512-float. That means no quantization overhead on those 4,194,304 parameters in your Dense layers.
* This accelerator is for an Edge/Inference case, so there is no training on this chip.
* We introduce a differentiable form of Maddness, allowing Maddness to be used in e2e training and present an application -> ResNet.
* We are still in the process of understanding how this will translate to transformers.
* The goal was to show that Maddness is feasible with a good codesign of the hardware.
* Compared to other extreme quantisation (BNN/TNN) and pruning schemes, this is more general as it replaces the matmul with an approximate matmul.
* The model architecture is not fixed in hardware. It is „just“ a matmul unit.
I hope this helps :-)