Title misses that it's "weights and activations" as in they (from my skim) are doing all the math with mostly 4-bit values. Normally quantized weights are converted to 32-bits (edit: or more generally a type for which there are native machine instructions to multiply/add, as mentioned below) as they are applied during a forward pass which saves memory but incurs extra processing. They are keeping everything in (mostly) 4-bits to make it faster.
It's a nontrivial problem to do faster arithmetic operations with quantized values (which are not just rounded to fit in 4 bits but are usually block quantized with one or more scaling parameters stored at full precision) faster than to convert them to floats that the cpu/GPU is already optimized to multiply/add. That is what this project is proposing a solution to.
If you have a 4 bit signal multiplied by a 4 bit weigh, you can just use a lookup table for any operation with any output type you want. A 256 entry table of 32bit floats fits in 1K for example, which will fit in cache with plenty left over. This is independent of the quantization scheme so long as it's not dynamic.
It's quite unfavorable on modern hardware. A Sapphire Rapids core can do 2 separate 32 half-precision FMAs (vfmadd132ph, [1]) per clock, which is 128 FLOPs/cycle. It is not possible to achieve that kind of throughput with an 8-bit LUT and accumulation, even just a shuffle with vpshufb is too slow.
That's absolutely wild. If you really only needed vpshufb, the throughput is the same in terms of values, because there are twice as many values per register and you get to retire half as many instructions, but it takes a bunch more instructions to combine the two inputs and apply a LUT of 256 values :(
Fair point. It might help if the system is DRAM bandwidth limited, so reducing the data size helps even though individual operations take multiple instructions. But that is not the situation with todays hardware.
Experts, if we bring the accuracy of operations to 4 bits, would it not impact the output quality?
Intuition says, yes. Would appreciate some practitioner/theorist on the subject to say what is the impact of lower precision on accuracy/output of a model.
You can make up by learning more parameters, albeit each parameter is of a lower resolution. The tradeoff evidently works favorably till upto 4-bits; I'm basing this off the results reported on the k-bit inference scaling laws paper by Tim Dettmers and Luke Zettlemoyer, see Figure 1 here [1].
I tried to find the article I saw that did exactly that and couldn't, but I guess empirically if you take a look at the LLM leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...) and add the "precision" column to the data, you can see the GPTQ/4-bit/8-bit quants still handily beat out the smaller models at full precision. Downside is there's no 3-bit on the submission page, so we can't easily gauge how those are doing, but all my anecdotal personal experience with 3-bit has been extremely disappointing. Exllamav2 might have bridged that gap a bit. Again, wish I could find you that article I had. It laid all this out and showed a huge perplexity dropoff below 4-bit.
The short answer is yes, it has a big impact on output quality, all other things being equal.
There are techniques like "quantization-aware training" [1] which aim to reduce the impact.
However, if your benchmark is "output quality for a fixed amount of compute" (e.g. running real time voice recognition on a smartphone) a larger model that's been quantized might perform better than a smaller model that hasn't.
This is more of an off-topic, but is there research into not having to evaluate all LLM tokens for each output token (at perhaps some cost to output quality), thereby making it possible to run these models in a more compute and memory efficient manner?
Astonishing work, I was muttering to myself last week about quantization and thought useful 4 bit might never happen. IIUC there's a loss of 6-16% which is totally reasonable and means my 4.8 GB Mistral model looks more like 2.4 GB once this trickles through to MLC. That means on device GPT 3.25ish at 30 tkns/sec...
One thing that I've currently wondered about quantization... peft seems to require a GPU. Are there any current quantization approaches that make fine tuning more efficient purely on a CPU?