Q: Doesn't 4bit have worsen output performance than 8bit or 16bit?
A: GPTQ doesn't quantize linearly. While RTN 8bit does reduce output quality, GPTQ 4bit has effectively little output quality loss compared to baseline uncompressed fp16.
This is really interesting, thank you for the reference!
Having worked more with images based NN than language models before, I wonder: are LLM inherently more suited to aggressive quantisation, due to their very large size? I see people suggesting here 4b is pretty good, and 3b should be the target.
I remember ResNets etc can of course also be quantized, and up to 8-6b you get pretty good results with very little effort, with low-ish degradation in performance. Trying to go down to 4b is more challenging, though this paper claims with quantisation aware training 4b is possible indeed, but that means a lot of dedicate training compute needed to get to 4b (not just finetuning post-training): https://arxiv.org/abs/2105.03536
Q: Doesn't 4bit have worsen output performance than 8bit or 16bit? A: GPTQ doesn't quantize linearly. While RTN 8bit does reduce output quality, GPTQ 4bit has effectively little output quality loss compared to baseline uncompressed fp16.
https://i.imgur.com/xmaNNDd.png https://i.imgur.com/xmaNNDd.png