Hacker News new | past | comments | ask | show | jobs | submit login

See: https://arxiv.org/abs/2210.17323

Q: Doesn't 4bit have worsen output performance than 8bit or 16bit? A: GPTQ doesn't quantize linearly. While RTN 8bit does reduce output quality, GPTQ 4bit has effectively little output quality loss compared to baseline uncompressed fp16.

https://i.imgur.com/xmaNNDd.png https://i.imgur.com/xmaNNDd.png




This is really interesting, thank you for the reference!

Having worked more with images based NN than language models before, I wonder: are LLM inherently more suited to aggressive quantisation, due to their very large size? I see people suggesting here 4b is pretty good, and 3b should be the target.

I remember ResNets etc can of course also be quantized, and up to 8-6b you get pretty good results with very little effort, with low-ish degradation in performance. Trying to go down to 4b is more challenging, though this paper claims with quantisation aware training 4b is possible indeed, but that means a lot of dedicate training compute needed to get to 4b (not just finetuning post-training): https://arxiv.org/abs/2105.03536




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: