3 bits? Is that for all weights in the network?

superkuh · on March 20, 2023

As far as I know, yes. https://arxiv.org/abs/2210.17323

"Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline."

This would be 175 billion 3 bit weights instead of 175 billion 16 (or 32!) bit weights. It massively reduces the size of the model. It makes loading it in ram on consumer computers feasible. The number of parameters stays the same.

rnosov · on March 20, 2023

> https://arxiv.org/abs/2210.17323

I've read the paper and to be honest I'm not sure what to make of it. Their headline benchmark is perplexity on WikiText2 which would not be particularly relevant to most users. If you look at the tables in the appendix A.4 with some more relevant benchmarks you'll sometimes find that straight RTN 4 bit quantisation beats both GPTQ and even full 16 bit original! No explanation of it is given in the paper.

sebzim4500 · on March 20, 2023

Some of those benchmarks have a pretty small sample size IIRC, might just be coincidence that the noise introduced by RTN just happens to slightly improve them.

GPTQ beats RTN on almost every benchmark at almost every size, though.

coeneedell · on March 20, 2023

I wonder if reducing the bit depth of parameters like we have been acts as a normalization feature in these huge deep models.

rcme · on March 20, 2023

The number of parameters stays the same, but the amount of information encodable by those parameters is not the same.

thomasahle · on March 20, 2023

But they have to expand it back out to actually use it, right? Or does NVIDIA support 3 bit matrix mult?

DrJosiah · on March 20, 2023

It might have been a typo, as the current llama.cpp / alpaca.cpp included quantizers default to 4 bits.