Cool. The way current hardware handles very low precision is quite inefficient.

kristianp · on Oct 18, 2023

Indeed, reading the QLoRa paper, 4 bit quantised data is converted to 16 bit floats (usually BFloat16) for calculations, then converted back again. I suppose this standard allows for smaller data types to be supported by the hardware instructions.

buildbot · on Oct 19, 2023

Exactly!

yardstick · on Oct 18, 2023

Stupid question but will people start pushing for more precision once the cost of memory and compute falls more? Or by the nature of these data types there will never really be a need for more bits?

brucethemoose2 · on Oct 18, 2023

Different types of models seem to have different "sweet spots."

For example, current transformers llms seem to like 4-6 bits with smart quantization, with good performance at 3-4 bits with extremely aggressive quantization methods (like good use of sparsity and profiling inference on useful data).

The Stable Diffusion unet doesn't like 8 bit without some changes, the vae barely even likes fp16.

So to answer your question, some quantization is "free" and theres no reason not to use it, but sometimes its very lossy and a serious compromise that would not be taken with more compute/ram.

Also, sometimes there is compute overhead that makes quantization inferencr/training slower. Sometimes the reduced model weights size makes passes faster due to a memory bandwidth bottleneck. It just depends.

redox99 · on Oct 18, 2023

If memory and compute prices fall, people will just train bigger models.

dist-epoch · on Oct 18, 2023

Both. You could either go for more precision or go for bigger models.