Hacker News new | past | comments | ask | show | jobs | submit login

Cool. The way current hardware handles very low precision is quite inefficient.



Indeed, reading the QLoRa paper, 4 bit quantised data is converted to 16 bit floats (usually BFloat16) for calculations, then converted back again. I suppose this standard allows for smaller data types to be supported by the hardware instructions.


Exactly!


Stupid question but will people start pushing for more precision once the cost of memory and compute falls more? Or by the nature of these data types there will never really be a need for more bits?


Different types of models seem to have different "sweet spots."

For example, current transformers llms seem to like 4-6 bits with smart quantization, with good performance at 3-4 bits with extremely aggressive quantization methods (like good use of sparsity and profiling inference on useful data).

The Stable Diffusion unet doesn't like 8 bit without some changes, the vae barely even likes fp16.

So to answer your question, some quantization is "free" and theres no reason not to use it, but sometimes its very lossy and a serious compromise that would not be taken with more compute/ram.

Also, sometimes there is compute overhead that makes quantization inferencr/training slower. Sometimes the reduced model weights size makes passes faster due to a memory bandwidth bottleneck. It just depends.


If memory and compute prices fall, people will just train bigger models.


Both. You could either go for more precision or go for bigger models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: