Hacker News new | past | comments | ask | show | jobs | submit login

The goal of this type of quantization is to move the multiplication by the fp32 rescale factor outside of the dot-product accumulation.

So the multiplications+additions are done on fp8/int8/int4/whatever (when the hardware support those operators of course) and accumulated in a fp32 or similar, and only the final accumulator is multiplied by the rescale factor in fp32.






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: