I'm still amazed that quantization works at all, coming out as a mild degradatio...

rfoo · 2024-03-28T12:06:21

> Does quantization work with most neural networks?

Yes. It works pretty well for CNN-based vision models. Or rather, I'd claim it works even better: with post-training quantization you can make most models work with minimal precision loss entirely in int8 (fixed point), that is, computation is over int8/int32, no floating point at all, instead of weight-only approach discussed here.

If you do QAT something down to 2-bit weight and 4-bit activation would work.

People aren't interested in a weight-only quantization back then because CNNs are in general "denser", i.e. bottleneck was on compute, not memory.

jonnycomputer · 2024-03-28T12:35:18

thanks!

qeternity · 2024-03-28T12:35:33

Intuitively the output space is much smaller than the latent space. So during training, you need the higher precision so that the latent space converges. But during inference, you just need to be precise enough that your much smaller output space does.