Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The 3090 was always a fallacy without native fp8.


Performance for text generation is memory-limited, so lack of native fp8 support does not matter. You have more than enough compute left over to do the math in whichever floating point format you fancy.


How? It's not like Nvidia is some also-ran company for which people did not build custom kernels that combine dequantization and GEMV/GEMM in a single kernel.


Performance is good enough for non-reasoning models even if they're FP8 or FP4. Check the phoronix article, the difference between the 3090 and 4090 is rather small.

There's weight-only FP8 in vLLM on NVidia Ampere: https://docs.vllm.ai/en/latest/features/quantization/fp8.htm...


What? There are highly optimized Marlin kernels for W8A16 that function very well on a 3090 in both FP8 and INT8 formats.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: