The 3090 was always a fallacy without native fp8.

threeducks · 2025-03-10T09:16:26 1741598186

Performance for text generation is memory-limited, so lack of native fp8 support does not matter. You have more than enough compute left over to do the math in whichever floating point format you fancy.

imtringued · 2025-03-10T07:51:34 1741593094

How? It's not like Nvidia is some also-ran company for which people did not build custom kernels that combine dequantization and GEMV/GEMM in a single kernel.

Tepix · 2025-03-10T07:30:46 1741591846

Performance is good enough for non-reasoning models even if they're FP8 or FP4. Check the phoronix article, the difference between the 3090 and 4090 is rather small.

There's weight-only FP8 in vLLM on NVidia Ampere: https://docs.vllm.ai/en/latest/features/quantization/fp8.htm...

qeternity · 2025-03-10T11:09:37 1741604977

What? There are highly optimized Marlin kernels for W8A16 that function very well on a 3090 in both FP8 and INT8 formats.