Performance for text generation is memory-limited, so lack of native fp8 support does not matter. You have more than enough compute left over to do the math in whichever floating point format you fancy.
How? It's not like Nvidia is some also-ran company for which people did not build custom kernels that combine dequantization and GEMV/GEMM in a single kernel.
Performance is good enough for non-reasoning models even if they're FP8 or FP4. Check the phoronix article, the difference between the 3090 and 4090 is rather small.