Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Performance is good enough for non-reasoning models even if they're FP8 or FP4. Check the phoronix article, the difference between the 3090 and 4090 is rather small.

There's weight-only FP8 in vLLM on NVidia Ampere: https://docs.vllm.ai/en/latest/features/quantization/fp8.htm...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: