Hacker News new | past | comments | ask | show | jobs | submit login

TL;DR

- We explore how the inference performance of Llama 3.1 405B varies on 8x AMD MI300X GPUs across vLLM and TGI backends in different use cases.

- TGI is highly efficient at handling medium to high workloads. In our tests on 8x AMD MI300X GPU, medium workloads are defined as RPS between 2 and 4. In these cases, it delivers faster time to first token (TTFT) and higher throughput.

- Conversely, vLLM works well with lower RPS but struggles to scale, making it less ideal for more demanding workloads.

- TGI's edge comes from its continuous batching algorithm which dynamically modifies batch sizes to optimize GPU usage.

If you have feedback, or want to help improve the benchmark, please let me know.




Thanks for detailed analysis, would be curios to see FP8 comparison too, given vllm has some custom kernels


Unfortunately, neither VLLM nor TGI support FP8 on AMD yet. But once they do, we will look into it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: