Hacker News new | past | comments | ask | show | jobs | submit login

I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.

As the comments on reddit said, those numbers don’t make sense.






> I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.

That was my first though as well, but from a quick search it looks like Llama.cpp has a default batch size that's quite high (like 256 or 512 I don't remember exactly, which I find surprising for something that's mostly used by local users) so it shouldn't be the issue.

> As the comments on reddit said, those numbers don’t make sense.

Absolutely, hence my question!


Sure, but that default batch size would only matter if the person in question was actually generating and measuring parallel requests, not just measuring the straight line performance of sequential requests... and I have no confidence they were.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: