Great benchmark, very interesting. Although, I am not sure about the extrapolation of the H200 from the lambda bench. From my understanding, Lambda and theirs bench used different models - LLama 405B and Mistral 123B - with different bench and inference libs. Since the study is focused on memory-hungry scenario, I am really curious to know why they took H100 instead of H200.
Yes it’s a different model + backend and obviously the extrapolation will never be as good as experimental values.
but,
1. We have only used the multiplier value 3.4, and not the exact throughput from Lambda’s experiment.
2. We have also used the same input/output sequence length as Lambda's experiment.
3. Also our extrapolated value is inline with the specs of H200 when compared to Mi300x