30x is the type of number that when you see it in a generational improvement, yo...

azeirah · 2024-03-18T21:23:33 1710797013

From how I understood it, it means they optimised the entire stack from CUDA to the networking interconnects specifically for data centers, meaning you get 30x more inference per dollar for a datacenter. This is probably not fluff, but it's only relevant for a very very specific use-case, ie enterprises with the money to buy a stack to serve thousands of users with LLMs.

It doesn't matter for anyone who's not microsoft, aws or openai or similar.

misterdabb · 2024-03-18T23:39:28 1710805168

It's a weird graph... It's specifically tokens per GPU but the x-axis is "interactivity per second", so the y-axis is including Blackwell being twice the size and also the increase from fp8 -> fp4, note it will needs to be counted multiple time as half as much data is needed to be going through the networks as well.

acchow · 2024-03-18T22:10:11 1710799811

They showed 30x was for FP4. Who is using FP4 in practice?

KaoruAoiShiho · 2024-03-18T22:21:37 1710800497

But maybe you should. Once the software stack is ready for it there'll be more people since the performance gains are so massive.

dagmx · 2024-03-19T02:57:02 1710817022

It would depend highly on the model though. Some stuff will generalize better to FP4 than others.