Hacker News new | past | comments | ask | show | jobs | submit login

30x is the type of number that when you see it in a generational improvement, you should ignore it as marketing fluff.



From how I understood it, it means they optimised the entire stack from CUDA to the networking interconnects specifically for data centers, meaning you get 30x more inference per dollar for a datacenter. This is probably not fluff, but it's only relevant for a very very specific use-case, ie enterprises with the money to buy a stack to serve thousands of users with LLMs.

It doesn't matter for anyone who's not microsoft, aws or openai or similar.


It's a weird graph... It's specifically tokens per GPU but the x-axis is "interactivity per second", so the y-axis is including Blackwell being twice the size and also the increase from fp8 -> fp4, note it will needs to be counted multiple time as half as much data is needed to be going through the networks as well.


They showed 30x was for FP4. Who is using FP4 in practice?


But maybe you should. Once the software stack is ready for it there'll be more people since the performance gains are so massive.


It would depend highly on the model though. Some stuff will generalize better to FP4 than others.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: