Jensen revealed later that the LLM inference is 30x due to architectural improvements, it's massive. I don't know if it's latency or just 2-3x performance boost with 30x more customers served in the same chip. Either way, 30x is massive.
The other big announcement here is NIM - Nvidia Inference Microservice.
It's basically TensorRT-LLM + Triton Inference Server + pre-build of models to TensorRT-LLM engines + packaging + what appears to be an OpenAI compatible API router in front of all of it + other "enterprise" management and deployment tools.
This software stack is extremely performant and very flexible, I've noted here before it's what many large-scale hosted inference providers are already using (Amazon, Cloudflare, Mistral, etc).
From the article:
'Nvidia will work with AI companies like Microsoft or Hugging Face to ensure their AI models are tuned to run on all compatible Nvidia chips. Then, using a NIM, developers can efficiently run the model on their own servers or cloud-based Nvidia servers without a lengthy configuration process.
“In my code, where I was calling into OpenAI, I will replace one line of code to point it to this NIM that I got from Nvidia instead,” Das said.'
The dead giveaway is "I changed one line of code in my OpenAI code" which means "I pointed the OpenAI API base URL to an OpenAI compatible API proxy that likely interfaces with Triton on the backend via its gRPC protocol".
I have a lot of experience with TensorRT-LLM + Triton and have been working on a highly performant rust-based open source project for the OpenAI compatible API and routing portion[0].
On this hardware (FP4) with this software package 30x compared to other solutions (who knows what - base transformers?) on Hopper seems possible. TensorRT-LLM and Triton can already do FP8 on Hopper and as noted the performance is impressive.
He always does that. They stack up a bunch of special case features like sparsity that most people don't use in practice to get these unrealistic numbers. It'll be faster, certainly, but 30x will only be achievable in very special cases I'm sure.
The kind of sparsity that the hardware supports is not fully general. I'm not aware of any large models trained using it. Maybe they are all leaving 2x perf on the table for no reason, but maybe not. I don't think sparsity is really proven to be "almost always a win" for training.
To train well with it I think you still need to store all the optimizer state (derivatives and momentum or whatever) if not all the weights (for RigL), so maybe not nearly as much memory bandwidth advantage as you get in inference?
From how I understood it, it means they optimised the entire stack from CUDA to the networking interconnects specifically for data centers, meaning you get 30x more inference per dollar for a datacenter. This is probably not fluff, but it's only relevant for a very very specific use-case, ie enterprises with the money to buy a stack to serve thousands of users with LLMs.
It doesn't matter for anyone who's not microsoft, aws or openai or similar.
It's a weird graph... It's specifically tokens per GPU but the x-axis is "interactivity per second", so the y-axis is including Blackwell being twice the size and also the increase from fp8 -> fp4, note it will needs to be counted multiple time as half as much data is needed to be going through the networks as well.
Yeah and the 30x is largely due to the increase in factors like packaging and throughput. It's not indicative of general purpose performance which is what I was talking about.
Again, I do think the throughput and energy efficiency gains are impressive, but the raw performance gain is lower than I'd have expected for the massive leap in node size etc
This is also the only place Nvidia are getting competitive pressure - from the likes of Groq (and likely but less published from Cerebras) with higher inferance T/s and concurrency utilization/batching [1] so if this proves to be the true then the case for big chip systems (on todays specs) will be harder.
Because it involved scaling in chip area needed for FP8. AI community realized that FP8 training is possible few years back so the transistors given for FP8 was scaled. Overall I think transistors grow just by ~50% per generation so most of the gains comes from removing FP32/FP64 share which were dominant 10 years back, but there is only some point it could go to.
though it seems most of the progress has been on memory throughput and power use which is still very impressive.
I wonder how this will trickle down to the consumer segment.