Why is this called a whitepaper, as this is more of a documentation and architecture overview of the cluster? Wow a CLOS topology for networking, very innovative.
Details on NVLink would be great. For example, the needs and problems solved by their custom cables seemingly required by NVLink would be worth a whitepaper.
Don't get me wrong, this is still great the general public can get a glimpse into Grace Hopper. And they do a good job of simplifying while throwing around mind-boggling numbers (the NVLink bandwidth is insane, though no words on latency, crucial for remote memory access).
To be fair NVIDIA used to publish more detailed "white paper" for their GPUs ex. [1] and CPU textbooks like H&P [2] draws a lot of details from these. This less detailed "whitepaper" still has a scent of these old tradition.
I was always taught that “whitepapers” were this sort of thing and were distinct from academic papers. However this seems to be industry or ecosystem specific because the cryptocurrency ecosystem uses “whitepaper” to mean their academic papers, or at least their approximation of them.
NVDA has spent too much time surrounded by cryptocurrency hacks that published “whitepapers” left and right with zero technical information or innovation. As they say, never get high on your own supply.
What's funny is that even though the DGX GH200 is some of the most powerful hardware available, there's such a voracious demand that it's not gonna be enough to quench it. In fact, this is one of those cases where I think the demand will always outpace supply. Exciting stuff ahead.
I heard Elon say something interesting during the discussion/launch of xAI: "My prediction is that we will go from an extreme silicon shortage today, to probably a voltage-transformer shortage in about year, and then an electricity shortage in about a year, two years."
I'm not sure about the timeline, but it's an intriguing idea that soon the rate limiting resource will be electricity. I wonder how true that is and if we're prepared for that.
He’s just plain wrong about the electricity usage going up because of AI compute.
To a first approximation, the amount of silicon wafers going through fabs globally is constant. We won’t suddenly increase chip manufacturing a hundredfold! There aren’t enough fabs or “tools” like the ASML EUV machines for that.
Electricity is used for lots of things, not just compute, and within compute the AI fraction is tiny. We’re ramping up a rounding error to a slightly larger rounding error.
What will increase is global energy demand for overall economic activity as manufacturing and industry is accelerated by AIs.
Anyone who’s played games like Factorio would know intuitively that the only two real inputs to the economy are raw materials and energy. Increases to manufacturing speed need matching increases to energy supply!
I bet you're right. Even if you take into account that a data center is a monster consumer of energy, in the grand scheme of things it's not that big. Some back of the envelope math:
Global electrical production in 2022 was ~30,000 TWh.[1]
If we over-estimate that a hyperscale data-center will consume about 100 MW of power, per year that would be around 876 GWh.[2]
Let's overestimate again and say that 1,000 new data centers spring up in a year, every year they would consume 876 TWh.
Which, is 2.92% of total electricity production. Which given the fact that I overestimated the energy consumption by more than an order of magnitude, I would say the term "rounding error" is accurate.
I think the main limiting factor in the near term is going to be chip production capacity. The fabs take so long to spin up, it's going to be a while before we can even consider "electricity production" being a limiting factor.
Elon is speaking with all the Eliezur-esque "foom" in mind, where in AI will explode and either kill us or help us take over the Universe (and destroy everything in our way).
Let's assume yield is 100% to make things easier. The rated max power of the A16 is about 250W, while the H100 is quoted at 700W. Thus, a wafer of A16's is about 25-30 kW of power, while a wafer of H100's is about 21 kW.
Edit: Just clarifying, this is not about the Apple A16, but the Nvidia A16. The mobile process used by the Apple chips is built for much lower performance and power, so I can't imagine the two chips being anywhere near comparable - they fill two completely different roles.
Demand right now is not shifting from mobile to datacenter, demand is shifting from "normal" datacenter compute to AI datacenter compute.
I think if you had said "AMD Epyc" rather than a mobile chip, that would be a much more apt comparison. The AI chips are somewhat more power intensive per box, but fairly similar on power/area. It turns out that these silicon processes are fairly uniform in terms of the power/area that they can sustain for any kind of workload.
Mobile chips are designed for <10% utilization and "rush-to-idle" workloads, and they are not remotely comparable to datacenter silicon (of any kind).
An H100 uses up to 350 Watts, while an A16 has a TDP of only 8 W. But, the A16 is a smaller chip (about 108mm vs. the H100's 814mm) so you can fit more of them on a wafer. Since a wafer is 300mm in diameter, its area is 70685 mm^2, which would yield 86 H100's or 654 A16's. [1][2]
However, that discounts the waste on the edges of the circular wafer, as well as the chip yield, which will both likely be worse for the larger chip [3]. But, assuming a generous 70% yield by area [4], one wafer's worth of H100s all packaged into GPUs and running full blast will use maybe 20 kilowatts, while the same wafer of A16s might use 3.6 kilowatts. Although in practice, the A16s will spend most of their time conserving battery power in your pocket, and even the H100s will spend some of their time idle.
TSMC is now producing over 14 million wafers per year. At most 1.2 million of those are on the 3nm node, and not all of that production goes to GPUs. But as an upper bound, if we imagine that all of TSMC's wafers could be filled up with nothing but H100 chips, and if all of those H100 chips were immediately put to use running AI 24/7, how much additional load could it put on the power grid every year?
The answer is, around 280 gigawatts, or if they were running 24/7 for a year, about 2500 terawatt-hours. That's about 10% of current world electricity consumption! So it's not completely implausible to imagine that a huge ramp-up in AI usage might have an effect on the electric grid.
*edit: This assumes we're talking about the Apple A16 (ie. the difference between phone chips and GPU chips). If we're talking about the Nvidia A16 (ie. the difference between current GPU chips and last node's GPU chips) see pclmulqdq's comment.
⠀
It seems unlikely that anyone could afford the number of A100s needed to create an electricity shortage.
If there is an electricity shortage, far more likely that ageing infrastructure and rising demand for air conditioning and electric car charging are to blame.
Elon's timeline predictions in both of those industries for his own companies have been consistently wrong for years. (FSD when??)
Given we're talking about hardware for software, let's at least look at his track record in the software industry... glances at Twitter ah, yeah, not great either.
Voltage regulator and electricity shortage from AI growth straight up doesn't make sense, it's dumber than the stuff he was spouting while "deep-diving" his Twitter misacquisition.
The memory and bandwidth numbers are mind blowing. Going to be very hard to catch Nvidia. It’s as if competitors are going through the motions for participation prizes.
AMD has been shipping 128x lanes of PCIe 5.0 on chip. That's 0.5TBps. Getting up to 0.9TBps isn't that crazy, but having big enough fabric & switches to attach to is a huge feat.
I have hope though. CXL switching is going to give the whole industry a very fresh look at interconnect fabrics, as a simpler to manage faster more direct alternative to PCIe. Should be good.
Personally I worry it's flogging a dead horse, has too many constraints, but Ethernet could be rumbling into action again too maybe. The hyperscalers & others created a new LinuxFoundation group "Ultra Ethernet Scaling" to scale up much faster. Still, even at 1Tbps, that's a bunch of lanes (7x) of that ultra Ethernet you'd need to get to NVlink's 0.9TBps GPU interconnect. More radical breaks with Ethernet are needed than line speed bumps, things that can make switches easier to scale out big, if this realm of tech is to be good systems fabric. https://www.linuxfoundation.org/press/announcing-ultra-ether...
One interesting note on the DGX GH200 architecture that is super interesting to me is that it's inverted the connectivity relationship. Typically a system would have NIC & GPU hanging off the processor bus, and interconnect would go over that bus (maybe optimizing with p2p-dma to skip going through main memory, if it's fancy). But here? GPUs have a 0.9TBps connection to the NVswitch. If the CPU wants to talk to the cluster, it uses nvlink c2c to send the data to the gpu that then used it's nvlink connection to the NVswitch to send it out. Interesting reversal, interesting flourish, and gee it sure makes sense to me; the GPU is the thing!
Also, past 256 GPUs, there are BlueField 3 devices for Ethernet or infiniband connectivity on DGX nodes. Which is a good but also pretty boring/standard smartnic based scale out strategy.
Gaudi2 was competitive with the A100 on paper but was borderline vaporware.
Agree for now, but long do we think this will last though.
There really hasn’t been that great of a financial incentive to compete on DL. Nvidia themselves only recently made this a major priority.
However, now that heaps of money are being thrown at massive training runs I expect we’ll see more competition popping up. Particularly if Intel pulls off IFS and catches up on the next node increasing availability.
I wonder how much this thing will cost, best I've been able to find so far is a 'low 8 digits' estimate in Anandtech article but nothing more specific than that.
Does sparse mean anything other than we can not actually do as many FP8 operations per second as we just claimed? To me it sounds like they can do X matrix operations per second on sparse matrices using Y FP8 operations per second, but instead of just saying what Y is they tell us how many FP8 operations would be required if the matrices were not sparse. Is this pure marketing bullshit or is there some logic to this? How sparse do those matrices have to be? Or am I misunderstanding this claim?
It means a very specific sparsity pattern - 2:4, so 2 out of 4 values are not 0. It's not pure bullshit, because a matrix with 2:4 sparsity may represent more "information" than a matrix that is 50% smaller.
Okay, yes, there is a bit more information than in a matrix with half the number of entries, namely the position of the zeros. But when it comes to the number of floating point operations, doubling that number seems at least somewhat questionable to me, they are not performing that many multiplications. On the other hand it would probably be hard if not impossible to achieve the same performance if one tried to manually exploit this sparsity and avoid the multiplications, so maybe under that angle it is not too unreasonable.
But this also made me wonder, how does one use this in practice? If the matrices are not tiny, then they will probably have to be incredible sparse in order to always have at least two out of four entries zero. So does this just set some entries to zero if there are not enough of them in each group of four? Does one have to ensure this on its own, reorder rows and columns and introduce zeros where required and acceptable?
Unfortunate that they don't mention the running times for any of the applications they benchmark (e.g., PageRank). Does anyone in the know have some idea how long this takes?
They claim 1.1x to 7x, depending on what you're doing. The 10% to 50% is for the ~10k GPU LLM training, where the main bottleneck tends to be networking:
> DGX GH200 enables more efficient parallel mapping and alleviates the networking communication bottleneck. As a result, up to 1.5x faster training time can be achieved over a DGX H100-based solution for LLM training at scale.
It's been on the roadmap for a few years although there were no performance numbers. I assume GH200 is more expensive so the price/performance advantage may not be overwhelming. Worst case you order GH200s and then scalp your H100s on the used market.
Details on NVLink would be great. For example, the needs and problems solved by their custom cables seemingly required by NVLink would be worth a whitepaper.
Don't get me wrong, this is still great the general public can get a glimpse into Grace Hopper. And they do a good job of simplifying while throwing around mind-boggling numbers (the NVLink bandwidth is insane, though no words on latency, crucial for remote memory access).