I generally expect Grace to show up as underpowered compared to modern AMD and Intel offerings (Skylake is old), but if they do end up with any wins when they run that benchmark, I would assume it would be due to the memory bandwidth. LPDDR5 on a server chip with 2x the bandwidth of the DDR5-based competitors is really good to see.
I feel like x86 offerings often fall down so flat in this area. Even when theoretical memory bandwidth is provided, there are often other significant limitations that prevent its use. I've run into this personally with AMD Zen cores, where it is pretty easy to saturate the infinite fabric bandwidth.
Do you know why they would outfit it with LPDDR5 instead of regular DDR5? I don't have a good feel for the difference here. At first blush I would assume low power DDR would run slower not faster, but that doesn't seem to be the case anywhere I see it deployed.
LPDDR5 has lower total latency (or can, and I assume it very much does in this case). Essentially the "standard motherboard layout" normal DDR5 is designed for is too big to target a lower latency.
HBM has such high access latency that the Xeon Max can't actually use all of the available bandwidth except in a caching mode (where a lot of that bandwidth is used with cache line spills and fills): the cores' line fill buffers will get full and cause backpressure and stalls in the CPU. And this assumes that every core is trying to load all the time.
You can see line fill buffer issues with 8-socket machines today, for example, where they are the cause of stalls due to the high latency of cache coherency checks. However, since they are part of the core and fine for 99% of users, the size of the line fill buffer is kept relatively small.
That limitation puts the HBM Xeon Max sort of on par with Grace for actual usable memory bandwidth, not so far above it.
That's a great paper. I've tangled with the line fill buffer issue before, along with other hidden limits such as infinity fabric limits on AMD systems. It's one of the first big disppointments when you start doing high performance, bandwidth-limited compute work. It's also one that's almost entirely undocumented which is frustrating.
Interesting paper. I focus more on the latency than bandwidth. The paper gets a few things wrong, DDR5 is not a single 64bit channel, but 2 x 32 bit channels. So the normal Xeon is 16 x 32 bit channels, not 8x64 bit.
He talks about cache misses as 60ns, which glosses over that approximately half of that is missing through L1/L2/L3, then you enter the queue for the memory controller, for the memory channel you need. As a result you only get half the bandwidth if you only have a single request pending per channel.
The interesting part of the article isn't the processor but rather the LPDDR5 memory, which apparently has ECC support. I'm not aware of any vendors offering such memory - does anyone know which exact memory they're using?
I wish everyone would offer in-band ECC. I would generally be happy to pay the memory cost to get ECC. At least AMD allows you to use ECC memory with desktop parts, Intel still use it for market segmentation.
You should better wish that the DRAM manufacturers make memories with appropriate widths.
In-band ECC is a horrible workaround, which has a real hardware cost much higher than traditional ECC, in additional die area, increased energy consumption and much lower performance.
Intel has published some information about their in-band ECC controller, while NVIDIA, as usual, is much more secretive.
The performance of in-band ECC would be unacceptable without the addition of a special rather big cache memory and a great increase in complexity in the memory controller, which is required to implement clever caching algorithms. No matter how clever is the caching algorithm, there will still be workloads where the performance will be much lower than with standard ECC.
No one has verified ECC setup with AMD desktop parts. e.g. see this thread[1]. People said ASrock supports ECC in all the forums, but even their support denied it. While ECC memory would work with most motherboards, no one has any info if ECC is working on them.
Plenty of people including myself have verified that ECC works with the desktop parts where AMD specifies that ECC is supported, like my Ryzen 9 5900X (on an ASUS Pro WS X570-ACE).
Your link is only about the current ASRock desktop MBs, not about AM5 MBs in general, not even about ASRock AM5 MBs in general. Previously all ASRock desktop motherboards supported ECC, but their current generation of desktop AM5 motherboards no longer supports ECC.
On the other hand, ASRock Rack makes some full-featured server boards with the AM5 socket and ECC support for desktop Ryzen 7000 CPUs, which have a much better performance per dollar than any alternative with CPUs that are sold as "server" CPUs.
For normal desktops or workstations, ASUS remains the most accessible choice for motherboards that support ECC. They also have many motherboards that do not support ECC, so the MB specifications must be checked before buying (a good ECC ASUS MB is PRIME X670E-PRO WiFi).
While the hardware ECC of Ryzens appears to be OK, their software support is much worse than that of Intel, as unfortunately it is usual for AMD.
Their Linux EDAC driver for Ryzen CPUs had not been updated for many years, since the Bulldozer times until about a year ago. A couple of years ago, many features, like testing without having to overclock the memory, by injecting errors, no longer worked, due to mismatch with the hardware.
During the last year, the AMD EDAC driver for Ryzen CPUs has been updated multiple times, so perhaps now all its functions work fine, but I have not verified this.
It's not clear IMO. What you're saying would make sense given that the CPU only has access to 960GB of RAM despite there being at least a TB of RAM physically available. The article blamed the discrepancy on yield, which sounded ridiculous to me.
You lose data space for syndrome bits. This is not, in practice, different from fixing data space and adding chips for syndrome bits, but it reflects differently in specs.
NVidia's target market is GPU heavy workloads where the CPU is mostly for corrdinating the GPU-heavy work, moving data from IO to GPUs, etc. I've often seen in large-scale ML training that the CPU is mostly idling. Idle to the point where I've wondered if Presto/Spark workloads should be co-located on the idle cores for large $$$ savings.
I suspect that the $ of a CPU is sometimes hard to measure in a HPC world, especially in a way that could meaningfully compare to smaller machines.
What I would like to see is some W and Wh symbols, which usually cannot be subject to special bulk pricing, fluctuations due to demand and availability, etc.
It's worse than that. There's literally nothing similar about these systems. One is a system designed for the supercomputer MareNostrum 4 and the other for MareNostrom 5 (a completely different system).... So old CPU, but also different network cards, topology, memory (capacity and speed), storage system, operating system (SuSE from 2016 vs Ubuntu 22)... and so on. For example, they went from 10Gb ethernet to 200Gb infiniband.
And then they took all of the performance improvements that each of these contribute... and attributed them to the Nvidia CPU.
This is a misrepresentation. Included in the analyses are single-node runs, which don't care about network cards etc. This is a platform comparison, not a CPU showdown; among the questions here is whether Grace-based nodes are feasible at all for production HPC. The answer is a tentative yes, although I still have concerns about cooling at this density in a general-use (i.e. highly fluctuating) workload.
But mostly, these numbers are for their users, who are aware the system contract has been awarded but want to know what to expect when their workloads hit the new system.
Incidentally, MareNostrum 4 has a 100gbit Omnipath fabric. I'm sure they'd love to test against latest Omnipath, but Intel dumped the tech, so our choices these days are 200/400 gbit Ethernet or similar-throughput Infiniband.
Bear in mind at least one of those notes the code wasn’t optimised for ARM while all the meaningful HPC code in existence has been painstakingly optimised for Intel for decades.
Right the arm stuff is probably in the "it runs" camp. Largely because its SVE, which is barely available, and the code written to utilize it has largely probably been tuned for the a64fx, or maybe the gravaton v1's.
Both of which have considerably different memory and vector size/issue characteristics. So three different SVE variations now, and the previous two show significant uplift when given custom tuning (ex: see gcc -mtune=neoverse-512tvb, vs the custom a64fx compiler benchmarks). Arm put a bunch of effort into creating an instruction set that is microarch agnostic, but then its not exactly worked the first couple tries. Maybe that will be fixed with V2 and all SVE cores going forward.
Indeed. Right now there is about 0% HPC code tuned to Grace and Grace Hopper.
I'd love if Nvidia made reasonably priced Grace and Grace Hopper ATX boards (or a Nvidia Studio stylish desktop, priced like a Mac mini) developers could buy so that we can do our best to optimize code for Grace for free in our spare time.
Same goes for AMD and their MI300 family, in case AMD is listening. There is less to be gained, as the x86 side is pretty well cared for ATM, but, still, I'd love to see such a beast.
They already awarded the contract. The question users will be asking is "how much faster is the new computer compared to the one we've been using?" These are the answers.
If you scroll down to the Stony Brook results, they compare it to more modern CPUs.
I've had access to one of these (interested in it for its massive amounts of IO bandwidth for the power budget), and its stunningly fast. And yes, it runs FreeBSD.
>I've had access to one of these (interested in it for its massive amounts of IO bandwidth for the power budget), and its stunningly fast. And yes, it runs FreeBSD.
Are we going to get Serving Netflix Video Traffic at 1600Gb/s and Beyond anytime soon? :)
I think Apple are really missing a trick here if they don't consider offering a server machine learning CPU/GPU. They could probably develop a server business as big as Nvidia if they get the software support right but it's never been their focus so they will probably miss out on it despite having all the ability to make a competing ecosystem.
The same GPU could be repurposed for the Mac Pro too making this product actually cost effective to make rather than essentially impossible right now. There are lots of reasons this makes sense.
The H100 is currently $30000-40000 because there is no other game in town due to CUDA/software compatibility and the upcoming H200 will be even more expensive. These cards "only" have 80GB of RAM and there are other cards that I couldn't find a price for with more RAM for training LLMs. So people are already paying a huge premium for "ecosystem". Apple could easily do the same for machine learning driven server products.
I find it interesting that you think Apple products are more expensive because of the branding, I'm willing to pay more because they work much better than other systems for me and also if you subtract the much higher resale value they are actually the same value as other products.
> I find it interesting that you think Apple products are more expensive because of the branding, I'm willing to pay more because they work much better than other competing systems and also if you subtract the much higher resale value they are actually the same value as other products.
You're assuming that resale value matters for most people, when it probably doesn't (many will use hardware until it's dead, or will pass it on to siblings, which renders the resale value moot).
And I fundamentally disagree that Apple products work better, and especially enough to merit the price difference. A fun case is RAM/disk upgrades. You really cannot say that Apple's premiums on 16GB and more RAM have anything to d with their hardware/software being better (and again, it really isn't).
If you're curious why I think Apple hardware/software isn't better, in no particular order:
* hardware is pretty good, but with limited options (I want a matte screen ffs, the reflections of glossy screens kill my eyes over a few hours) and brutal premiums for trivial things like not having only 8GB of RAM available.
* UX is actually pretty bad. As someone who grew up on Windows, has used various Linux distros with GNOME or KDE extensively, macOS is a nightmare for a few reasons:
* same as with hardware, very limited options. Want to not follow Apple's special way of doing things? You cannot without installing a bunch of extra software, which is sometimes paid (I mean things that any other OS has had for years, like key remapping, or window management). Do you want to have a different scroll direction between the touchpad and the mouse wheel? Well, there are two settings in the two different menus, one for mouse and one for touchpad, but they toggle each other... This is horrendous UX even Microsoft wouldn't do.
* there are massive amounts of hidden tricks, which are sometimes useful, sometimes not, but you cannot (usually) disable them. I do not care for Apple Music and have never used it, yet it will always pop up if macOS considers I don't have anything else to play when I tap the play button on my keyboard. I do not care for Apple Notes, but it will always pop up if I click at the bottom right on my screen by mistake. Discovery of those tricks is also hard
* The OSes absolutely sucks in providing any sort of feedback to the user. "Something went wrong", now go fuck yourself. I've had that with screen mirroring to iPad, installing apps on an iPad, a USB device misbehaving on macOS, Apple TV+ app on iPad failing to download stuff offline... there's no information whatsoever to help you understand what is happening and how you can fix it.
So no, I don't think Apple products really work that well, or are worth the massive premiums they ask for them. There is a decent ecosystem, but it locks you in it, so IMO most people are paying for the brand and due because that's all they know (hammer, nail, etc.)
If you want the same memory bandwidth and latency in your windows laptop, you would similarly need to have soldered LPDDR5 memory and pay the premium for it
You pay a premium for GDDR or HBM because they're actually faster. LPDDR5 isn't, Apple just used the equivalent of a larger number of memory channels to increase the memory bandwidth (x86 servers do the same thing) and then charges a premium for the ordinary memory chips because they can get away with it.
Moreover, there is an obvious way to solve this without soldering the fast memory to the system board -- you solder it to the CPU. This is not only lower latency (traces don't have to go through the system board at all) and lower cost (don't need extra pins between the CPU and the system board), it allows you to upgrade the memory by upgrading the CPU. And isn't incompatible with still having dedicated slots on the system board to add additional memory with less memory bandwidth. Which is typically what you want anyway because if you're going to solder something to the processor, you're better off to actually use HBM or GDDR (which is even faster), but then you don't want all of the system memory to be that because it costs more, many applications don't need it and cache hierarchies are a thing for a reason.
Apple's whole advantage is owning the entire stack top to bottom. To achieve that with servers they'd need to develop their own server OS and perhaps app ecosystem on top of it too - that really isn't their focus. Or they could stop at "here's a machine, run what software you want", but that's completely antithetical to everything they've done so far.
> despite having all the ability to make a competing ecosystem.
Eh, citation needed? Nvidia has a 30 years head start making GPUs and has been working on CUDA for 15. Meanwhile AMD, competing in the same space, is not even close to Nvidia's offering. And I don't see how Apple's experience in developing consumer products helps in developing HPC computing hardware/software (ok, the M1/M2 are a first step in the right direction).