PCIe vs. CXL for Memory and Storage

dvas · on Nov 5, 2023

CXL and its coherency mechanisms will be interesting to watch as the requirements for LLMs and related applications requiring large memory pools continue to grow. This includes some HPC related workloads also.

One of the use cases I have seen recently was driving down the total cost of DRAM in a larger-scale deployment of systems at AWS, Azure, Meta etc.

Pond [1] which is a memory pooling system, claims to achieve the desired performance and at the same time lowering costs as one example.

I think a look at the overall, bigger picture is important. For example considering how a system will be combined together with multiple GPU's, memory systems and other accelerators to meet the demands of applications, consider interconnects like NVLink [3] too.

For those interested, I have left a previous comment about experimenting with CXL on a local setup [0]

[0] https://news.ycombinator.com/item?id=37944691#37948761

[1] Pond: CXL-Based Memory Pooling Systems for Cloud Platforms https://arxiv.org/abs/2203.00241

[2] Intel Reveals the "What" and "Why" of CXL Interconnect, its Answer to NVLink https://www.techpowerup.com/254462/intel-reveals-the-what-an...

[3] NVLink and NVSwitch; The building blocks of advanced multi-GPU communication—within and between servers. https://www.nvidia.com/en-us/data-center/nvlink/

ksec · on Nov 5, 2023

Well for one thing CXL 3.0 is built on top of PCI-Express 6.0.

I also think it is pretty much targeting HPC or Server use case. So it may not ever arriving at consumer PC or smartphone.

And in case anyone is wondering, PCIe 7.0 is estimate to complete in 2025 and possibly arriving on market in 2027. At that point we are looking at SSD with 4x pci-e lane with up to 64GB/s.

doctorpangloss · on Nov 5, 2023

Manufacturers could have never anticipated that they could charge $1,500 for an accelerator on the one hand, and $15,000 for the same accelerator on the other. They’re the same thing!

CXL doesn’t really make sense. Does even NVLink? And yet, we have no more 2 slot GPUs, we lost NVLink on 4090s. There are no AM5 motherboards with 4 2 spaced PCIe 16x slots. Why are the consumer chips even limited for IO?

It’s just a bunch of bullshit to segregate a market and extract more rents. We’re going backwards. That said end users are also to blame, for every 1 person building a PC to do something innovative there are 19 who buy stuff with the brightest RGBs.

So it’s up to the market really. Individual consumers might spend an amortized $12k/y on their cars. It’s not outside the realm of reason that hardware manufacturers can take way more of the surplus they’ve been giving to end users.

Whether you think this is good or bad is a subjective thing. I personally don’t think much innovation comes out of NVIDIA, they have somehow tricked the world that thousands of very simple processors, tiled neatly, is as sophisticated as a thing as a general purpose CPU, or that CUDA is especially innovative on its own and not riding on a huge amount of inertia. You should transfer all the surplus to the people taking the most risks, basically the people writing the software with CUDA, which once was rendering for CG and physics simulation and now is people trying to make ML models more efficient.

buildbot · on Nov 5, 2023

Manufacturers do segment the market, but also a 4090 is a totally different, cheaper piece of silicon than an H100. Not really just a price increase. HBM and the requisite packaging is very very expensive.

Why does CXL not make sense? Coherency is great!

NVLink is great for say 4x + gpus. With two 4090s , the effective speedup is the same as with two NVlinked A100s from my testing. So it’s really not a big deal it was pulled out.

Consumers really really don’t build dual GPU systems anymore. Gaming stopped using SLI/crossfire over ten years ago. So AM5 and Intel’s 1700 socket are designed to that idea, and price point. Why is socket SP3 so insanely massive? All those PCIE lanes and extra memory bandwidth take pins. More costly boards, more costly socket, etc.

Also, how are you going to cool 4x 4090s on air, when they are next to each other? That’s 1800w of heat!! How are you going to power it? That’s more than you average American house can power with a 15A 120v circuit!

I just built a dual 4090 system, while I was waiting for my waterblocks to come in, I was testing everything on air. The second card getting fed hot air by the first eventually had to downclock to half the clock of the first card, in order to keep the GPU under 90c at 100% fans. The waterblocks are single slot and go into a server class SP3 epyc board with 4x 1 slot spaced pcie. I have a 1600w psu that has a special plug, and it sits on the special single plug circuit that’s for an AC.

Nvidia is hugely innovative, and a 4090 core is really not simple at all. The cores were simple back in the 8800GTX days, kinda, now, you have pretty significant scheduling resources and tensor cores and different kinds of ALUs and a bonkers massive register file. CUDA is their secret sauce, way easier to use than openCL.

jorvi · on Nov 5, 2023

There are blower-style 4090s. If noise is not an issue, you could just run two of those together with some solid case fans.

buildbot · on Nov 5, 2023

There are?

doctorpangloss · on Nov 5, 2023

A 4090 280W limit will perform similarly to its unlimited configuration on CUDA workloads. As another commenter said blower style fans are fine for this. The segmentation around NVLink is what makes this worthless. And anyway, they could equip it with 48GB of RAM, like the L40 part. It's all bullshit.

buildbot · on Nov 5, 2023

Just tried this, it's 30% slower.

24GB of GGDR6X is $$$$. Especially if they are double density chips, to keep the RAM on a single side, important for cooling!

Again, NVlink is not required for 99% of people.

c0balt · on Nov 5, 2023

> There are no AM5 motherboards with 4 2 spaced PCIe 16x slots. Why are the consumer chips even limited for IO?

Because PCIe Slots, especially Gen 5, is a good chunk of the complexity/ cost of a consumer Motherboard. They are difficult to handle as they have quite hard limits ln timing and signal integrity requirements. Hitting these requirements for more full-width slots is harder and for most use cases just not feasible from a cost : usability perspective (how many extension cards can even saturate an 8x Gen 5 Bus?). This is also why PCIe Risers for newer PCIe generations are getting very expensive.

dvas · on Nov 5, 2023

I think the real world latency numbers will be interesting to see with PCIe 6.0/7.0 since adopting PAM4 modulation allows the greater bandwidth.

[0] Why Did PCIe 6.0 Adopt PAM4? There Are Many Reasons. https://blog.samtec.com/post/why-did-pcie-6-0-adopt-pam4-the...

pella · on Nov 5, 2023

> So it may not ever arriving at consumer PC

(2019) Intel Hints Towards An Xe ‘Coherent Multi-GPU’ Future With CXL Interconnect

https://wccftech.com/intel-xe-coherent-multi-gpu-cxl/

"If CXL can seamlessly scale GPUs, then the economics of the market would also change completely. People would be able to buy a cheaper GPU first and then simply add another one if they want more power. It would add much more flexibility in buying decisions and even alleviate buyers remorse to some extent for the gaming class. If CXL mode trickles down to the consumer level anytime soon, then we might even see motherboard designs change drastically as multiple sockets and multiple GPUs become a feasible option. Needless to say, it looks like things are going to get pretty exciting in a few years."

dist-epoch · on Nov 5, 2023

CXL seems to be good for attaching large number of memory modules.

Doesn't seem like a match for consumer PC/smartphone, which typically have 1-2 memory sticks, or at most 4.

jeffreygoesto · on Nov 5, 2023

Afaik, CXL 3.0 will run on top of UCIe as well. Let's hope for an open standard that will enable SoCs to be composed from many compatible chiplets. Cache coherency and small granularity below page size is a key ingredient.

peter_d_sherman · on Nov 5, 2023

When I originally (many moons ago!) read about CXL -- I conceptualized it solely as a means to attach additional memory -- to a single CPU...

Now that I read this article, I now also conceptualize CXL -- as a bus, that is, as a potential future PCIe bus replacement...

That is, assuming that PCIe is patent-encumbered, and CXL is not -- and I don't know the current legal/patent statuses of either...

Anyway, a very interesting article!

There's definitely more tiers in the memory/storage (latency) hierarchy these days, and potentially room for even more...

Also...

>"With CXL 2.0 and CXL 3.0 which include switching, a host can access memory from one or more devices that form a pool. It’s important to note that in this kind of pooled configuration, only the resources themselves and not the contents of the memory are shared among the hosts: each region of memory can only belong to a single coherency domain. Memory sharing, which has been added to the CXL 3.0 specification, actually allows individual memory regions within pooled resources to be shared between multiple hosts."

Historically, this reminds me of some of the ideas present in the the Cray-1:

https://en.wikipedia.org/wiki/Cray-1

I'm guessing (but not knowing!) that in the future, "homebrew" supercomputers using CXL to link multiple CPUs (and GPUs/TPUs/NPUs...) to multiple nearby banks of gargantuan shared memory (HAL 9000: "I'm sorry Dave, I can't let you do that..." <g>) -- will be able to be created by people with less-than-Supercomputing-Center-budgets -- but all of that is highly speculative for the time being!

iamtedd · on Nov 5, 2023

How old is this article? The "new types of persistent memory like Intel’s Optane Technology" is dead, as of July 2022.

loeg · on Nov 5, 2023

First copy in the Wayback Machine is Nov 2022: https://web.archive.org/web/20221130194031/https://www.synop...

jleahy · on Nov 5, 2023

Well it says "Limited adoption so far; CXL expected to accelerate through 2022 and 2023" (which didn't happen), so presumably 2021.

superjan · on Nov 5, 2023

Again an article trying to sell a product/technology without bothering to explain upfront what it is, or what the acronym stands for. I can gather bits and pieces from how CXL compares to existing tech, but why keep me guessing?

loeg · on Nov 5, 2023

It's happily unambiguous:

https://www.google.com/search?q=cxl

photonbeam · on Nov 5, 2023

They’re selling to an audience that already knows what it is.