That's probably the way to think about it. PCIe devices communicate point-to-poi...

bloopernova · on Nov 24, 2021

Interesting, thank you!

Teknoman117 · on Nov 24, 2021

The other neat thing is that peer-to-peer PCIe transactions do not have to go through the root complex (the Pi in this case).

Say you have a GPU and an NVMe drive or video capture card plugged into a PCIe switch, which is then plugged into the Pi. If your software supports having these devices directly communicate, they can do so as fast as their PCIe links support doing, even though there’s only the tiny link to the Pi.

Peer-to-peer transfers are only limited by the smallest link on the path between two devices on the tree, rather than the smallest link in the tree.

A real world example is some of the big GPU compute servers. Some of the 4U chassis can house 20 or more cards. Many of them have all of the GPUs (320 or more PCIe lanes) plugged into an array of PCIe switches which may constrain them to 16 or 32 lanes from the CPU. The GPUs can all talk to each other a full speed, but only a few can go full speed to the CPU at a time.

salawat · on Nov 24, 2021

Don't you mean the CPU can only full-duplex with a few GPU's at a time?

And are you sure that GPU's are actually talking with each other?

Forgive me if I'm woefully out of date/having a massive brainfart, but last I heard, GPU's got CPU-less DMA a few years ago allowing access to system memory, but programming/setting up the pipelines/cleanup/repriming/output consolidation was still solely a CPU-centric work stream.

So like, yeah, your GPU does it's thing and dumps the result in a buffer in it's memory space, and might even be smart enough to shlup a buffer back to system RAM to get the result forwarded somewhere else when the CPU gets around to telling the storage controller to do it...

But did I completely sleep through something absolutely amazing? Are GPU's passing messages/shared memory space aware now? Like,

"Oh, my jobs done, shlup this buffer off to pcie device->compute_card_2 for the next stage of processing."

Because if they're that smart now, holy crap, I have reading to do.

Teknoman117 · on Nov 24, 2021

Yes, GPUs are able to directly pass messages and bulk data to each other these days. It's completely supported in CUDA, OpenCL, Vulkan, DirectX 12, etc. DirectX calls it "Explicit Multi GPU".

It's been around since GPUs transitioned to PCIe, it's just that both AMD and NVIDIA just locked it to the "professional" GPUs, or only used it under the hood of some other feature. AMD GPUs have used PCIe peer to peer copies for CrossFireX since the GCN Radeon GPUs (R9 290X and friends), that's why they don't need an external bridge.

There's DirectGMA (AMD) and GPUDirect (NVIDIA) which allows PCIe drivers for other devices to directly copy to and from GPU VRAM.

Other than driver support, nothing prevents PCs from having GPUs communicate directly with storage devices. The GPU would need to expose some of it's VRAM over PCIe (using a BAR window). The OS and the GPU driver would need to create some NVMe submission and completion queues either in system or VRAM that the GPU can be considered to own. You'd then issue NVMe operations where the backing pages point into the VRAM. The resizable bar support of the current Radeon and GeForce GPUs is a major step in this direction, although again, resizable bars were part of the original PCIe specification and has been available on server hardware since ~2008.

salawat · on Nov 25, 2021

Well... Shit.

There goes my free time for the next 6 months. Need to brush up on intra-system component networking and figuring out how to make graphics cards dance.

Must... Become... Massively... Parallel...