APU’s for HPC are going to be a wild ride. Accelerated computing in shared memor...

JonChesterfield · 2024-05-04T20:42:20

APUs are very cool for GPU programming in general. Explicitly copying data to/from GPUs is a definite nuisance. I'm hopeful that the MI300A will have a positive knock on effect on the low power APUs in laptops and similar.

imtringued · 2024-05-04T22:21:13

>Explicitly copying data to/from GPUs is a definite nuisance.

CXL allows fine grained shared memory, but people look at the shiny high bandwidth NVLink and talk about how much better it is for... AI.

djmips · 2024-05-04T19:59:36

All video game consoles use APUs and it does make memory related operations potentially faster but at least for video games it's not the bottleneck. I suppose for HPC it might have more significance.

bayindirh · 2024-05-04T20:20:49

If you're doing simulations, or poking big matrices continuously on CPUs, you can saturate the memory controller pretty easily. If you know what you're doing, your FPU or vector units are saturated at the same time, so "whole system" becomes the bottleneck while it tries to keep itself cool.

Games move that kind of data in the beginning and doesn't stream new data that much after the initial texture and model data. If you are working on HPC with GPUs, you may need to constantly stream in new data to the GPU while streaming out the results. This is why datacenter/compute GPUs have multiple independent DMA engines.

crest · 2024-05-04T21:18:38

Afaik those unified memory architectures are mostly neither cache coherent nor do they support virtual addresses efficiently (you have to trap into privileged code to pin/unpin the mappings) which means that the relative cost is lower than a dedicated GPU accessed via PCIe slots, but still to high. Only the "boring" old Bobcat based AMD APUs supported accessing unpinned virtual memory from the L3 (aka system level) cache and nobody bothered with porting code to them.

paulmd · 2024-05-05T04:48:16

> Afaik those unified memory architectures are mostly neither cache coherent nor do they support virtual addresses efficiently (you have to trap into privileged code to pin/unpin the mappings) which means that the relative cost is lower than a dedicated GPU accessed via PCIe slots, but still to high. Only the "boring" old Bobcat based AMD APUs supported accessing unpinned virtual memory from the L3 (aka system level) cache and nobody bothered with porting code to them.

Other way around, bobcat was the era of “onion bus”/“garlic bus” and today things like apple silicon don’t need to be explicitly accessed in certain ways afaik.

https://www.realworldtech.com/fusion-llano/3/

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...