I'm always nervous about companies like Intel doing both hardware and software. If you do only hardware then you need to keep it simple or else software people won't use your stuff. If you are playing both sides then you can throw engineers at the problem and make it as complicated as you want. In this case they can just push complex AVX512 code into projects like glibc and bypass the whole "is this really worth the complexity?" discussion.
One related example is Intel CPUs having good cache-DMA integration but this not working across NUMA nodes. Ideally Intel would improve this feature to be NUMA-aware and everybody would be happy. However, it seems like they decided it would be cheaper to send forth an army of software engineers to put NUMA-scheduling kludges into a few key projects like Kubernetes and then call it a day.
> good cache-DMA integration but this not working across NUMA nodes. Ideally Intel would improve this feature to be NUMA-aware
How do you propose they do this? You can't magically move the physical devices to the other socket, and transferring the data between sockets is what those "NUMA-scheduling kludges" are trying to avoid. The only solution is to have software put the data close to the device.
My take is that they should use the memory subsystem to route the DMA to the appropriate L3 cache.
I don't think the main issue is conserving QPI bandwidth: it's minimizing latency when a core loads the data into L2. That latency often causes ~30% performance drag on e.g. packet processing applications.
One related example is Intel CPUs having good cache-DMA integration but this not working across NUMA nodes. Ideally Intel would improve this feature to be NUMA-aware and everybody would be happy. However, it seems like they decided it would be cheaper to send forth an army of software engineers to put NUMA-scheduling kludges into a few key projects like Kubernetes and then call it a day.
/rant!