I'm always nervous about companies like Intel doing both hardware and software. ...

tntn · on May 29, 2019

> good cache-DMA integration but this not working across NUMA nodes. Ideally Intel would improve this feature to be NUMA-aware

How do you propose they do this? You can't magically move the physical devices to the other socket, and transferring the data between sockets is what those "NUMA-scheduling kludges" are trying to avoid. The only solution is to have software put the data close to the device.

lukego · on May 29, 2019

My take is that they should use the memory subsystem to route the DMA to the appropriate L3 cache.

I don't think the main issue is conserving QPI bandwidth: it's minimizing latency when a core loads the data into L2. That latency often causes ~30% performance drag on e.g. packet processing applications.