Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

UC writes are very inefficient/slow. They are synchronous and not batched into cache lines. Oldschool GPUs used a UC-adjacent mode called WC (write combine) which was almost identical to UC just with adjacent-writes batched into full cache line writes. I think newer GPUs have dedicated DMA engines for copying to GPU memory that isn't mapped into the host address space but I'm not really familiar with these drivers.

For anything that isn't MMIO, you would prefer using ordinary WB caching and non-temporal stores to avoid populating the cache instead of UC mode.



I believe that UC writes are indeed slow, but are not actually fully synchronous. PCI "write posting" is a thing:

> Writes to MMIO space allow the CPU to continue before the transaction reaches the PCI device. HW weenies call this “Write Posting” because the write completion is “posted” to the CPU before the transaction has reached its destination.

https://www.kernel.org/doc/html/v5.6/PCI/pci.html


Yeah, PCI writes can be posted. It's still very very slow and will end up using an inefficient small transaction size on the PCIe side of things. And posting doesn't help with reads, of course. PCI devices that are cache coherent and can cache snoop and otherwise work with WB cache mode can be written a lot faster. UC writes are something like sequentially consistent (program order reordering with other operations is not allowed) which limits the OoO / superscalar magic the CPU can otherwise do to make code fast.

I like this series as an intro: https://xillybus.com/tutorials/pci-express-tlp-pcie-primer-t...


> I think newer GPUs have dedicated DMA engines for copying to GPU memory that isn't mapped into the host address space but I'm not really familiar with these drivers.

Not even newer, but instead it's a pretty common feature for GPUs for the past couple decades or so.


Does anyone still rely on CPU writes with WC? My impression is that is kind of obsolete these days.


And also heavily used in non-oss drivers.

DMA blocks can only do so much when the (often older but still well used) APIs don't map well to the synchronization required - either allowing the user to immediately free and/or reuse the buffer used to pass in data (so likely requires a synchronous CPU copy to a staging area before the API function returns), or direct memory mapping of resources. Putting either of those in the cache is often wasteful, as it's unlikely for any line to be re-used before being flushed anyway, then passed over to the GPU DMA block to do whatever asynchronously.

And there's also non- device driver use cases - I've seen image processing libraries intentionally skip the cache if they know they're not going to be touching the data again for some time, and the data set itself is large enough. I assume other users exist, I just see those as I work on GPUs, and images are a big source of large data sets.

WC allows these to avoid clobbering the cache while having at least some chance of using the memory/pcie bus effectively.


Yes, it's still used in the Intel 3D drivers in Mesa (at least).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: