Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Understanding x86_64 Paging (zolutal.github.io)
166 points by signa11 on Jan 16, 2024 | hide | past | favorite | 25 comments


Easy on paper, hard to get right when trying to do this in C or Rust!


It diverges more from x86 than I would have expected!


I’m not deeply familiar with either, though I have written a toy x86_64 bootloader once. There are more levels in x86_64 (which makes sense to me that a larger address space would benefit from a sparser encoding). What else is significantly different?


x86 in some modes (PSE but not PAE) has 4MB superpages, whereas in PAE mode or in x86_64 they're 2MB (first level). That's the only one I remember off-hand.


There are also 1 GiB pages in long mode, but support depends on the processor and CPUID's output. Almost all modern x86 processors in the last several years should have it, I think.


Some also technically support it, but it doesn't gain you a whole lot as the TLB will split it down to a 2MB TLB entry. Could save a bit of L2 for the page table entries under the right conditions, but you've already lost if invoking your page walk hardware is in your fast path to begin with.


Very cool. I've actually been trying to learn about this area recently. I know this was aside from the main thrust of the article, but this but made me think:

Page Cache Disabled (PCD) – pages descendant of this PGD entry should not enter the CPU’s cache hierarchy, sometimes also called the ‘Uncacheable’ (UC) bit.

I guess this is some optimisation where it's decided the page isn't worth taking up valuble CPU cache state. And now I'm wondering what this algorithm is that decides it..


My guess is that this is useful when interacting with memory-mapped IO devices - individual accesses can have meaning at the device level, so you don't want the caches getting in the way (and e.g. removing / combining accesses or providing stale data).


Also useful for memory you know you won't be touching multiple times and just thrash the cache.

For example GPU drivers and texture copying - even on integrated GPUs that have cache coherency with the CPU (so don't "need" to manually flush it), there's no point filling the CPU cache with data you're only touching once with effectively a memcpy() on the CPU, you'll just evict more useful data you'll probably need to readback after anyway.


UC writes are very inefficient/slow. They are synchronous and not batched into cache lines. Oldschool GPUs used a UC-adjacent mode called WC (write combine) which was almost identical to UC just with adjacent-writes batched into full cache line writes. I think newer GPUs have dedicated DMA engines for copying to GPU memory that isn't mapped into the host address space but I'm not really familiar with these drivers.

For anything that isn't MMIO, you would prefer using ordinary WB caching and non-temporal stores to avoid populating the cache instead of UC mode.


I believe that UC writes are indeed slow, but are not actually fully synchronous. PCI "write posting" is a thing:

> Writes to MMIO space allow the CPU to continue before the transaction reaches the PCI device. HW weenies call this “Write Posting” because the write completion is “posted” to the CPU before the transaction has reached its destination.

https://www.kernel.org/doc/html/v5.6/PCI/pci.html


Yeah, PCI writes can be posted. It's still very very slow and will end up using an inefficient small transaction size on the PCIe side of things. And posting doesn't help with reads, of course. PCI devices that are cache coherent and can cache snoop and otherwise work with WB cache mode can be written a lot faster. UC writes are something like sequentially consistent (program order reordering with other operations is not allowed) which limits the OoO / superscalar magic the CPU can otherwise do to make code fast.

I like this series as an intro: https://xillybus.com/tutorials/pci-express-tlp-pcie-primer-t...


> I think newer GPUs have dedicated DMA engines for copying to GPU memory that isn't mapped into the host address space but I'm not really familiar with these drivers.

Not even newer, but instead it's a pretty common feature for GPUs for the past couple decades or so.


Does anyone still rely on CPU writes with WC? My impression is that is kind of obsolete these days.


And also heavily used in non-oss drivers.

DMA blocks can only do so much when the (often older but still well used) APIs don't map well to the synchronization required - either allowing the user to immediately free and/or reuse the buffer used to pass in data (so likely requires a synchronous CPU copy to a staging area before the API function returns), or direct memory mapping of resources. Putting either of those in the cache is often wasteful, as it's unlikely for any line to be re-used before being flushed anyway, then passed over to the GPU DMA block to do whatever asynchronously.

And there's also non- device driver use cases - I've seen image processing libraries intentionally skip the cache if they know they're not going to be touching the data again for some time, and the data set itself is large enough. I assume other users exist, I just see those as I work on GPUs, and images are a big source of large data sets.

WC allows these to avoid clobbering the cache while having at least some chance of using the memory/pcie bus effectively.


Yes, it's still used in the Intel 3D drivers in Mesa (at least).


I don't think it's an optimization, for that you would typically use non-temporal instructions which avoid polluting cache with one-off accesses.


UC is for MMIO.


It's for mapped I/O primarily.


Is this useful for memory mapped IO?


This is great. I find the x86_64 paging to be very similar to RISC V which I was exposed to poring through xv6 code.


There's definitely both "established best practices" and many other legacy ways to do things on x86.

RISC-V is quite boring, majorly opting for well-proven approaches. And a lot of its technical value comes from doing just that.

Yet x86-64 and ARM would have a hard time justifying their complexity, when RISC-V achieves the same with simplicity, and matches or beats them on objective metrics[0].

0. https://dl.acm.org/doi/pdf/10.1145/3624062.3624233


Really enjoyed this deep dive into x86_64 paging.


[flagged]


Why should we care about your apathy?


> "I don't know and I don't care"

That is ignorance, and what is apathy? Get out of here?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: