Software-Refilled TLBs (1995-2007)

majke · on Feb 26, 2022

Somewhat related, if I remember right grsecurity emulated NX bit by abusing the fact that iTLB and dTLB caches are separated.

On tlb miss, it hooked code (trap on a page tree walk?) and tried to reverse if it was instruction or data cache miss. If it was instruction, it performed additional check for the "software" NX bit on a page.

bpye · on Feb 26, 2022

RISC-V actually calls out the posibility of a software filled TLB in it's spec. It's somewhere on my list of things I want to go try and do - to create a small RV32IC core capable of running user, supervisor and machine mode and have a software TLB for paging.

johndoe0815 · on Feb 26, 2022

The RISC-V Privileged Architecture spec (V20211203) [1] advises against a software-managed TLB (section 4.3) due to being a "performance bottleneck on high-performance systems, and [are] especially troublesome with decoupled specialized coprocessors", but this is definitely a topic worth investigating IMO.

An interesting related document is Gernot Heiser's "Inside L4/MIPS" [2], which details the design decisions and the implementation of software TLB management for an early version of the L4 microkernel on 64-bit MIPS systems.

[1] https://github.com/riscv/riscv-isa-manual/releases/download/...

[2] https://trustworthy.systems/publications/papers/Heiser%3AIL4...

Taniwha · on Feb 26, 2022

More importantly there are no RISCV "reserved for kernel use" registers that can be trashed at any time - the mips register convention leaves 2 of the 31 registers for the kernel to use in exceptions (including TLB misses).

To do this for RISCV you would have to build build special compilers and kernels

Taniwha · on Feb 26, 2022

BTW one of the coolest things is the way that MIPs (and other) systems do 2 level page tables - essentially they map the page tables into their virtual address space (mapped by TLBs just like anything else) and then when they get a miss on a TLB fetch from a PTE they take a recursive trap and fill the TLB entry that matches the upper level PDE

A long time ago I worked on an x86 clone built on top of a 64-bit RISC core (pre amd64) we used this trick (and more) to do segmentation as well (to get around the then somewhat bogus Intel paging patents)

throwawaylinux · on Feb 27, 2022

Virtual linear page table. Neat trick. I think ia64 had an option to do that as well. But I've never seen a really good study of them on a decent range of interesting workloads.

Primary TLB misses have no dependent memory accesses (just the PTE) but at a cost of another TLB entry. Could be a win in some cases. Nobody really does it anymore though, it could be there is not much appetite for the risk of trying out different virtual memory schemes these days.

johndoe0815 · on Feb 26, 2022

On RISC-V, I've seen the trick of swapping a register with the m/sscratch CSR in order to obtain one freely useable one. But this, of course, still is more overhead than on MIPS.

Taniwha · on Feb 27, 2022

yup, and a swap is more difficult because (depending on the design) is more likely to generate pipe bubbles. The mips also has (quite simple) magic hardware to convert the TLB miss address into a PTE address so that you can create the address you need in 1 clock, you could do some magic like that where the csr swap saved the reg in scratch and returned the PTE address, then the "write tlb entry" would also swap from scratch - you could do something like (pseudo code) :

  tlb_fault:
    movcsr  r, pte_addr, r
    ldd r, (r)
    movcsr  r, tlb_data, r
    mret

(tlb entry would use the saved fault virtual address and the result from the ldd - the "ldd" would fault to a special pde vector if there's no TLB entry, the "movcsr r, tlb_data, r" would fault if you try and write an invalid PTE.

BTW I think this only makes sense for simple RISC system - a massive super-scalar OOO system (like mine) would suffer massively if you had to toss all the (100+) instructions in the pipeline when you took a TLB miss

Taniwha · on Feb 27, 2022

You could also expand a TLB miss "ld r1, (r2)" inline to:

    movcsr  r, pte_addr, r
    ldd r, (r)
    movcsr  r, tlb_data, r  
    ld r1, (r2)

With appropriate traps as described above (that's essentially how I handle vectored CLIC interrupts on my RISCV)

Taniwha · on Feb 27, 2022

(in other words it's a bit of an architectural dead end, like explicit branch delay slots if you want to grow a design past the simple risc stage)

userbinator · on Feb 26, 2022

It's interesting to see the shift in perspective. The first post was well in the "RISC era", several months before the P6 (PPro) came out and surprised everyone. I would've loved to see what the engineers at Intel thought of this TLB discussion --- especially those who worked on the 386, and thus made the decision to go with a hardware TLB.

The idea of software-refilled TLBs always seemed strange to me; a TLB is a cache, yet I don't know of any widely-used CPU with data/instruction caches which need to be explicitly managed in software. ARM, the most common RISC now, uses a hardware TLB.

throwawaylinux · on Feb 26, 2022

TLBs do need to be explicitly managed by software, I don't know any CPU architecture where this is not the case.

It would probably be quite feasible to do actually with an architecture that has a simple page table like a single physically contiguous hash table, but x86 style arbitrary radix page tables seem like they would be prohibitive to snoop.

cryptonector · on Feb 26, 2022

IMO software-filled TLBs is the only thing that makes sense because it's much more flexible than any HW page table walking algorithms. The latter get baked in, or at best microcoded, while the former are infinitely customizable to fit the needs of the day.

About the only improvement I can think of for software-filled TLBs is to have features that allow said software to always be hot, loaded into fast on-die memory, not subject to being pushed out of a cache.

throwawaylinux · on Feb 27, 2022

The big problem of software filled TLBs is that taking an interrupt serializes the pipeline and is an inherently single-threaded operation. Interrupts themselves are very heavy weight on high performance cores too. You could handwave that way somewhat, but not the serial nature of it. Software filled just doesn't cut it for high performance cores.

If an ISA defined a very small contained set of operations for TLB miss to program page table walking / TLB refill state machines that allowed the OS to define that behavior without taking an architectural interrupt that would be very cool. I suspect the hardware cost and complexity is not considered worthwhile though.

johndoe0815 · on Feb 26, 2022

ARM also provides software-managed high-speed memories, which the call Tightly Coupled Memory (TCM), the common term for these is scratchpad memory (SPM) [1].

There is still a lot of recent research focusing on the design of SPM hardware and management approaches in the embedded and real-time/predictable systems communities.

[1] https://dl.acm.org/doi/10.1145/774789.774805

cryptonector · on Feb 26, 2022

This is a great thread. I'm so glad you posted it!

johndoe0815 · on Feb 26, 2022

The good old times of Usenet when you could have discussions on deep technical topics with real experts, even if you were just a lowly undergraduate student... how I miss this, I learned so much from Usenet as a student in the nineties.

saagarjha · on Feb 26, 2022

If you’re lucky, you can get some of that here…

mhh__ · on Feb 26, 2022

There are a lot of chip designers and similar figures on stackoverflow, you can still do it, albeit it's not as conversational.