
MMU gang wars: the TLB drive-by shootdown - matt_d
https://bitcharmer.blogspot.com/2020/05/t_84.html
======
eqvinox
There is a very glaring omission (or call it gloss-over) in this article -
whether and when a free() even causes the allocator to actually release memory
to the OS. JVMs are kinda famous for their "om nom nom" attitude on this, but
you can't transfer from ptmalloc (glibc, whose man page is cited regarding
>=128k mmap) to tcmalloc and jemalloc.

Also, on Linux, an allocator may use madvise(MADV_FREE) instead of munmap();
this tells the kernel that the data in a page is no longer needed but can be
unmapped at the kernel's discretion, possibly even at different times for
different threads (e.g. when doing a task switch for an unrelated reason.)

[EDIT: this doesn't actually work, see below. Sorry.]

~~~
kentonv
IIRC MADV_FREE doesn't actually let the kernel avoid shootdowns. From the man
page ([http://man7.org/linux/man-
pages/man2/madvise.2.html](http://man7.org/linux/man-
pages/man2/madvise.2.html)):

    
    
        MADV_FREE (since Linux 4.5)
           The application no longer requires the pages in the range
           specified by addr and len.  The kernel can thus free these
           pages, but the freeing could be delayed until memory pressure
           occurs.  For each of the pages that has been marked to be
           freed but has not yet been freed, the free operation will be
           canceled if the caller writes into the page. (...)
    

The problem is with this bit about a write to the page canceling the free...
This side effect needs to be implemented in a page fault. But if the page is
still in the TLB, then writes to it won't fault. So... you need a TLB
shootdown. :(

I do wonder why there isn't an API for "lazy munmap()"... it would behave
exactly like munmap(), except that the pages might remain accessible in other
threads until the end of their timeslices, when the kernel can apply queued
TLB flushes. It seems to me likely that 99.9% of uses of munmap() could be
switched to this lazy operation without introducing any bugs. I'm not a kernel
programmer, though. Maybe there's some reason this doesn't work, or maybe the
performance improvements aren't worth the effort?

~~~
cyphar
> I do wonder why there isn't an API for "lazy munmap()"... it would behave
> exactly like munmap(), except that the pages might remain accessible in
> other threads until the end of their timeslices, when the kernel can apply
> queued TLB flushes.

That does exist, it's called MADV_DONTNEED and most operating systems have
implemented it. However, MADV_DONTNEED on Linux was incorrectly implemented
and (from memory) would always result in the page being unmapped immediately
-- making it roughly equivalent to MADV_FREE. To quote the man page:

> All of the advice values listed here have analogs in the POSIX-specified
> posix_madvise(3) function, and the values have the same meanings, with the
> exception of MADV_DONTNEED.

~~~
kentonv
AFAICT the POSIX behavior for MADV_DONTNEED is just a hint that the memory
won't be accessed soon.

[https://pubs.opengroup.org/onlinepubs/009695399/functions/po...](https://pubs.opengroup.org/onlinepubs/009695399/functions/posix_madvise.html)
says:

    
    
        POSIX_MADV_DONTNEED
            Specifies that the application expects that it will not access the specified
            range in the near future.
    

[https://www.freebsd.org/cgi/man.cgi?query=madvise&sektion=2](https://www.freebsd.org/cgi/man.cgi?query=madvise&sektion=2)
says:

    
    
         MADV_DONTNEED    Allows the VM system to decrease the in-memory priority
            of pages in the specified address range. Consequently,
            future references to this address range are more likely
            to incur a page fault.
    

Neither of these suggest that the contents of the memory content can be
discarded (as Linux does), only that it can be swapped out.

So it doesn't appear that this provides the "lazy unmap to avoid shootdown"
behavior on any OS.

------
joe_the_user
_Every once in a while I get involuntarily dragged into heated debates about
whether reusing memory is better for performance than freeing it._

I couldn't comment on all the instances the article talks about. But this way
of asking the question seems to me to hide the problem. It seems simpler to
say "what memory allocation algorithm should you use?" Which is to say, "does
your knowledge of your application's memory needs and memory performance trump
all the effort and knowledge that went into creating the memory allocator of
the operating system you're using?". And so then you get into the massive
number of technical considerations the article and others might raise.

Memory allocation is a weird thing, it's an algorithm but it's often taken as
a given in programming languages and discussions of algorithms.

~~~
bitcharmer
Hi, author here. I'm not sure if I follow your argument. This article doesn't
touch on allocators at all (ie. SLUB vs SLAB). It focuses solely on the cost
of _freeing_ memory which TLB-shootdowns are a notable part of.

I even mention it at the beginning:

> Regardless of the method by which your program acquired memory there are
> side effects of freeing/reclaiming it. This post focuses on the impact of so
> called TLB-shootdowns.

Hope this helps.

~~~
joe_the_user
As I understand things, allocating and freeing memory pretty much forms a
single system. Especially, if I have a system where I "manually" allocate 10
meg for my use, never free it but use an internal method to mark the memory
free or used, I will still have issues with caching and virtual memory based
on my use of the memory. IE, reusing memory effectively creating a "roll your
own" free and allocate functions.

And in general, how contiguously you allocate memory plays a big part in
whether freed memory can be easily discarded from the cache. If you get the
heap to be exactly like a stack, then the cache shouldn't have problems. But
I'll admit I'm not an expert and I could be missing something.

~~~
rocqua
This article isn't so much about free-ing, but about unmapping memory. It
could well be that you have an allocator that decides not to un-map free-ed
memory so that it can quickly be re-used later.

That said, as per
[https://linux.die.net/man/3/malloc](https://linux.die.net/man/3/malloc) (and
the article) the default implementation of free will (in some cases) unmap
memory.

It is this un-mapping of memory that causes other threads to be affected.
Because those threads should get a seg-fault after the unmapping if they try
to access that memory.

------
brandmeyer
This whole rigmarole is necessary for a single reason: TLBs don't participate
in the cache coherency system.

Uh, why is that? If they did participate, then the mere act of writing to the
cache line(s) which change the mapping would implicitly invalidate all of the
associated entries in all of the system's TLBs. (Handwave, handwave), maybe
you still end up needing a barrier similar to the instruction barrier needed
when altering the content of executable pages.

What's the downside? Is it just power? Or is there something more fundamental
about the TLB structure that makes it impractical?

~~~
eqvinox
I'm guessing what you mean is for the TLB to get a notification when the
physical memory that contains the page mapping it was loaded from is changed.

(As opposed to, some weird contortion the other way around where you touch the
cache for the mapping that got changed.)

It'd probably require holding the cacheline for the TLB entry in the actual
cache in at least MESI "S" state. For 5 levels of page tables. And you can't
do non-flushing changes (e.g. dirty bit) anymore. I'm no CPU designer but it
sounds complicated and bug prone...

~~~
brandmeyer
> I'm guessing what you mean is for the TLB to get a notification when the
> physical memory that contains the page mapping it was loaded from is
> changed.

Yes, that's right.

Let us conceive of an inclusive cache architecture, just for the sake of
argument. L2 already maintains a directory listing of all the lines which are
present in L1I and L1D, and forwards cache coherency traffic to those caches
based on the messages it receives. Expanding this to the ITLB and DTLB would
be exactly the same circuits. The hard parts are already done.

But I think you're onto something with the hardware-managed dirty bit. Its
much less general than the S->E->M transition, and doesn't need the Exclusive
state at all.

I've done a bit more digging in the background. Turns out that ARMv8's TLB
invalidate instructions are broadcasted operations. You don't need IPIs to
execute them on every core. So processors do have _some_ snoopy and/or
directory management hardware interface for managing TLB shootdown. It just
isn't as general-purpose as the rest of the cache maintenance hardware.

~~~
eqvinox
A separate broadcast makes much more sense on the complexity front... if it
were tied to the cache, the TLB wouldn't really care about anything other than
specific transitions. It doesn't need the cacheline contents, and one
cacheline would generally hold 8 or 16 PTEs (depending on PTE and cacheline
size.) You'd also necessarily be binding to the PTE's physical address since
page table walks occur on physical addresses.

With a broadcast, on the other hand, you get the specific transition event you
need, and you can choose whether to broadcast the physical PTE, physical data
or virtual (+ASID) address, or both. You can also have an acknowledgement
returned if needed.

------
cryptonector
Nice write-up. I've long believed that malloc(4192) == mmap() is a very bad
idea for this reason -- in fact, let your heap get giant pages, damn it. Also,
fork() turns out to be pretty evil. fork() is OK when you're forking worker
processes very early in a daemon's life, but for most other uses it's just a
very bad idea -- use vfork() or posix_spawn() instead.

There are just too many problems with fork(). That paper about vfork() being
harmful had it exactly backwards. It is fork() that is harmful. It has safety
issues that make using it safely sufficiently hard as to not be worth it in
many cases, but the real killer is fork()'s copy semantics, which just kills
performance.

(A colleague of mine has used fork() in signal handlers to call abort() on the
child side as a way of getting core dumps from live processes without killing
them. That's pretty neat, and one of the very few uses of fork() I would
endorse.)

------
forrestthewoods
The anti-windows nipple rubbing was... weird. And not helpful.

As a primarily Windows guy I’d rather just know what, if any, differences
there are. =/

~~~
idoncarfb
Is this the only thing from the article you found worth pointing out? Also, no
one here had a problem with that except you. What does it tell you?

~~~
forrestthewoods
That everyone here uses Linux and MacBooks.

------
ncmncm
All of this is moot if you have only the one thread. If you need parallelism,
and can afford its complexity, separate processes sharing only what must be
shared can eliminate your TLB exposure. Single-writer mappings eliminate cache
storms.

All of this is moot if you allocate all your memory upfront.

All of this is moot if you can identify junctures when a stall is free, and
cluster your shootdowns at such times.

~~~
adwn
> _If you need parallelism, and can afford its complexity, separate processes
> sharing only what must be shared can eliminate your TLB exposure._

Then you need to ensure that all shared data structures are exclusively
allocated from a dedicated memory pool which is mapped to all processes. Oh,
and you must make sure that you either don't use any pointers within those
datastructures, or that the shared memory pool is mapped to the same address
in all processes. And no pointers must point from within to outside that
memory pool.

In theory, it is possible. In practice, this would preclude any kind of non-
trivial parallelism.

~~~
ncmncm
I see that you just haven't done it yet.

"Allocation" in this context just means open(), mmap(). A "shared memory pool"
is a thing I would have no use for.

But I can assure you that you can have excellent, non-trivial parallelism with
separate processes and chosen shared memory pages -- much moreso than with
threads that must battle one another for access to locks, queues, and "pools".
I routinely get order-of-magnitude performance improvement by reorganizing
this way.

The single-writer ring buffer is the component that makes it all work. The
environment might seem stark, but in exchange you can have exactly 0%
concurrency overhead. Not wasting 90% on pool management and thread contention
means other optimizations become meaningful. And, you can start and stop the
processes independently.

------
rawoke083600
Awesome article... Extra points for posting on blogger.com :)

------
7532yahoogmail
A hell of a good write up. I learned tons here.

------
lonelappde
The title is absolutely obnoxious.

~~~
idoncarfb
That's only if you're devoid of sense of humor.

