
A cache invalidation bug in Linux memory management - known
https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html
======
tom_
4 billion was a lot when I first got into programming, but it just isn't that
much any more. Always start out by making your sequence numbers 64 bits: 2^64
nanoseconds = ~585 years. (I have in the past stolen a few bits for other
uses, but I'm less certain of the wisdom of that these days.)

32 bits is suitable for things that can only happen once a frame - (2^32/60)
seconds = 2.26 years - but not much more than that.

But there's a flip side to computers being able to wrap a uint32_t so quickly:
if you've got 32 bits'-worth of combinations, then you can might actually be
able to do an exhaustive test. 2^32 microseconds = 71 minutes. Work in C++,
run on all threads, and you probably won't even need to wait that long.

Good example: [https://randomascii.wordpress.com/2014/01/27/theres-only-
fou...](https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-
floatsso-test-them-all/) \- runs in 90 seconds.

I did something similar a few years ago, prototyping a 16x16 multiply routine
for an 8-bit CPU. It took about 5 minutes running on both threads on my 2.5GHz
dual core laptop, and found some interesting cases I'd got completely wrong.

~~~
mmphosis
65536 was a lot when I first got into programming, there was also floating
point, and then the realization that we don't need to be constrained to 4
bits, 8 bits, 16 bits, 24 bits, 32 bits, 64 bits, 128 bits, two digit years,
640K of RAM, small stack frames, broken memory models and all the other
ridiculous limits. Give it a decade and I bet the ~585 year time frame is
reduced to a small enough interval where bugs appear. Bet on lots of 64-bit
bugs appearing, and a programmer in the future commenting that 128-bits should
have been used, and 128-bits will break in the future of that future, and so
on.

For me, the biggest bug in the modern x86 chip is ME: a full system that can
be turned on, take over everything and eats up valuable core space.

~~~
dagenix
> Give it a decade and I bet the ~585 year time frame is reduced to a small
> enough interval where bugs appear.

I'm skeptical that much changes in the next decade regarding where 64-bit
integers are useful. It's been a long time since there are been meaningful
clock frequency increases in commodity CPUs. Even if we had a 100Ghz CPU and
we had it increment a register once per clock cycle, we're still talking ~6
years for it to overflow. And, thats assuming a 20+ times increase in clock
frequency when clock frequencies have been mostly flat for the last 10 years.

> a programmer in the future commenting that 128-bits should have been used,
> and 128-bits will break in the future of that future, and so on.

I disagree with this statement. With 128-bits, it becomes difficult (though
not impossible) to even come up with things that are numerous enough to be
unable to count with that number of bits. That was never the case for 64-bit
or smaller integers.

------
steelframe
Note this is by Jann Horn, the same Google security researcher who discovered
Spectre/Meltdown. Serious props to this guy, and hopefully he stays in a
whitehat role where he can find and deal with these issues through responsible
disclosure.

~~~
ronnier
Wow he’s only 22
[https://www.bloomberg.com/news/articles/2018-01-17/how-a-22-...](https://www.bloomberg.com/news/articles/2018-01-17/how-a-22-year-
old-discovered-the-worst-chip-flaws-in-history)

~~~
romed
Perhaps people new to the codebase (or the industry, or to life in general)
are less likely to view existing constructs as obviously correct, and
therefore more likely to point out flaws.

~~~
quotemstr
That's one reason big tech companies should allow easy inter-team transfers.

~~~
romed
Also why you should give all the critical projects to interns.

------
jacquesm
"This fits in with the kernel's policy of attempting to keep the system
running as much as possible by default; if an accidental use-after-free bug
occurs here for some reason, the kernel can probably heuristically mitigate
its effects and keep the process working."

That's a pretty bad design decision. I'd much rather have a kernel panic than
a kernel that continues to run with known bad datastructures. Like that the
bugs will never really shake out, silent failures like that are super
dangerous because essentially the system is running in an undefined state past
that point.

~~~
tonysdg
It's terrible from a correctness perspective, sure. But from a business
perspective, that could mean $50 million of revenue not-lost to a flaw you (as
an owner/operator) can't do much about. Sure, it could lead to an exploit like
this, but I'd wager a cost-benefit analysis for almost any organization
(except maybe CIA/NSA types) would support this design choice.

That's not to say it's a _good_ design choice, but it's certainly a defensible
one IMO. You can have the most secure OS in the world, but if no one wants to
use it, all you've got is a very secure waste of hard drive space.

~~~
jacquesm
If a kernel panic can cost you $50 million you have other problems. Really, in
an organization where downtime due to a server rebooting would be that
expensive you'd hope they would be able to deal with that gracefully and that
they would be deploying the rough equivalent of chaosmonkey to ensure that
their stuff is protected against such errors.

After all, a harddrive or a CPU could die just the same.

~~~
sangnoir
Your implicit assumption is that the server will only reboot once. I think if
the Linux kernel is upfront about its security posture and your threat-model
is incompatible, it is OK to go with OpenBSD.

~~~
jacquesm
The more frequent the reboots the quicker the problem will get diagnosed!

Especially if the system is still stable enough to write a log entry to disk.

------
afwaller
There are only two hard problems in computer science.

Cache invalidation, and naming things. And off by one errors.

~~~
progval
And off by 2^32 errors

~~~
taneq
Don't think I've ever heard of off-by-0 errors before. Doesn't sound that
bad...

------
quotemstr
I've been taking a pretty close look at Linux's mm code recently, and I've
been wondering why it needs to be as complex as it is. There's a brain-
exploding amount of complexity, most of it open-coded, that seems to my eye to
be unnecessary, or at least amenable to abstraction.

For example:

1) Why do we have entirely different management structures for anonymous
memory (anon_vma and friends) and shared memory (inodes) when they really end
up doing the same job? (Anonymous memory can be shared, so both paths end up
needing to do the same kind of thing.) ISTM we can just use shared objects in
all cases, tweaking the semantics as needed for shared memory. Anonymous
memory is just swap-backed shared memory, after all. It should use the same
logic.

(Yes, you need to work without a swapfile. No, that doesn't change the
conceptual model.)

2) Do we really need open-coded page table walking? Why duplicate essentially
the same logic four times? Sure, you occasionally want to do slightly
different things at each page table level, but you can still unify most of the
logic. (If you really want, you can encode the per-level differences in code
generated via C++ template.)

3) Do micro-optimizations of the sort mentioned in the article really help? If
these sequence numbers had been 64 bits long, they wouldn't have overflowed.
If the VMA tree had been implemented as a splay tree (like on NT) instead of
an RB tree with a per-thread cache, the per-thread cache might not have been
needed. (Splay trees automatically cache recently-accessed nodes.) How much do
these little tricks actually help? Is their _cumulative_ effect positive?

4) Why is so much of the vm internal logic spread throughout the kernel? Both
NT and the BSDs have a well-defined API that separates the MM subsystem from
the rest of ring 0, but on Linux, ISTM that lots of code needs to care
directly about low-level mm structures and locks. Why should code very far
from the mm core (like the i915 driver) take mmap_sem directly? It feels like
more abstraction would be possible here.

I get arguments based around ruthless pursuit of performance, but with
function calls taking nanoseconds and page faults taking orders of magnitude
more time, is anyone saving anything significant?

~~~
jacquesm
If you strip out all performance optimizations from modern computers they will
be a lot more secure but they will run slow as molasses. Starting with the
hardware caches and ending with optimizations like these we are not talking
'minor gains' but likely orders of magnitude difference between optimized and
non-optimized. And all of those have security trade-offs.

It's a tough call, the mantra used to be 'get it right, then make it fast',
but in practice whoever ships the fastest stuff will win the race, even if it
is incorrect. As long as it is correct long enough to run the benchmarks I
guess.

This is yet another reason why I love microkernels (real ones), they are a lot
easier to reason about and to get right to the point that you can really rely
on them not to suddenly exhibit some weird behavior due to the complexity of
their execution context.

~~~
quotemstr
> If you strip out all performance optimizations from modern computers they
> will be a lot more secure but they will run slow as molasses

Each optimization needs to be evaluated on its own. Not every optimization
actually helps, and due to the effects of path length, icache pressure, and
code complexity, the cumulative effect of many micro-optimizations may end up
being negative. (It's why -funroll-all-loops is not in fact very fun.)

There's no forced tradeoff between complexity and speed! The fastest code is
the code that doesn't run at all. Often, the best way to optimize a program is
to strip it down to the rafters and then re-add only the optimizations that
actually help. Simpler is often faster.

~~~
jacquesm
That is mostly true, but the fact is that until 'spectre' and 'meltdown'
everybody believed that the security implications of hardware caches were
theoretical. It took decades since we started using them for the first
practical manifestations of issues that were warned about long ago.

For pure software optimizations I would be more than happy to end up with a
machine that is half as fast but bulletproof. But I am not sure that I am in
the majority there.

And even though you are right that there is no forced tradeoff between
complexity and speed in practice this is often the case, so often that we have
institutionalized it, we generally accept that in order to make things faster
we will have to make them more complex. If not then the most naive version of
code would in general be the fastest and this just isn't the case.

So in the end it all boils down to where you draw an arbitrary line and say
'to the right there is too much complexity, to the left we are too slow'.
Everybody will want to draw that line somewhere else, so we try to have our
cake and eat it too: we optimize as much as we can, make stuff more and more
complex and then we try to shoot holes in it. Sometimes successfully!

------
misiti3780
Serious question - where would you start if you wanted to learn about how to
do find/diagnose this types of issues and maybe one day work for project zero.
Everytime I see these posts it blows my mind, maybe because I have never done
any kernel development, but I tend to get the feeling these guys/girls must be
the best of the best.

------
zydeco
I can see why they haven't come up with a catchy name for it yet.

~~~
db48x
It should obviously be The Mummy's Curse, because of all the wrappings.

------
kolderman
Sounds like a classic case of "not thinking things through properly at the
beginning".

I wonder how freeBSD engineered this...Linux appears to grow organically at
times.

~~~
markjdb
You could say that about most bugs...

FreeBSD doesn't have a per-thread cache, but does similarly use a generation
number to allow operations to detect stale state upon reacquiring the vm map
lock. I don't see why it isn't prone to the same kinds of bugs.

~~~
kolderman
I entirely disagree. You have have the most well thought out, architected,
designed system in the world with:

if (i=0) {

~~~
markjdb
You could argue that a system written in a language that permits such errors
is not well-thought out. The terms you're using are not well-defined, so it's
easy to disagree endlessly without arriving at a useful insight.

------
hi41
This is such fabulous work! Sometimes I look at such work with a tinge of
jealousy. How does one get good so good at programming. I try but I cannot get
to be this good. I especially find it hard to understand someone else's code.

------
alain94040
_The bug was fixed by changing the sequence numbers to 64 bits, thereby making
an overflow infeasible, and removing the overflow handling logic._

So the bug is still there then? It will just take longer to hit.

Until someone else decides to change the encoding of that field, for some
reason, and not initialize it at 0. It instead starts with a random value (for
safety reasons). Good luck figuring out that random crash that happens once in
a blue moon.

~~~
jacquesm
> So the bug is still there then? It will just take longer to hit.

It will take so long that the hardware will have died many times over before
the bug triggers, assuming society as we know it is still around.

~~~
alain94040
Like I said in my original comment, not if a future kernel change randomized
that value to make it unpredictable to attacks. Then the bug will be back.

~~~
winstonewert
The bug doesn't trigger when the counter resets to zero, the bug triggers when
the counter reuses a previous value.

So:

A) There is no reason whatsoever to randomize the counter B) Even if they did,
it wouldn't be a problem because it would still take an absurdly long time to
loop around to the starting positions.

