
The mysterious case of the Linux Page Table Isolation patches - KirinDave
http://pythonsweetness.tumblr.com/post/169166980422/the-mysterious-case-of-the-linux-page-table
======
justincormack
Looks like it is speculative execution based, and does not affect AMD

[https://lkml.org/lkml/2017/12/27/2](https://lkml.org/lkml/2017/12/27/2)

AMD processors are not subject to the types of attacks that the kernel page
table isolation feature protects against. The AMD microarchitecture does not
allow memory references, including speculative references, that access higher
privileged data when running in a lesser privileged mode when that access
would result in a page fault.

Disable page table isolation by default on AMD processors by not setting the
X86_BUG_CPU_INSECURE feature, which controls whether X86_FEATURE_PTI is set.

~~~
wilun
I guess Intel decided to speculate data access regardless of privilege level
of the target address, with the theory that what has been successfully
speculated can't be accessed anyway before the permission are really checked,
and somebody found a bug (or given all Intel processors are taggued as
unsecure, maybe a quasi-architectural hole) that let read the speculated data
or a significant subset or trace of it.

My wild guess is that you can read a good portion (if not all) of the memory
(or a significant subset or trace of it) of the whole computer from
unprivileged userspace programs.

~~~
rst
One possible vector suggested by Matt Tait (pwnallthethings) on Twitter: if
speculative operations can influence what the processor does with the cache,
the results can be observed with cache timing attacks. If the branch predictor
reads the results of speculative operations, it's _real_ easy, as he suggests
here:

[https://twitter.com/pwnallthethings/status/94797892728438374...](https://twitter.com/pwnallthethings/status/947978927284383744)

but (as he notes elsewhere) there are plenty of other channels in an Intel
processor for information to leak...

~~~
Taniwha
my guess is that the cache tags or tlb entries loaded on failed speculative
accesses are wrong (maybe the valid bit is set but the address wasn't changed,
or the user/supervisor protections are munged), that could leave you with a
cache line or page tagged as user accessible but really protected kernel data

------
joncrane
If true, this is pretty huge. As if the AWS "You blew through your budget"
emails right around midnight of New Years were only an appetizer.

Edit: AWS Spurious Budget Email Barrage:
[https://www.reddit.com/r/aws/comments/7ndvli/anybody_get_spu...](https://www.reddit.com/r/aws/comments/7ndvli/anybody_get_spurious_aws_budgets_alarms_early_on/)

~~~
ProAm
I missed what happened with AWS? Any details

~~~
jerrysievert
not sure on exact details, but I received one as well, on a free-tier account
I had sitting around with an empty dynamodb table that was showing very high
projected usage. it was enough that I logged in immediately thinking that
account had been hacked.

nope, still empty table, deleted it and went to bed. glad I wasn't the only
one who got that.

~~~
dghughes
Same here. Mine was on standby so I terminated it.

I wasn't sure what was going on so I nuked everything cancelled my account and
will hope for the best.

~~~
dghughes
Amazon sent me an email this morning (Jan 3, 2018) to say I have a bill but I
had already cancelled my free account. I can't see what I owe (hopefully zero
dollars) because my account is now non-existent.

One tiny Ubuntu instance I barely used and then put on standby. Then
terminated.

Ugh :(

------
vitaliyf
This may or may not be related, but there is a Xen advisory embargoed until
Thursday (see [https://xenbits.xen.org/xsa/](https://xenbits.xen.org/xsa/))
and I am aware of at least one VM provider who scheduled emergency VM reboots
across their entire fleet this week because the issue cannot be addressed
through hot-patching.

~~~
snom380
I got around 10% of our Xen instances scheduled for reboot on Jan 4th, mostly
long running instances. It's the first time I've seen that many instances
scheduled at once.

~~~
VectorLock
Also got a large number of instances with scheduled reboot on Jan 4th in older
AZs.

------
userbinator
IMHO, with RowHammer, the hardware is broken and it will continue to be broken
until users complain enough --- maybe to the point of absolutely refusing to
buy --- that the manufacturers and designers stop thinking "works
99.9999999999% of the time" is good enough:
[https://news.ycombinator.com/item?id=12410274](https://news.ycombinator.com/item?id=12410274)

~~~
hujji
The current miniaturization of DRAM circuitry doesn't really allow for a
hardware fix for the RowHammer attack. During DRAM manufacturing a test
similar to the RowHammer attack exists. This test has certain bounds for
passing. If the bounds were tightened up to the level of perfection to prevent
the attack it would drop the yield a considerable amount.

~~~
userbinator
_If the bounds were tightened up to the level of perfection to prevent the
attack it would drop the yield a considerable amount._

The fact that DRAM older than a few years is effectively immune to RH suggests
it is possible to manufacture such. Yes, it will cost more, but I think many
would be willing to pay for it like they used to, for none other than the
assurance of having more reliable memory.

~~~
fiddlerwoaroof
Isn’t the problem related to the capacity of current memory modules? So you
can have, say, a 32GB module that’s vulnerable or a 8GB module that isn’t?
(Assuming the 8GB module uses lower density DRAM chips)

~~~
HelloNurse
A new 8GB DIMM module has about 1/4 the silicon area of a new 32GB module and
the same density (often 1/4 the number of identical chips); only an _old_ 8GB
module, made with an entirely different process, would have larger and less
dense safer DRAM cells.

~~~
fiddlerwoaroof
Yeah, that's why I included the qualification

------
Aissen
Guess: it is not directly related to RowHammer/DRAM, and is purely a CPU
issue. Maybe PageFault timing info leaks or cache timing, like this pwnie
award winner: [http://www.cs.vu.nl/~giuffrida/papers/anc-
ndss-2017.pdf](http://www.cs.vu.nl/~giuffrida/papers/anc-ndss-2017.pdf)

If this last one can defeat ASLR, imagine leaking bit by bit from a co-hosted
VM, to extract secrets of other cloud customers… This is the reason anyone
serious about cloud security will reserve instances so that they won't share
physical hardware with other customers (think EC2 dedicated instances).

~~~
Aissen
Addendum: I know in most modern CPUs the memory controller is on-die, so my
comment is partially wrong (RowHammer is definitely a SoC issue).

Also, if you're interested in this type of things: _Armv8.4-A adds a flag …
indicating that you want the execution time of instructions to be independent
of the data._

[https://twitter.com/agl__/status/927929410321244160](https://twitter.com/agl__/status/927929410321244160)

Now the primary source seems to have been edited(why?)… But webarchive still
has it:

 _Data Independent Timing

CPU implementations of the Arm Architecture do not have to make guarantees
about the length of time instructions take to execute. In particular, the same
instructions can take different lengths of time, dependent upon the values
that need to be operated on. For example, performing the arithmetic operation
‘1 x 1’ may be quicker than ‘2546483 x 245303’, even though they are both the
same instruction (multiply).

This sensitivity to the data being processed can cause issues when developing
cryptographic algorithms. Here, you want the routine to execute in the same
amount of time no matter what you are processing – so that you don’t
inadvertently leak information to an attacker. To help with this, Armv8.4-A
adds a flag to the processor state, indicating that you want the execution
time of instructions to be independent of the data operated on. This flag does
not apply to every instruction (for example loads and stores may still take
different amounts of time to execute, depending on the memory being accessed),
but it will make development of secure cryptographic routines simpler._

[https://web.archive.org/web/20171107164628/https://community...](https://web.archive.org/web/20171107164628/https://community.arm.com/processors/b/blog/posts/introducing-2017s-extensions-
to-the-arm-architecture)

The scope seems limited to ALU, so not really related to the TLB thing we have
here. Also, it's still very far away, I'm not sure its predecessor Armv8.3-A
is even shipping to customers yet.

------
twoodfin
I'm confused about the TLB impact. The pythonsweetness link claims these
patches now require TLB flushes when crossing the kernel/user boundary, but
the description of KAISER @ lwn[1] suggests that these flushes are unnecessary
with "more recent" processors supporting PCIDs. How recent is "more recent",
and is the PCID support likely to be ported back to earlier kernels along with
KPTI?

TLB flushes for syscalls would be absolutely brutal for many performance-
critical applications.

[1] [https://lwn.net/Articles/738975/](https://lwn.net/Articles/738975/)

~~~
pja
If the problem is row-hammer style attacks on the TLB that let you map
userspace writable pages into the kernel address space then any kernel entries
remaining in the TLB when userspace is running are going to be a security
hole. The problem won’t be a process writing to the kernel entry (that would
be forbidden by existing code / hardware) but a process updating it’s own TLB
entries in ways that corrupt adjacent kernel ones. PCID doesn't help you here
- indeed it hurts, because it means there are more TLB entries from the
hypervisor or other virtual machines remaining in the TLB to be corrupted!

(Unless I have entirely the wrong end of the stick about this?)

~~~
kijiki
I don't think rowhammer style attacks are possible on TLBs, since they are
SRAMs (CAMs, to be precise), not DRAMs.

~~~
sbierwagen
I took OP to mean "rowhammer style" in the sense of a chip operation having
unexpected physical effects on nearby transistors; not an attack literally
identical to rowhammer.

~~~
pja
Yes, that was my intention. Probably could have been clearer.

------
rwmj
This is probably a better link:

[https://lwn.net/SubscriberLink/741878/eb6c9d3913d7cb2b/](https://lwn.net/SubscriberLink/741878/eb6c9d3913d7cb2b/)
([https://news.ycombinator.com/item?id=16001476](https://news.ycombinator.com/item?id=16001476))

------
aeleos
Usually I go into these things with a fair amount of skepticism but given the
linux kernels usual pace of development and the nature of undisclosed bugs we
have seen in the past this seems like a large hypervisor bug could be the
reality. It must be pretty bad if its the kind of bug that they can't really
fix easily, and have to push through an entire new feature into something as
old and important as the paging code.

------
jaccarmac
The kernels for Gentoo have been all over the place for the past few weeks.
I'm running 4.12 at the moment, then the repos updated to 4.14, which wouldn't
build for me, so I waited a week for genkernel to modernize. When I came back
4.14 had been marked unstable and 4.12 was masked, making 4.9 the latest
supported kernel. Seems that whatever is happening is a Big Deal.

~~~
revelation
Thats because Gentoo decided to switch on a new compile flag, then didn't
bother to test that the kernel still boots:

[https://lkml.org/lkml/2017/12/29/449](https://lkml.org/lkml/2017/12/29/449)

~~~
amluto
And, for reasons that are entirely unknown, the issue got worse due to one of
the PTI patches (written by, and hence tentatively blamed on, yours truly).
Presumably it caused some minor change in code generation causing GCC to go
nuts.

FWIW, the compile flag that Gentoo enabled activates a seriously busted GCC
feature, and I'm a bit surprised that Gentoo gets away with it in user code.

~~~
caf
Is there anywhere to read up on the bustedness of the stack probing feature?
(apart from the obvious incompatibility with trying to do that for kernel
code).

~~~
JdeBP
Probing more than a page size below the current stack pointer is wrong.
Probing more than a page size further when one's saved frame area, save area,
locals area, and (maximum) calling parameters area do not amount to a page in
total is also wrong.

For more on the considerations that underpin stack probing, see
[http://jdebp.eu./FGA/function-
perilogues.html#StackProbes](http://jdebp.eu./FGA/function-
perilogues.html#StackProbes) for starters.

~~~
amluto
Also, the probe does an unlocked RMW (or 0) instead of a read, which is slower
but also wrong in a multithreaded application. The latter is what broke Go.

~~~
caf
Is that because Go crams the thread stacks closely together and it probed into
another thread's stack?

Regardless, it seems like RMW (or even just a store) should be fine as long as
the stack pointer is adjusted _before_ the probe.

------
jbfoo
Shouldn't cloud-grade computers be immune to rowhammer (or at least rowhammer
should be much less efficient) as they typically use ECC RAM. Switching ECC
RAM in a way that also modifies checksum in a deterministic way is (was?) not
practical?

~~~
dmm
ECC doesn't protect you from from all rowhammer problems because they can flip
more than two bits at a time, the limit which ECC can detect.

"Tests show that simple ECC solutions, providing single-error correction and
double-error detection (SECDED) capabilities, are not able to correct or
detect all observed disturbance errors because some of them include more than
two flipped bits per memory word"

[https://en.wikipedia.org/wiki/Row_hammer#Mitigation](https://en.wikipedia.org/wiki/Row_hammer#Mitigation)

~~~
hathawsh
OTOH, a rowhammer attack on ECC memory will likely flip 1 bit before it flips
2, making attacks theoretically detectable. Without ECC, there's no clear way
to detect an attack.

~~~
lolc
I'd assume that parity is checked on access, which may give enough time to
flip more than one bit before it's detected.

~~~
Rebellos
ECC memory controller performs memory scrubbing periodically, in the
background, during which it checks parity and corrects any bitflips. Otherwise
ECC would not work nearly as well as it does.

~~~
Dylan16807
Parity isn't checked during every single row refresh?

~~~
cesarb
AFAIK, row refresh is done within each memory chip, while the ECC bits are
normally on a separate chip (for instance, where a non-ECC module has 8 chips,
an ECC modules has 9 chips), so ECC scrubbing has to be done in the memory
controller.

------
phigcch
Interesting to note that [https://en.wikipedia.org/wiki/Kernel_page-
table_isolation](https://en.wikipedia.org/wiki/Kernel_page-table_isolation)
was created on December 29th.

------
TheSwordsman
One question I have around this is whether the patches made to the Windows
kernel in November exhibit the same performance hits. Does anyone know?

I'm due to refresh my gaming PC, and I was going to go with Intel again as
they've not been a problem. However, if Intel chips are going to incur the
same 5% - 50% performance hit on Windows, I might end up investing in AMD
hardware instead.

~~~
JJJollyjim
I'm certainly no expert, but I would guess that the context-switch time (where
I believe the new overhead is added) in games is fairly small compared to the
raw number-crunching in-process, so the effect would be minimal in any case

~~~
snuxoll
Every draw call will need to transition to kernel space to send data over the
PCIe bus to the GPU. Modern games execute something on the order of 1000+
draws per frame, so assuming 60fps that's going to be at least 60,000*2
context switches into the kernel and back per second, more if you're doing
high refresh rates.

How big the impact I will be, I don't know - but I wouldn't be surprised if it
was a couple percent (effectively ruining the single-threaded performance
boost Intel has in gaming over AMD before accounting for overclocking).

~~~
TheSwordsman
I was also worried about network I/O too, which could be an absolute pain for
games where you care about latency.

~~~
user5994461
Network IO is negligible in game. The source engine for instance is hard
limited to using 30 kB/s of bandwidth.

------
nobody987
AWS had scheduled maintenance requiring reboot of all EC2 instances in
December. Maybe this is somehow related.

~~~
chillydawg
It wasn't all instances. Only one of mine needed a reboot.

~~~
nobody987
In my case the reboot was scheduled for most (if not all) EC2 and RDS
instances.

------
Asdfbla
Just curious: Even after an attacker goes through all the effort of finding
out the physical address of the memory location they want to manipulate, how
would someone make sure to get an adjacent memory location to even attempt to
execute the Rowhammer attack? And even then, the smallest memory units
allocated are basically pages within page frames, right? So if your target
memory row is within a physical page frame, does the RowHammer attack even
work? (Since there's no adjacent row an attacker has access to then.)

~~~
seibelj
If it’s something that can be triggered from browser JavaScript, maybe it can
be attempted N times per second, and there is a mathematical probability of
gaining full privileges within a set amount of attempts.

------
qume
The bare metal cloud providers will be rubbing their hands together

~~~
saas_co_de
I never understand why people don't go bare metal. It is just as easy to
automate, just as cheap (or cheaper) if you plan well, and more secure.

A decade ago everyone knew that shared hosting was for hobby sites and stuff
that didn't really matter.

Maybe some more people will learn that lesson.

~~~
foobarbazetc
As someone who shares your skepticism of the cloud, I can say that people
don’t switch from bare metal hosting (something like SoftLayer) to AWS/GCP for
the cost.

If you do the math like “we have 1000 cores and 2048Gb of RAM and 10Tb of
RAID’ed SSD” and then plug that in to the GCP calculator... it’s going to be
at minimum 1.5-2x your bare metal cost.

That’s not even including bandwidth which is pretty much free at bare metal
hosts unless you’re doing a lot of egress.

The calculus changes when you realize that you’re over-provisioned on the bare
metal side for a variety of reasons: high availability, “what if”, future
growth that’s more medium term than short, etc.

Then you scale back the numbers you’re plugging into the calculator and things
are still expensive but now within reason.

Couple that with things like global anycast region aware load balancer,
firewalls (an in-line 10GigE highly available firewall costs _a lot_ of
money), ability to spin up hundreds of cores in 5 seconds and the value
proposition becomes clearer.

It still depends on your work load, but there’s a lot more to consider than
just straight up monthly cost.

~~~
saas_co_de
Totally agree. Cloud makes tons of sense if your workload is really dynamic.
Lots of small players are running static workloads though because actually
setting up dynamic workloads is pretty complex.

I use GCE for DNS, Storage, CDN (for fronting storage backed files), dynamic
workloads that can run on preemptible instances, and scalable instances to
serve published static content, but I use dedicated servers for databases,
elasticsearch, redis, and application servers fronting those things.

~~~
foobarbazetc
Yeah we’re medium size but still bare metal at IBM/SoftLayer.

We keep looking at GCP waiting for the pricing to make sense and still trying
to figure out how people run low latency Postgres on there. :)

~~~
jsolson
Have you run u to latency issues with Postgres?

(I work on GCE)

------
john_moscow
>public NT kernels from as early as November have begun to implement the same
technique.

Does the author refer to ReactOS, or has Microsoft really open-sourced parts
of the NT kernel?

~~~
neerajsi
Reverse engineers pretty much know how everything in NT works. Msft publishes
enough symbols that it's even possible to automatically decompile much of the
code. Something like page table splitting would be obvious.

~~~
bhouston
And the source of an older version of NT leaked a while back.

~~~
nikanj
I think that was NT 4. I don’t think there’s much of that 20-year old code
left in the kernel.

~~~
PyComfy
The source code of the kernel of Windows server 2003 and Windows XP Pro x64
was available to universities for education purposes __. Someone leaked the
code on internet years ago which is now everywhere on github (search
WRK-v1.2). The code doesn 't include the ntfs module.

__[https://web.archive.org/web/20120412091908/http://www.facult...](https://web.archive.org/web/20120412091908/http://www.facultyresourcecenter.com:80/curriculum/pfv.aspx?ID=7366&c1=en-
us&c2=0)

------
taHAETY
This might seem like a too easy theory, but if you l9ok 8nto the article
posted by jedisct1 there is the somewhat unrestricted access of the L1 Cache
and well with multiple mentions of Rowhammer, could it just be rowhammering
from whatever the L1 Cache accesses?

------
rasur
So, I googled quickly but couldn't see anything obvious.. this does or doesn't
affect Linux running on IBM Z Series mainframes then? I haven't seen much
about if Power CPUs are affected by the same flaws

------
cybercognitio
[https://newsroom.intel.com/news/intel-responds-to-
security-r...](https://newsroom.intel.com/news/intel-responds-to-security-
research-findings/)

------
hnaparst
Where is the patch for the Linux kernel, and what kernel options have to be
set? Also, are there any compilation options that need to be set?

------
hnaparst
You need to set CONFIG_PAGE_TABLE_ISOLATION=y in 4.11.11 or greater if you
have an Intel CPU.

------
tambre
Non-AMP link: [http://pythonsweetness.tumblr.com/post/169166980422/the-
myst...](http://pythonsweetness.tumblr.com/post/169166980422/the-mysterious-
case-of-the-linux-page-table)

~~~
ce4
Alternatively, use Firefox (on your mobile) to skip Amp and other sillinesses.
Greatly improved my mobile browsing experience, haven't looked back (ublock
origin, hint hint).

~~~
eat_veggies
Firefox is pretty slow on my phone and sometimes when I try to search
something, nothing happens. I want to like it, but chrome is just a lot
smoother :'(

I do love how Google tries to "downgrade" its experience on Firefox mobile,
but all it really does is cut out all the javascript and material design
bullshit.

~~~
vesinisa
When is the last time you tried Firefox on mobile? Since quantum landed on the
Android version, it's become really smooth - comparable to Chrome on my phone
(Pixel) at least.

------
dingo_bat
Even though I understood less than 50% of that I am still very excited about
reading more about whatever the real issue is. If somebody can pwn aws from a
random instance that would be highly amusing to me :D

------
tbrownaw
"Hey, I think I noticed a horrible horrible embargoed security bug. I know, I
should do my best to pole holes in the embargo early!"

~~~
cjbprime
They weren't really trying to uncover the exploit such that they can reproduce
it. They were trying to learn who the exploit affects and what the impact is.
I don't think there's anything wrong with that. If you're an AWS customer who
depends on hypervisor isolation for critical security guarantees, it helps you
to know that this is threatened and perhaps exploitable.

Please don't buy into the idea that embargoes and coordinated disclosure are
sacred. They tend to just reinforce existing power structures, sometimes in an
unethical (or at least unfair) way.

~~~
madez
The CCC stated also that they observed that companies take a more reactive
rather than proactive stance regarding their IT security because they believe
that they will be notified of vulnerabilities prior of public disclosure or
attacks. This may justify not following embargoes and coordinated disclosure.

~~~
tbrownaw
Do you have a link?

I'd expect the incentives to be a bit more complicated than that, and I'm also
a bit skeptical that _either_ is all that good of a solution. I'd also like to
see how exactly "proactive" and "reactive" are being used here, is it about
push vs pull for vulnerability notifications, or about hiring their own
security researchers, or... ?

------
revelation
No ones hiding anything, this patchset was developed in the open for many many
months. The hysteria and intrigue in this random tumblr blog is completely
superfluous. It's a hardware bug anyway.

Here is a good hint to when something is not being embargoed: there is a paper
and a public demonstration.

~~~
saas_co_de
It sounds like Intel, Google and Amazon are hiding something. Wouldn't want
customers thinking that cloud computing is fundamentally insecure now would
we?

~~~
revelation
Well Intel for one manufactures the insecure CPUs..

This will be merged for 4.16, when there is no 4.15 release yet. No idea what
your cloud computing companies run but it's not 4.15-dirty, and backporting
this monster is a great recipe for a nightly emergency when it goes OOPS.

edit: it isn't even merged yet.

~~~
justincormack
It is being backported to 4.14, and presumably earlier kernel too.

~~~
revelation
Sure, with the appropriate baking time. But I don't see a cloud company taking
an intermediate version of this patchset, backporting it themselves and then
sending it out to all their customers in a hurry.

~~~
tonfa
Why not? Isn't that what they should do in case of a kernel security issue?

------
lordmajster
Isn't this related to recent Linus rant about security patch from Google Pixel
team?
[https://www.theregister.co.uk/2017/11/20/security_people_are...](https://www.theregister.co.uk/2017/11/20/security_people_are_morons_says_linus_torvalds/)

~~~
caf
No.

------
acoye
Ok so all I hear is Intel and no trace of AMD? So X86_64 ISA wise the only
diff between modern CPUs are AVX512.

I bet you within the foundations of AVX512 lies a nasty one that can't be
patched with microcode update.

~~~
sd8dgf8ds8g8dsg
I'm confused, what makes you think the AMD and Intel cpu internals are the
same?

~~~
acoye
It is not, but if like it was the case for rowhammer, it was linked to a
specific instruction (clflush).

This time it could be an AVX512 instruction (intel only) that leaks kernel
address in a way or another.

I was talking from an ISA perspective. For eg, clflush may be implemented
differently between Intel and AMD, it has the same effect on system RAM hence
a shared exploit.

~~~
cesarb
> It is not, but if like it was the case for rowhammer, it was linked to a
> specific instruction (clflush).

No, rowhammer does not need clflush. All rowhammer needs is to be able to
write to the same physical memory locations repeatedly. Normally the cache
would get in the way, so the attacker needs to bypass it. Flushing the cache
(clflush) is one way, but there are others; AFAIK, it has been demonstrated
rowhammer from within a Javascript VM, which has no access to clflush.

~~~
acoye
Yes, I knew you could do rowhammer on arm too where clflush does not exit. So
rowhammer is not the correct example. Yet at some point it was believed that
it was necessary on x86 for the attack to work.
[https://en.wikipedia.org/wiki/Row_hammer#Exploits](https://en.wikipedia.org/wiki/Row_hammer#Exploits)

I just made a bet that I could guess something out of ISA only. Going macro to
describe what may be the issue. I'm just doing a guess work here.

I was not implying Intel has the same implementation as AMD, nor I was making
a case for "this is like rowhammer"

