
Multiple OS Vendors Release Security Patches After Misinterpreting Intel Docs - ingve
https://www.bleepingcomputer.com/news/security/multiple-os-vendors-release-security-patches-after-misinterpreting-intel-docs/
======
LeifCarrotson
At first I thought this was an error in implementing new docs related to
Meltdown/Spectre. But the original research paper:

[https://everdox.net/popss.pdf](https://everdox.net/popss.pdf)

the researcher wrote:

> _Somewhere around the release of the 8086, Intel decided to add a special
> caveat to instructions loading the SS register...where loading SS with [`pop
> ss` or `move ss`] would force the processor to disable external interrupts,
> NMIs, and pending debug exceptions._

So it's a really, really old piece of documentation, dating from around 1980.

To call it a 'misinterpretation' rather than a vulnerability is extremely
generous, given that most Intel engineers spent entire careers in the presence
of code vulnerable to this 'misinterpretation' without calling the OS vendors
out on their error.

~~~
ianopolous
On this particular thing I think the Intel docs are clear. This is where I
implemented the same in JPC:

[https://github.com/ianopolous/JPC/blob/master/src/org/jpc/em...](https://github.com/ianopolous/JPC/blob/master/src/org/jpc/emulator/execution/decoder/Disassembler.java#L351)

~~~
caf
The interrupt shadow itself is clear.

The specific implication of that for a mov ss ; syscall pair with a hardware
breakpoint set on the first instruction is a lot more subtle.

~~~
ianopolous
Agreed, but that's just saying that combining two or more simple things
results in a more complex thing. All platforms I know of describe each
individual instruction and its consequences, and leave you to deduce the
consequences of combining them.

------
hyperman1
I'm trying to understand this one, even if most of my ASM knowledge is from
the 8086 era. My guess:

* When an interrupt, debug exception, ... occurs, the CPU pushes stuff on the stack as part of the task switch.

* The stack is managed by 2 registers: SS and (e/r)SP. To change your stack, you have to change both registers. If an interrupt happens and you've changed only 1, stuff gets pushed on an invalid stack and you're toast.

* To fix this, the CPU has a wild card: When you change SS, you get exactly 1 instruction that will not be interrupted. The idea is you use that instruction to change (e/r)SP and make the stack valid again. If there is a need for an interrupt, it will be delayed for 1 instruction.

* Now this being a security problem, what would happen if you use this second instruction to switch to kernel mode ? It turns out the delayed interrupt happens before the first kernel mode instruction, but in the kernel.

* And you can trigger the right kind of interrupt with debug exceptions and single stepping.

* And if you do this, the kernel tells the debugger not about the debugged program but about the kernel. Oops.

So to fix this, I suppose the kernel checks the debug exception info from the
CPU, and if it is debugging the kernel it fixes things up so you go back 1
instruction.

~~~
codedokode
> To fix this, the CPU has a wild card: When you change SS, you get exactly 1
> instruction that will not be interrupted. The idea is you use that
> instruction to change (e/r)SP and make the stack valid again. If there is a
> need for an interrupt, it will be delayed for 1 instruction.

I wonder, why could not they make a single instruction to change both SS and
SP?

~~~
Someone
They could, but attackers would still use the approach that works for them.

And they can’t really remove the old instructions because of backwards
compatibility.

Whitelisting a limited set of instructions that can follow setting SS and
making all others trap might be an option, though. It still would break
backwards compatibility, but if the effective impact would be negligible, they
could deem it acceptable.

~~~
Someone
Follow-up: not only could they, they did.
[https://software.intel.com/sites/default/files/managed/7c/f1...](https://software.intel.com/sites/default/files/managed/7c/f1/253667-sdm-
vol-2b.pdf), page 4-385:

“Loading the SS register with a POP instruction suppresses or inhibits some
debug exceptions and inhibits interrupts on the following instruction
boundary. (The inhibition ends after delivery of an exception or the execution
of the next instruction.) This behavior allows a stack pointer to be loaded
into the ESP register with the next instruction (POP ESP) before an event can
be delivered. See Section 6.8.3, “Masking Exceptions and Interrupts When
Switching Stacks,” in Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A. _Intel recommends that software use the LSS instruction to
load the SS register and ESP together_.”

------
Tobba_
As far as I understand, what's happening is:

* There's an old feature which causes POP SS/MOV SS instructions to delay all interrupts until the next instruction has executed, to safely allow changing both SS and SP without an interrupt firing inbetween on a bad stack.

* If such an instruction itself causes an interrupt (by triggering a memory breakpoint through the debug registers), it is delayed (as intended).

* The delayed interrupt will fire after the second instruction _even if the second instruction disabled interrupts_.

* By means of the above, a MOV SS instruction triggering a #DB followed by an INT n instruction will cause the #DB exception to fire before the first instruction of the interrupt handler, even though this should be impossible (as entering the handlers sets IF=0, disabling interrupts).

* The OS #DB handler assumes GS has been fixed up by the previous interrupt handler, which in now under user control.

------
ysleepy
The x86 ISA and its implementations are now in the spotlight of the whole
security research community. There is probably a lot more to come since it
accumulated a lot of cruft in the name of backwards compatibility.

I hope we learn a lot, and take the time to record the experience, for coming
platforms like RISC-V and others.

Why is there no big CAVEATs document from intel detailing weird quirks. I
strongly assume the intel arch engineers are well aware of many of those
counter-intuitive behaviours in their products.

~~~
adrianratnapala
> I strongly assume the intel arch engineers are well aware of many of those
> counter-intuitive behaviours in their products.

But it is likely just kind of distributed, organic knowledge that is hard to
condense into a single document. Writing and maintaining such a thing would be
a significant project, and (I am speculating here) not the kind of thing that
significantly burnishes anyone's performance review.

That said, the whole community of assembly-hackers has even broader knowledge
of the topic, and could start such a document out in the open. And Intel
engineers might likely contribute their own two cents. (Unless lawyers forbid
it).

~~~
ysleepy
I stumbled on a blogpost by bunnie huang which describes a liability angle,
which I found to be plausible:
[https://www.bunniestudios.com/blog/?p=5127](https://www.bunniestudios.com/blog/?p=5127)
\- worth a read.

------
erric
Wow, the article shows that many vendors mis-read the docs: Apple, Microsoft,
FreeBSD, Red Hat, Ubuntu, SUSE Linux, and other Linux distros...as well as
VMware and Xen.

This is going to be a busy day!

~~~
hannob
> Apple, Microsoft, FreeBSD, Red Hat, Ubuntu, SUSE Linux, and other Linux
> distros.

Just to clarify, this is kernel code. Listing 3 different (+ "other") Linux
distros as affected is kinda bogus, it's not that they all made the same
mistake, they just all use the same kernel.

~~~
confounded
Many of them use different versions of the same underlying Linux kernel,
sometimes put together in different ways.

~~~
tedunangst
It seems improbable that there are multiple ways of putting together the
kernel's handling of mov ss.

~~~
acdha
As an example, Red Hat doesn’t ship major kernel upgrades except with major
releases. If you’re running RHEL 6, you’re still on a 2.6 kernel and the fact
that someone patched 4.x probably doesn’t help you all that much unless you
have the time to backport the change and confirm that it doesn’t break
something else.

------
feikname
"Both Peterson and the CERT/CC team blamed the "unclear and perhaps even
incomplete documentation"

Yet the article's title makes it seems like it was the OS implementors faults
instead of Intel's.

I wonder for what reasons the website tries to shift/soften Intel's fault on
this?

Seems like there's a trend in news articles to have incoherent titles in
relation to content lately, it's really annoying...

------
hyperman1
I wonder what happens if you execute multiple POP SS instructions. In fact,
you could set up a 64K v86 mode segment containing only copies of the POP SS
instruction. jmp far into it. When IP reaches the last instruction it wraps
around and starts again. Will it ever be interrupted by anything? If the stack
usage bothers it, just do MOV SS,AX

~~~
dooglius
Tested it: you get an interrupt after only skipping one instruction

~~~
hyperman1
Thanks, both dooglius and Someone. One learns something everyday.

------
jgtrosh
> Fixing the bug and having synchronized patches out by yesterday was an
> industry-wide effort, one that deserves praises, compared to the jumbled
> Meltdown and Spectre patching process.

Is this a fair comparison? I feel like the patching techniques must have been
easier to develop than for Meltdown/Spectre. Furthermore, if this affected the
same kind of people in this community, maybe this time around benefitted from
the communication channels of the previous exercises.

Maybe this isn't a comparison to try and badmouth the previous iteration, and
instead just tried to show a general improvement in the industry—I just find
it a bit unfair.

~~~
gregkh
This is one of the first times that I know of that the Linux kernel and
Windows kernel developers discussed a security issue together directly. So
while the fix was much simpler than Meltdown/Spectre was (Linux was fixed with
a patch that was written in 2015) overall, the communication between different
OS kernel developers right now is very good.

And yes, it is all due to the horrible Meltdown/Spectre problem and how that
was handled. We were not allowed to work together for that problem, and we do
not want to that to happen again.

~~~
pritambaral
> We were not allowed to work together for that problem

How do you mean?

~~~
SmellyGeekBoy
It was covered by NDA.

------
ams6110
Interesting that some of the BSDs (Dragonfly, FreeBSD) are listed as Affected
and others (NetBSD and OpenBSD) are listed as Not Affected.

~~~
barkingcat
Dragonfly descends from FreeBSD so it makes sense that it's affected.

The other BSD's have different kernels and vastly different development
histories as well.

~~~
amluto
Some BSDs never allowed debug register writes in the first place, so they were
immune.

------
lallysingh
It's been so long I had to look it up: SS is the stack segment.

~~~
0x0
And furthermore, the current stack address is determined by the combination of
two registers: ss for the segment and rsp/esp/sp for the stack pointer within
the segment. I guess the strange behavior around modifying ss comes from the
fact that you need to also modify sp immediately afterwards, because otherwise
you are running with a wild stack address pointing to random memory. You also
can't modify sp before modifying ss because then you are also running with a
wild stack, and an interrupt could come in at any time and push things onto
random memory.

------
_bxg1
I felt a great disturbance in the Force, as if thousands of voices suddenly
cried out, _" Oops..."_

------
dwighttk
The vulnerability notes[1] say Apple patched this on May 8, but my last
security update was May 3 and I don't currently show any available updates...
I wonder if the May 3 patch fixed this, or if my computer might not be
affected.

[1][https://www.kb.cert.org/vuls/byvendor?searchview&Query=FIELD...](https://www.kb.cert.org/vuls/byvendor?searchview&Query=FIELD+Reference=631579&SearchOrder=4)

~~~
amluto
The May 3 patch fixed it. The nature of the fix was such that Linux and Mac OS
were able to patch it early without revealing much.

In fact, the Linux fix was a patch I wrote in 2015 (for unrelated reasons) and
just never got around to upstreaming.

~~~
jey
Link to referenced patch:
[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8ba61ba58c88d5207c1ba2f7d9a2280e7d03be9)

------
twunde
I'm super happy to see that the BSDs were contacted (with the exception of
HardenedBSD).

~~~
sattoshi
HardenedBSD would just receive it downstream from OpenBSD, wouldn't it? It's
like contacting the Linux Mint group while contacting Ubuntu is sufficient.

~~~
twunde
HardenedBSD is a separate OS forked from FreeBSD with its own kernel
development. While the BSDs may share some code, they're essentially all
different OS'. There is no upstream like Linux has. As a side-note after
Spectre/Meltdown, Shawn Webb complained in a NYCBUG thread about getting not
being able to get access to these embargoed vulnerabilities.

------
ddtaylor
Is there a PoC exploit source?

~~~
amluto
[https://lkml.kernel.org/r/67e08b69817171da8026e0eb3af0214b06...](https://lkml.kernel.org/r/67e08b69817171da8026e0eb3af0214b06b4d74f.1525800455.git.luto@kernel.org/67e08b69817171da8026e0eb3af0214b06b4d74f.1525800455.git.luto@kernel.org)

------
emily-c
A bit tangentially related, but I've always wondered why the syscall
instruction doesn't use the TSS for stack switching like int does. I guess it
does give you more flexibility to load rsp from gs during a cpl 3 -> cpl 0
transition rather than consulting the TSS to switch it automatically. Can
anyone weigh in on this?

------
tmd83
Can someone explain the risk factor? It cannot be remotely exploited or though
browser if I'm reading it right. But a malicious program with user level
access can get kernel access. So exposure to malware running on a limited
account can get higher access?

------
ComodoHacker
Archived version: [http://archive.is/DxUwA](http://archive.is/DxUwA)

------
std_throwaway
Is this also affecting code in 64 bit mode?

~~~
dfox
As the code in the paper is written for amd64 I would assume that it is
affecting it.

In the other hand it is somewhat surprising, because loading anything into SS
is mostly pointless operation.

~~~
std_throwaway
This doesn't even make sense. Why would they keep a behavior that could be
considered a bug when creating a new instruction set?

------
floatboth
Was illumos not affected?

------
darkerside
I'm sure the initial reaction here is going to be lamentation about the state
of documentation. People will correctly point out that, if multiple entities
misread the documentation, it just have been unclear. And they are right. But
that doesn't make this Intel's fault alone. Clear or unclear, the
documentation described behavior that was understood at the Intel
organization, and the shipped product worked as described.

Where was the security testing at the OS level? Why can't there be automated
test suites that catch unauthorized access issues before ship (if not before
merge commit)? If your vendor delivers an insecure product and you don't
discover it, how much blame do you share?

~~~
sillysaurus3
_Why can 't there be automated test suites that catch unauthorized access
issues before ship (if not before merge commit)?_

Usually the search space is too large.

~~~
rocqua
Isn't that what fuzzing is for?

~~~
hedora
Concolic testing would probably catch it, but only if the person that
implemented the hardware model for the theorem prover understood the Intel
documentation, which seems unlikely.

Basic fuzzing probably wouldn’t catch this; as the other comments point out,
the search space is probably too large, and the set of vulnerable executions
is probably too small for an undirected random search.

