

Nested Kernel Operating System Architecture - emaste
http://nestedkernel.org/

======
comex
Cute but extremely risky. The kernel has access to _all_ memory-mapped devices
and _all_ weird privileged mode instructions, and the nested kernel has to
ensure that the IOMMU is safe and no instructions that can turn off paging,
modify the page tables, etc. appear at any offset within the code, including
unaligned. Which also means that even if working around coincidental unaligned
instances of such instructions is easy in practice, as stated in the paper,
there's always a chance some (code, compiler) version combination will
randomly fail to compile and (even if this is hypothetically made to work
automatically most of the time) possibly require manual intervention...

So, have they missed any such instructions? To be fair, they are assuming the
original kernel modules are trustworthy but have been compromised at runtime,
so these would have to appear either for another reason or by chance in the
kernel, but ... off the top of my head:

\- Hardware breakpoints/watchpoints in the inner kernel or at the CR0 writing
instruction.

\- - Or hoping/somehow ensuring the CPU gets an IRQ or something right after
executing the PG0 write. Problematic, but not out of the question.

\- - This could be partially negated by ensuring that like the PG0 enabler,
the fault handlers are mapped at an address that corresponds to a physical
address containing a trap. But I don't think this is sufficient, 'cause... now
that paging is disabled and so you can write to code, you should be able to
set the stack pointer in the TSS to overwrite the handler!

\- Hardware VM support. I checked the Intel manual: the VMCS structure points
to some addresses that map guest physical addresses to real physical
addresses, so this really needs to be disabled. (Also, the VMCLEAR
instruction, which requires VMXON first, writes directly to physical
addresses.) There's also whatever AMD does.

And with a bit more manual reading - but I could be wrong on these:

\- Switching to 32-bit mode and confusing the inner kernel that way.

\- Switching to 32-bit mode and then using hardware task switching to load
CR3.

Those two only require the iret instruction!

Based on scanner-objdump.py, they seem to only be checking for movs to
cr{0,2,3} and wrmsr, so all of the above should work. I'm not an x86 expert,
so (a) some of the above may be wrong, but (b) there are probably more
potential avenues I don't know about. :)

~~~
bcg1
I'm not being sarcastic, this is a serious question because the specifics of
this are over my head

Are you saying that the "nested kernel" architecture technically increases the
surface area for attack, or is the problem that it gives a false sense of
security compared to what it actually provides?

~~~
comex
The latter. There's no obvious reason it would decrease security, but if the
main kernel is compromised, the inner kernel's security model is really hard
to enforce correctly against an opponent which is also running in ring 0.

------
emaste
Also, their Github repo is here:
[https://github.com/nestedkernel/PerspicuOS](https://github.com/nestedkernel/PerspicuOS)

------
agumonkey
necessary cached version
[http://webcache.googleusercontent.com/search?q=cache:8JznPXU...](http://webcache.googleusercontent.com/search?q=cache:8JznPXUhTZgJ:nestedkernel.org/&hl=en&strip=1)

------
Animats
See Multics "rings of protection", circa 1969.

~~~
zvrba
You sound like a jealous and/or competing university professor. I myself have
received a number of paper reviews in the tone like yours.

Snarky, superficial dismissals from "I think I know everything" maleficent
people like you are a real problem in any community.

So I'll take a liberty of being condescending myself and explain the
contribution of this paper.

Multics had hardware facilities to help in implementing multi-ring privileges.
In x86-64 there are 4 rings, but segment-level protection is largely gone,
thus effectively collapsing 4 ring levels into just 2: supervisor and user.

Retrofitting nested kernels in a secure way [0] on top of HW where each
process's address space is flat and there are only page-based protections with
just 2 levels is the real research contribution.

[0] They use code scanning (disassembly) to prevent loading of subversive
modules. Given the complexity of x86-64 instruction encodings (for one,
they're not unique), I have some doubts about the robustness of this approach.

If you want to blame anybody for the current situation, blame Intel and their
CPU design, don't throw stones at people who work within constraints of the
available HW.

Maybe you're tempted to denigrate this as "engineering", but the paper solves
a real problem in a novel, practical and rather performant way. It qualifies
as a research.

~~~
Animats
The frustrating thing about rings of protection is that there have been
machines which had the right hardware, but no OS used it. DEC's VAX line had
all the hardware for that. Nobody used it. IA-32 has lots of machinery such as
call gates and segmented segment-level memory protection for fine-grained
control over memory access. Nobody used that. AMD left all that stuff out of
AMD-64 because nobody used it in IA-32. (I once asked the designer of AMD-64
about that, when he spoke at Stanford.) C and UNIX/Linux want a big flat
address space and a vanilla CPU.

Protecting the OS's code and the MMU's state is nice, but it prevents only a
few classes of attacks. The "untrusted" kernel still has read and write access
to most of the kernel's data structures. Attacks can still mess with
networking, files, login, etc.

One can go further with the protection hardware on AMD-64. See the
KeyKos->EROS->Coyotos development, which continued until 2008. That project
appears to be dead. The original KeyKos system was quite successful, with
machines running for decades. Good concept, killed by partnering with the
wrong hardware vendors. (Omron? Anybody remember Omron? No?)

[1] [http://coyotos.org/](http://coyotos.org/)

~~~
zvrba
> The frustrating thing about rings of protection is that there have been
> machines which had the right hardware, but no OS used it. [...] C and
> UNIX/Linux want a big flat address space and a vanilla CPU.

C per se does not want flat address space: it is, for example, UB to subtract
pointers that do not point within a same object. IOW, "far pointers" from the
DOS era and segmented 286-style pointers would absolutely be within the bounds
of the C standard, if they hadn't used special syntax.

With 386 segment-level protection you could define one segment per object,
with byte-level granularity limit checks for segments smaller than 1MB. Each
string could have had its own segment with precise length. No more buffer
overruns.

I think we ended where we are now because of several factors:

1\. early hardware -- 286 -- was too limiting. Each segment could be at most
64k long, and

2\. programmers (naturally) needed arrays larger than 64k even back then,

3\. OS-es, UNIX and Win32, but not VMS, targeted the lowest common HW
denominator, which is flat address space with only U/S page protections,

4\. rather bad tooling.

Situation is changing slowly though. Intel's MPX extension offers much of the
old segment-level protection, but is opt-in for software and needs tooling
support (compiler, linker, loader, etc). This is being worked on, e.g., in
gcc:
[http://gcc.gnu.org/wiki/Intel%20MPX%20support%20in%20the%20G...](http://gcc.gnu.org/wiki/Intel%20MPX%20support%20in%20the%20GCC%20compiler)

