
Writing an OS in Rust: Returning from Exceptions - phil-opp
http://os.phil-opp.com/returning-from-exceptions.html
======
adamnemecek
I think that now that we have a language suitable for writing kernels, we also
need to have a discussion about what do we really want from a new os. Over the
years there have been quite a few OSs and some of them were more advanced than
the currently popular OSs. E.g. AS/400 didn't make the distinction between
memory and hdd which seems to me like a good idea. BeOS has a fully async API
which resulted in a much better user experience and better CPU utilization.
Tandem had erlang like processes. None of the popular OSs have this. I'm not
sure I want another UNIX implementation.

Furthermore, why are all the APIs so diverse? Why aren't there reactive
operating systems (as in OS with reactive API)? All of these ideas can be
explored in Rust but on some level I'm not sure what should be the feature set
of the OS of the future.

The current driver models aren't that great either.

~~~
BinaryIdiot
> The current driver models aren't that great either.

Driver developer is _soooo_ far out of my frield but it's always fascinated
me. I'm curious why hardware companies didn't go with something like an
interface for products versus these highly complex drivers; wouldn't it be
possible for each type of device a group comes together, decides on an
interface, and now anytime you plug a device in it would work (and if a
specialized driver could be better you could install that separately but the
goal is making everything magically work, immediately).

But since no one has done it I feel like my idea is either horrible or short
sighted. But I do want to look into how it all works one day.

~~~
wott
During DOS days, standards or more often de-facto standards had appeared. So
chip makers or board makers would make their chip or board compatible with
another existing model, and you would just need one driver per kind of device
to get most of the expected functions. I am thinking of :

\-- sound cards being compatible with AdLib and/or SoundBlaster and/or
SoundBlaster Pro;

\-- network cards being compatible with NE2000;

\-- video cards being compatible with VESA standards.

Then Windows took over, buses changed, the shitfest started and never stopped.
And now the shitfest comes with NDAs too.

~~~
pcwalton
Well, we do have things like USB, Bluetooth, Intel HD Audio, ACPI, and so
forth, all of which are standards. VESA is still good enough to get an
unaccelerated framebuffer set up and going. Network adapters are usually at
least supported by Linux, so there's some open-source reference implementation
somewhere. (NDISwrapper is mostly a thing of the past, thank goodness...)

 _Accelerated_ graphics is probably the biggest remaining disaster, honestly.
At least with Vulkan/DX12 we might be getting thinner drivers...

------
kibwen
The amount of effort that Phil puts into these posts is really fantastic. Not
only are they a great example of how to leverage Rust's abstractions to
provide some level of safety in an unsafe domain, but they're also (IMO)
approachable enough to appeal to people who have never worked at such a low
level before, growing the population of systems programmers in the process.
Keep it up! :)

~~~
phil-opp
Thank you so much!

------
amluto
One minor sort-of-error:

> The iretq instruction is the one and only way to return from exceptions and
> is specifically designed for this purpose.

Not quite true. STI; LRET works too, and it's faster for stupid reasons.

Also, the AMD architects blew it badly here. That quote from the manual:

> IRET “must be used to terminate the exception or interrupt handler
> associated with the exception”.

Indicates that the architects didn't think about how multitasking works.
Consider:

1\. User process A goes to sleep using a system call (select, nanosleep,
whatever) that uses the SYSCALL instruction.

2\. The kernel does a context switch to process B.

3\. B's time slice runs out. The kernel finds out about this due to an
interrupt. The kernel switches back to process A.

4\. The kernel returns to process A's user code using SYSRET.

This is an entirely ordinary sequence of events. But think about it from the
CPU's perspective: the CPU entered the kernel in step 3 via an interrupt and
returned in step 4 using SYSRET, which is not the same thing as IRETQ. Oh no!

It turns out that this actually causes a problem on AMD CPUs: SYSRET will
screw up the hidden part of the SS descriptor, causing bizarre crashes. Go
AMD.

Intel, fortunately, implemented SYSRET a bit differently and it works fine.
Linux has a specific workaround for this design failure -- search for
SYSRET_SS_ATTRS in the kernel source. I don't know how other kernels deal with
it.

Of course, Intel made other absurd errors in their IA-32e design , but that's
another story.

~~~
phil-opp
> STI; LRET works too, and it's faster for stupid reasons.

Interesting, didn't know that.

> This is an entirely ordinary sequence of events. [...] It turns out that
> this actually causes a problem on AMD CPUs

Sometimes I think that the hardware designers intentionally made kernel
development complicated :D. Thanks for the heads-up!

~~~
amluto
> > STI; LRET works too, and it's faster for stupid reasons. > > Interesting,
> didn't know that.

In case you're curious, here's an implementation for Linux:

[https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/...](https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/lret-
to-userspace)

There are a couple gotchas. RF and TF won't work right with the LRET hack. You
need to make sure not to clear IF until the STI, as otherwise you lose the
magic one-instruction no-interrupts window. And it's unclear in the spec if
NMIs or MCEs honor that window, so, if you want to be robust and your kernel
can recover from NMI or MCE, you should detect if this happens, rewind one
instruction, and clear IF again before returning.

Other than that, it appears to work perfectly. :)

------
haberman
A very interesting article. One thing that stood out to me:

> Unfortunately, Rust does not support [a save-all-registers calling
> convention]. It was proposed once, but did not get accepted for various
> reasons. The primary reason was that such calling conventions can be
> simulated by writing a naked wrapper function.

Followed by:

> However, auto-vectorization causes a problem for us: Most of the multimedia
> registers are caller-saved. [...] We don’t use any multimedia registers
> explicitly, but the Rust compiler might auto-vectorize our code (including
> the exception handlers).

This seems like a pretty convincing argument in favor of supporting this
calling convention explicitly: only Rust knows what registers it is actually
using. The current approach devolves into preserving every register that Rust
might possibly use.

AVX-512 has 2kb of registers alone! That's a lot of junk to save to the stack
on the off-chance that Rust decides to super-auto-vectorize something.

~~~
pcwalton
> That's a lot of junk to save to the stack on the off-chance that Rust
> decides to super-auto-vectorize something.

Note that LLVM loves to use the XMM registers to do memcpys. This is something
that kernels definitely do. So it's definitely a tradeoff.

~~~
haberman
Right but my point is that saving registers that are not actually used is
needless waste.

If a "save all registers" calling convention was natively supported by Rust,
you would only pay the cost for registers that are actually used.

~~~
Manishearth
Not exactly, you would have to save all registers all the time.

For caller-saved/scratch registers in interrupt handlers, not only do you have
to avoid stomping over registers, anything you call has to do too. You have
three options here:

\- Save all the registers in each interrupt

\- Only call "save all registers" functions in your function, enforce this
somehow. Since things like pagefault handlers can get pretty involved, you
probably don't want to do this.

\- Just compile your kernel with most extra registers disabled.

There is the fourth option which involves whole program taint analysis or
something to track the registers stomped on by all transitive calls from the
exception handler. Requires special compiler support though.

~~~
haberman
You certainly wouldn't have to save all registers in the case of a leaf
function.

And yes, optimizing this would require special compiler support. That is the
point!

The compiler is the only component that is in a position to possibly do
something smarter than spilling everything. Even if the compiler doesn't
actually do this, letting users say what they mean is better than making them
write something that will _definitely_ be sub-optimal. It at least leaves open
the possibility that the compiler _could_ do something smarter.

------
Animats
Hm. Is that code using the user's stack to handle an exception or interrupt?
That's unsafe. If there's not enough user stack space (something the user can
force), the kernel will get a double fault, usually a kernel panic condition.
Normally, OSs above the DOS level switch to a kernel stack on an exception or
interrupt.

There's hardware support to help with this; see "Task state segment" (16 and
32 bit x86 only, amd64 is different).

~~~
amluto
No. In fact, that hardware support is mandatory. The SP0 field is used
unconditionally on a cross-privilege exception.

Sadly, AMD64 came up with a terrible design for SYSCALL, and an exception
right after SYSCALL will not automatically switch stacks. The result is a big
mess.

------
mtnygard
This series is wonderful. Somehow, Phil is able to introduce concepts from CPU
state through compiler wonkiness to kernel design--all while showing how Rust
helps handle them--but make them all accessible.

One other thing the series does is show how much we dink around due to x86
backward compatibility:

\- GDT still required by CPU but not needed for programs.

\- Ridiculous structure of IDT pointers due to multiple generations of bit
width extension.

\- Boot sequence.

Compared to the low-level setup for an ARM chip, it's night and day. ARM is
what Intel was in the days of the 8088: load your program at an address, CPU
jumps to that address, end of story!

