
Evolution of the x86 context switch in Linux (2018) - chowyuncat
http://www.maizure.org/projects/evolution_x86_context_switch_linux/
======
bogomipz
Can someone say why is using the TSS still mandatory with software-based task
switching? Is this a requirement imposed by x86?

In looking at the OS dev wiki I see the following:

>"The TSS is primarily suited for hardware multitasking, where each individual
process has its own TSS. In Software multitasking, one or two TSS's are also
generally used, as they allow for entering Ring 0 code after an interrupt."

Would you not be able to enter Ring0 after an interrupt with a TSS entry? Is
this why it is still required?

~~~
monocasa
The interrupt stack pointer comes from the TSS. Without that you're still
running on the untrusted user stack with no way of bootstrapping a kernel
context without corrupting the user state.

~~~
bogomipz
Oh right, without a mapping its a "chicken and egg" situation. Cheers.

------
dekhn
I recall reading a paper comparing Linux and Solaris context switch times in
~98 and Linux was 10-100X faster. Solaris did something incredibly slow and
safe.

~~~
shereadsthenews
Real context switches on Solaris were very slow which is why they had LWP. But
Linux process context switches were also faster than Sun's LWP switches.

------
filereaper
Enjoyed this article, anybody know the significance of adding the do..while(0)
loop within the macro starting Linux 1.3?

Was curious if it guards against some C pre-processor issues.

    
    
      /** include/asm-i386/system.h */
      #define switch_to(tsk) do {
        [...]
      } while (0)

~~~
akuma73
Good answer here: [https://stackoverflow.com/questions/257418/do-
while-0-what-i...](https://stackoverflow.com/questions/257418/do-while-0-what-
is-it-good-for)

~~~
berti
Just to add more context, this is a very common cpp (c pre-processor) idiom.
You'll find it in most non-trivial C projects somewhere.

~~~
pantalaimon
Abbreviating the C Preprocessor as cpp is very confusing imho.

~~~
monocasa
It's a common abbreviation older than C++. You used to even be able to run the
c pre processor on arbitrary non-C files by using the cpp command.

~~~
berti
You still can.

------
souprock
An addition to "Linux 2.2 (1999)" is: introduced meltdown vulnerability. That
was the then-unknown cost of software context switching.

Later, with Red Hat's 4g4g kernels that Linus rejected, the problem would go
away for people who installed Red Hat's version of the OS on systems with many
gigabytes of memory.

~~~
bogomipz
Can you elaborate? How does relying on pure TSS for context switching prevent
meltdown?

What were the 4g4g kernels? Might you have any literature and/or on those?

~~~
DSingularity
Separate address space for kernel and user. Hardware will use TSS to switch
address space as needed for syscalls.

~~~
bogomipz
Do you the reason why Linus rejected this idea?

~~~
souprock
The main reason seemed to be the relatively bad performance of the hardware
task switch (loading segment registers and the page table base) that would be
required for any system call.

Of course, that turns out to be the fix for meltdown, unless you have the
process-context identifiers (PCID) available on Haswell chips and newer. The
meltdown fix for older CPUs, such as the Pentium III and Intel Core, is
roughly the same as the 4g4g kernel changes.

BTW, the 4g4g kernels were created for a different reason. The kernel needed
more virtual address space for itself, and thus couldn't share with user code.
This was for a time when people were trying to run 32-bit kernels on systems
with 32 gigabytes of RAM.

~~~
bogomipz
>"Of course, that turns out to be the fix for meltdown, unless you have the
process-context identifiers (PCID) available on Haswell chips and newer."

Using TSS based switching is incompatible with PCIDs? Or is it incompatible
with separate address spaces for user space and kernel space?

PCIDs are process ID tags on cache lines correct?

~~~
souprock
TSS switching is incompatible with running in 64-bit mode. It is very slow.
Doing most of the same actions (reload segment registers and the page table
base) in software is also very slow. Prior to the meltdown issue, Linux system
calls had avoided most reloads of the segment registers and page table base.

PCIDs are incompatible with older hardware. They are modestly slow. I think
the PCID state includes the TLB.

That pretty much means the kernel must support both methods. The PCIDs is used
when possible. When the hardware doesn't support PCIDs, Linux must instead
reload segment registers and the page table base, either step-by-step in
software or via a TSS switch.

BTW, I had to implement x86 hardware task switching for an x86 emulator. The
complexity is insane. See my "Who is Hiring?" post if that sounds fun for you.

~~~
amluto
I think you may be a bit confused here. PCID is a feature that lets the kernel
avoid a TLB when CR3 is written. With or without PCID, CR3 gets written. The
segment registers have nothing whatsoever to do with this on a 64-bit system.

~~~
souprock
The explanation is a bit awkward because I'm trying to describe numerous
situations. (pre-2.2 kernels, post-meldown kernels, in-between kernels, 32-bit
builds, 64-bit builds, using PCID hardware, not using PCID hardware...) It
looks like 7 of the 12 possibilities are valid, more or less.

Some segment registers are reloaded on a 64-bit system. That includes CS, DS,
SS, and GS. The 32-bit systems must additionally reload ES. All segment
registers are loaded for Linux 2.0 and older, via hardware task switching.

CR3 does not get written for system calls when running on a normal Linux from
version 2.2 until the meltdown workaround hit.

------
amluto
Nice article!

If you want to have lots of fun, you could look at switch_mm() on a modern
kernel :)

------
Upvoter33
This is really well done, bravo.

------
glonq
Very thorough. Nice job.

------
MichaelMoser123
anyone knows what Ingo Molnar is doing these days?

~~~
moosingin3space
A ton of stuff related to eBPF, last I saw.

------
heinrichhartman
I really like how good the article looks when printed. I enjoy reading long,
in-depth articles much more when I can read them in print. Unfortunately many
blog posts need a lot of tweaking until I can get an acceptable print result.
This one looks good enough without any trickery.

Thanks to the author, for caring about the paper people. : )

~~~
Groxx
out of curiosity, since it changes how the page prints and I haven't
experimented much: have you tried printing while in the browser's "reading"
mode? Or does that tend to be worse?

~~~
ezconnect
Not OP,but if it looks good on "reading" mode it prints nicely. Plus you can
adjust font size and print width to conform to your taste.

