
It’s Time for a Modern Synthesis Kernel - ingve
https://blog.regehr.org/archives/1676
======
cogman10
Honestly, seems like a big waste of time to me. Generating the code wouldn't
be free, so you pay for that. IO, generally, is fairly expensive, so you are
now weighing the costs of IO against the savings in instructions/reduced
branches. So where are the savings? OS level mutexes?

Basically all of the modern CPUs are going to do a really good job at
optimizing away most of the checks cooked into the code Modern CPUs are really
good at seeing "99% of the time, you take this branch" and "these 8
instructions can be ran at the same time".

What they can't optimize away is accessing data off CPU. The speed of light is
a killer here. In the best case, you are talking about ~1000 cycles if you hit
main memory and way more if you talk to the disk (even an SSD). You can run
down a ton of branches and run a bunch of code in the period of time you sit
around waiting for data to arrive.

This wasn't true in 1992 when CPUs ran in the Mhz. But we are long past that
phase.

About the only place this would seem to be applicable is in embedded systems
looking to squeeze out performance while decreasing power usage. Otherwise,
feels like a waste for anything with a Rasberry pi CPU.

~~~
vidarh
CPU limitations relative to IO became a much bigger issue again with 10Gbps+
networking and NVMe, but ignoring that, there is the entirely separate issue
of IPC and anything else that causes lots of traversals of the
userspace/kernel boundary at a high rate relative to the amount of time spent
on external IO, and it can be absolutely brutal on throughput.

The biggest speedups I've made on systems I've worked on over the last 20
years have usually been down to being aware of patterns people don't think
about where they caused unnecessary amounts of context switches, because
context switches are far more expensive than people realize.

Now, doing that is not going to be fixed by "just" JIT'ing the write syscall,
but you have to start somewhere, and I think the more promising approach there
would be to look at what parts of the kernel code you could actually JIT into
user-space.

E.g. obviously you can't elide anything that checks for anything security
related to userspace, but consider when the kernel knows that you're writing
to a pipe that only a single other process is reading from, for example. Could
you manage to turn the write()/read() pairs into creating a buffer mapped
directly into the reading process, and replace the read() syscalls with
working purely on the userspace buffer when there is data there, and only ever
requiring a context switch when outpacing the writer/reader? (e.g. using
futexes when the buffer is potentially full or empty)

Other case is things like user-space networking. If the user has sufficient
privilege to write directly to the network anyway, why not make their
networking code dynamically run in user space when it makes sense?

Yes, it won't be free, but the point of targeting IO intensive code for this
kind of thing is that a lot of the code paths are quite simple but tends to
get executed at high frequency over and over again, so you can in fact often
afford to do a lot of work even for relatively modest speedups. Even more so
in modern systems where distributing the work over enough cores is often a
challenge - any time you have a core idling you have time to spent trying to
JIT existing code paths with more expensive optimizations.

It's worth at least exploring.

~~~
JoeSamoa
Any must reads for getting into kernel optimization? I'd like to be able to
diagnose performance issues better and you seem to know the flow very well.
Any advice on material to read or places to hang around would be appreciated!

~~~
vidarh
Specifically kernel optimization isn't really my area. I've mostly approached
it from the userspace side in that I've worked a lot on optimizing apps that
often turn out to abuse syscalls.

From _that_ side, I'd say looking at tracing tools like systrace is
worthwhile. But also just place strace (system calls) and ltrace (dynamic
library calls). "strace -c -p [pid]", wait a bit and then ctrl-c, or just
"strace -p [pid]" for a running process on Linux (or drop the "-p" and add the
command you want to trace) is hugely illuminating and will often tell you a
lot of things about the code you're running.

You can usually do the same with "proper" profilers (like e.g. gprof), but
strace has the benefit of focusing specifically on the syscall boundary, and
being very easy to pull out. You can use it both for profiling (with -c) or to
see the flow of syscalls (without -c). The latter is useful for code that
gives opaque error messages or otherwise to find out e.g. which files an
application tries to open, and is invaluable for trouble-shooting.

Specifically for performance issues, there is one thing that is very valuable
to remember: syscalls are expensive. Very expensive. Code that e.g. does lots
of small read() calls, for example, is almost certainly doing something stupid
(like not doing userspace buffering). But in general, a brief run of strace -c
and figuring out _why_ the top calls occur so often and/or take so long very
often leads to obvious fixes.

E.g. a far too common pattern, though it's getting better, is read()'ing a
length indicator in an incoming packet of data, followed by a second read() to
read the contents, instead of using a non-blocking socket and doing larger
reads into a buffer. When the message flow is slow, it doesn't matter. When
it's fast, the difference in throughput can be massive.

------
monocasa
Neat! My main side project takes exactly the same base idea, that we need a
synthesis v2.

The big question is "what have we learned in the past thirty years of OS
design that's applicable?" Capabilities are amazing in this context is the big
one; if you squint synthesis quajects and capability objects line up in
semantics pretty well. I'd argue that xok would be a lot cleaner if they had a
central VM concept instead of the three or so seperate kernel VMs.

------
kazinator
> _The idea is that at file open time, the OS has a lot of information that it
> can use to generate highly specialized code, eliding code paths that are
> provably not going to execute._

So when you see the bytes surrounding the instruction pointer in your kernel
OOPS message, good luck looking for them in the _vmlinux_ disassembly.

~~~
saagarjha
I mean, this is a problem that existing JITs have to face all the time.

------
amluto
One problem here is that in-kernel self-modifying code is very expensive on
current CPUs due to the synchronization and possible flushing involved.
There's flushing because the various levels of instruction caching are, in
general, not coherent in hardware, and there's flushing because you can't turn
writable memory into read-only executable memory without flushing the writable
TLB entries.

------
kragen
This is a wonderful idea, and I hope many people start working on it right
away.

Although Massalin has never published her code, according to my memory of her
thesis, Synthesis's runtime code generation was _mostly_ extremely simple,
more like linking than what we think of as "code generation" — it copied a
template method into the appropriate slot in the newly-generated quaject, then
overwrote specific bytes in the generated code with pointers to the relevant
callout (or, in some cases, the initial value of an instance variable for that
quaject). Parts of the code that did not benefit from being specialized in
this way were factored into ordinary functions the quaject method would just
call.

This meant that only a small amount of code was generated for each quaject,
and the runtime code generation was very nearly as fast as memcpy(), which
meant that it was reasonable to use it on every quaject instantiation.

Massalin _also_ talked about applying some optimizations to the generated
code, such as the constant-folding and dead-code removal John mentions, but I
have the intuition that only a minority of quaject instantiations involved
such more aggressive optimizations. Since she never published Synthesis, it's
impossible to know for sure. (I'm not questioning her integrity or claiming
that the impressive benchmarks reported in her dissertation are faked; I'm
saying that we unfortunately can't see the exact mixture of interesting things
you need to do to get those kickass benchmarks; so, like an insecure Intel
CPU, I'm reduced to speculation.)

Later implementations inspired by Massalin's approach included Engler's VCODE
(which, to my knowledge, has also never been published; Engler's PLDI paper
cites Massalin in the second sentence of the abstract), which was used to
implement Engler's `C, and GNU Lightning (inspired by Engler's published
papers _about_ VCODE), used in a number of modern JIT compilers.

I suspect that, by contrast, John's idea of using LLVM is inevitably going to
have much higher overhead — if only from the CPU cache devastation brought
about by any LLVM invocation — so will only be a win for much-longer-lived
objects, where the large instantiation overhead can be amortized over a much
larger number of invocations. An intermediate approach like Engler's `C might
be more broadly applicable.

John suggests this early on in his "for deployment" comment, but I think that
it's probably necessary for prototyping too, since the objective of the whole
exercise would be to get an order-of-magnitude speedup, and the objective of
the prototype would be to find out if that's a plausible result. A prototype
that makes all your programs run slower due to LLVM wouldn't provide any
useful evidence about that.

I asked Perry what he thought about the above, and he replied with this gem:

 _So you’re correct that the code generation was mostly “template
instantiation”. I think that was key to having calls like open() function in
reasonable time. I also suspect LLVM is a blunt instrument for this work. That
said, it would have been difficult for anyone but Massalin to work with the
approach in Synthesis. It was very much the product of a person who was both
intensely brilliant and completely comfortable with writing “weird code” in
the instruction set they were working in._

 _So there’s then the question of how one can take the ideas from Synthesis
and make them a practical thing that ordinary programmers could build and
contribute to. And that is almost certainly going to involve compiler tooling.
As a prototype, making this work by using LLVM is probably a good approach.
Ultimately, I think that one is going to have to do the magic at kernel build
time and have something fairly lightweight happen at runtime. But to figure
out what that is, one needs to play. And the easiest tools right now for
playing involve LLVM. If, for example, you can use LLVM successfully to
specialize a write call’s instruction path down an order of magnitude or more,
or to do similar things in the networking code, one can then ask how to do
this better._

 _There are, of course, a number of interesting paths towards playing with
this. I suspect that none of them end anywhere near where they start. But the
only way to see what might be possible with much better tooling is to start,
and you have to start somewhere._

 _BTW, I think the time is right, or even over-right, for this. Single
processor core performance is stalled out, and while in 1992 one could just
say “well, we’ll have another factor of ten performance improvement in a few
years, who needs the trouble”, that’s no longer the case. Note that this
argument also applies, to a considerable extent, to other parts of the modern
software ecosystem. When you can’t just say “we could spend a couple of months
optimizing this, but the next generation of processors will be out by then”,
things change._

 _Anyway, not sure if this answers your call for comments, but if you are
interested in specific areas around this, I’ve no shortage of opinions. Many
would say I have far too many opinions..._

 _You can quote any subset or the entire thing. So long as you don’t distort
my intent I have no problem with people using my words that way._

~~~
gmueckl
I also got the impression when reading the article that a full blown high
level language JIT is overkill for a kernel. A lighter template instantiation
approach os much better. The key is deriving the template fragments and
comatamt patch tables etc... from the fully featured source code. Doing this
in a fully automatic fashion seems computationally hard and likely requires
finely tuned heuristics to figure out sections that can likely be made
optional depending on parameters etc.. humans could annotate the code, but
that essentially creates a new and quite weird programming language. The
compiler would effectively comsiders all possible combinations compile time
configuration switches at once and emit instructions on how to instantiate a
certain comfiguration from machine code templates.

All of this has devils and monsters lurkimg in the details, of course. I have
seen systems where this kind of runtime code specialization woukd indeed be
helpful even in user space. There is a lot of code where the innermost loop
has conditionals based on some user input or program setting that cannot be
determined at compile time. Code like this could profit a bunch from runtime
specialization, too. Mapped into a programming language, this would probably
look something like generics, but with parameters that need to be passed to a
comstructor at runtime to get something executable.

~~~
kragen
I don't know if it's such a hard problem as you suggest; it's pretty close to
what a linker does, really. Have you looked at `C?

~~~
AstralStorm
It gets harder when you want to do link time optimization. Without it, the
approach is inferior to ahead of time optimization with some programmer
input... (Degeneralization and devirtualization rather than specialization.)

~~~
kragen
It's fine if it's inferior to ahead-of-time optimization with programmer
input, because you don't have time for a programmer to optimize the kernel
code anew every time you spawn a process or open a pipe or a socket. It just
needs to be faster than what the kernel is doing now when you call read() or
write() or whatever, without being too much slower when you open the file
(etc.)

Like, ahead-of-time optimization with programmer input might take a few hours
to a few days, and that's not really an acceptable execution time for the
open() system call.

------
obl
Nice article.

Of course if you get some of the benefits of run time specialization, you also
get some of the downside, i.e., overspecialization.

The heuristics to know when to stop specializing is a really hard part of
those kind of systems.

You could have a program that does a single write() on 100k different files.
You could have thousands of processes each writing to a single file and
bloating the I$ with useless copies of write.

~~~
trurl
This isn't exactly a new issue in runtime code generation, for example, the
HotSpot Java JIT was introduced over 20 years ago.

------
lachlan-sneff
This is a cool idea, but I bet that modern cpu cache effects would counteract,
to some extent, the gains from a synthesis kernel.

~~~
trurl
Probably fifteen years ago I discussed runtime code generation with some
computer architects and they did indeed raise concerns about caching at the
time. But now I am not as convinced this is as much of a worry as people seem
to think. Sure if you were generating new code ever time you opened a file and
did a write perhaps you would blow out the instruction cache with excessive
amounts of freshly generated code. But if you waited to generate an optimized
version of write until you see it is "hot spot" it seems like you would
probably be okay.

I can only speculate, so maybe someone who knows better can correct me, but I
would be surprised if many kernel syscall implementations would actually stay
in cache after a dwq context switches. Especially with recent timing channel
mitigations. But I'm just guessing.

And even if there are architectural issues, with processors seeing less
benefit from Moore's Law I would expect computer architects to be looking for
ways to eek out more performance such that they'd find solutions.

------
oconnor663
Novice question here: Applications that care about performance already do
things like buffered reads, to minimize the overhead of system calls. Is there
much value to optimizing read() in that case? Are there other system calls
that are harder for applications to amortize?

~~~
kragen
Part of the value of reducing system call overhead is that it reduces the need
to buffer reads; one of Massalin's key use cases was building a real-time
audio processing system (on I think a 40-MHz dual-processor 68020 workstation)
out of shell pipelines, for example to simulate what her voice would sound
like after her gender transition, and real-time audio processing is terrible
when there's a lot of buffering introducing latency. Only up to a few
milliseconds is acceptable over the whole pipeline. On my Linux laptop the
read() overhead is only about 300 ns, but lmbench says the pipe latency is
19000 ns, which adds up fast.

There's also the overall complexity cost: by pushing complexity into dozens of
libraries and thousands of applications to compensate for unnecessary
inefficiency in the kernel, the system can become more complex — harder to
debug and harder to optimize — than if you just bite the bullet and optimize
the kernel. And even then, you don't get back all the performance you lost —
on Synthesis, unbuffered reads were actually _faster_ than buffered reads on
other operating systems, because you could avoid the extra copy through the
userland buffer!

Assuming it's possible to optimize the kernel that much, of course. On a
Netburst Pentium IV, you're fucked, and it still remains to be seen how bad
the Intel speculative execution disaster ends up being.

------
kmod
I'm working on something that could be applicable to this! There seem to be
many areas that would benefit from a pluggable JIT engine, so I started
building one:
[https://github.com/nitrousjit/nitrous/](https://github.com/nitrousjit/nitrous/)

It's designed to analyze and optimize existing codebases. So far it's produced
some pretty decent speedups on the Python interpreter, though with a big
drawback of very high compilation times due to using LLVM.

Let me know what you think or if you have performance-sensitive programs that
you think would benefit from a low-investment JIT.

~~~
ihnorton
Are you compiling to LLVM IR with a modified Clang that is intended to be
drop-in, and then interpreting+JITing? Or instrumenting?

------
muricula
Chrome's v8 Javascript JIT gets attacked all the time. I can only imagine the
fun to be had attacking a kernel which uses JIT extensively.

~~~
saagarjha
Yeah, I’m surprised that security wasn’t mentioned at all. The Linux kernel’s
BPF JIT is incredibly restrictive on what it can interpret and produce:
presumably a JIT like this would be much less constrained.

~~~
gpvos
Security was mentioned in passing:

 _> In production we’d need to think carefully whether we wanted to include a
big chunk of non-hardened code in the kernel._

------
maxdamantus
I suspect the main advantage of a system like this (basically, userspace code
is run in a software VM) would not be code synthesis (though I do think that
is also useful), but the fact that there's no longer an expensive boundary
between "user" and "kernel" code. A system call would simply be a regular
function call; there would be no use of interrupt handling or
manipulating/switching page tables or ultimately any use of an MMU.

Once this boundary is removed, we won't need things like userspace buffering,
which exists to circumvent this expense at the expense of some odd user
behaviour:

    
    
      $ cat | grep a | cat
    

Ultimately, I do think that OSes will go in this direction. As mentioned in
the article, there are currently things like eBPF in Linux which take
advantage of this. I don't see why we couldn't have a kernel which supports
both running VM code in "ring 0" but also supports a Linux-like VM mechanism
for running computational code which would be compiled sub-optimally by the
software VM. Presumably this functionality could be implemented directly into
an existing OS, so it wouldn't necessitate some major switch.

~~~
kragen
Running userspace code in a kernel VM isn't what John is suggesting, and the
experimental OSes that tried this tack have been hit rather badly by the Intel
speculation security problems.

------
lachlan-sneff
Webassembly would probably be an ideal executable format for a synthesis
kernel. Safe to execute, easy to generate these days from all sorts of
programming languages.

Plus, you can actually inline syscalls and whatnot into your compiled
application at runtime.

~~~
intea
inline syscalls? How would that work?

------
cbsmith
x86 code is effectively JIT'd inside modern CPUs.

...and then of course there is ACPI...

