
EBPF is turning the Linux kernel into a microkernel - yoquan
https://docs.google.com/presentation/d/1AcB4x7JCWET0ysDr0gsX-EIdQSTyBtmi6OAW7bE0jm0/edit#slide=id.g704abb5039_2_106
======
fefe23
I don't think that word means what you think it means.

Microkernel = move all the code OUT OF the kernel.

These slides are about moving all the code INTO the kernel.

Putting your application logic into the kernel would be more like a unikernel
I guess?

~~~
naasking
> Microkernel = move all the code OUT OF the kernel.

Move it out of the kernel to isolate and protect. If you can isolate and
protect code within the kernel's memory protection boundary, I'm not sure that
that should disqualify it as a microkernel.

In other words, I'm not sure that the microkernel design _depends on_ memory
protection boundaries specifically, it's a more general philosophy, akin to,
"a microkernel is an operating system design which runs the minimum amount of
code needed for an OS with full trust".

~~~
MisterTea
This is what I understand a Microkernel: it handles the bare minimum of system
duties: process creation/management, memory management, IPC mechanism, I/O
access. This is where you implement your security layer in order to isolate
and protect hardware, processes, and memory. You can even implement
namespacing and isolation to make each process look like it's running in a
container or VM so to speak.

All of your drivers and user programs live in user space and secured via the
kernel. E.g. a USB host controller is a process which talks to the hardware
and provides some sort of interface which USB device drivers can talk to to
implement their respective device interfaces. A file system on a disk is
handled by a file system process speaking e.g. ext4 to a disk and then a
socket is provided to mount and access the files.

It's a very nice setup because if your IPC mechanism is how everything talks
then you can think of the kernel as a microservice host and IPC router which
your processes talk through. They can provide or consume resources. Now if you
can push that IPC mechanism over the network transparently then you have a
distributed system. Then you eliminate a lot of code and protocol nonsense
talking over networks.

~~~
shawnz
OP's definition is the same as yours, except instead of "user space" it's "a
virtualized sandbox, but still running in ring 0".

What's the difference besides technicalities of the implementation? Everything
else you are saying about using IPC interfaces and isolating the kernel code
still applies in both cases.

------
MaxBarraclough
I'm not seeing how this helps solve the API stability problem faced by
ordinary kernel modules. There must be some difference between this project,
and a project that simply creates a more stable wrapper/subset of the APIs
available to kernel modules, but it's not clear to me what it is.

Also, why use JIT rather than offline verification and ahead-of-time
compilation?

Aside: the idea that the web delivers on the requirement of _Programmability
must be provided with minimal overhead_ is pretty laughable. Think Microsoft
Teams (a chat application) would consume 600MB of memory if it were built with
C++ rather than Electron? I realise not every JIT-powered technology needs to
be as bloated as the web, but it seems a poor example.

~~~
barrkel
How can the kernel trust your offline verification? At best, what you're
arguing for sounds like signed binary blobs.

How do you dynamically instrument things? How do you write programs which
decide, at run time, to move compute closer to the hardware?

~~~
MaxBarraclough
> How can the kernel trust your offline verification?

My thinking was that _I_ would trust the offline verification, and this would
be enough for me (as superuser) to load the precompiled module into the
kernel. I believe LLVM does something vaguely comparable, where it can verify
that bitcode modules are well-formed, to protect against certain classes of
compiler bugs. (Java of course does its class-verification at runtime.)

I don't think this idea is all that different though. If the JIT implements
caching of its generated native-code (assuming it can do this securely) then
we'd get the best of both worlds: I don't need to be a superuser, and we avoid
needless recompilation.

> How do you write programs which decide, at run time, to move compute closer
> to the hardware?

When would this make sense? If you've got a working kernel implementation,
which is robust and trusted, why would you _not_ use it?

~~~
barrkel
If you're writing a router, or high speed trading system where you want to
respond to packets on the wire with lower latency.

The point of a safely programmable kernel is that the user gets to inject
third party code into their kernel without needing to know if it's safe,
because the kernel will take care of it.

~~~
MaxBarraclough
I think you misread my question. I asked why you _wouldn 't_ move the code to
run in the kernel, if you have that ability.

Anyway, if I'm understanding things correctly, the point of using JIT is to
handle the compilation in a trusted context rather than having it run as the
user.

~~~
barrkel
You wouldn't write the code in the kernel because writing safe code is almost
impossible for humans without a lot of tooling help, and that tooling looks a
lot like a sandbox.

I think you're fundamentally not understanding why we have virtual machines in
language implementations, or process boundaries in operating systems, and why
these things are good and useful. Because if you did, you'd see that a sandbox
in the kernel is a hybrid of the two ideas.

~~~
MaxBarraclough
> I think you're fundamentally not understanding why we have virtual machines
> in language implementations, or process boundaries in operating systems, and
> why these things are good and useful.

No. My understanding is fine.

> You wouldn't write the code in the kernel because writing safe code is
> almost impossible for humans without a lot of tooling help, and that tooling
> looks a lot like a sandbox.

Right, but with EBPF, you have exactly that.

My question was in response to your _How do you write programs which decide,
at run time, to move compute closer to the hardware?_

To rephrase, my question was this: If you have the ability to move code into
the kernel without concerns of stability or security, why would you wait until
runtime to decide whether to do it? Why wouldn't you just do it
unconditionally?

> You wouldn't write the code in the kernel because writing safe code is
> almost impossible for humans without a lot of tooling help, and that tooling
> looks a lot like a sandbox.

Of course. My question there was about the use of JIT rather than ahead-of-
time compilation. As we've now both said, the answer is that EBPF is able to
move the compilation out of the hands of the user, avoiding having to trust
the user. It may also be helpful that the input to the JIT can be built up at
runtime, as with the routing example you mentioned, but this could be done
even if we trusted the user to handle the compilation.

This doesn't mean my suggestion is unworkable. You _could_ entrust the user
with the compilation process, and you'd still get the robustness guarantees,
but, well, you'd have to trust the user. Better to have the kernel handle the
compilation (and ideally caching).

------
dathinab
While the sites are interesting and Linux gets some functionalities known
mostly from micro kennels it's not really turning Linux into a micro kennel at
all.

It just provided a new _additional_ extension mechanism which is sandboxed and
much nicer to use.

But to make the Linux kennel into a micro kennel eBPF would need to have the
capability to replace _all_ existing kernel modules. Including file system
drivers, and graphic drivers. Which is not something it's cable of sand at
least currently it's only meant for new kennel functionality in to of the
"core" which we have.

This maybe could change at some point in the (not very close by) future. But
for now it doesn't yet turn Linux into a micro kennel.

~~~
riffraff
I appreciate this comment and agree with it but there are so many
typos/autocorrectisms that it's painful to read.

Dathinab, maybe do an "edit" pass? :)

EDIT: fixed my own mess, thanks :)

~~~
ghostpepper
> there are some many typos

I think you meant "so many"?

~~~
wolfgang42
Muphry's law strikes again.

~~~
hinkley
That damn Muphry, always messing with people.

------
justinsaccount
The link should be changed to

[https://docs.google.com/presentation/d/1AcB4x7JCWET0ysDr0gsX...](https://docs.google.com/presentation/d/1AcB4x7JCWET0ysDr0gsX-
EIdQSTyBtmi6OAW7bE0jm0/preview)

currently it links to the 2nd to last slide and not the beginning.

------
jdub
eBPF is turning Linux into a microkernel like drinking Gatorade is turning me
into a Super Bowl quarterback.

(I tried to localise this for a predominantly US audience.)

~~~
numlock86
Are you telling me that when I use Axe deodorant my house won't be flooded by
hundreds of nearby - alleged beautiful - women in their early 20s within a
couple of seconds? Outrageous!

~~~
choeger
No. That one is a hard fact. You just have to find the right variant of axe
for your particular neighborhood. Imagine my face when I accidentally stumbled
upon that variant.

~~~
Y_Y
I use a hatchet.

------
aey
EBPF is ridiculously awesome. It’s safe enough to jit in ring-0!

We built a rust tool chain that can output ebpf elfs :).
[https://github.com/solana-labs/rust-bpf-builder](https://github.com/solana-
labs/rust-bpf-builder)

------
ThePhysicist
EBPF is a super interesting technology but it’s so painfully hard to use it
for application development. There are some tools based on LLVM to compile
EBPF programs using C as a source language (which is much easier to reason in
than the low-level code), but there is a lot of room for improving the
developer workflow.

~~~
rhinoceraptor
bpftrace is getting pretty good lately, they've added support for stack
arguments, so you can do things like trace golang function calls, and get
arguments with a one-liner.

------
brendangregg
I don't see anyone sharing it, but the video for this talk is here:
[https://www.infoq.com/presentations/facebook-google-bpf-
linu...](https://www.infoq.com/presentations/facebook-google-bpf-linux-
kernel/?utm_source=twitter&utm_medium=link&utm_campaign=helpcampaign)

------
stefan_
eBPF are vendor kernel modules on steroids: now instead of getting compile
failures trying to build your out-of-tree module, your stuff just blows up at
runtime.

~~~
bitcharmer
eBPF has been invaluable in my field (low-latency linux applications) and it
changed a lot.

If you had problems working with kernel modules before, you probably should
expect struggling with writing correct code for eBPF too. It's not for
everyone.

~~~
bibabaloo
Can you share the sort of things you've been doing with it?

~~~
bitcharmer
Most recently I used ebpf to track which other threads were stealing cpu time
(and how much) from my latency sensitive cpu-pinned thread. You can do almost
anything, the level of introspection into the kernel internals is amazing.

------
jfkebwjsbx
Everything would be a microkernel if adding some kind of VM or interpreter is
enough to get that name, no?

With that logic, could we argue loadable kernel modules (perhaps with proper
memory separation) are a sign of a microkernel architecture?

~~~
rantwasp
yes. the author of that deck is playing it pretty loose when it come to the
definition of a microkernel.

normally the microkernel means the minimum needed primitives to implement the
OS and after that everything is build on top of that, not pluggable modules.

For all intents and purposes the Linux kernel is a monolithic one and the eBPF
capability make it more extensible / less of a pain to do certain things but
definitely do not turn it into a microkernel.

~~~
naasking
> normally the microkernel means the minimum needed primitives to implement
> the OS and after that everything is build on top of that

Sure, the minimum amount of full trust code. In this case, the full trust code
is the eBPF VM which enforces protection boundaries instead of the MMU as in a
classic microkernel. I'm not sure a microkernel classification ought to depend
on the MMU specifically, it's a general system design philosophy.

~~~
rantwasp
it’s not just the memory protection. it’s the scheduling, IPC, etc.

the eBPF vm uses the capabilities of the kernel, it is not the kernel. No
kernel, no nothing.

also, following your train og thought I could say that containers make this a
microkernel. it would be a claim that would get you laughed out of a room.

~~~
naasking
A kernel provides trusted runtime services for an operating system.

A microkernel provides a _minimal_ set of trusted runtime services for an
operating system, and relies on some protection mechanism for isolating
subsystems to avoid corrupting the trusted core. Preemptive scheduling is not
_necessarily_ part of it; depends whether your system requires "time" to be a
protected resource.

eBPF is a kernel service, just like processes, scheduling, IPC. If eBPF can
isolate subsystems and supports safe collaboration of eBPF programs despite
all running at ring 0, then the eBPF VM in the Linux kernel could qualify as a
microkernel once you remove everything else.

> also, following your train og thought I could say that containers make this
> a microkernel.

If you could run all of the device drivers in containers such that they
couldn't corrupt the kernel's data, then sure, you could run it as a
microkernel because you wouldn't have anything left in the kernel except
essential services like threading, IPC and containers.

------
monocasa
It's turning it into an exokernel.

Check out xok, it had three in kernel virtual machines.

[https://github.com/monocasa/exopc/tree/master/sys](https://github.com/monocasa/exopc/tree/master/sys)

------
rjsw
Sun did some experiments with building a JVM into their kernel so that you
could write device drivers in Java.

------
snvzz
Running even more code in supervisor mode != turning into a microkernel.

------
RMPR
I was thinking as EBPF as a way to enter in the Linux kernel development with
a modern language, but I'm kinda confused by I read in the comments, it's not
quite a thing?

~~~
rhinoceraptor
eBPF is just an in-kernel VM. You can do a lot of things with it, which makes
it hard to figure out what to do with it.

Original BPF is in most Unix kernels, it was just a way of writing simple
packet filtering programs that run in-kernel. For example, tcpdump is
effectively just a frontend that emits BPF bytecode.

eBPF expands the capabilities of the VM, but it still has tight restrictions
on what can run: no unbounded loops, arbitrary memory access, etc. I would
recommend trying out bpftrace as a first step:

[https://github.com/iovisor/bpftrace](https://github.com/iovisor/bpftrace)

------
peter_d_sherman
"A thorough introduction to eBPF"

[https://lwn.net/Articles/740157/](https://lwn.net/Articles/740157/)

Excerpts:

"While eBPF was originally used for network packet filtering, it turns out
that running user-space code inside a sanity-checking virtual machine is a
powerful tool for kernel developers and production engineers."

[...]

"The eBPF virtual machine more closely resembles contemporary processors,
allowing eBPF instructions to be mapped more closely to the hardware ISA for
improved performance."

[...]

"Originally, eBPF was only used internally by the kernel and cBPF programs
were translated seamlessly under the hood. But with commit daedfb22451d in
2014, the eBPF virtual machine was exposed directly to user space."

[...]

"What can you do with eBPF?

An eBPF program is "attached" to a designated code path in the kernel. When
the code path is traversed, any attached eBPF programs are executed. Given its
origin, eBPF is especially suited to writing network programs and it's
possible to write programs that attach to a network socket to filter traffic,
to classify traffic, and to run network classifier actions. It's even possible
to modify the settings of an established network socket with an eBPF program.
The XDP project, in particular, uses eBPF to do high-performance packet
processing by running eBPF programs at the lowest level of the network stack,
immediately after a packet is received.

Another type of filtering performed by the kernel is restricting which system
calls a process can use. This is done with seccomp BPF.

eBPF is also useful for debugging the kernel and carrying out performance
analysis; programs can be attached to tracepoints, kprobes, and perf events.
Because eBPF programs can access kernel data structures, developers can write
and test new debugging code without having to recompile the kernel. The
implications are obvious for busy engineers debugging issues on live, running
systems. It's even possible to use eBPF to debug user-space programs by using
Userland Statically Defined Tracepoints."

There, now you understand eBPF.

It is not a Microkernel.

It is an in-kernel Virtual Machine, with access to all of the kernel, whose
programs can register for, receive, filter, and optionally act upon or act to
moderate, kernel events.

Quite the powerful tool indeed -- but not a Microkernel...

------
smitty1e
Tanenbaum lives!

~~~
rhinoceraptor
Technically, Linux is just a guest OS, running on top of Minix :)

------
musicale
More like an exokernel.

~~~
seangrogg
This is perhaps the most apt description available.

------
perlgeek
Can device drivers be written in EBPF?

------
exabrial
I saw a link on HN a few months back that was going to do the same thing with
WASM.

------
dingo_bat
> Rebooting 20,000 servers takes a very long time without risking extensive
> downtime.

With eBPF, hot-patching servers will take a very short time to start the
extensive downtime, plus the consequent reboot of 20,000 servers.

------
gazsp
It's not.

------
layoutIfNeeded
As always, worse is better™!

