Hacker News new | past | comments | ask | show | jobs | submit login
EBPF is turning the Linux kernel into a microkernel (docs.google.com)
194 points by yoquan on April 23, 2020 | hide | past | favorite | 89 comments



I don't think that word means what you think it means.

Microkernel = move all the code OUT OF the kernel.

These slides are about moving all the code INTO the kernel.

Putting your application logic into the kernel would be more like a unikernel I guess?


It seems to me this pattern recurs over and over. The pragmatics roll up their sleeves and get the job done today, in some messy format (real kernels). The academics decry them as making a mess and doing terrible things, and declare that some solution is obviously correct and the practicals should be doing this instead (microkernels).

Ten to thirty years later, what happens is that both of them turn out to be correct in ways that neither of them could have foreseen at the beginning. It is true that what the practicals were doing was messy and it did blow up, and they did end up taking academic ideas into their code, but not quite as the academics would have thought at the beginning either.

Yeah, this is not a "real" microkernel. But if this process continues as described, you'll have a small hardcoded inner kernel, a whole bunch of sandboxed kernel code running in the kernel level of privilege but still essentially like "microkernel" components, suitably modified for the modern world, and a sane and safe sandbox to add more. And it'll be better than the original concept of "microkernels" were, and might as well co-opt the name, because we won't be going back to those ideas. The difference is that if we're getting to the point we can write sandboxes that actually work, the microkernel idea should be updated to account for that. The original microkernel ideas didn't expect that was a reasonable possibility, because at the time it wasn't. (Whether it is yet, we'll have to see, but we're at least getting closer.)


Ever heard the phrase, "don't let perfect be the enemy of good?" I tend to think of Linux as "good" and GNU Hurd as "perfect".


I believe that 2020 is the year of windows server, linux desktop and GNU Hurd 1.0


I believe it's only a matter of time before we figure out how to get Intel ME to directly access the UEFI framebuffer and expose it as an X server, thus ushering in the Year of the Minix Desktop instantaneously on all PCs with Intel CPUs manufactured in the last decade.


Considering how 2020 has gone so far...


Hurd and Mach did the Microkernel thing wrong. L4 did it right, and it's colleagues.

No receiver mailboxes. Either the reveicer is ready to pick it up the message, or it is lost. Just like with signals. Huge mistake.

Though I do know other permissive real-time kernels with mailboxes. Which do work fine in the industry, like in automobile and airplanes/rockets. Or even darwin. So maybe there's a second big problem in the Hurd design I'm not aware of.


Also known as "worse is better" software. Create a program that is as small as possible, fast, simple in implementation, and reliable. Then build more on top of that. Even if the interfaces to this program have some rough edges, it will knock the "perfect design" or even the "good enough" version out of the park.


It’s the “reliable” part in your list that’s been problematic with monolithic kernels. Cramming the entire OS into the kernel was fast and simple, but exposed attack vectors and enabled bugs in drivers and other stuff to freeze the kernel. Not reliable.


Or you could think of Minix as perfect. Or SEL4 as perfect. Or QNX as perfect.

No need to cherry pick Hurd for your example when there are shipping microkernels.


I tend to think of Hurd as "bad."


> Microkernel = move all the code OUT OF the kernel.

Move it out of the kernel to isolate and protect. If you can isolate and protect code within the kernel's memory protection boundary, I'm not sure that that should disqualify it as a microkernel.

In other words, I'm not sure that the microkernel design depends on memory protection boundaries specifically, it's a more general philosophy, akin to, "a microkernel is an operating system design which runs the minimum amount of code needed for an OS with full trust".


This is what I understand a Microkernel: it handles the bare minimum of system duties: process creation/management, memory management, IPC mechanism, I/O access. This is where you implement your security layer in order to isolate and protect hardware, processes, and memory. You can even implement namespacing and isolation to make each process look like it's running in a container or VM so to speak.

All of your drivers and user programs live in user space and secured via the kernel. E.g. a USB host controller is a process which talks to the hardware and provides some sort of interface which USB device drivers can talk to to implement their respective device interfaces. A file system on a disk is handled by a file system process speaking e.g. ext4 to a disk and then a socket is provided to mount and access the files.

It's a very nice setup because if your IPC mechanism is how everything talks then you can think of the kernel as a microservice host and IPC router which your processes talk through. They can provide or consume resources. Now if you can push that IPC mechanism over the network transparently then you have a distributed system. Then you eliminate a lot of code and protocol nonsense talking over networks.


OP's definition is the same as yours, except instead of "user space" it's "a virtualized sandbox, but still running in ring 0".

What's the difference besides technicalities of the implementation? Everything else you are saying about using IPC interfaces and isolating the kernel code still applies in both cases.


What you're talking about (I think) is a "hybrid" design, akin to Windows NT: like a microkernel, there are a bunch of different services ostensibly isolated from one another, but like a monolithic kernel, those services are all in kernel-space and interacting via function calls or somesuch rather than a full-blown communications protocol.


Linux is now doing what SPIN did 25 years ago, and the creators of SPIN called in an "extensible microkernel" [1].

Although i think the SPIN base system was a lot smaller than the Linux kernel - for example [2]:

> The Web server application, as well as the file system interface, entire network protocol stack, and device infrastructure are all linked into the system after it boots.

[1] http://www.cs.cornell.edu/people/egs/papers/spin-tr94-03-03....

[2] http://www-spin.cs.washington.edu/


I don't think that's the right way to think about it. We believe our code to be functionally decomposed when it's in separate functions that we can reason about and test in isolation, but when compiled, the functions may be inlined, may have constant propagation, may even be evaluated at compile time.

If the kernel, at its core, is just an execution environment for sandboxed programs that are sent into it, the source-level decomposition is solid, and the kernel is micro, relatively speaking, to the total amount of functionality. Drivers etc. can still be developed outside the kernel and need only interact with the sandbox API. Drivers may be upgraded live if the sandboxed programs can be replaced. The point of sandboxing is to ensure (ideally prove) that there can be no crashes due to violations of memory safety, or busy loops.


Microkernels are about reducing the amount of code run in kernel space.

Is this project moving things from user space to kernel space? If so it's the opposite of a microkernel.

E.g. if the Linux kernel becomes aware of the application layer, that really does sound like stuff that used to run in user space is now running in kernel space.


Depends what you call kernel space

For example, nebulet https://github.com/nebulet/nebulet wanted to run everything in ring 0, but "userspace" was WebAssembly code that had been compiled by "kernelspace" to run sandboxed in ring 0

If a CPU architecture was implemented to only offer ring 0, would a microkernel be impossible? Or would we accept this concept of kernelspace/userspace being implemented in software?


I don't quite agree. I think they're about reducing the size of your kernel, and moving the complexity of operating system service details into separate processes, where they can crash, restart, be upgraded, etc. in isolation, in their own terms.

The analogy is like microservices vs monolith, but with kernels, not big applications.


To me, it sounds like it brings code synthesis into the kernel, sort of a realization of Massalin's Synthesis kernel ideas.


No, it's similar to what Microsoft did. Put all the attack vectors into the kernel, because it's so much faster and we rather redo it again, we don't want to take a proven and secure existing solution. Just play the rust game and call it secure. People believe everything if you constantly repeat it.


It's disingenuous to call this the same as putting "attack vectors into the kernel", as BPF programs are sandboxed, unlike Windows kernel components. I don't know of any existing proven and secure solutions to this besides BPF, by the way.


As we saw with CPU's and VM's those sandboxing schemes are never secure. Eg eBPF arrays can be abused for cache attacks. The white paper and security guarantees never thought of that. The secure solution is to disable it, as well as hyperthreading. And use a secure, non-backdoored CPU.


I'm not seeing how this helps solve the API stability problem faced by ordinary kernel modules. There must be some difference between this project, and a project that simply creates a more stable wrapper/subset of the APIs available to kernel modules, but it's not clear to me what it is.

Also, why use JIT rather than offline verification and ahead-of-time compilation?

Aside: the idea that the web delivers on the requirement of Programmability must be provided with minimal overhead is pretty laughable. Think Microsoft Teams (a chat application) would consume 600MB of memory if it were built with C++ rather than Electron? I realise not every JIT-powered technology needs to be as bloated as the web, but it seems a poor example.


> I'm not seeing how this helps solve the API stability problem faced by ordinary kernel modules.

I'm not entirely certain, but my impression is that EBPF has more limited capabilities, and so the API can be kept stable more easily. Of course that also means that you cannot do everything in EBPF that you can do in ordinary C modules.

Hence my questions elsewhere in the discussions if you could write device drivers in EBPF. If yes, that might enable much easier when the toolchain eventually matures. If not, much is explained.


How can the kernel trust your offline verification? At best, what you're arguing for sounds like signed binary blobs.

How do you dynamically instrument things? How do you write programs which decide, at run time, to move compute closer to the hardware?


> How can the kernel trust your offline verification?

My thinking was that I would trust the offline verification, and this would be enough for me (as superuser) to load the precompiled module into the kernel. I believe LLVM does something vaguely comparable, where it can verify that bitcode modules are well-formed, to protect against certain classes of compiler bugs. (Java of course does its class-verification at runtime.)

I don't think this idea is all that different though. If the JIT implements caching of its generated native-code (assuming it can do this securely) then we'd get the best of both worlds: I don't need to be a superuser, and we avoid needless recompilation.

> How do you write programs which decide, at run time, to move compute closer to the hardware?

When would this make sense? If you've got a working kernel implementation, which is robust and trusted, why would you not use it?


If you're writing a router, or high speed trading system where you want to respond to packets on the wire with lower latency.

The point of a safely programmable kernel is that the user gets to inject third party code into their kernel without needing to know if it's safe, because the kernel will take care of it.


I think you misread my question. I asked why you wouldn't move the code to run in the kernel, if you have that ability.

Anyway, if I'm understanding things correctly, the point of using JIT is to handle the compilation in a trusted context rather than having it run as the user.


You wouldn't write the code in the kernel because writing safe code is almost impossible for humans without a lot of tooling help, and that tooling looks a lot like a sandbox.

I think you're fundamentally not understanding why we have virtual machines in language implementations, or process boundaries in operating systems, and why these things are good and useful. Because if you did, you'd see that a sandbox in the kernel is a hybrid of the two ideas.


> I think you're fundamentally not understanding why we have virtual machines in language implementations, or process boundaries in operating systems, and why these things are good and useful.

No. My understanding is fine.

> You wouldn't write the code in the kernel because writing safe code is almost impossible for humans without a lot of tooling help, and that tooling looks a lot like a sandbox.

Right, but with EBPF, you have exactly that.

My question was in response to your How do you write programs which decide, at run time, to move compute closer to the hardware?

To rephrase, my question was this: If you have the ability to move code into the kernel without concerns of stability or security, why would you wait until runtime to decide whether to do it? Why wouldn't you just do it unconditionally?

> You wouldn't write the code in the kernel because writing safe code is almost impossible for humans without a lot of tooling help, and that tooling looks a lot like a sandbox.

Of course. My question there was about the use of JIT rather than ahead-of-time compilation. As we've now both said, the answer is that EBPF is able to move the compilation out of the hands of the user, avoiding having to trust the user. It may also be helpful that the input to the JIT can be built up at runtime, as with the routing example you mentioned, but this could be done even if we trusted the user to handle the compilation.

This doesn't mean my suggestion is unworkable. You could entrust the user with the compilation process, and you'd still get the robustness guarantees, but, well, you'd have to trust the user. Better to have the kernel handle the compilation (and ideally caching).


> How can the kernel trust your offline verification?

You can use proof-carrying code. There is a residual "online" verification of course, but it ought to be quick and efficient.


You're right, but you're way ahead of me. I'd misunderstood the emphasis of the project, and was thinking I'd be a superuser, trusted by the kernel.


Well, you would still need "superuser" privileges for things like adding new capabilities to the proof verifier. Of course this might open you up to security problems if you're relying on incorrect assumptions while doing that. But then, this project also has trusted components of its own, such as the JIT. A proof verifier can be a lot simpler than a JIT.


> you would still need "superuser" privileges for things like adding new capabilities to the proof verifier.

You mean to upgrade EBPF itself? Well of course. Same as any kernel upgrade.

> Of course this might open you up to security problems if you're relying on incorrect assumptions while doing that.

I don't follow. It's giving the system a full proof of safety. What assumptions are there? It seems very similar to Java's class verification, which doesn't suffer from issues with ungrounded assumptions.

> this project also has trusted components of its own, such as the JIT. A proof verifier can be a lot simpler than a JIT.

Interesting point. It might reduce the total amount of highly-trusted kernel code to approach things that way.


While the sites are interesting and Linux gets some functionalities known mostly from micro kennels it's not really turning Linux into a micro kennel at all.

It just provided a new _additional_ extension mechanism which is sandboxed and much nicer to use.

But to make the Linux kennel into a micro kennel eBPF would need to have the capability to replace _all_ existing kernel modules. Including file system drivers, and graphic drivers. Which is not something it's cable of sand at least currently it's only meant for new kennel functionality in to of the "core" which we have.

This maybe could change at some point in the (not very close by) future. But for now it doesn't yet turn Linux into a micro kennel.


I appreciate this comment and agree with it but there are so many typos/autocorrectisms that it's painful to read.

Dathinab, maybe do an "edit" pass? :)

EDIT: fixed my own mess, thanks :)


> there are some many typos

I think you meant "so many"?


Muphry's law strikes again.


That damn Muphry, always messing with people.


"Micro kennel" has a nice ring to it.


As does "cable of sand".


The link should be changed to

https://docs.google.com/presentation/d/1AcB4x7JCWET0ysDr0gsX...

currently it links to the 2nd to last slide and not the beginning.


eBPF is turning Linux into a microkernel like drinking Gatorade is turning me into a Super Bowl quarterback.

(I tried to localise this for a predominantly US audience.)


Are you telling me that when I use Axe deodorant my house won't be flooded by hundreds of nearby - alleged beautiful - women in their early 20s within a couple of seconds? Outrageous!


No. That one is a hard fact. You just have to find the right variant of axe for your particular neighborhood. Imagine my face when I accidentally stumbled upon that variant.


I use a hatchet.


True, this should have a disclaimer: "* For a very flexible definition of a microkernel"


"given a sufficiently large value of 'micro'"


> eBPF is turning Linux into a microkernel like drinking Gatorade is turning me into a Super Bowl quarterback.

To be fair, you probably aren't any better or worse than Dilfer with or without the gatorade.

> (I tried to localise this for a predominantly US audience.)

localize.


The Internet is not the US.


> The Internet is not the US.

Well, and HN is not the internet. What is your point?


EBPF is ridiculously awesome. It’s safe enough to jit in ring-0!

We built a rust tool chain that can output ebpf elfs :). https://github.com/solana-labs/rust-bpf-builder


EBPF is a super interesting technology but it’s so painfully hard to use it for application development. There are some tools based on LLVM to compile EBPF programs using C as a source language (which is much easier to reason in than the low-level code), but there is a lot of room for improving the developer workflow.


bpftrace is getting pretty good lately, they've added support for stack arguments, so you can do things like trace golang function calls, and get arguments with a one-liner.


I don't see anyone sharing it, but the video for this talk is here: https://www.infoq.com/presentations/facebook-google-bpf-linu...


eBPF are vendor kernel modules on steroids: now instead of getting compile failures trying to build your out-of-tree module, your stuff just blows up at runtime.


eBPF has been invaluable in my field (low-latency linux applications) and it changed a lot.

If you had problems working with kernel modules before, you probably should expect struggling with writing correct code for eBPF too. It's not for everyone.


Can you share the sort of things you've been doing with it?


Most recently I used ebpf to track which other threads were stealing cpu time (and how much) from my latency sensitive cpu-pinned thread. You can do almost anything, the level of introspection into the kernel internals is amazing.


But, you have the huge advantage that if they crash, they don't bring down your system.


This seems to be around the wrong way.

For both traditional kernel modules and eBPF programs, you compile the code ahead of time. For kernel modules, if you have a bug, you load it into the kernel and the kernel hard crashes at runtime. For eBPF programs, the kernel will reject the program before you inject it.

In practice to deploy eBPF programs, you end up adding the kernel verification step into part of your CI/dev workflow so that by the time you ship your programs, you know that they will safely load and safely run in real environments.


Pfft, had that with 1987 Amiga 1.3, took Linux another 26 years to get there.


Tioga editor in Xerox's Cedar already had a native structural active document capability back in 1987, but the most successful commercial Microsoft Office applications with billions of dollars budget still do not have this capability, your point exactly?


Everything would be a microkernel if adding some kind of VM or interpreter is enough to get that name, no?

With that logic, could we argue loadable kernel modules (perhaps with proper memory separation) are a sign of a microkernel architecture?


yes. the author of that deck is playing it pretty loose when it come to the definition of a microkernel.

normally the microkernel means the minimum needed primitives to implement the OS and after that everything is build on top of that, not pluggable modules.

For all intents and purposes the Linux kernel is a monolithic one and the eBPF capability make it more extensible / less of a pain to do certain things but definitely do not turn it into a microkernel.


> normally the microkernel means the minimum needed primitives to implement the OS and after that everything is build on top of that

Sure, the minimum amount of full trust code. In this case, the full trust code is the eBPF VM which enforces protection boundaries instead of the MMU as in a classic microkernel. I'm not sure a microkernel classification ought to depend on the MMU specifically, it's a general system design philosophy.


it’s not just the memory protection. it’s the scheduling, IPC, etc.

the eBPF vm uses the capabilities of the kernel, it is not the kernel. No kernel, no nothing.

also, following your train og thought I could say that containers make this a microkernel. it would be a claim that would get you laughed out of a room.


A kernel provides trusted runtime services for an operating system.

A microkernel provides a minimal set of trusted runtime services for an operating system, and relies on some protection mechanism for isolating subsystems to avoid corrupting the trusted core. Preemptive scheduling is not necessarily part of it; depends whether your system requires "time" to be a protected resource.

eBPF is a kernel service, just like processes, scheduling, IPC. If eBPF can isolate subsystems and supports safe collaboration of eBPF programs despite all running at ring 0, then the eBPF VM in the Linux kernel could qualify as a microkernel once you remove everything else.

> also, following your train og thought I could say that containers make this a microkernel.

If you could run all of the device drivers in containers such that they couldn't corrupt the kernel's data, then sure, you could run it as a microkernel because you wouldn't have anything left in the kernel except essential services like threading, IPC and containers.


No, a microkernel is only 'the real thing' when 'kernel modules' are simply called 'user processes'.


Turn your linux into a microkernel with this one weird trick: run fuse!


It's turning it into an exokernel.

Check out xok, it had three in kernel virtual machines.

https://github.com/monocasa/exopc/tree/master/sys


Sun did some experiments with building a JVM into their kernel so that you could write device drivers in Java.


Running even more code in supervisor mode != turning into a microkernel.


I was thinking as EBPF as a way to enter in the Linux kernel development with a modern language, but I'm kinda confused by I read in the comments, it's not quite a thing?


eBPF is just an in-kernel VM. You can do a lot of things with it, which makes it hard to figure out what to do with it.

Original BPF is in most Unix kernels, it was just a way of writing simple packet filtering programs that run in-kernel. For example, tcpdump is effectively just a frontend that emits BPF bytecode.

eBPF expands the capabilities of the VM, but it still has tight restrictions on what can run: no unbounded loops, arbitrary memory access, etc. I would recommend trying out bpftrace as a first step:

https://github.com/iovisor/bpftrace


"A thorough introduction to eBPF"

https://lwn.net/Articles/740157/

Excerpts:

"While eBPF was originally used for network packet filtering, it turns out that running user-space code inside a sanity-checking virtual machine is a powerful tool for kernel developers and production engineers."

[...]

"The eBPF virtual machine more closely resembles contemporary processors, allowing eBPF instructions to be mapped more closely to the hardware ISA for improved performance."

[...]

"Originally, eBPF was only used internally by the kernel and cBPF programs were translated seamlessly under the hood. But with commit daedfb22451d in 2014, the eBPF virtual machine was exposed directly to user space."

[...]

"What can you do with eBPF?

An eBPF program is "attached" to a designated code path in the kernel. When the code path is traversed, any attached eBPF programs are executed. Given its origin, eBPF is especially suited to writing network programs and it's possible to write programs that attach to a network socket to filter traffic, to classify traffic, and to run network classifier actions. It's even possible to modify the settings of an established network socket with an eBPF program. The XDP project, in particular, uses eBPF to do high-performance packet processing by running eBPF programs at the lowest level of the network stack, immediately after a packet is received.

Another type of filtering performed by the kernel is restricting which system calls a process can use. This is done with seccomp BPF.

eBPF is also useful for debugging the kernel and carrying out performance analysis; programs can be attached to tracepoints, kprobes, and perf events. Because eBPF programs can access kernel data structures, developers can write and test new debugging code without having to recompile the kernel. The implications are obvious for busy engineers debugging issues on live, running systems. It's even possible to use eBPF to debug user-space programs by using Userland Statically Defined Tracepoints."

There, now you understand eBPF.

It is not a Microkernel.

It is an in-kernel Virtual Machine, with access to all of the kernel, whose programs can register for, receive, filter, and optionally act upon or act to moderate, kernel events.

Quite the powerful tool indeed -- but not a Microkernel...


Tanenbaum lives!


Technically, Linux is just a guest OS, running on top of Minix :)


More like an exokernel.


This is perhaps the most apt description available.


Can device drivers be written in EBPF?


I saw a link on HN a few months back that was going to do the same thing with WASM.


> Rebooting 20,000 servers takes a very long time without risking extensive downtime.

With eBPF, hot-patching servers will take a very short time to start the extensive downtime, plus the consequent reboot of 20,000 servers.


It's not.


As always, worse is better™!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: