Good job. This is where microkernels and exokernels work much better. Putting complex subsystems such as encryption and filesystems into the kernel is a terrible idea and this is just one of the reasons why.
While what you said may be theoretically true, does anyone run microkernels or exokernels on servers in production? I know microkernels like QNX are extremely popular and effective in the embedded market and as far as I can tell exokernels are more research oriented. But if I want to use a server with dozens of cores, hundreds of gigs of memory, dozens of drives, and several NICS what microkernel could I use? Currently I would use FreeBSD for such a role. What microkernel or exokernel provides all the features of FreeBSD on the server, like a COW file system, visibility(dtrace, vmstat, iostat, etc), os-level virtualization(jails), hypervisor(bhyve), performant network stack, linux binary emulation, and is open source? Unless people write a microkernel or exokernel that can replace Linux or FreeBSD on the server than people will continue to put monolithic kernels into production.
L4 runs on every phone in the world (to a first approximation).
Xen (derived from the original exokernel work) runs AWS on a bazillion servers.
Linux, FreeBSD, C are examples of how a heroic amount of work can make up for poor design, but it's clear that with a better architecture and the same number of developers they would be much more featureful.
It's also significant, I think, that all the important distributed filesystems (eg. Gluster) are implemented as userspace servers, not as kernel drivers. Once you go beyond a certain level of complexity implementing a kernel filesystem isn't really feasible even with heroic effort.
> Xen (derived from the original exokernel work) runs AWS on a bazillion servers.
So there's an interesting point there: what's the practical difference between a microkernel architecture and a microservices architecture?
Local privilege escalation on Linux has been entirely too easy. But privilege escalation on Xen (from one domain to another) has been relatively hard, and AWS works on that assumption. So you treat each Linux image as if it's within one privilege domain, and exploits like the one in this article don't affect you. You can't get from one VM to another, because each VM has its own filesystem server.
The only practical difference, I think, is that microservices exist in the real world, and microkernels have been tried many times and have (with a few exceptions) failed. But the security benefits are about the same.
It's pretty similar from a security standpoint. The point of a microkernel is to get process-isolation guarantees at a much finer-grained level than traditional monolithic kernels allow: not only do you guarantee that processes can't stomp all over each other, but you also guarantee that they can't stomp on kernel data structures like filesystems or drivers. Microservices give this as well, since if there's only one process running on the box, there's not much for it to stomp on (except, well, all the other OS cruft like shells and debugging tools that typically goes on a VM but could be used to turn a pwned box into a spam farm).
The big downside of microservices is the complexity they introduce, both in terms of the networking & IPC overhead (which isn't free, and in many cases can consume a majority of the CPU time of the service) and in terms of programming & devops time. Hence some of the interest in unikernels, which both eliminate the other OS cruft that's attackable and eliminate much of the CPU overhead of communicating between microservices. The tooling doesn't really exist for unikernels yet, though - there's no pragmatic, drop-in solution where you can just write a binary and be able to trace & debug performance problems on a production box.
Yea your right about that. But my point still stands almost all of AWS is being used to host monolithic kernels because the monolithic kernels(Linux, FreeBSD) have the features people wan't for running their business/organization.
> But my point still stands almost all of AWS is being used to host monolithic kernels because the monolithic kernels(Linux, FreeBSD) have the features people wan't
The implication there is that they have the features because they are monolithic kernels, which is presuming the very thing being discussed. The only thing you can say without a lot more proof is that people use Linux of FreeBSD because they have features people want. Whether being monolithic is a causation or correlation with those features is not established (in this discussion) yet.
> The implication there is that they have the features because they are monolithic kernels, which is presuming the very thing being discussed. The only thing you can say without a lot more proof is that people use Linux of FreeBSD because they have features people want. Whether being monolithic is a causation or correlation with those features is not established (in this discussion) yet.
I'm not saying FreeBSD and Linux have those features because they are monolithic, i'm saying people use monolithic kernels because they are more feature complete at this point in time. More specifically if I want to run a big web app in production there are no suitable microkernels or exokernels to do so now.
> More specifically if I want to run a big web app in production there are no suitable microkernels or exokernels to do so now.
Windows NT is a hybrid kernel[1], so there's somewhat conflicting evidence.
Really, I think I could make an argument that the reason a microkernel hasn't gained popularity is because there's far too much focus on performance to the detriment of security and stability in much of the last few decades program development, much less OS development. People aren't good at correctly estimating risk and reward for future events that are more than a few years out, and worse when the risk is abstract in nature.
Would everything be running slower if we used microkernels for the majority of systems? Probably (but who knows by how much? If there was a lot more work focused on that problem, we might have ways to alleviate much of it by now). Would our systems be more secure and less prone to software errors? I think so. Do I think it's possible we as a culture (species?) could choose stability and security over performance? Not without enforcement negative consequences when developers supply buggy and/or exploitable code, and consequences for those companies that choose to run that code even if they are carefully advised on the possible consequences. So, not likely, and that's a huge discussion in itself.
I'd call the original Xbox and the Xbox 360's OS an exokernel. Threads, timers, encryption, and interrupt dispatching were handled in kernel, but USB, Ethernet, audio, and GPU drivers ran in the application image.
I don't think that this would have been solved by keeping encryption and filesystems outside of the kernel. A stack overflow is a stack overflow, wherever it is. If you split out all the relevant code out of the kernel but changed nothing else, you'd still be able to escalate privileges to run code in the context of the unencrypted filesystem. At that point, the fact that you weren't able to get to the microkernel is sort of academic—everything of value is in the unencrypted filesystem.
This might be an argument for greater stack protections (say, guard pages) on sensitive code, but you can do that just fine on a monolithic kernel.
In both a microkernel and an exokernel, the userspace servers (micro) or the shared libraries (exo) have effectively unlimited stack. If you do exceed the stack limits then the process will crash, which for a microkernel means the service is restarted afresh while for an exokernel means the user process dies.
Whereas in Linux you have 16K stack (or only 8K up to a few years ago) and if you exceed it you can overwrite bits of kernel memory, including a crucial struct which was used to exploit the whole kernel as in this exploit. Having a guard page is technically difficult in Linux since there's no virtual memory in the kernel, so you can't have an unmapped page, and even having an extra page is difficult because kernel stacks have to be allocated contiguously (this point might not apply to other monolithic kernels however).
So I believe there is a real difference.
However my main point is that userspace code is more flexible, so:
- It's a lot easier to write than kernel code.
- You can write it in different languages, perhaps ones which by design don't have buffer overflows at all.
- Even if you write your servers in C, you can enable many more exploit protection features in userspace code.
- You can use more complex algorithms, including recursive ones (think trees), so your code might be faster.
- You can use garbage collection which means you can manage memory without going crazy, and should have fewer memory leaks.
- The barrier to entry is lower, so I'd expect to see a lot more features (eg. strange and exotic filesystems), even if some of those features might be a bit half-baked.
You absolutely have virtual memory in the kernel—that's how the kernel gets to live at a fixed high address in every process, and that's how it gets to read things from userspace when it needs to. When a process makes a system call, the virtual memory map stays exactly the same; the CPU just switches to supervisor mode and starts using a different segment that enables accessing kernel memory.
What you don't easily have, at least on Linux, is the ability to page things out (which is sometimes colloquially called "virtual memory" thanks to some customer-facing marketing in the '90s), but that's not an inherent requirement of monolithic kernels. NT, for instance, allows paging out certain parts of kernel memory (though obviously not the parts that are needed to page things back in), at the cost of a bit of mental overhead from driver authors. And this isn't necessarily solved in microkernels or exokernels: your swap device is going to be supported by at least a few userspace drivers. If you're using a machine with full-disk encryption, your disk encryption server isn't allowed to swap.
What you also don't have on Linux is different mappings for different kernel threads (all of kernel memory is mapped in every thread, accessible by kernel code only) and memory protection between kernel threads, but you don't really need that, and that wouldn't have helped solve this problem.
> kernel stacks have to be allocated contiguously
There's no particular reason that segmented stacks (like old Go or old Rust had) couldn't be used in the kernel. They're just complicated, and both Go and Rust found that they were bad for performance.
> It's a lot easier to write than kernel code.
I'm not convinced by this. Apart from the experimental evidence that monolithic kernels have been far more featureful, there's the inherent problem that you don't really know what assumptions you get to make. You're only in mostly-userspace code; there's the example I mentioned above of the full-disk-encryption driver, which doesn't get to swap because it would be implementing its own swap device. If you want your OS to support swap files, filesystem drivers can't swap. If you have syslog over local sockets, your socket implementation can't syslog. And so forth.
What is easier (as I advocated in another comment) is to keep the monolithic kernel model, but pretend that privsep within a single kernel doesn't happen, and run it all within VMs.
> There's no particular reason that segmented stacks (like old Go or old Rust had) couldn't be used in the kernel.
A very good reason is that Linux happens to be written in C, which just doesn't support this. So the patch you linked uses VM mappings instead.
> What is easier (as I advocated in another comment) is to keep the monolithic kernel model, but pretend that privsep within a single kernel doesn't happen, and run it all within VMs.
That's the exokernel model, with hypervisor being the kernel and Linux being a private "libOS" of your httpd or whatever.
> A very good reason is that Linux happens to be written in C, which just doesn't support this.
This is a tangent, but I thought that C supported this just fine as long as there was compiler and runtime support? The compiler needs to insert a check at each stack allocation (beginning of each function, or each call to alloca or equivalent), and if it switches your stack pointer, it needs a trampoline as the last frame on the new stack, to un-switch the stack pointer. And you need to teach everything that traces stacks how to keep tracing across segments. But it doesn't seem impossible.
ecryptfs shouldn't be running kernel_read() from the page fault handler anyway - it should pass the read off to a work queue and put the process to sleep until it completes.