To that, I add this article, which is sufficient in destroying that idea.
However, stuff like WebAssembly and other sandboxing methods can be used to leverage two processes into the same address space. Your filesystem driver then simply lives as a module in address space and it's a normal process. The IPC turns into a simple jump using a pointer value provided by the kernel (which depending on the trust level can provide parameter validation or can be a plain pointer to the correct function).
The cost of switching processes are not a problem of address space, the MMU makes sure that is not a problem as shared memory is a reasonable tradeoff for certain domains where performance is necessary. Normally you'd rather want use messages but that is a different topic.
In either case, the cost of a process switch is handling the registers and cache - that will not go away no matter how you do it, which is why multicore implementation with messages can actually turn out being faster. Less switching and more locality.
The cost of switching processes is significantly reduced when you don't need to switch privilege levels and not having to invalidate the TLB as well as not having to change address space at all will make a context switch not significantly more expensive than a function call.
You get compile-time isolation and can still take advantage of the MMU when needed.
And using such an implementation does not prevent you from implementing your driver such that it can run on each core and take advantage of it or even passes messages. An ethernet driver could still, for example, pass a message when the "send data to tcp socket" function is called while allowing another program to use the same function without any message passing, depending on what is better for your use case.
If a driver runs in user mode, an exploit needs to exploit the hardware as well - and that is for all intents and purposes something that we see very rarely.
If the same driver runs in "software user mode", but executing as supervisor (basically inside an VM environment), we need constant security checks in software, and an exploit now have the VM code to further exploit, if successful that will automatically grant it supervisor access.
In both cases it's assumed that neither implementation has access to more interfaces than necessary for it to do it's work. For instance, a driver for a mouse does not need access to the disks.
From my experience, a lot of hardware is terribly insecure against exploits. Not necessarily the CPU but stuff like your GPU or HBAs, ethernet cards, etc.
With software containment, the advantage is that you can set it up that drivers need to declare their interfaces and privileges beforehand. In an ELF or WASM you have to declare imported and linked functions, it should not be difficult to leverage that to determine what a driver can effectively do. With WASM you get the added benefit that doing anything but using declared interfaces results in a compile-time error.
A driver can be written so that a minimal, audited interface exists to talk to the hardware almost directly with some security checks and then the WASM part that handles the larger logic parts and provides the actual functionality.
WASM isn't a supervisor, so exploits on VM code aren't that relevant. Exploiting the WASM compiler/interpreter/JIT is more interesting but those are exposed to the daily internet shitstorm exploits, so I think they are fairly safe.
> it should not be difficult to ...
Famous last words.
So the trusted core part of the OS can run without any spectre prevention, though you can still enable the various hardware protections available in the chicken bits.
And if it's necessary to protect against spectre attacks, you can use shim layers or even isolation into ring3 to take preventative measures. This allows leveraging performance were important and security where necessary.
If it's in webassembly, you can even run two versions of a driver; one with spectre-mitigations compiled in and one without, sharing one memory space and the kernel can choose to invoke either one depending on the call chain.
As far as I remember they weren't able to defend from side channel attacks within the same process completely and decided to rely on process isolation instead, estimating it would be too much work to address all known spectre class vulnerabilities on their existing compilers and too hard to ensure for defenses not to be broken later by compiler developers.
Drivers tend to have high bug density, and supervisor mode means high potential for damage.
Much can be mitigated by running drivers in userspace, especially with some IOMMU help.
I think you can run large parts of the filesystem and network stack in user-land now.
And even if Apple now talks about user-Mode drivers - do we even know what percentage they aim for?
They are following a two release steps, in release N, the user space drivers for a specific class get introduced and the respective kernel APIs are automatically deprecated. In release N + 1, those deprecated APIs will be removed.
I saw a talk a couple of years ago by Tanenbaum where he said he would be ok with Minix being 20% slower than a monolithic kernel like Linux, indicating that it was currently slower than that.
Granted, Minix has not seen the type of optimizations that popular monolithic kernel has due to lack of manpower.
So, I really look forward to seeing benchmarks made between monolithic kernels and new micro kernels like Google's Zircon, and Redox once they've had sufficient time to mature.
It doesn't indicate anything of the sort. He was making a comment that the reliability and security benefits of microkernels are simply more important than performance in this mind.
What are the significant advances in micro kernels that does not apply to monolithic kernels ?
Also they are not only VERY old, they seem intentionally vague when it comes to actual data about the systems they are comparing against. In short, I welcome the new micro kernels so that we can see a comparison between modern monolithic and modern micro kernels and actually get a good representation of what the performance difference is.
Because if not for performance, there is no reason not to use a micro kernel.
I don't remember the OS4000 context switching time but it was fast in comparison with other systems of the 1980s, and it was very fast in actual use (running real-time and interactive processes). The performance of L4Linux is quoted within the typical margin of speculative execution mitigations relative to Linux. However, it's a strange idea to me that speed is all that matters for an OS, and not reliability, trustworthiness, etc.
- paging is a very efficient way to copy a block of data from one process to another.
- the perceived speed of an interactive system has everything to do with responsiveness and very little with actual throughput. And responsiveness is associated with near real-time operation which happens to be something micro kernel based systems excel at.
The deal with user facing libraries like this is that I'd rather they generalize this and expose the existing network library drivers with a uring interface, and the user processes can take care of packet decap the way they want it.
Of course, helpful to have the stub to map the PCI BARs to userspace, and hopefully without any message signalled interrupts.
These two alone, may be hard to do, with all the existing network drivers out there. Engineering feats like these are good but not helpful to most people unless they are simple and generic enough to work on most devices.
I hope the guys who wrote this take note and eventually layer out and open source this library.
Are there many systems using microkernels in production environments? I mean at least at this kind, or a somewhat similar, scale?
3 years seems like a long time, while up until this moment microkernels seemed fairly niche to me and something reserved for more experimental systems.
The only one I can think of off the top of my head is Fuchsia.
Edit: The Wikipedia entry has a little less fluff than the official homepage: https://en.wikipedia.org/wiki/QNX
Is this just difficult to design well or are people genuinely okay with socket(AF_UNIX, SOCK_SEQPACKET, 0)?
If at the end people continue building their own IPC mechanisms on top of TCP/IP or unix domain sockets then the in-kernel mechanism will just be another thing to maintain.
There had been some endaveurs to bring new IPC mechanisms into the Linux Kernel (AF_BUS, KDBUS, BUS1). I think those failed for similar reasons (although I'm not sure were BUS1 now is - the others are definitely discontinued).
You can go all the way from the lowest level kernel uses to very high level application constructs with that. Re-inventing existing wheels badly is something the software world excels at.
Not quite. Those functions need established connections, so you need ConnectAttach and ConnectDetach as well. But none of that is useful unless you can identify clients so you also need ConnectClientInfo.
This isn’t a dig, having worked on a custom IPC system I found the QNX approach to be the best of all worlds.
Every form of IPC can be implemented on top of asynchronous message passing. Interface is not the problem. The problem is high performance designs with all the batching, memory mapped buffers, no syscalls, etc.
Sure, if you want to introduce inherent DoS vulnerabilities into your IPC subsystem, not to mention slow down IPC so much that it's practically unusable. Many early microkernels were asynchronous, and synchronous microkernels like L4 beat them easily every time.
Furthermore, synchronous IPC can be made immune to DoS which is inherent to async IPC: https://www.researchgate.net/publication/4015956_Vulnerabili...
Fuchsia certainly is async, but that’s not if L4 family.
I know only of async notifications, which I believe require no allocation of storage and so don't open up DoS opportunities.
This is nonsense. There could be DoS vulnerabilities in implementations, but they are not inherent to async message passing.
1. If they're booked to the receiver, then clients can easily DoS receivers by flooding them with message.
2. If they're booked to the sender, then receivers can easily DoS senders by blocking indefinitely.
3. If they're booked to the kernel (which is most common for true async message passing, unfortunately), then senders or receivers can DoS the whole system by the above two mechanisms.
And that's only the most basic analysis. I suggest you read the paper I linked and its references if you want a more in-depth analysis of IPC vulnerabilities and performance properties.
In high performance scenario it would be more like this: process shares fixed ring-like buffers with the kernel where it can put messages, messages it can't put there it either accumulates locally until it can or just drops, there would be some kind of polling or event notification mechanism to know when it can put and get more messages into and from shared buffers.
P.S. I can't access the paper, but presumably they are making the same faulty assumptions if they claim the same things you did.
> process shares fixed ring-like buffers with the kernel where it can put messages
I am making a claim like "Turing machines can't solve the Halting problem", and you are saying, "If you put a limit the number of computation steps, then the Halting problem is decidable". But such a system is no longer a Turing machine.
What you are describing is not asynchronous IPC. With async IPC, you ought to be able to send a message at any time without blocking. That's what async IPC means.
If you must sometimes block or throttle before you can successfully send a message, even if only in principle, then it's no longer async IPC. It is instead a mixed sync/async system, which invariably becomes necessary in order to address the inherent limitations of async IPC.
> messages it can't put there it either accumulates locally until it can or just drops
So DoS against the sender, like I said. Try assuming a less liberal threat model and see how far async IPC takes you.
The need to handle back pressure is exactly why it's not pure async IPC.
> And still no DoS, it's completely up to the application to decide what to do with the messages it generates too fast.
And if the program can't discard messages, then it's a DoS. If the program can instead rely on the receiver to keep up so it doesn't need to make this choice, then there's a trust assumption between these processes.
There's no escaping this tradeoff with async IPC.
It doesn't work like that and is getting into hypothetical non real world systems again. Trust is especially interesting in this context, because if you don't trust other processes, you absolutely have to be able to discard their messages. They can misbehave, crash, stop responding at any time.
But say somehow you can't discard messages and don't trust them. It's still only about backpressure handling. For example, kernel can just refuse to send messages to a particular recipient it knows is not consuming its incoming messages and instead can return messages back to senders into their incoming buffers or rejected buffers or whatever. Senders can decide what to do with that information, wait for a particular recipient to become ready again, waiting is not DoS, stop generating messages for that recipient or accumulate them or just drop them, minimal cooperation is required of course, but not trust, if they don't cooperate they don't hurt anyone but themselves. It's all still pure asynchronous stuff. And in fact, all high performance real world asynchronous communications deal with backpressure all without DoS and trust, they all also can discard messages though.
They make use of Android IPC to communicate among themselves and the kernel.
Now given the role of Linux on Android, and what is accessible to userspace, maybe we shouldn't anyway consider it an *NIX system.
> Do they pay people to do this?
Yes you fucking bet they do.
Makes my point