I remember seeing the PCI userspace option in the Linux kernel menuconfig and wondered why anyone would do that, and then a few years ago at Kiwicon I saw my first use case. A presenter was trying to hack a Cisco router.
Older Cisco routers ran IOS directly on proprietary hardware. At some point, Cisco decided to switch to Intel hardware but didn't port their kernel. They use a Linux kernel and ran IOS as a huge 50MB+ binary. The guy doing the talk got shell access and only found one ethernet device when running ifconfig. The actual switching hardware was being handled in userspace by the large binary.
I'm guessing they probably just wrote some shim layers to connect their PCI drivers up to the userspace PCI Linux API.
Well, most of the magic of hardware routers also comes in the form of hardware acceleration of the actual data plane - ie. L3 switching - on the silicon itself. That's what makes them fast and so expensive. That sort of mechanism doesn't map nicely into the Linux interface paradigm (unless you do things like exporting kernel routes into the hardware, but that's borderline absurd).
I think even if the driver were to be implemented in kernelspace, it would still probably not expose any of it's physical interfaces to userspace as plain ethernet devices, maybe apart from virtual/mgmt ones to run SSH on, and perhaps one so that the kernel can handle packets that the router doesn't have flows programmed for (like in OpenFlow).
> That sort of mechanism doesn't map nicely into the Linux interface paradigm (unless you do things like exporting kernel routes into the hardware, but that's borderline absurd).
Not absurd at all. Cumulus (which I cofounded) does exactly that. There are >1000 customers, including several of the largest cloud operators in the world.
It works really well in practice, since you can just fall back to the kernel for non-fast-path stuff like ARP. IOS/NXOS implement ARP (and everything else) themselves. We can just use the kernel's implementation.
The idea is essentially to use the lightning fast forwarding ASIC as a hardware accelerator for the networking functionality the kernel already has.
> I think even if the driver were to be implemented in kernelspace, it would still probably not expose any of it's physical interfaces to userspace as plain ethernet devices, maybe apart from virtual/mgmt ones to run SSH on, and perhaps one so that the kernel can handle packets that the router doesn't have flows programmed for (like in OpenFlow).
That's basically how switch development works in a nutshell, look at Broadcom's OpenNSL.
For years and years, the X server was effectively a userspace device driver. It would map the configuration registers and the framebuffer and do everything outside the kernel. And it worked fine, for the most part.
Once GPUs arrived, the ability to do latency-critical management of the device state became important and the register management moved into the kernel. But for traditional framebuffers the device setup was for the most part done once, and there's no particular need for that to be managed outside userspace.
Also for a long time there was no need to do any kind of fine-grained synchronisation with the graphics hardware apart from usual IO wait states and tha whole thing could be accessed as few memory mappings without any kind of interrupt handling (even to the extent of Sun's proprietary UPA slot not even supporting interrupts in it's low-cost graphics-only incarnation).
From what I know, the reason they moved device setup to the kernel was to avoid flickering when the system switches from the boot screen to the login manager.
One important one is that accessing the PCI config space via IO ports 0xCF8/0xCFC is racy with the kernel, since a read or write requires writing the BDF address to 0xCF8, and then reading/writing the data from 0xCFC. If the kernel tries to do this dance while the X server is doing it as well one of them is going to read or write the wrong address.
Interestingly, this design required in the X server to run under binary translation in VMware's monitor, even though it was userspace code, because it had to elevate its IOPL to be able to read/write the IO ports. CSRSS.EXE in windows also ran in BT, since it too was driving the graphics card before NT4. After NT4 moved the graphics code into the kernel, no one remembered to take out the IOPL elevation code, so at least until XP (and probably later) CSRSS.EXE runs with elevated privileges that it didn't need.
After the graphics driver moved out of it and into the kernel, it probably no longer needs the ability to turn off interrupts and read and write legacy IO ports.
Yes. Most of the interesting parts of modern graphics drivers are loaded into the X client process these days, anyway. Whoever wants to render something talks directly to a thin kernel interface with the X server out of the way except for some high level management stuff.
It's far more than that in the end, but yes: mode switching is a spot where you need whole-system management of the resources. The XFree86 binary couldn't easily make assumptions about what someone else was doing.
It may be equal parts GPL avoidance. Broadcom switch ASIC PDKs are a kind of hybrid kernel and userland application with no legacy reason, so I assume it is just arbitrarily about working around a restrictive license.
I really like NetBSD. It supports Xen[0], it has the awesome multithreaded NPF[1], the CHFS filesystem tailored for SSDs[2], and support for rump kernels[3].
I heartily recommend this talk. It and the corresponding proof-of-concept Intel userland packet processing driver[1] went a long way for me in removing a lot of the magic from low-level network packet handling and device management in Linux.
NVMe and networking are now used from use space to reduce latency in storage systems. I'm doing that in my current job. There are libraries to help you with it such as DPDK and SPDK which are good starting points.
The main benefit is reliability. Driver code is usually lower-quality than other code that runs in kernels. The hardware itself can act weird in a way that messed the drivers up. The infamous Blue Screen of Death on Windows was usually driver errors. Isolating them in their own address space prevents errors from taking the system down. One might also use safe coding, static analysis, model-checking, etc when developing drivers themselves. Microsoft eliminated most of their blue screens with SLAM toolkit for model-checking drivers. Of the two, isolation with restarts is the easiest given you can use it on unmodified or lightly-modified drivers in many cases.
Far as security, it really depends on the design of the system and hardware. The basic, isolation mechanisms like MMU's might restrict the rogue driver enough if the attack just lets them go for other memory addresses. If it uses DMA, then they might control the DMA to indirectly hit other memory or even go for peripheral firmware. If the DMA is restricted, then maybe not. It all depends as I said on what the hardware offers you plus how the system uses it.
All these possibilities are why high-assurance security pushed in the 1980's-1990's to have formal specifications of every component, hardware and software, that map every interaction of state or flow of information. That didn't happen for most mainstream stuff. Without precise models, there's probably more attacks to come involving drivers interacting with underlying hardware that's complex. It's why I recommend simple, RISC CPU's with verified drivers for high-security applications. Quite a few folks from the old guard even use 8-16-bit microcontrollers with no DMA specifically to reduce these risks.
Far as verifying drivers, here's a sample of approaches I've seen that weren't as heavy as something like seL4:
It’s nice if drivers are not running in the kernel but even if your graphics drivers are running in userspace, if they crash you can’t use your pc anymore.
The main advantage is that you don’t have to deal with all the limitations of kernel mode programming.
You forget that if the graphic drivers crash in the userspace, it can be automatically restarted, while if it crash in kernel space, nothing you can do except hard reboot
On top of wean_irdeh's comment, Ill add you might have apps running in the background that can still do work (esp networked) or just shut down cleanly. There can even be a keyboard sequence for that.
there can be serious performance benefits for i/o heavy workloads:
- removal of copies mandated by the user/kernel boundary
- lower control transfer overhead, up to and including becoming completely polled mode. interrupt, getting into the kernel service thread from interrupt, through the kernel stack, into epoll, and into a user thread takes some time
- use of device specific features without having to plumb them through all the various kernel interfaces
- native asynch removing overheads associated with i/o thread pools
- exploitation of workload specific optimizations that would be defeated by the kernel scheduler, memory management, buffer cache, and other machinery
of course you lose all device independence from your interface, any intra-process resource sharing provided by the kernel mechanisms. you have to deal with all the error recovery and safety issues yourself. but on some occasions its really worth it.
Reminds me that Intel didn't include PCI on their Atom variants aimed at mobile devices, supposedly because it was too power hungry.
This in turn lead to Microsoft balking at supporting said hardware as Windows is deeply reliant on PCI (even the ARM SOCs powering the Windows RT products support PCI).
In turn Intel developed Moblin, that later merged efforts with Nokia's Maemo to become Meego. Later still foisted onto the Linux Foundation.
Most real life implementations of "microkernels" end up being hybrids. NT started out as a micro, but Microsoft have been moving things (the graphics subsystem in particular) in and out of kernel space in the hunt for the optimal tradeoff between stability and performance.
Similarly i think the Mach kernel powering Apple's OSs are a "fat micro" where various things that should be in userspace, if one followed the microkernel orthodoxy, resides in kernel space.
Perhaps the only orthodox microkernel OS out there is QNX, these days languishing in the bowels of Blackberry's holdings.
Mac OS/XNU is in fact derivate of DEC's OSF/1 (later called Tru64 Unix). It has very weird hybrid design where essentially anything that would be in monolitic kernel runs as one big Mach process.
Edit: it is somewhat ironic that Alpha's memory protection model is designed such way that the natural way to implement any OS would be to write your own microkernel as OS-specific PALcode (something between firmware and microcode, written in extended Alpha ISA and the only thing that the CPU hardware sees as privileged code), but none of the Alpha OSes is implemented this way. In OSF/1 you thus get limited microkerne-ish thing that runs two process-ish things, one of which is Mach kernel and the other currently running Mach task, which in turn is either the essentially monolithic Unix kernel or Unix userspace process.
From memory, they did swap out Mach from the version 2.5 used in NS/OS for version 3 from MkLinux (which was from OSF). This was at the time of the Rhapsody to OS X transition. My memory is hazy on the BSD kernel mode component history.
Personally I feel like hybrids are the best implementation variant. Simply sticking to either monolithic or micro just ends up with kernels that are impractical or consist of a thousand moving parts that can crash independently when one goes down.
There are also modular kernels, which are also neat when implemented right (Linux is basically a modular kernel at this point)
Good to see the redheaded step child of the BSD world finally getting the limelight. FreeBSD gets all the attention, while OpenBSD gets all the praise.
Older Cisco routers ran IOS directly on proprietary hardware. At some point, Cisco decided to switch to Intel hardware but didn't port their kernel. They use a Linux kernel and ran IOS as a huge 50MB+ binary. The guy doing the talk got shell access and only found one ethernet device when running ifconfig. The actual switching hardware was being handled in userspace by the large binary.
I'm guessing they probably just wrote some shim layers to connect their PCI drivers up to the userspace PCI Linux API.