Hacker News new | comments | show | ask | jobs | submit login
Userland PCI drivers (netbsd.org)
105 points by wean_irdeh 28 days ago | hide | past | web | favorite | 54 comments

I remember seeing the PCI userspace option in the Linux kernel menuconfig and wondered why anyone would do that, and then a few years ago at Kiwicon I saw my first use case. A presenter was trying to hack a Cisco router.

Older Cisco routers ran IOS directly on proprietary hardware. At some point, Cisco decided to switch to Intel hardware but didn't port their kernel. They use a Linux kernel and ran IOS as a huge 50MB+ binary. The guy doing the talk got shell access and only found one ethernet device when running ifconfig. The actual switching hardware was being handled in userspace by the large binary.

I'm guessing they probably just wrote some shim layers to connect their PCI drivers up to the userspace PCI Linux API.

Well, most of the magic of hardware routers also comes in the form of hardware acceleration of the actual data plane - ie. L3 switching - on the silicon itself. That's what makes them fast and so expensive. That sort of mechanism doesn't map nicely into the Linux interface paradigm (unless you do things like exporting kernel routes into the hardware, but that's borderline absurd).

I think even if the driver were to be implemented in kernelspace, it would still probably not expose any of it's physical interfaces to userspace as plain ethernet devices, maybe apart from virtual/mgmt ones to run SSH on, and perhaps one so that the kernel can handle packets that the router doesn't have flows programmed for (like in OpenFlow).

> That sort of mechanism doesn't map nicely into the Linux interface paradigm (unless you do things like exporting kernel routes into the hardware, but that's borderline absurd).

Not absurd at all. Cumulus (which I cofounded) does exactly that. There are >1000 customers, including several of the largest cloud operators in the world.

It works really well in practice, since you can just fall back to the kernel for non-fast-path stuff like ARP. IOS/NXOS implement ARP (and everything else) themselves. We can just use the kernel's implementation.

The idea is essentially to use the lightning fast forwarding ASIC as a hardware accelerator for the networking functionality the kernel already has.

> I think even if the driver were to be implemented in kernelspace, it would still probably not expose any of it's physical interfaces to userspace as plain ethernet devices, maybe apart from virtual/mgmt ones to run SSH on, and perhaps one so that the kernel can handle packets that the router doesn't have flows programmed for (like in OpenFlow).

That's basically how switch development works in a nutshell, look at Broadcom's OpenNSL.

Isn't switchdev supposed to provide a way to make an interface to in-silicon forwarding engines?

For years and years, the X server was effectively a userspace device driver. It would map the configuration registers and the framebuffer and do everything outside the kernel. And it worked fine, for the most part.

Once GPUs arrived, the ability to do latency-critical management of the device state became important and the register management moved into the kernel. But for traditional framebuffers the device setup was for the most part done once, and there's no particular need for that to be managed outside userspace.

Also for a long time there was no need to do any kind of fine-grained synchronisation with the graphics hardware apart from usual IO wait states and tha whole thing could be accessed as few memory mappings without any kind of interrupt handling (even to the extent of Sun's proprietary UPA slot not even supporting interrupts in it's low-cost graphics-only incarnation).

From what I know, the reason they moved device setup to the kernel was to avoid flickering when the system switches from the boot screen to the login manager.

There were a bunch of reasons.

One important one is that accessing the PCI config space via IO ports 0xCF8/0xCFC is racy with the kernel, since a read or write requires writing the BDF address to 0xCF8, and then reading/writing the data from 0xCFC. If the kernel tries to do this dance while the X server is doing it as well one of them is going to read or write the wrong address.

Interestingly, this design required in the X server to run under binary translation in VMware's monitor, even though it was userspace code, because it had to elevate its IOPL to be able to read/write the IO ports. CSRSS.EXE in windows also ran in BT, since it too was driving the graphics card before NT4. After NT4 moved the graphics code into the kernel, no one remembered to take out the IOPL elevation code, so at least until XP (and probably later) CSRSS.EXE runs with elevated privileges that it didn't need.

Csrss.exe is the userspace part of the win32 personality. It controls all win32 processes, which on a Windows system is just about every process.

It is not very useful to limit its privileges.

After the graphics driver moved out of it and into the kernel, it probably no longer needs the ability to turn off interrupts and read and write legacy IO ports.

Yes. Most of the interesting parts of modern graphics drivers are loaded into the X client process these days, anyway. Whoever wants to render something talks directly to a thin kernel interface with the X server out of the way except for some high level management stuff.

It's far more than that in the end, but yes: mode switching is a spot where you need whole-system management of the resources. The XFree86 binary couldn't easily make assumptions about what someone else was doing.

It may be equal parts GPL avoidance. Broadcom switch ASIC PDKs are a kind of hybrid kernel and userland application with no legacy reason, so I assume it is just arbitrarily about working around a restrictive license.

For those who are confused, this is not Apple's IOS, but Cisco's OS that was once (not sure about now) called IOS.

Still called IOS (well, there's also IOS-XE, IOS-XR, NX-OS... but that's a different story). Why would they change the name?

True, because Apples OS is called iOS.

I really like NetBSD. It supports Xen[0], it has the awesome multithreaded NPF[1], the CHFS filesystem tailored for SSDs[2], and support for rump kernels[3].

[0]: https://wiki.netbsd.org/ports/xen/

[1]: https://en.wikipedia.org/wiki/NPF_(firewall)

[2]: https://en.wikipedia.org/wiki/CHFS

[3]: http://rumpkernel.org/

I heartily recommend this talk. It and the corresponding proof-of-concept Intel userland packet processing driver[1] went a long way for me in removing a lot of the magic from low-level network packet handling and device management in Linux.

[1] https://github.com/emmericp/ixy

NVMe and networking are now used from use space to reduce latency in storage systems. I'm doing that in my current job. There are libraries to help you with it such as DPDK and SPDK which are good starting points.

Liedtke was right.

If it can be done outside the kernel, it shouldn't be in the kernel.

That’s a great idea if you don’t care about performance.

Common misconception. I suggest this article.


Being in userspace doesn't mean compromised performance, just like how QNX has done for decades

IRIX had a userland interface for PCI access: pciba(7)

I don’t know of anything that used it, but I’m sure there were custom PCI cards for data acquisition, hardware control, etc etc. that used it.


Isn't the advantage of running hardware drivers in userspace to limit the attack surface of a driver being exploited?

See Tannenbaum et al's paper on reliability/security mechanisms for a nice intro:


The main benefit is reliability. Driver code is usually lower-quality than other code that runs in kernels. The hardware itself can act weird in a way that messed the drivers up. The infamous Blue Screen of Death on Windows was usually driver errors. Isolating them in their own address space prevents errors from taking the system down. One might also use safe coding, static analysis, model-checking, etc when developing drivers themselves. Microsoft eliminated most of their blue screens with SLAM toolkit for model-checking drivers. Of the two, isolation with restarts is the easiest given you can use it on unmodified or lightly-modified drivers in many cases.

Far as security, it really depends on the design of the system and hardware. The basic, isolation mechanisms like MMU's might restrict the rogue driver enough if the attack just lets them go for other memory addresses. If it uses DMA, then they might control the DMA to indirectly hit other memory or even go for peripheral firmware. If the DMA is restricted, then maybe not. It all depends as I said on what the hardware offers you plus how the system uses it.

All these possibilities are why high-assurance security pushed in the 1980's-1990's to have formal specifications of every component, hardware and software, that map every interaction of state or flow of information. That didn't happen for most mainstream stuff. Without precise models, there's probably more attacks to come involving drivers interacting with underlying hardware that's complex. It's why I recommend simple, RISC CPU's with verified drivers for high-security applications. Quite a few folks from the old guard even use 8-16-bit microcontrollers with no DMA specifically to reduce these risks.

Far as verifying drivers, here's a sample of approaches I've seen that weren't as heavy as something like seL4:






Note: Including that last one specifically for the I/O verification part.

It’s nice if drivers are not running in the kernel but even if your graphics drivers are running in userspace, if they crash you can’t use your pc anymore.

The main advantage is that you don’t have to deal with all the limitations of kernel mode programming.

You forget that if the graphic drivers crash in the userspace, it can be automatically restarted, while if it crash in kernel space, nothing you can do except hard reboot

On top of wean_irdeh's comment, Ill add you might have apps running in the background that can still do work (esp networked) or just shut down cleanly. There can even be a keyboard sequence for that.

Note that Windows Vista onwards has UMDF too (user-mode driver framework). NT6.0 was a very big step.

UMDF was available for XP too, and printer drivers had been user space for a long long time.

iirc you can have kernel-mode GDI printer drivers, and the printer port drivers are in the kernel as well

I see, was a backport there

Somewhat related, Minix3 finally fixed the release blocker for 3.4.0.

Expect a release soon, for the first time in years. And it's a major one.

Where I can hear more about this one? Link please

Wikipedia article[1] has a list of changes, based on the rc. See the column at the right.

[1] https://en.wikipedia.org/wiki/MINIX_3#History

Thanks! Where's the announcement about the fixed release blocker for 3.4.0?

See https://github.com/Stichting-MINIX-Research-Foundation/minix...

It "only" delayed Minix 3.4 for 2 years.

Unless you are Cisco dev and do everything possible to screw it up

Edit: Cisco hardware was mentioned in this thread

thats one advantage. another is portability.

there can be serious performance benefits for i/o heavy workloads: - removal of copies mandated by the user/kernel boundary

   - lower control transfer overhead, up to and including becoming completely polled mode. interrupt, getting into the kernel service thread from interrupt, through the kernel stack, into epoll, and into a user thread takes some time

   - use of device specific features without having to plumb them through all the various kernel interfaces

   - native asynch removing overheads associated with i/o thread pools 

   - exploitation of workload specific optimizations that would be defeated by the kernel scheduler, memory management, buffer cache, and other machinery
of course you lose all device independence from your interface, any intra-process resource sharing provided by the kernel mechanisms. you have to deal with all the error recovery and safety issues yourself. but on some occasions its really worth it.

Reminds me that Intel didn't include PCI on their Atom variants aimed at mobile devices, supposedly because it was too power hungry.

This in turn lead to Microsoft balking at supporting said hardware as Windows is deeply reliant on PCI (even the ARM SOCs powering the Windows RT products support PCI).

In turn Intel developed Moblin, that later merged efforts with Nokia's Maemo to become Meego. Later still foisted onto the Linux Foundation.

I guess monolithic kernels have gone full-circle now.

Most real life implementations of "microkernels" end up being hybrids. NT started out as a micro, but Microsoft have been moving things (the graphics subsystem in particular) in and out of kernel space in the hunt for the optimal tradeoff between stability and performance.

Similarly i think the Mach kernel powering Apple's OSs are a "fat micro" where various things that should be in userspace, if one followed the microkernel orthodoxy, resides in kernel space.

Perhaps the only orthodox microkernel OS out there is QNX, these days languishing in the bowels of Blackberry's holdings.

Mac OS/XNU is in fact derivate of DEC's OSF/1 (later called Tru64 Unix). It has very weird hybrid design where essentially anything that would be in monolitic kernel runs as one big Mach process.

Edit: it is somewhat ironic that Alpha's memory protection model is designed such way that the natural way to implement any OS would be to write your own microkernel as OS-specific PALcode (something between firmware and microcode, written in extended Alpha ISA and the only thing that the CPU hardware sees as privileged code), but none of the Alpha OSes is implemented this way. In OSF/1 you thus get limited microkerne-ish thing that runs two process-ish things, one of which is Mach kernel and the other currently running Mach task, which in turn is either the essentially monolithic Unix kernel or Unix userspace process.

Sorry, but MacOS/OSX is a deriviative of NeXTStep, not Tru64.

There is no relationship to Tru64 except that HP did also support OpenSTEP at one point.

From memory, they did swap out Mach from the version 2.5 used in NS/OS for version 3 from MkLinux (which was from OSF). This was at the time of the Rhapsody to OS X transition. My memory is hazy on the BSD kernel mode component history.

There is Minix running in most Intel CPUs.

L4 running on most GSM radio chips.

Many embedded RTOS targeted at critical systems, are microkernels as well. For example the offerings from Green Hills.

Personally I feel like hybrids are the best implementation variant. Simply sticking to either monolithic or micro just ends up with kernels that are impractical or consist of a thousand moving parts that can crash independently when one goes down.

There are also modular kernels, which are also neat when implemented right (Linux is basically a modular kernel at this point)

if you are interested in this or other projects, GSoC is now and the deadline for student applications is 27 March (tomorrow).

Good to see the redheaded step child of the BSD world finally getting the limelight. FreeBSD gets all the attention, while OpenBSD gets all the praise.

The "redheaded step child" quote reminded me of this:

"BSD is Dying"


As I recall, in that presentation, the redheaded stepchild is OSX.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact