Hacker News new | past | comments | ask | show | jobs | submit login
Linus Torvalds on userspace filesystems (spinics.net)
135 points by sasvari on June 29, 2011 | hide | past | web | favorite | 98 comments

An approach for this kind of problem: suppose you have a very simple language composed of simple, verifiable instructions, like: block read, block write, read-offset, write-offset, simple arithmetic, simple branching; you could compose programs for various file-system operations and pass them over to the kernel for execution en masse, rather than needing a chatty interface.

This basic idea is something I've applied to a file format that stored objects, but where the costs of serializing and deserializing a whole object graph was prohibitive for a request/response round-trip. For any given object path, foo.bar[3].baz, a "program" could be compiled that could be interpreted over the file contents and the answer retrieved. All object paths were available for compilation ahead of time (it was a custom language, long story), so this approach could be far more efficient than any serialization story.

Sounds similar to the Untrusted Deterministic Functions used in exokernels to let userspace do god-knows-what to underlying hardware resources in an efficient but provably safe way, using a correct-by-construction language [1,2].

Another tool in this space is Native Client; any chance it could be adapted to allow an even smaller, safer x86 subset, so that you could give the kernel pre-compiled 'callbacks' that it could execute on, e.g., receiving a network packet, with 0 context switches? Probably not without a lot of work, since NaCl relies a lot on virtual memory for memory isolation, but it's an idea.

[1] http://portal.acm.org/citation.cfm?id=266644 [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=

Nice links. The UDFs look like a very similar idea to mine, taken further and generalized; I make no claim to originality, it seemed obvious to my compiler-oriented eye.

>> I make no claim to originality

Oh, I didn't interpret it as such either; just some interesting related work :)

Another peripherally-related idea is Active Messages [1,2]. I see it as the analog to UDFs or other rich executable messages for efficient node-to-node communication in large clusters. It's sort of like RPC if you squint hard enough; you send little executable messages that run on a remote node, and can send you back some data if you want. I guess the point is that having rich executable messages allows you to get around long physical latencies between sender and receiver, as well as more mundane intra-node delays due to userspace/kernelspace context switches.

[1] http://www.cs.cornell.edu/home/tve/thesis/ (particularly chapter 4, it's a bit wordy but has lots of detail). [2] http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA204210&Locati... (there's lots of other published work on J-Machine from Dally's group, too)

The same system as the anti-serialization mechanism I described also used an RPC-like mechanism between client web pages and the server. From the JS perspective, it queued a list of commands to execute, with a return value closure passed along to each command. AJAX (pre-JSON days) roundtrips were minimized, because all the commands could be executed sequentially on the server, all results batched up, and returned.

The anti-serialization mechanism also acted as a transactional store - either all instructions succeeded, or the server effectively had no state change (DB transactions etc. were included into a distributed transaction as necessary). No in-memory state. It was a neat architecture (still is, I guess it's probably still in production).

That reminds me of Singularity OS, an experimental OS from Microsoft Research, which has no distinction between kernel and user space, thanks to writing everything in a variant of .NET bytecode.


There have been Java OSes too, with the idea that verifiable bytecode does away with the need for "expensive" memory protection etc. Running everything in one memory space is error-prone, but it's the default way of the monolithic kernel. Having services delegated out to processes with separate memory spaces is architecturally more stable; a single bug in an obscure driver in a less actively maintained corner of the kernel shouldn't lead to a system-wide vulnerability. The approach I mention preserves the ability to keep things separate (with appropriate controls and verification at the boundary); the Singularity / Java OS approach seems rather different.

Snoracle says JavaOS itself is obsolete, FWIW.

Yet another variant on the idea was the AS 400. It had no memory protection and didn't need it, since there was no way to address anything that you weren't supposed to.

Given the impressive stability and security record of the AS 400, they had a point.

Yes, when you have typed pointers at the hardware level, it can be even better than page protection.

Only because Snoracle hasn't ported Oracle to it, yet.

I wonder how they really achieve that considering that the x86 CPU in protected mode definitely makes the distinction (rings 0 and 3, respectively)?

I'm pretty sure Lisp OSes (you know, on the bytecoded Lisp Machines from companies such as Symbolics and TI) worked like that, and I know the CISC variants of the AS/400 architecture worked like that as well. (I don't know the RISC AS/400 or iSeries that well.)

An interesting aspect of that is what it does to your security model: In the AS/400 world, where applications are compiled to bytecode and then to machine code, the software that compiles to machine code is trusted to never emit dangerous machine code, as there are no further checks on application behavior. In the Lisp Machine world, anyone who can load microcode is God. In Singularity OS, the .Net runtime is effectively the microcode and the same remarks about Lisp Machines apply.

"That's like saying you should do a microkernel - it may sound nice on paper, but it's a damn stupid idea for people who care more about some idea than they care about reality."

He just had to sneak in a dig at Tanenbaum.

Actually, there are microkernels running in a vast amount of mobile phones nowadays. They typically don't run on the main CPU, though, but on various controller chips.

Microkernels are practical -- they just have different trade-offs. Microkernels are preferable if (a) the whole Linux kernel would be too large, or (b) you consider security more important than performance. It's probably also easier to provide realtime services with Microkernels, but I don't know much about Linux' realtime API to know if Microkernels would be much better here.

Given that most phones doing more than just providing calling and texting are running on monolithic or hybrid kernels, I wonder where you get this statistic. In addition, the most important part of secure and reliable systems is not what distributed safeguards you build into them, it's the amount of attack surface. If the monolithic kernel + drivers is smaller than the microkernel, it's probably more likely that the monolithic kernel is more secure.

I think he means the mobile baseband chips run microkernels.

Yes, most smartphones actually have two CPUs - the one running the OS and doing web browsing and stuff, which runs some monolithic kernel, and the one driving the radio chip, which runs a microkernel. They don't have much to do with each other.

I think the main reason for the separation is the realtime constraints needed by radio codecs - if your web browsing slows down because you received an email, it's a minor annoyance, but if you stop sending radio packets because you received an email, you probably lose the connection or worse.

Many kernels can give the necessary hard real time guarantees today. Honestly, I think that baseband chips run their own operating systems because it's illegal to sell open software radios in a lot of countries, and a cellphone where you could tweak the code running the baseband chip would essentially be one.

The radio CPU can be optimized for its job, so it may have extra instructions for error detection, decryption, etc. It may also be on a separate bus with the antenna peripherals, freeing up a lot of bandwidth for the main CPU.

It should be possible to do all this on a single CPU (real-time is no problem - just use an RTOS), but it would be expensive and eat a lot of power.

The other reason they split up the baseband and the main CPU is for regulatory reasons - the baseband is the only part talking to the radio, so it's the only part which needs to be thoroughly tested for regulatory compliance. This then lets them upgrade other parts more frequently without having to go through as much testing and obtaining approvals.

real-time is no problem - just use an RTOS

In theory, yes, but the main smartphone OSes aren't RTOSes, and making them such requires nontrivial reengineering.

The result is my Android kernel crashes, and is rebooting, while I'm still on a voice call. Not necessarily the intended result.

I'd say that's like arguing that "My language is better than yours at doing 'Hello, world!.'" This isn't a hard problem and even poor solutions can solve easy problems.

How is more privileged code more secure, I'm not even going to try to reason with you. The OKL4 and QNX microkernels are used in millions of devices from smartphones to planes.

So is VxWorks. Your point?

The funny thing is that at the time Linus made his famous remarks about microkernels, QNX was thoroughly kicking Linux's butt in nearly every area Linus thought was a problem for microkernels.

Personally, I love how confrontational he is. You don't see very many people who are that passionately practical. It's oddly inspiring...

He was replying to someone named Andrew. I'm pretty sure it's a reflex by now.

.. if performance is your only criterion. Some people (probably a minority) might prefer software that is highly resilient, even if that means it's 10-20% slower.

Snarky as this sentence might be, it is also my biggest takeaway from the thread, his relentless emphasis on practicality and not on ideas might be one of the reason he gets things done. I wish I was this practical.

And, intended or not, at Stallman

GNU HURD was hamstrung mainly by politics, not so much by technical issues. There was a ton of code written, but it kept getting ripped up and thrown out because someone didn't like it.

That sounds like juicy story. Who didn't like it and threw away the code?

Note that I'm not claiming that some competent people could fix HURD right now if the political environment were better. It's more that the politics moved the project into the least tenable position. I don't know if there's a complete history somewhere, but just some things I managed to piece together from Wikipedia articles:

Berkeley wouldn't cooperate with development on the 4.4BSD-Lite modified kernel, so in 1987 HURD decided to go with the Mach microkernel. But then they waited 3 years for licensing issues to clear up before investing any real effort into it. CMU stopped work on Mach in 1994, so HURD switched to Utah Mach. Utah stopped working on it in 1996. GNU kept working on that one under the name GNU Mach. And then (from Wikipedia): "In 2002, Roland McGrath branched the OSKit-Mach branch from GNU Mach 1.2, intending to replace all the device drivers and some of the hardware support with code from OSKit. After the release of GNU Mach 1.3, this branch was intended to become the GNU Mach 2.0 main line; however, as of 2006, OSKit-Mach is not being developed.

As of 2007, development continues on the GNU Mach 1.x branch, and is working towards a 1.4 release."

In 2004, an effort was started to move to a more "modern" microkernel. L4 was the first and it died almost immediately. Work started toward the Coyotos microkernel, but between 2007 and 2009, focus shifted to Viengoos. But then "As of 2011, development on Viengoos is paused due to Walfield lacking time to work on it. In the meantime, others have continued working on the Mach variant of Hurd."

I have it on good authority that Stallman now agrees that going with a microkernel was a mistake.

Is there any public statement or other public source for that statement? In particular, it would be very interesting to read about the concrete reasons which lead him to this conclusion.

Sorry, I got that from private communication with Thomas Bushnell, who was the principle architect on the project for a long time. His claim is that they had two choices, go with the microkernel, or start with the BSD code that was legally in the clear and rewrite the parts that were under a legal cloud at the time.

Thomas preferred the latter, Stallman the former. As events proved, the BSD approach would have been fine (particularly since the legal issues eventually got cleared up), while the microkernel approach ran into much larger unexpected roadblocks than anticipated.

In his own words: "Finally, I take full responsibility for the technical decision to develop the GNU kernel based on Mach, a decision which seems to have been responsible for the slowness of the development. I thought using Mach would speed the work by saving us a large part of the job, but I was wrong."


That's not him saying using a microkernel was a mistake, it's him saying that he was mistaken about how long it would take, and the impact Mach would have on development time.

One of the smartest guys of all time, but this was unnecessary. Plus, Windows is _generally_ a modified microkernel (some violations, but mostly right) - and it runs > a billion machines.

In fact, I doubt the kernel is responsible for any real performance issues anymore. A bloated userspace(especially on Windows, but most Linux distributions apply here too) is responsible.

Lolwut? Windows is at best a 'hybrid' kernel, and with that it's probably 95% monolithic and 5% microkernel-istic in it's design. How does that translate to '_generally_ a modified microkernel'?? You need to seriously read up on Monolithic/Hybrid/Microkernels and the differences between kernel and user-space. I'll be waiting.


"Windows NT's kernel mode code further distinguishes between the "kernel", whose primary purpose is to implement processor and architecture dependent functions, and the "executive". This was designed as a modified microkernel, as the Windows NT kernel does not meet all of the criteria of a pure microkernel."

Ahh, wikipedia. Then, I raise you one, please make sure you read the NT description:


I guess it's all just semantics at this point - my point was that you'd be more correct in saying that NT was a microkernel with some changes, than a monolithic with some changes (I suppose this was your point with pulling out the Hybrid term).

FYI, most drivers including video, have been moved back to a split between User Mode and Kernel Mode, rather than being completely kernel mode as they used to be (the Wikipedia article links to something that is very out of date). IIRC, Application IPC is still Kernel-Mode.


Oops, wrong copy paste - http://msdn.microsoft.com/en-us/library/aa480220.aspx

(about WDDM, which move most of the display driver functionality back to user-mode)

This is amusing given the topic at hand, and what the Plan 9 people thought of Tanenbaum:


I don't care what he says, I'm still waiting for an inclusion of some kind of union mount functionality in mainline. I don't need anything fancy, just one ro dir and one rw dir. If he can't even get me that in-kernel, I'll stick with unionfs in fuse, thank you very much.

Don't tell me about aufs, UnionMount, or overlayfs: aufs isn't up to date because it's not mainlined (for no good reason I can find), and UnionMount and overlayfs are too far out of mainline and too much in development to find reasonable packages.

Linus: when you can back up your claims with working code, I'll start listening. ;-)

aufs isn't mainline, and will probably never be mainline, because it's ~27,000 lines of poorly documented and often buggy code, written by a single maintainer in a style that doesn't really mesh with the rest of the kernel.

Speaking from experience, there are a lot of edge-case bugs remaining in aufs. It's usable enough for livecds and whatnot, but trying to use it on a server is a recipe for disaster.

Hey, question: What's a convincing use case for a union filesystem?

Also, what I don't think union filesystems are a valid argument. They're not real filesystems in the way NTFS/FAT/extfs/reiserfs or whatever are — there's a lot less work to be done, and I could see the falling under the "toy" category that Linus was talking about...

squashfs plus unionmount

commonly used for livecds, I compress /usr, saves me about 5GB and some I/O and therefore some battery

I'd love to use them for VM image building, something I'm doing quite a lot of at the moment. Having an editable FS overlay on top of a pristine root image without having to futz with qcow2 and nbd would make the process a lot simpler.

The current status of this seems to be summed up in this LWN article: https://lwn.net/Articles/447650/

Isn't it down to you to prove him wrong rather than vice versa?

you misunderstand me. I have a working solution, using fuse. if he thinks that's bad, it's on him to provide a correct in-kernel solution, then show me it's faster

Read the whole thread (or even just the immediate parent email, from Andrew Morton). Linus wasn't saying "don't use fuse for this", he was pushing back against a suggestion that the existence of the fuse implementation was a reason not to merge a kernel implementation.

I didn't read the parent, thanks.

Union mount functionality is not something that needs to be done in userspace, Plan 9 has union mounts, but that is one of the few things it does in kernel, for details see: http://man.cat-v.org/plan_9/2/bind

The key thing is to have simple, clean and well defined semantics, this is much harder when your VFS is polluted by stuff like symlinks and other weird kinds of pseudofiles.

I know Al Viro has wanted to have proper union mounts in Linux for many years, but getting proper private namespaces was hard enough, and now nobody uses them, which is sad (but more the fault of the suid-centric userspace environment than of the kernel, if you have to be root to create a new namespace, it is rather pointless).

I'm beginning to suspect that people report cases of Linus insulting people regardless of whether the discussion was interesting or not.

Linus insults someone on a forum -> first page on HN !!

Just a thought.

Yes, if the discussion has some fascinating relevance for the general HN readership, I can't figure it out. Linus doesn't come off very well, trying to settle a discussion with common sense and sarcasm while the other people on the thread use facts and examples to paint a more complicated picture.

People who think that userspace filesystems are realistic for anything but toys are just misguided.

I use sshfs all the time. Much of the software I use every day only needs to meet "toy" standards to be useful to me. What is Linus on about here?

Yes, but would you rely on it as the filesystem for a public server?

His point isn't that userspace filesystems are useless, but they are much slower, so the existence of a Fuse filesystem is not an argument for not including a kernel space filesystem in the kernel.

Linus wasn't very clear, but the issue has more to do with the cost of context switches between the kernel and a userland process to support a userland filesystem. These switches are relatively expensive; a good example is the relative throughput of a LUKS/DM-CRYPT encrypted volume compared to a Truecrypt volume with equivalent cipher selections.

LUKS' throughput is much higher, simply because it does not add at least two more kernel<->userspace transitions on every read and write.

Oh, I remember back in the good-ole days when it'd be hitting the front page of Slashdot. Not a terribly new phenomena.


It was never news, for just about any definition of 'news'.

this reply is a bit more sane - http://www.spinics.net/lists/linux-fsdevel/msg46080.html

glad i don't work with torvlads.

Is there a website or book with the collected ideas/opinions of Linus and the design of linux?

"that's like saying you should do a microkernel - it may sound nice on paper, but it's a damn stupid idea for people who care more about some idea than they care about reality."

So I look up this microkernel thing and find this old debate: http://www.dina.dk/~abraham/Linus_vs_Tanenbaum.html

It would be great to have a resource that explains the what and the why behind the linux architecture. Also arguments, errr I mean debates, between high caliber smart people would be cool.

There is a book "Just for Fun: The Story of an Accidental Revolutionary" that I ordered, but I have a feeling it doesn't go into the technical depth I would like. I could be wrong, just ordered it, will find out.

Anecdotally, a project I worked on involved evaluating some encrypted file systems on linux and encFS (on FUSE) was at an order of magnitude worse performance than kernel solutions.

Not to say it could never happen, but the tradeoff on performance is just no where near worth it right now.

I'd suspected performance might not be the primary purpose when I came across the Python packages for implementing filesystems.

Still, it seems like a lot of the interesting filesystem ideas have to do with non-root users authenticating to something else over some network connection. This type of thing is simply a more natural fit for user space.

Keeping the high-level stateful protocol stuff out of the kernel is usually a good idea except when performance is all that matters (i.e. there will be full-time developers tuning it and cleaning up the inevitable security and crash bugs).

The main thing I use fuse for is sshfs, where trying to optimize inside the kernel is unlikely to give noticeable gains relative to network latency.

And Linus acknowledges:

> fuse works fine if the thing being exported is some random low-use

> interface to a fundamentally slow device

It's a bit amusing to see Linus, all these years after his famous dispute with Tanenbaum, still fighting microkernels. Personally, I think that the Minix source code shows that microkernels are quite viable, and that Linus's disdain is misguided.

I'm not an OS expert, so perhaps I'm just being an idiot, but doesn't that argument go both ways (and probably more to Linux)? I mean, couldn't one say that the Linux source code proves that microkernels aren't really necessary?

Microkernels and Linux style kernels solve different problems. If reliability is your concern, then Microkernels are the way to go because a buggy driver (which will exist no matter how careful you are) can't take down your entire system. For desktop operating systems, where the user really is fine with pushing the reset button when something goes wrong, it is not worth the development effort to write a microkernel.

In the end, the answer is always to pick the right tool for the job. Comparing Minix and Linux is about as meaningful as comparing a phillips screwdriver with a flathead. They're used for different purposes.

On real-world hardware, a buggy driver can take down your entire system regardless of whether you're using a microkernel or not. Program a bad DMA, blit, or scatter/gather and Minix won't save you.

I agree with your last sentence though.

If reliability is your concern, then Microkernels are the way to go because a buggy driver (which will exist no matter how careful you are) can't take down your entire system.

If reliability is that much of a concern, why isn't the system being designed in a way that makes the distinction irrelevant?

about as meaningful as comparing a phillips screwdriver with a flathead. They're used for different purposes.

Is this just that they work with different kinds of screws, or are there cases where it's actually preferable to use a flathead screw+screwdriver instead of Phillips (or Robertson, which for some reason we don't have around here)?

I agree with your post, but this:

...as meaningful as comparing a phillips screwdriver with a flathead. They're used for different purposes.

is perhaps not the best analogy. They are both for turning screws, and some people do, in fact, have strong opinions on which is better.

Unless you meant to say that a phillips is for turning screws and a flathead is for prying stuff open. :-D

It's not really about the 'development effort', it's about the performance. Message passing will always be alot slower than communicating through shared memory. Just like user-space filesystems will always be slower than kernel-space filesystems.

This means that in areas where performance is paramount, the system stability benefits of a microkernel design just isn't worth the loss in speed. In other areas it might be, but as the current OS landscape shows it's not in huge demand. Part of this is obviously that monolithic/hybrid kernels really aren't that prone to crashing (bugs get fixed), which means that in practice these kernel designs offers BOTH performance and stability.

I hear Hurd is making some progress...

I guess no thread on Microkernels is complete without bringing up Hurd :P

There are at least a few microkernel implementations beside Minix that are in use/more or less stable/shows some viability. (E.g. QNX, L4 family and to some extent EROS family)

They have received nowhere as much attention as something like Linux. So the usability is not really up there.

Is that because of usability or network effects, though?

QNX is proprietary and is very expensive (with occasional free demos).

The only useful thing you can run on L4 is Linux, but that's pointless.

EROS/CapROS/Coyotos never had any useful userspace.

How does the Minix source code prove this? Minix has existed long before Linux and yet Linux made enormous gains in popularity while Minix is still minor and very rarely used.

Has the Minix kernel shown significant performance results? (I am not being sarcastic btw, I am honestly asking.)

I know OSX uses something like a micro-kernel and it is a perfectly fine desktop OS, but performance suffers once you get to thousands of threads.

I don't think XNU is a microkernel in any real sense. Certainly not in the sense that the Minix kernel is or was.

The filesystem, device drivers, IPC, network stack, etc is all in kernel space. It's true that it's built around the Mach microkernel, but like most systems that built around it, Mach is just a core layer in the kernel onion, rather than a microkernel in and of itself with the other major components implemented as user-space servers.

Ummm, IPC pretty much has to be in the kernel, by definition...

A low-level IPC mechanism needs to exist in the microkernel. Higher-level IPC mechanisms that applications will actually care about and use certainly don't need to; see, for example, pipes in Minix.

minix was not free for many years. that may also have affected its popularity.

> Has the Minix kernel shown significant performance results?

"If the automobile had followed the same development as the computer, a Rolls-Royce would today cost $100, get a million miles per gallon, and explode once a year killing everyone inside."

> I know OSX uses something like a micro-kernel and it is a perfectly fine desktop OS,

OSX doesn't use anything like a micro-kernel, it runs Mach (which even microkernel fans ridicule, it is bigger than many monolithic kernels like the Plan 9 kernel), with a monolithic BSD-style kernel directly on top, when your microkernel has (being generous) two components, both huge, I don't think it is like what most people consider a microkernel at all.

It is also fun to remember that people also claimed Windows NT was a microkernel, and then they added the graphics system right into the kernel.

Minix is still just a toy. It doesn't implement virtual memory, which is one of the hardest things modern operating systems do.

If you've ever implemented real-world code that solves a problem you'll find that there are some people (mostly who have not implemented much practical code but read a lot about it) who want to tell you how you should have done it. I think Linus got sick of hearing these suggestions and began defending his implementation decisions based on his experience and the running code.

Correct me if I'm wrong, but this is nothing new. We've known about these limitations on FUSE for quite some time, haven't we?

sshfs isn't a toy

Linus is increasingly becoming a crazy old man. I find his diatribes largely irrelevant.

Interesting in light of Go:

  - New languages are needed for writing distributed/parallel applications
  `Needed', no.  `Helpful', perhaps.  The jury's still out.

Note that the Plan 9 team at the time were already working on Alef ( http://doc.cat-v.org/plan_9/2nd_edition/papers/alef/ ) and would later develop Limbo ( http://doc.cat-v.org/inferno/4th_edition/limbo_language/ ) as the language for the Inferno distributed system ( http://doc.cat-v.org/inferno/ )

It's sad Linus doesn't want to learn.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact