The Hairy State of Linux Filesystems

blasdel · on Feb 12, 2009

His argurment is completely bogus -- they're less slim because they share common code? Did he even look at the symbols he's complaining about?

The only filesystem I have built as a module on my machine is the brand-new btrfs, it has 223 unresolved symbols, and guess what? They're all basic, neccessary shit that should obviously be shared: mallocs, libc functions, concurrency primitives, block IO, inode management, VFS api, generic datastructures, zlib, etc.

This is more simplicity, not less, and an example of something that fustrates interlopers about the Linux kernel development culture -- your module does not make it into vanilla unless you're working at the same level as your siblings. If you have a systemic improvement to make, you make it to a lower layer, you don't get to keep your own toys.

This way all the filesystems are improved by the innovations of one -- it took years for the production-ready XFS code released by SGI to make it into Linus's tree because it effectively implemented its own VFS tools. Eventually all its improvements were merged into the VFS layer, the refactored XFS was merged in, and all users got a performance boost.

LogicHoleFlaw · on Feb 11, 2009

I look forward to seeing what happens to filesystems as solid-state drives with intelligent controllers become more common.

Optimizing file systems for disc rotational latency is so last century.

wmf · on Feb 12, 2009

Optimizing file systems for disc rotational latency is so last century.

It's not clear that modern file systems are doing a very good job of that anyway; WAFL, ZFS, and Btrfs all deliberately fragment data to some extent. Besides 4K partition alignment, it's not clear that modern SSDs with pure page-mapping FTL benefit from any filesystem optimizations.

blasdel · on Feb 12, 2009

But flash memory (even with intelligent controllers) still has physical constraints to optimize for.

For starters: there's the much larger (and variable) block sizes, the erasure constraints around blocks (you have to rewrite a whole block to modify a bit), and the much more atomic power-management possibilities (can do better than just idle 'spin-down').

neilc · on Feb 12, 2009

you have to rewrite a whole block to modify a bit

Technically, you can change bits from 1 => 0; you need to erase a whole block to go from 0 => 1.

blasdel · on Feb 12, 2009

All the harder to optimize for -- you could make the bitmasks in the on-disk format inverted, but it wouldn't help any given that you're almost always going to modify an integer at the same time. I suppose you could use unary representations :)

Another related (and potentially conflicting) optimization would be for powersaving -- in solid state logic, boolean values are represented as 'low' and 'high', and since 'high' costs more power (and nand gates plentiful), many circuits get implemented in negative logic. That could mean that an on-disk format and it's inversion would have different power consumption!

jacquesm · on Feb 12, 2009

On regular harddrives you have to rewrite a whole sector to modify one bit.

blasdel · on Feb 12, 2009

But there's no optimization to be made, because the read head is going to pass over the whole sector anyway.

jacquesm · on Feb 12, 2009

To write one bit you have to read the whole sector, modify the bit and then write the whole sector back again, so the head (not the read head, there is no such thing) will have to see the whole sector pass by twice in order to modify one bit.

blasdel · on Feb 12, 2009

Yes, but the sectors are small, and the sector will pass beneath the head again on the next disk rotation. You would have already needed a full rotation, now you just need two -- it's still O(n).

With flash the blocks are much larger, and you don't need to seek over the whole thing to read one bit. The R/W disparity is far larger.

Plus with Flash there's a rich ongoing history of using raw controllerless devices that don't present a uniform interface to the OS (like MTDs), where you implement a lot of the controller logic in the filesystem, something that was never totally true for hard disks.

jacquesm · on Feb 12, 2009

Not 'totally true', but it was a lot more true in the past than it is today, logical block addressing on harddrives (especially PC drives, scsi and other non-consumer targeted drive architectures much less so) is a relatively new thing.

In the past the host did a lot more of the work that has now been offloaded to the drives, since IDE basically the whole controller moved off host onto the drive. The reasons are simple, tighter integration and lower costs as well as a uniform interface to the drive, no matter what goes on physically inside it. Nobody really cares! (well, drive manufacturers obviously do, but consumers will treat it like a black box, it either works or it doesn't, no much point in figuring out how it works).

So, flash in this sense is roughly where harddrives were in the times of the early winchester, you present a pretty rough device to the outside and the host has to know a lot about how it works to make it function.

SSDs are already changing this, the controller is built in now and the interface presented is very close to the one used by a normal harddrive, in fact most are indistinguishable. Even most usb sticks and the larger cards do a good job of hiding the gory details.

That full rotation business is not quite the whole story by the way, plenty of drives nowadays do not need to see the 'whole' track in order to do a simple read or write of a sector, they can optimize for that to the point that the sector will be just about ready to be read from or written to by the time the drive head arrives above the track.

Most drives try to avoid that situation as long as they can though, and most file systems do too by picking a block size that is a fairly large multiple of a sector.

mchadwick · on Feb 12, 2009

Obviously a bit off topic, but his graphic representation is astounding. Though not quite a Tufte archetype, I gleaned more insight from the image than the article. Very well done.

_csoo · on Feb 12, 2009

Asking the wrong question, looking at the wrong answer.

Who cares about this if the fundamental design of the file-system is flawed...

jff · on Feb 12, 2009

Slashdot was reporting yesterday that OSes are sliming down.

Don't worry, I got my proton pack right here for that Slimer.