Hacker News new | past | comments | ask | show | jobs | submit login
On-disk format robustness requirements for new filesystems (lwn.net)
192 points by chmaynard 58 days ago | hide | past | web | favorite | 58 comments

Somehow reminds me of this conversation [0]:

  > Al Viro asked if there is a plan to allow mounting hand-crafted XFS or ext4
  > filesystem images. That is an easy way for an attacker to run their own code
  > in ring 0, he said. The filesystems are not written to expect that kind of
  > (ab)use. When asked if it really was that easy to crash the kernel with a
  > hand-crafted filesystem image, Viro said: "is water wet?" 
[0] https://lwn.net/Articles/718639/

"The filesystems are not written to expect that kind of (ab)use. When asked if it really was that easy to crash the kernel with a hand-crafted filesystem image, Viro said: "is water wet?""

This is why an rsync.net account that is enabled to allow zfs send/recv is actually inside a VM and the customer is given their own zpool and their own root login.

It's really resource intensive to do it this way and there are other, much simpler and scalable ways to provide the ability to zfs send into cloud storage ...

However, there is universal agreement among the ZFS coding community[1] that allowing someone to 'zfs send' an arbitrary datastream (in this case, a snapshot) is tremendously dangerous.

In the best case, the malicious actor can crash the kernel and deny service. In the worst case, the malicious actor could destroy the underlying zpool.

[1] Please consider attending the OpenZFS developer Summit in November if you have any interest in this ...

I agree fully with the desire for resilience throughout the kernel, especially in today's world. It was very different 20 years ago when Linux was still young.

Otoh it's one of the things that I think made Linux succeed in the beginning. Everyone could upstream a half arsed driver for something and it would get fixed while people use it and encounter bugs. Now that Linux is used professionally everywhere that just isn't feasible anymore.

On another note, I remember that some time ago there was a talk about Linux file system fuzzing given at some conference and ext4 fared the best by far, which is why I'm still using that exclusively, although some of the features of btrfs would come in handy at times.

I think the main reason why btrfs has not fully replaced the ext family by now is its bugs. It's understandable that they don't want to add more buggy filesystems.

Personally, I’ve never noticed bugs, but the performance characteristics are truly weird. If you fill up a large partition, expect using a few days of running various incantations to free space, even after removing plenty of files.

I’m completely amazed at how fast snapshots can be done, though.

Yeah I'd classify that bad performance with nearly full drives as a bug. Compare that to one of my ext4 partitions which has been > 90% since years of active use without major issues. Occasionally I get errors that it can't write stuff any more due to capacity, I just delete some unneeded stuff and go on with life.

It's not really a bug though taken in context of how the file systems work with copy on write. You need somewhere to write, so if you low free space and have the drive heads seeking...

It really makes it seem like handling free space was an afterthought in btrfs.

Infinite memory and infinite storage are great abstractions, easy to use models and a lot of software runs on those models... but in this case it seems like ENOSPC should have been designed in from the start.

I have noticed lots of things break when the resource horizon shrinks. Weirdness happens as on crosses into 1/2 (for things that like to double, like amortized data structures) and then again at <5% of free space/cpu/network/etc. Often the best mitigation is to have excess reserve and remove the false blocker (file on disk, optional process eating ram, etc).

Just more anecdotes: I accidentally filled up a 2tb btrfs filesystem on ssd with a buggy backup script. I deleted the stuff that shouldn't be there, ran a few recommended incantations, and was back to normal in about ten minutes. It was much less of a big deal than I expected based on what I'd read.

Btrfs has had its fair share of bugs, but rarely do they result in the inability to at least mount the file system read-only and get your data out. What has further been exposed by Btrfs are device bugs, in particular firmware. When write ordering doesn't happen the way the file system expects, it's a big problem, and Btrfs does not tolerate them, it should abruptly go read only in order to prevent amplifying file system confusion. This is separate from silent data corruption which Btrfs also detects and complains about, but usually just results in EIO as Btrfs will not propagate data it think is corrupt due to csum mismatches. But even in that case you can get your data out.

Yep. I personally won't use btrfs for this reason. The last thing I want to deal with is filesystem corruption. If I needed the advanced features, I'd just use zfs.

"just use zfs" is it's own can of worms.

Maybe Canonical will solve that, we will see how distributing zfs together with kernel will go through.

> "just use zfs" is it's own can of worms.

it's been fine for some of us for, what, nearly 10 years now?

If you consider normal to self-compile kernel modules on your production machines (or have compiler there, or use mutable system at all), well, then yes.

Otherwise, it is a massive pita. I do have one machine with zfs, and I made the mistake of placing the root into a subvolume. Won't do that again.

A few distros include ZoL drivers in their repos.

There's also FreeBSD as well as the various platforms derived from the, now defunct, OpenSolaris.

I've never had compile ZFS myself (though I have, on occasions, chosen to because I've wanted to try features before they hit the repos).

I'd rather do that than use a buggy filesystem!

ext4 and xfs are perfectly fine, you don't have to choose only between zfs or btrfs. In their case, being the boring ones, is an important feature.

And they are supported by any linux distro out of the box.

Oh sure. I generally stick to ext4. But IF I needed the advanced features of btrfs/zfs, then I'd choose zfs.

More than 10 years in my case. I've run ZFS on production systems on Solaris, OpenSolaris, Nexenta, FreeBSD (vanilla, not FreeNAS) and, more recently, Ubuntu. Never had an issue.

That said, I did once run into an issue running ZFS on ArchLinux which caused data loss. That was a highly experimental set up though and was before ZoL really took off (incidentally I've also run Btrfs on ArchLinux and that also caused me data loss).

Hopefully I'm not jynxing things saying this, but ZFS has saved me from excessive downtime on a number of occasions. It has even recovered from corrupted superblock failures (when a RAID controller was faulty and randomly dropping devices during heavy load).

After an experience where btrfs decided to break my combined filesystem I vow to never use it again.

Could you explain/describe what happened and why and how to prevent it?

Sure, I was trying to remove a disk from my array (1TB, with 50GB used, the other disk was 4TB and had ~600GB used). I tried doing `btrfs remove /dev/disk1 /mnt` and it refused, claiming there was no free space. No amount of arcane commands worked. Eventually I just copied all the files to somewhere else and nuked the filesystem entirely.

Due to the way btrfs allocates in "collections" of extents called block groups, it's possible all the space was allocated by mostly empty block groups, which could make enospc possible. But that's a rather old set of bugs that I haven't seen in a very long time and predates all the modern space handling code. It must have been pre-4.0 kernels. And I did run into it myself on purpose many times while trying to help improve the behavior.

Non-obvious, but very straight forward way around such a wedged in file system, is to add a 3rd device. It could even be a USB stick, back then I was using small 4GiB sticks and it would work. That was enough to allocate a couple metadata only block groups to the stick, to write out the file system changes necessary to back out the second device. And once that completes, a brief filtered balance (e.g. btrfs balance start -dusage=10 is usually sufficien) allows enough free space on the 1st device, to back out the 3rd (the USB stick).

The non-obvious thing about any COW file system is that deletion always requires free space. There is no such thing as deleting a file with COW unless the fs can write that deletion change to all the affected trees into free space. Once the entire set of metadata changes is committed, then the data and metadata extents for those deleted files can be freed.

Anyway, a lot has changed even in one year in Btrfs, let alone the past five years. It's thousands of line changes per kernel release.

Wondering, if there is flash memory friendly filesystem, that is actually not overwriting existing blocks until card is full, but write any changes to new memory cells, rather than overwriting existing ones. This is not a problem for devices like cameras, since they only create new files, so wear is distributed among all cells evenly (assuming you remove all files when card is full, before taking new photos), but it's real problem for devices like raspberry pi and rapsbian, doing lot of updates (log files, ...). And, yes I understand root partition, could be set to read-only and log stored on tmpfs, but I'm still curious.

PS: As I still struggle to understand, why I'm getting downvotes. So please be so kind and write why, so I can eventually delete this comment.

It's called a log-structured file system. LWN has a good overview:


SSDs implement them internally, and some Android devices use the F2FS filesystem:


Log-structured is conceptually robust, but SSDs/NVMes have failure modes that can do things like "in this extent of 64MB, all bytes have their 6th bit erased". So it's an illusion to think that in case of a crash, the things you did not touch remained unharmed.

here's a paper that will shatter some illusions: https://www.usenix.org/system/files/conference/fast13/fast13...

Considering how the hardware doesn't provide the serializability guarantees that it claims, why do we pay the high software performance and complexity hit to try to get the same?

Lockfiles, O_SYNC, flush(), etc. all become unnecessary if we just assume that all data is at risk in case of improper poweroff. libeatmydata does this, and dramatically increases performance for some workloads.

I/O has always been a happy-path-only adventure in mainstream software and hardware.

Attempting consistent I/O (kernel and hardware will try their best to thwart any attempt) etc. may help in some cases, but ultimately there are hardly any guarantees when it comes to power loss, and your data may be gone or corrupted no matter what you did.

I thought most flash devices did their own block mapping to implement this. Or is that just SSDs?

Only tiny tiny embedded devices don't use a wear-leveling controller of some kind.

AFAIU, you are correct, but apparently, "most flash devices" does not include MicroSD cards. At least that is how I remember it.

No, they've definitely got a Flash translation layer in there, and a microcontroller to run it. It's pretty much unavoidable for making a working flash device with reasonable performance on traditional filesystems.

But SD cards, CF cards, eMMCs, USB sticks - unless enterprise graded and claim otherwise, all have either very primitive wear leveling in their FTL or none at all, it's only SSDs that have full blown log structured algorithms.

It feels surprising to me that a Linux system can be bricked or rooted by a maliciously constructed filesystem and this is not considered a major bug. Surely this is an obvious attack vector? (Dropped USB sticks, etc.)

It's a pretty good argument for FUSE-stye userspace filesystems. Then the code reading the filesystem need not have any more permissions than the user mounting it.

That might prevent privilege escalation but I'm not sure it prevents a kernel wedge.

If you can wedge the kernel with a FUSE filesystem, that is certainly a severe bug, because it means an unprivileged user is able to DOS the system for everyone.

It's the reason libguestfs exists.

... or block storage systems in container environments, where the user somehow has write access to the block device.

Aren't those the usual way to provision some databases?

One argument is that allowing non-root to mount(2) is already a big security problem.

Fuzzing was suggested in the past to shake out some corrupt filesystem image bugs: https://lwn.net/Articles/685182/

User Mode Linux is fresh in my mind due to a recent HN post on it – wouldn't it be easier to fuzz the kernel as an user-mode program instead of futzing around with a `/dev/afl` device?

Being so pluggable, a file system should even be fuzzable inside of a highly sandboxed environment.

1) compile fs code into wasm

2) generate in-memory disk images

3) run fs code over in-memory disk image

4) use a neural net to search the fuzz space

At that point one could couple something like profile guide optimization but branch predicted adversarial input differentiation. This would automatically find patterns between on-disk data structures and the code that is executed from changes in those structures.

Zero syscall, zero vm exit file system fuzzing all in user space. One could easily get thousands of cores working on this problem in short order.

NetBSD has been sort of doing this; they have rump kernels to run kernel pieces in userspace, and testing/fuzzing is a significant use.

Noob question - what is the advantage for Huawei in getting this moved out of staging? All of their devices are already using it despite it being in staging. Is it to reduce maintenance burden on themselves because it's now a shared responsibility?

Filesystems are fairly pluggable, so them keeping it private isn't that big a burden.

They would be expected to maintain it either way.

The biggest benefit would probably be if they hope to get Google to make it a standard part of Android, or hope for other manufacturers to start using it (and therefore sharing the work of feature development and maintanance).


But also to make sure that it isn't kicked out of the kernel. Staging is not meant to be a permanent location for any kernel code.

Desktop Linux systems often feature some kind of automount capability so you can just plug in an external drive and have it work. Windows and macOS provide similar facilities. Per the article, is this actually insecure? Will plugging in a corrupt/maliciously modified ext4-formatted drive on an automount Linux system enable kernel compromise? If so, why is it that this is generally OK on Windows/macOS (barring silly things like autostart viruses)?

As an aside, if I were to plug a ext4 formatted memory stick in, and the system automounts it, and I've placed a setuid binary on there, will it work? Or does the automounter predict that and mount with nosuid?

On a side-note, how does one go about adding data to a read-only file system Suh as EROFS?

This is common in the embedded world with squashfs - unpack it somewhere, add/modify the filesystem there, re-pack it.

If it's a working environment you could add an overlayfs and repack it after working on it. Used to do that when I ran Gentoo in ram. Emerge and squash.

At least I think that's what I did... Man time flies

SquashFS also supports append-only modifications in-place.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact