
On-disk format robustness requirements for new filesystems - chmaynard
https://lwn.net/SubscriberLink/796687/a7a91ffcc9b7d52a/
======
senozhatsky
Somehow reminds me of this conversation [0]:

    
    
      > Al Viro asked if there is a plan to allow mounting hand-crafted XFS or ext4
      > filesystem images. That is an easy way for an attacker to run their own code
      > in ring 0, he said. The filesystems are not written to expect that kind of
      > (ab)use. When asked if it really was that easy to crash the kernel with a
      > hand-crafted filesystem image, Viro said: "is water wet?" 
    

[0] [https://lwn.net/Articles/718639/](https://lwn.net/Articles/718639/)

~~~
rsync
"The filesystems are not written to expect that kind of (ab)use. When asked if
it really was that easy to crash the kernel with a hand-crafted filesystem
image, Viro said: "is water wet?""

This is why an rsync.net account that is enabled to allow zfs send/recv is
_actually_ inside a VM and the customer is given their own zpool and their own
root login.

It's really resource intensive to do it this way and there are other, much
simpler and scalable ways to provide the ability to zfs send into cloud
storage ...

However, there is universal agreement among the ZFS coding community[1] that
allowing someone to 'zfs send' an arbitrary datastream (in this case, a
snapshot) is _tremendously dangerous_.

In the best case, the malicious actor can crash the kernel and deny service.
In the worst case, the malicious actor could destroy the underlying zpool.

[1] Please consider attending the OpenZFS developer Summit in November if you
have any interest in this ...

------
iforgotpassword
I agree fully with the desire for resilience throughout the kernel, especially
in today's world. It was very different 20 years ago when Linux was still
young.

Otoh it's one of the things that I think made Linux succeed in the beginning.
Everyone could upstream a half arsed driver for something and it would get
fixed while people use it and encounter bugs. Now that Linux is used
professionally everywhere that just isn't feasible anymore.

On another note, I remember that some time ago there was a talk about Linux
file system fuzzing given at some conference and ext4 fared the best by far,
which is why I'm still using that exclusively, although some of the features
of btrfs would come in handy at times.

~~~
est31
I think the main reason why btrfs has not fully replaced the ext family by now
is its bugs. It's understandable that they don't want to add more buggy
filesystems.

~~~
pletnes
Personally, I’ve never noticed bugs, but the performance characteristics are
truly weird. If you fill up a large partition, expect using a few days of
running various incantations to free space, even after removing plenty of
files.

I’m completely amazed at how fast snapshots can be done, though.

~~~
est31
Yeah I'd classify that bad performance with nearly full drives as a bug.
Compare that to one of my ext4 partitions which has been > 90% since years of
active use without major issues. Occasionally I get errors that it can't write
stuff any more due to capacity, I just delete some unneeded stuff and go on
with life.

~~~
seized
It's not really a bug though taken in context of how the file systems work
with copy on write. You need somewhere to write, so if you low free space and
have the drive heads seeking...

------
finchisko
Wondering, if there is flash memory friendly filesystem, that is actually not
overwriting existing blocks until card is full, but write any changes to new
memory cells, rather than overwriting existing ones. This is not a problem for
devices like cameras, since they only create new files, so wear is distributed
among all cells evenly (assuming you remove all files when card is full,
before taking new photos), but it's real problem for devices like raspberry pi
and rapsbian, doing lot of updates (log files, ...). And, yes I understand
root partition, could be set to read-only and log stored on tmpfs, but I'm
still curious.

PS: As I still struggle to understand, why I'm getting downvotes. So please be
so kind and write why, so I can eventually delete this comment.

~~~
mrob
It's called a log-structured file system. LWN has a good overview:

[https://lwn.net/Articles/353411/](https://lwn.net/Articles/353411/)

SSDs implement them internally, and some Android devices use the F2FS
filesystem:

[https://en.wikipedia.org/wiki/F2FS](https://en.wikipedia.org/wiki/F2FS)

~~~
toolslive
Log-structured is conceptually robust, but SSDs/NVMes have failure modes that
can do things like "in this extent of 64MB, all bytes have their 6th bit
erased". So it's an illusion to think that in case of a crash, the things you
did not touch remained unharmed.

~~~
toolslive
here's a paper that will shatter some illusions:
[https://www.usenix.org/system/files/conference/fast13/fast13...](https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf)

~~~
londons_explore
Considering how the hardware doesn't provide the serializability guarantees
that it claims, why do we pay the high software performance and complexity hit
to try to get the same?

Lockfiles, O_SYNC, flush(), etc. all become unnecessary if we just assume that
all data is at risk in case of improper poweroff. libeatmydata does this, and
_dramatically_ increases performance for some workloads.

~~~
blattimwind
I/O has always been a happy-path-only adventure in mainstream software and
hardware.

Attempting consistent I/O (kernel and hardware will try their best to thwart
any attempt) etc. may help in some cases, but ultimately there are hardly any
guarantees when it comes to power loss, and your data may be gone or corrupted
no matter what you did.

------
JulianMorrison
It feels surprising to me that a Linux system can be bricked or rooted by a
maliciously constructed filesystem and this is not considered a major bug.
Surely this is an obvious attack vector? (Dropped USB sticks, etc.)

~~~
londons_explore
It's a pretty good argument for FUSE-stye userspace filesystems. Then the code
reading the filesystem need not have any more permissions than the user
mounting it.

~~~
JulianMorrison
That might prevent privilege escalation but I'm not sure it prevents a kernel
wedge.

~~~
catern
If you can wedge the kernel with a FUSE filesystem, that is certainly a severe
bug, because it means an unprivileged user is able to DOS the system for
everyone.

------
the8472
Fuzzing was suggested in the past to shake out some corrupt filesystem image
bugs: [https://lwn.net/Articles/685182/](https://lwn.net/Articles/685182/)

~~~
akx
User Mode Linux is fresh in my mind due to a recent HN post on it – wouldn't
it be easier to fuzz the kernel as an user-mode program instead of futzing
around with a `/dev/afl` device?

~~~
sitkack
Being so pluggable, a file system should even be fuzzable inside of a highly
sandboxed environment.

1) compile fs code into wasm

2) generate in-memory disk images

3) run fs code over in-memory disk image

4) use a neural net to search the fuzz space

At that point one could couple something like profile guide optimization but
branch predicted adversarial input differentiation. This would automatically
find patterns between on-disk data structures and the code that is executed
from changes in those structures.

Zero syscall, zero vm exit file system fuzzing all in user space. One could
easily get thousands of cores working on this problem in short order.

------
nindalf
Noob question - what is the advantage for Huawei in getting this moved out of
staging? All of their devices are already using it despite it being in
staging. Is it to reduce maintenance burden on themselves because it's now a
shared responsibility?

~~~
londons_explore
Filesystems are _fairly_ pluggable, so them keeping it private isn't that big
a burden.

They would be expected to maintain it either way.

The biggest benefit would probably be if they hope to get Google to make it a
standard part of Android, or hope for other manufacturers to start using it
(and therefore sharing the work of feature development and maintanance).

------
nneonneo
Desktop Linux systems often feature some kind of automount capability so you
can just plug in an external drive and have it work. Windows and macOS provide
similar facilities. Per the article, is this actually insecure? Will plugging
in a corrupt/maliciously modified ext4-formatted drive on an automount Linux
system enable kernel compromise? If so, why is it that this is generally OK on
Windows/macOS (barring silly things like autostart viruses)?

------
amaccuish
As an aside, if I were to plug a ext4 formatted memory stick in, and the
system automounts it, and I've placed a setuid binary on there, will it work?
Or does the automounter predict that and mount with nosuid?

------
rumanator
On a side-note, how does one go about adding data to a read-only file system
Suh as EROFS?

~~~
mschuster91
This is common in the embedded world with squashfs - unpack it somewhere,
add/modify the filesystem there, re-pack it.

~~~
BlackLotus89
If it's a working environment you could add an overlayfs and repack it after
working on it. Used to do that when I ran Gentoo in ram. Emerge and squash.

At least I think that's what I did... Man time flies

