
Filesystem error handling - Veelox
https://danluu.com/filesystem-errors/
======
kev009
One trope I frequently encounter as a FreeBSD professional is "Linux is used
by so many people, it doesn't have vast and sweeping bugs anymore".. the many
eyes fallacy. In reality, you get bystander paradox.. everyone wants RedHat or
IBM to do all the hard work and they reap all the rewards.

I find my field of systems pretty interesting.. there's so much to do and not
a lot of people rushing in to do it as the heavy hitters retire or move to
entrepreneurship etc. It's probably much cheaper to harden a *BSD or Illumos
system in these domains, and much easier to get the results integrated into
mainline. The financial and fame are much greater in doing it on Linux though.

~~~
Bromskloss
> the many eyes fallacy

What do you mean by this? Do more eyes not find and point out more problems?
Is the effect offset by some counteracting mechanism?

~~~
aw1621107
I think it's more along the lines of "Eh, I probably don't need to look, since
someone else is surely doing that".

~~~
convolvatron
once it gets annoying enough - followed by "lets try a newer kernel and see if
it goes away"

------
magnat
> This could happen if there’s disk corruption, a transient read failure, or a
> transient write failure caused bad data to be written.

Wouldn't transient I/O error get corrected by block driver itself by repeating
the request a few times? Does a filesystem driver assume any error reported by
storage device is permanent?

> The 2017 tests above used an 8k file where the first block that contained
> file data either returned an error at the block device level or was
> corrupted, depending on the test.

Are there any tests on how filesystems deal with metadata corruption or I/O
error? Apart from NTFS "self-healing" introduced in Windows 7/2008, do modern
filesystem drivers attempt to correct broken metadata completely on the kernel
side, or just give up to avoid making things worse?

~~~
binarycrusader
I believe apfs (apple's new filesystem), btrfs, and zfs all checksum metadata
and will attempt to correct it.

~~~
tscs37
I recommend the OP article since it specifically mentions apfs.

~~~
binarycrusader
The OP article mentions apfs does not checksum _data_ , but I assure you that
it does checksum _metadata_ :

[https://en.wikipedia.org/wiki/Apple_File_System#Data_integri...](https://en.wikipedia.org/wiki/Apple_File_System#Data_integrity)

~~~
tscs37
Even ext4 has check summing of metadata, it's a fairly boring feature by now.

~~~
binarycrusader
ext4 only received checksumming of metadata in Linux 3.16, which was released
in March 2016 -- not that long ago really. So I wouldn't call it a fairly
boring feature since it's barely been two years since it was integrated and I
imagine some Linux distributions might not even have it yet.

Furthermore, it's unclear to me whether ext4 just uses those checksums for
integrity checks or whether it actually has any "self-healing" capabilities.

My understanding is apfs, btrfs, and zfs all have self-healing capabilities.

~~~
tscs37
Checksums do not provide "self-healing". Ever.

They can only tell you if a piece of data is wrong or not.

In the case of BTRFS it's less a case of self-healing, the data is not healed,
rather that metadata is replicated across the disk.

I believe ZFS does the same but I'm not sure of this. I'm certain that ZFS
does this if it has several disks to replicate from.

I'm not aware of APFS being self-healing on a single disk.

~~~
jorangreef
Yes, ZFS also replicates metadata on a single disk.

~~~
binarycrusader
You can even tell ZFS to do the same thing with user data with the ditto
blocks feature.

------
snvzz
While neat, it's limited to Linux. It'd be nice to see how the BSDs behave, on
their filesystems (such as HAMMER2). ZFS was also notably absent.

~~~
X86BSD
As usual, another example of the Linux echo chamber. It's just one constant
inward gaze with that crowd. They learn nothing from anyone, only trial and
error.

~~~
wyldfire
I don't know that I'd agree. As much as llvm+clang inspired gcc to improve,
ZFS seems like a big part of the inspiration for btrfs. I'm not sure that it
hits the whole ZFS feature set, but it's probably a step in that direction.

------
blattimwind
Something to keep in mind is that by default writes go through the page cache.
By the time I/O is initiated for a write(2), the application process can have
exited.

~~~
simcop2387
This is why calling fsync(fd) before closing the file and exiting is a good
idea if you need that kind of error to be handled. You should get it as a
return of fsync if it happens after the write.

~~~
blattimwind
O_[D]SYNC is better than a separate call to fsync, since it is not supposed to
suffer from the race condition inherent to fsync. Arguably pedantic.

~~~
jorangreef
I agree with using O_DSYNC to surface the error to the write call, rather than
waiting until the fsync call, which is often not checked by the user.

I did some testing recently [1] with O_DIRECT + O_DSYNC and found some
surprising performance results, on Linux it can be similar to O_DIRECT +
fsync() after every write for hard drives. But as soon as you are doing
grouped writes, performance is almost always better by using O_DIRECT +
fsync() after the end of the group.

For SSD drives though, O_DIRECT + O_DSYNC can be faster than O_DIRECT +
fsync() after the end of the group, if you are pipelining your IO, e.g. you
encrypt and checksum the next batch of sectors while you wait for the previous
batch of checksummed and encrypted sectors to be written out. Because SSDs are
so much faster, you can actually afford to slow down the write a little more
by using O_DSYNC, so that your write is not faster than the related CPU work.

[1] [https://github.com/ronomon/direct-io](https://github.com/ronomon/direct-
io).

~~~
blattimwind
A more advanced (and somewhat easy to get wrong) option would be
sync_file_range combined with fdatasync, which allows to roughly emulate
O_DSYNC overall but without blocking synchronously for IO.

~~~
jorangreef
sync_file_range is different to fsync, fdatasync and O_DSYNC in that it does
not flush the disk write cache (whereas the latter explicitly do on newer
kernels):

[https://linux.die.net/man/2/sync_file_range](https://linux.die.net/man/2/sync_file_range):

    
    
      This system call does not flush disk write caches and thus does not
      provide any data integrity on systems with volatile disk write caches.

~~~
blattimwind
Hence "sync_file_range combined with fdatasync". sfr is only useful to get the
kernel to start the write-out _now_.

------
mjw1007
I can't see anywhere in the article where it says which version of linux this
was tested on (beyond "2017"), which is a shame.

Is this the state of affairs after the improvements described at
[https://lwn.net/Articles/724307/](https://lwn.net/Articles/724307/) have gone
in?

~~~
mhei
second to last paragraph:

> All tests were run on both ubuntu 17.04, 4.10.0-37, as well as on arch,
> 4.12.8-2. We got the same results on both machines.

~~~
mjw1007
That wasn't there half an hour ago :-).

That LWN article was from June 2017 and 4.10 was released in February, so I
think that means these tests predate any of the changes it discusses.

~~~
edmccard
Arch linux 4.12.8-2 came out in August.

------
wyldfire
> It’s a common conception that SSDs are less likely to return bad data than
> rotational disks, but when Google studied this across their drives, they
> found:

>> The annual replacement rates of hard disk drives have previously been
reported to be 2-9% [19,20], which is high compared to the 4-10% of flash
drives we see being replaced in a 4 year period. However, flash drives are
less attractive when it comes to their error rates. More than 20% of flash
drives develop uncorrectable errors in a four year period, 30-80% develop bad
blocks and 2-7% of them develop bad chips. In comparison, previous work [1] on
HDDs reports that only 3.5% of disks in a large population developed bad
sectors in a 32 months period – a low number when taking into account that the
number of sectors on a hard disk is orders of magnitudes larger than the
number of either blocks or chips on a solid state drive, and that sectors are
smaller than blocks, so a failure is less severe.

Unfortunately "return bad data" is slightly ambiguous. IIRC these failures
that are reported are ones where the drive itself claims that it discovered
invalid blocks which were detected by error detection algorithms. The rate at
which drives "return bad data" is likely only to be discovered by a well-
controlled test that writes specific data to specific locations, or data
patterns, then looks to see whether that data was preserved. This could tell
us about data that was corrupted and escaped detection. Google's aggregate
disk usage stats are a dump of the devices and their service activity with
application-specific load, not the well-controlled test required for detection
of the "returned bad data" symptom.

It's possible that you could scour application logs and try to attribute
application failures to corrupt data from a disk, but it would be difficult to
isolate.

~~~
Sinjo
Or any of the layers they run above the block device could be designed to
detect it, or even correct it.

~~~
wyldfire
Some of them do precisely that. btrfs is designed to correct small errors
introduced by the layers below it (this concept is tested in the article) .

------
Upvoter33
Pretty cool to see an update on the old analysis - and nice to hear, file
systems have improved!

------
rurban
Well, this certainly changed my opinion on btrfs. Now I can tolerate it's
slightly slower performance, knowing that it is the only fs correctly handling
errors.

~~~
bartvk
That only goes for small files.

From the article: "Since the file was tiny, btrfs packed the file into the
metadata and the file was duplicated along with the metadata, allowing the
filesystem to fix the error when one block either returned bad data or
reported a failure."

