Hacker News new | past | comments | ask | show | jobs | submit login
Filesystem error handling (danluu.com)
158 points by Veelox on Oct 23, 2017 | hide | past | web | favorite | 46 comments

One trope I frequently encounter as a FreeBSD professional is "Linux is used by so many people, it doesn't have vast and sweeping bugs anymore".. the many eyes fallacy. In reality, you get bystander paradox.. everyone wants RedHat or IBM to do all the hard work and they reap all the rewards.

I find my field of systems pretty interesting.. there's so much to do and not a lot of people rushing in to do it as the heavy hitters retire or move to entrepreneurship etc. It's probably much cheaper to harden a *BSD or Illumos system in these domains, and much easier to get the results integrated into mainline. The financial and fame are much greater in doing it on Linux though.

> the many eyes fallacy

What do you mean by this? Do more eyes not find and point out more problems? Is the effect offset by some counteracting mechanism?

The many eyes idea postulates many eyes looking at the code, not many eyes finding bugs. (Proprietary software has as many or more users finding bugs as open source!) In reality, the number of developers actually reading code to find bugs may not be significantly larger for complex open source like Linux, particularly in esoteric areas like filesystems or device drivers.


The results of "many eyes" seems to be basically moot (i.e. certainly not a 10x magnitude and more likely well below 2x) in the face of the cultural mores of a project. For example, Linux kernel treats userland ABI compatibility quite seriously and there haven't been many faults there. OTOH security and KPI stability are not taken seriously, and are in constant flux. I can dig up papers on the former if needed, but it is the most glaring failure of "many eyes"

What's particularly noteworthy from my perspective is the bandwagon effect toward one kernel is causing the thing that is supposed to be the ultimate best that keeps getting better (Linux kernel) to lose focus, investment (new talent and money), maybe even quality due to decreased competition. Systems software is starting to look a lot more like Electrical Engineering as a field than the greater software ecosystem. Mostly done by large, central organizations. That makes me sad.

One big problem is how many users actually translate into extra eyes contributing to the codebase. Linux has a huge number of users but how many actually do more than find a workaround for a bug, much less contribute a patch back.

It definitely happens but in my experience a lot of people think that’s too much work and assume someone else will do it even if they don’t. It’s been far more common for someone to try hoping it doesn’t happen again, disabling features blindly, etc.

I think it's more along the lines of "Eh, I probably don't need to look, since someone else is surely doing that".

once it gets annoying enough - followed by "lets try a newer kernel and see if it goes away"

There's a psychological principle called Diffusion of Responsibility that I think applies here. But this is my rusty memory from undergrad around a decade ago.

The fallacy is in the assumption that there actually are many eyes taking a critical look - e.g. at the level demonstrated here.

When it comes to security, there may be more eyes, in which case the fallacy is to assume that the preponderance of them are well-intentioned.

Here we have empirical evidence of problems being overlooked, yet you would prefer to put faith in a comforting aphorism?

> This could happen if there’s disk corruption, a transient read failure, or a transient write failure caused bad data to be written.

Wouldn't transient I/O error get corrected by block driver itself by repeating the request a few times? Does a filesystem driver assume any error reported by storage device is permanent?

> The 2017 tests above used an 8k file where the first block that contained file data either returned an error at the block device level or was corrupted, depending on the test.

Are there any tests on how filesystems deal with metadata corruption or I/O error? Apart from NTFS "self-healing" introduced in Windows 7/2008, do modern filesystem drivers attempt to correct broken metadata completely on the kernel side, or just give up to avoid making things worse?

I believe apfs (apple's new filesystem), btrfs, and zfs all checksum metadata and will attempt to correct it.

I recommend the OP article since it specifically mentions apfs.

The OP article mentions apfs does not checksum data, but I assure you that it does checksum metadata:


Even ext4 has check summing of metadata, it's a fairly boring feature by now.

ext4 only received checksumming of metadata in Linux 3.16, which was released in March 2016 -- not that long ago really. So I wouldn't call it a fairly boring feature since it's barely been two years since it was integrated and I imagine some Linux distributions might not even have it yet.

Furthermore, it's unclear to me whether ext4 just uses those checksums for integrity checks or whether it actually has any "self-healing" capabilities.

My understanding is apfs, btrfs, and zfs all have self-healing capabilities.

Checksums do not provide "self-healing". Ever.

They can only tell you if a piece of data is wrong or not.

In the case of BTRFS it's less a case of self-healing, the data is not healed, rather that metadata is replicated across the disk.

I believe ZFS does the same but I'm not sure of this. I'm certain that ZFS does this if it has several disks to replicate from.

I'm not aware of APFS being self-healing on a single disk.

Yes, ZFS also replicates metadata on a single disk.

You can even tell ZFS to do the same thing with user data with the ditto blocks feature.

Yes, you obviously need additional pieces of information for reconstruction/repair, sorry for the confusing wording.

> Wouldn't transient I/O error get corrected by block driver itself by repeating the request a few times?

I am surprised at the seeming assumption here that errors are always transient and you shouldn't worry because the retry will fix it. Sure that probably won't hurt, but it won't address the case where retries fail.

depends on the OS, according to this comment on the ZoL github [0] solaris passes through errors from the hardware while linux retries.

[0] https://github.com/zfsonlinux/zfs/issues/1256#issuecomment-1...

While neat, it's limited to Linux. It'd be nice to see how the BSDs behave, on their filesystems (such as HAMMER2). ZFS was also notably absent.

As usual, another example of the Linux echo chamber. It's just one constant inward gaze with that crowd. They learn nothing from anyone, only trial and error.

I don't know that I'd agree. As much as llvm+clang inspired gcc to improve, ZFS seems like a big part of the inspiration for btrfs. I'm not sure that it hits the whole ZFS feature set, but it's probably a step in that direction.

Including the fragmentation experience of the UNIX wars glory days.

Something to keep in mind is that by default writes go through the page cache. By the time I/O is initiated for a write(2), the application process can have exited.

This is why calling fsync(fd) before closing the file and exiting is a good idea if you need that kind of error to be handled. You should get it as a return of fsync if it happens after the write.

O_[D]SYNC is better than a separate call to fsync, since it is not supposed to suffer from the race condition inherent to fsync. Arguably pedantic.

Definitely agree there. But it's also not always a good idea if you're going to do a lot of writes and rewrites to the same area of a file. If you're doing something more write once or a log append type pattern it won't make a difference usually. But if you're changing data a lot before closing/finishing the file then you might not want the dramatic performance change that O_SYNC can bring, and the race between fsync and close might still be worth it (esp if you're doing an fsync on all the directories involved to ensure metadata is commited too).

I agree with using O_DSYNC to surface the error to the write call, rather than waiting until the fsync call, which is often not checked by the user.

I did some testing recently [1] with O_DIRECT + O_DSYNC and found some surprising performance results, on Linux it can be similar to O_DIRECT + fsync() after every write for hard drives. But as soon as you are doing grouped writes, performance is almost always better by using O_DIRECT + fsync() after the end of the group.

For SSD drives though, O_DIRECT + O_DSYNC can be faster than O_DIRECT + fsync() after the end of the group, if you are pipelining your IO, e.g. you encrypt and checksum the next batch of sectors while you wait for the previous batch of checksummed and encrypted sectors to be written out. Because SSDs are so much faster, you can actually afford to slow down the write a little more by using O_DSYNC, so that your write is not faster than the related CPU work.

[1] https://github.com/ronomon/direct-io.

A more advanced (and somewhat easy to get wrong) option would be sync_file_range combined with fdatasync, which allows to roughly emulate O_DSYNC overall but without blocking synchronously for IO.

sync_file_range is different to fsync, fdatasync and O_DSYNC in that it does not flush the disk write cache (whereas the latter explicitly do on newer kernels):


  This system call does not flush disk write caches and thus does not
  provide any data integrity on systems with volatile disk write caches.

Hence "sync_file_range combined with fdatasync". sfr is only useful to get the kernel to start the write-out now.

Does the article do this?

I can't see anywhere in the article where it says which version of linux this was tested on (beyond "2017"), which is a shame.

Is this the state of affairs after the improvements described at https://lwn.net/Articles/724307/ have gone in?

second to last paragraph:

> All tests were run on both ubuntu 17.04, 4.10.0-37, as well as on arch, 4.12.8-2. We got the same results on both machines.

That wasn't there half an hour ago :-).

That LWN article was from June 2017 and 4.10 was released in February, so I think that means these tests predate any of the changes it discusses.

Arch linux 4.12.8-2 came out in August.

> It’s a common conception that SSDs are less likely to return bad data than rotational disks, but when Google studied this across their drives, they found:

>> The annual replacement rates of hard disk drives have previously been reported to be 2-9% [19,20], which is high compared to the 4-10% of flash drives we see being replaced in a 4 year period. However, flash drives are less attractive when it comes to their error rates. More than 20% of flash drives develop uncorrectable errors in a four year period, 30-80% develop bad blocks and 2-7% of them develop bad chips. In comparison, previous work [1] on HDDs reports that only 3.5% of disks in a large population developed bad sectors in a 32 months period – a low number when taking into account that the number of sectors on a hard disk is orders of magnitudes larger than the number of either blocks or chips on a solid state drive, and that sectors are smaller than blocks, so a failure is less severe.

Unfortunately "return bad data" is slightly ambiguous. IIRC these failures that are reported are ones where the drive itself claims that it discovered invalid blocks which were detected by error detection algorithms. The rate at which drives "return bad data" is likely only to be discovered by a well-controlled test that writes specific data to specific locations, or data patterns, then looks to see whether that data was preserved. This could tell us about data that was corrupted and escaped detection. Google's aggregate disk usage stats are a dump of the devices and their service activity with application-specific load, not the well-controlled test required for detection of the "returned bad data" symptom.

It's possible that you could scour application logs and try to attribute application failures to corrupt data from a disk, but it would be difficult to isolate.

We see a lot of SSD errors that turn out to be problems with RAID controllers. The lost drive rate dropped quite a bit when we started using OS-level RAID functionality rather than whatever the heck is in the firmware of Promise / LSI / Adaptec / etc. cards, and we were able to return a bunch of drives to service after we started using those cards in a simple pass through mode.

Raise your hand if you've lost data that was RAIDed eight ways to Sunday because of a RAID controller failure that turned out to be unrecoverable because of crappy decisions in the firmware, or rebuild times equal to the number of days it took to order, rack and configure a competitor's product.

Or any of the layers they run above the block device could be designed to detect it, or even correct it.

Some of them do precisely that. btrfs is designed to correct small errors introduced by the layers below it (this concept is tested in the article) .

This is correct, I checked claim against the paper and the paper does seem to be talking about errors that were caught by the SSD itself: http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f...

It could have been that the paper was about errors detected by checksums at the level of Google's distributed storage engine, but that's not the case.

This means that Apple could in fact be using entirely standard SSDs while being correct in their claim that they don't need checksums. Their SSDs may sometimes bubble "uncorrectable errors" up to them, but like the article says, redundancy doesn't help you avoid those on SSDs.

Pretty cool to see an update on the old analysis - and nice to hear, file systems have improved!

Well, this certainly changed my opinion on btrfs. Now I can tolerate it's slightly slower performance, knowing that it is the only fs correctly handling errors.

That only goes for small files.

From the article: "Since the file was tiny, btrfs packed the file into the metadata and the file was duplicated along with the metadata, allowing the filesystem to fix the error when one block either returned bad data or reported a failure."

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact