I find my field of systems pretty interesting.. there's so much to do and not a lot of people rushing in to do it as the heavy hitters retire or move to entrepreneurship etc. It's probably much cheaper to harden a *BSD or Illumos system in these domains, and much easier to get the results integrated into mainline. The financial and fame are much greater in doing it on Linux though.
What do you mean by this? Do more eyes not find and point out more problems? Is the effect offset by some counteracting mechanism?
The results of "many eyes" seems to be basically moot (i.e. certainly not a 10x magnitude and more likely well below 2x) in the face of the cultural mores of a project. For example, Linux kernel treats userland ABI compatibility quite seriously and there haven't been many faults there. OTOH security and KPI stability are not taken seriously, and are in constant flux. I can dig up papers on the former if needed, but it is the most glaring failure of "many eyes"
What's particularly noteworthy from my perspective is the bandwagon effect toward one kernel is causing the thing that is supposed to be the ultimate best that keeps getting better (Linux kernel) to lose focus, investment (new talent and money), maybe even quality due to decreased competition. Systems software is starting to look a lot more like Electrical Engineering as a field than the greater software ecosystem. Mostly done by large, central organizations. That makes me sad.
It definitely happens but in my experience a lot of people think that’s too much work and assume someone else will do it even if they don’t. It’s been far more common for someone to try hoping it doesn’t happen again, disabling features blindly, etc.
When it comes to security, there may be more eyes, in which case the fallacy is to assume that the preponderance of them are well-intentioned.
Here we have empirical evidence of problems being overlooked, yet you would prefer to put faith in a comforting aphorism?
Wouldn't transient I/O error get corrected by block driver itself by repeating the request a few times? Does a filesystem driver assume any error reported by storage device is permanent?
> The 2017 tests above used an 8k file where the first block that contained file data either returned an error at the block device level or was corrupted, depending on the test.
Are there any tests on how filesystems deal with metadata corruption or I/O error? Apart from NTFS "self-healing" introduced in Windows 7/2008, do modern filesystem drivers attempt to correct broken metadata completely on the kernel side, or just give up to avoid making things worse?
Furthermore, it's unclear to me whether ext4 just uses those checksums for integrity checks or whether it actually has any "self-healing" capabilities.
My understanding is apfs, btrfs, and zfs all have self-healing capabilities.
They can only tell you if a piece of data is wrong or not.
In the case of BTRFS it's less a case of self-healing, the data is not healed, rather that metadata is replicated across the disk.
I believe ZFS does the same but I'm not sure of this. I'm certain that ZFS does this if it has several disks to replicate from.
I'm not aware of APFS being self-healing on a single disk.
I am surprised at the seeming assumption here that errors are always transient and you shouldn't worry because the retry will fix it. Sure that probably won't hurt, but it won't address the case where retries fail.
I did some testing recently  with O_DIRECT + O_DSYNC and found some surprising performance results, on Linux it can be similar to O_DIRECT + fsync() after every write for hard drives. But as soon as you are doing grouped writes, performance is almost always better by using O_DIRECT + fsync() after the end of the group.
For SSD drives though, O_DIRECT + O_DSYNC can be faster than O_DIRECT + fsync() after the end of the group, if you are pipelining your IO, e.g. you encrypt and checksum the next batch of sectors while you wait for the previous batch of checksummed and encrypted sectors to be written out. Because SSDs are so much faster, you can actually afford to slow down the write a little more by using O_DSYNC, so that your write is not faster than the related CPU work.
This system call does not flush disk write caches and thus does not
provide any data integrity on systems with volatile disk write caches.
Is this the state of affairs after the improvements described at https://lwn.net/Articles/724307/ have gone in?
> All tests were run on both ubuntu 17.04, 4.10.0-37, as well as on arch, 4.12.8-2. We got the same results on both machines.
That LWN article was from June 2017 and 4.10 was released in February, so I think that means these tests predate any of the changes it discusses.
>> The annual replacement rates of hard disk drives have previously been reported to be 2-9% [19,20], which is high compared to the 4-10% of flash drives we see being replaced in a 4 year period. However, flash drives are less attractive when it comes to their error rates. More than 20% of flash drives develop uncorrectable errors in a four year period, 30-80% develop bad blocks and 2-7% of them develop bad chips. In comparison, previous work  on HDDs reports that only 3.5% of disks in a large population developed bad sectors in a 32 months period – a low number when taking into account that the number of sectors on a hard disk is orders of magnitudes larger than the number of either blocks or chips on a solid state drive, and that sectors are smaller than blocks, so a failure is less severe.
Unfortunately "return bad data" is slightly ambiguous. IIRC these failures that are reported are ones where the drive itself claims that it discovered invalid blocks which were detected by error detection algorithms. The rate at which drives "return bad data" is likely only to be discovered by a well-controlled test that writes specific data to specific locations, or data patterns, then looks to see whether that data was preserved. This could tell us about data that was corrupted and escaped detection. Google's aggregate disk usage stats are a dump of the devices and their service activity with application-specific load, not the well-controlled test required for detection of the "returned bad data" symptom.
It's possible that you could scour application logs and try to attribute application failures to corrupt data from a disk, but it would be difficult to isolate.
Raise your hand if you've lost data that was RAIDed eight ways to Sunday because of a RAID controller failure that turned out to be unrecoverable because of crappy decisions in the firmware, or rebuild times equal to the number of days it took to order, rack and configure a competitor's product.
It could have been that the paper was about errors detected by checksums at the level of Google's distributed storage engine, but that's not the case.
This means that Apple could in fact be using entirely standard SSDs while being correct in their claim that they don't need checksums. Their SSDs may sometimes bubble "uncorrectable errors" up to them, but like the article says, redundancy doesn't help you avoid those on SSDs.
From the article: "Since the file was tiny, btrfs packed the file into the metadata and the file was duplicated along with the metadata, allowing the filesystem to fix the error when one block either returned bad data or reported a failure."