
Silent Data Corruption Is Real - Ianvdl
http://changelog.complete.org/archives/9769-silent-data-corruption-is-real
======
elfchief
It really bugs me (and has for a while) that there is still no mainstream
linux filesystem that supports data block checksumming. Silent corruption is
not exactly new, and the odds of running into it have grown significantly as
drives have gotten bigger. It's a bit maddening that nobody seems to care (or
maybe I'm just looking in the wrong places)

(...sure, you could call zfs or btrfs "mainstream", I suppose, but when I say
"mainstream" I mean something along the lines of "officially supported by
RedHat". zfs isn't, and RH considers btrfs to still be "experimental".)

~~~
dom0
btrfs only uses CRC32c which is weakish. ZFS is great but not exactly
portable. I started to use Borg now for archiving purposes as well, not just
backup. For me (low access concurrency, i.e. single or at most "a few" users)
that works very well. Portable + strong checksumming + strong crypto +
mountable + reasonable speed (with prospect of more) is a good package. It
doesn't solve error correction, though.

~~~
ak217
crc32c is not weakish, and was chosen for a reason: crc32c has widespread
hardware acceleration support that remains faster than any hash, and crc32c
can be computed in parallel (unlike a hash, it has no hidden state, so you can
sum independently computed block checksums to get the overall blob checksum).
Bitrot detection doesn't need a cryptographic hash. You may want a hash for
other purposes (like if you somehow trust your metadata volume more than your
data volume), but that's a separate and slower use case.

~~~
bascule
_Bitrot detection doesn 't need a cryptographic hash._

Not only is a cryptographic hash unnecessary, under certain circumstances it
will actually do a _worse_ job.

Cryptographic hashes operate under a different set of constraints than error
detecting code. With an error detecting code, it's desirable to _guarantee_ a
different checksum in the event of a bitflip.

With a cryptographic random oracle, this is not the case: we want all outcomes
to have equal probability, even potentially producing the same digest in the
event of a bitflip. As an example of a system which failed in this way: Enigma
was specifically designed so the ciphertext of a given letter was always
different from its plaintext. This left a statistical signature on the
ciphertext which in turn was exploited by cryptanalysts to recover plaintexts.
(Note: a comparison to block ciphers is apt as most hash functions are, at
their core, similar to block ciphers)

Though the digests of cryptographic hash functions are so large it's
statistically improbable for a single bitflip to result in the same digest as
the original, it is not a guarantee the same way it is with the CRC family.

Cryptographic hash functions are not designed to be error detecting codes.
They are designed to be random oracles. Outside a security context, using a
CRC family function will not only be faster, but will actually provide
guarantees cryptographic hash functions _can 't_.

~~~
mrb
" _Not only is a cryptographic hash unnecessary, under certain circumstances
it will actually do a worse job._ "

A cryptographic hash is unnecessary, but as you point out in the next to last
paragraph, it is statistically improbable that it will do a worse job. Because
collisions are statistically improbable.

~~~
jepler
Specifically, I wrote a program to search for single-bit-flip collisions in
sha1 truncated to 16 bits. The program didn't need to search for long before
finding two messages with the same 16-truncated sha1 with a single bit flip at
bit 1 of byte 171 of a 256-byte message. 376 1 171
be44b935e7ecfc81d1fe2cddcd7c1d7e04338fd83fa994cd6a877732ca5d8db83346bd9ccbfc4c8770682bd307c782421a512a80a106be87825d5c13f3156e23ffaacdfc1651f88f775507d1175542def2ccf084271ebd4ead175c8a448be0d50b26f59d970301ebc5a7f672d3ea870d9a1e02f8f5fd01c38297b8aa264a3f07fec32f9a91aa359784d2d9ce0e4649465c705f50feed23dcbefc0a726cfadb5e47ee577ed45203f90d6e2e650d42ddb10cba49d06bd4cdad4e6eaf5cfcb062de2539fc847ce0c104f2e667369080eaaab5934ae5f7f1ba733c3d1bfbda87bfa72ef12475b9ff0edc4deb99e6a5cf387c7f6b9c71ea62b4db4bb67c92d36460dd
be44b935e7ecfc81d1fe2cddcd7c1d7e04338fd83fa994cd6a877732ca5d8db83346bd9ccbfc4c8770682bd307c782421a512a80a106be87825d5c13f3156e23ffaacdfc1651f88f775507d1175542def2ccf084271ebd4ead175c8a448be0d50b26f59d970301ebc5a7f672d3ea870d9a1e02f8f5fd01c38297b8aa264a3f07fec32f9a91aa359784d2d9ce0e4649465c705f50feed23dcbefc0a726cfadb5e47ee577ed45203f90d6e2e640d42ddb10cba49d06bd4cdad4e6eaf5cfcb062de2539fc847ce0c104f2e667369080eaaab5934ae5f7f1ba733c3d1bfbda87bfa72ef12475b9ff0edc4deb99e6a5cf387c7f6b9c71ea62b4db4bb67c92d36460dd

[https://gist.github.com/jepler/96d1e779dc95b8941b208887e10a8...](https://gist.github.com/jepler/96d1e779dc95b8941b208887e10a8084)

On the other hand, any CRC will detect all such errors; a well-chosen one such
as CRC32C will detect all messages with up to 5 bits flipped at this message
size.

This is quite appropriate for the error model in data transmission, of
uncorrelated bit errors.
[https://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koop...](https://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koopman.pdf)
is a pretty good paper, though there are probably better ones for readers
without an existing background in how CRC works.

~~~
mrb
You are not testing a crypto hash. "Crypto hash" means it is cryptographically
strong, not truncated to 16 bits. For example ZFS with checksum=sha256 will
use the full 256-bit hash for detecting data corruption.

~~~
jepler
Yup, you're right. if you use a full size cryptographic hash then the number
of undetected errors can be treated as 0 regardless of hamming distance. On
the other hand, it has 8x the storage overhead of a 32-bit CRC.

------
kabdib
Oh, yes. Silent bit errors are tons of fun to track down.

I spent a day chasing what turned out to be a bad bit in the cache of a disk
drive; bits would get set to zero in random sectors, but always at a specific
sector offset. The drive firmware didn't bother doing any kind of memory test;
even a simple stuck-at test would have found this and preserved the customer's
data.

In another case, we had Merkle-tree integrity checking in a file system, to
prevent attackers from tampering with data. The unasked-for feature was that
it was a memory test, too, and we found a bunch of systems with bad RAM. ECC
would have made this a non-issue, but this was consumer-level hardware with
very small cost margins.

It's fun (well maybe "fun" isn't the right word) to watch the different ways
that large populations of systems fail. Crash reports from 50M machines will
shake your trust in anything more powerful than a pocket calculator.

~~~
blablabloe
Enterprise disks do have ECC cache as opposed to consumer drives. Was it a
consumer drive?

------
nisa
ZFS is also crazy good on surviving disks with bad sectors (as long as they
still respond fast). Check out this paper:
[https://research.cs.wisc.edu/wind/Publications/zfs-
corruptio...](https://research.cs.wisc.edu/wind/Publications/zfs-corruption-
fast10.pdf)

They even spread the metadata across the disk by default. I'm running on some
old WD-Greens with 1500+ of bad sectors and it's cruising along with RAIDZ
just fine.

There is also failmode=continue where ZFS doesn't hang when it can't read
something. If you have a distributed layer above ZFS that also checksums (like
HDFS) you can go pretty far even without RAID and quite broken disks. There is
also copies=n. When ZFS broke, the disk usally stopped talking or died a few
days later. btrs, ext4 just choke and remount ro quite fast (probably the best
and correct course of action) but you can tell ZFS to just carry on! Great
piece of engineering!

~~~
comboy
Pretty fascinating. But just based on this comment, I reckon that these 1500+
bad sectors drives aren't worth your time. So, why? Is it just that you wanted
to play with all these options and don't really care about the data on these
drives, or do you actually believe it's reasonable bang for the buck?

~~~
nisa
I forget the disclaimer that you should not do this, ever :)

We had a cluster for Hadoop experiments at uni and no ressources to replace
all the faulty disks at that time (20-30% were faulty to some degree from the
SMART values - more than 150 disks). So this was kind of an experiment. All
used data was available and backup up ouside of that cluster. The problem was
that with ext4 after running a job certain disks always switched to readonly
and this was a major hassle as this node had to be touched by hand. HDFS ist
3x replicated and checksummed and the disks usally worked fine for quite a
time after the first bad sector. So we switched to ZFS, ran weekly scrubs -
only replaced disks that didn't survived the srub in reasonable time or with
reasonable failure rates and bumped up the HDFS checksum reads that everything
is control read once a week. The working directory for the layer above
(MapReduce and stuff like that) got a dataset with copies=2 so that
intermediate data is still fine within reasonable amounts. This was for
learning or research purposes where top speed or 100% integrity didn't matter
and uptime and usability was more important. Basically the metadata on disk
had to be sound and the data on a single disk didn't matter that much. This
was quite a ride and it's long been replaced since then.

Just thought it's interesting how far you can push that. In the end it worked
but turned out there is no magic, disks die sooner or later and sometimes take
the whole node with them.

Don't go to ebay and buy broken disks out of believing with ZFS these will
work. Some survive a while, most die fast, some exhibit strange behavoir.

That RAIDZ is more or less for "let's see where this goes" purposes, backups
are in place it's not a production system.

~~~
comboy
Hah, thanks for the story.

It seems that limited resources often lead to some interesting solutions (and
learning new things). A factor that is not very common in VC backed companies.

------
Malic
It's articles like this that re-enforce my disappointment that Apple is
choosing to NOT implement checksums in their new file system, APFS.

[https://news.ycombinator.com/item?id=11934457](https://news.ycombinator.com/item?id=11934457)

~~~
JumpCrisscross
Can someone explain why one would checksum metadata but not user data? Is the
assumption everything's backed up on iCloud? If so, are system files
checksummed?

~~~
cmurf
Metadata writes can be considered atomic, so an integrated checksum is written
at the time the metadata block is written, or overwritten. Whereas with data,
you can't do overwrites of either data or checksum, it's not atomic and any
kind of crash or powerfailure will result in mismatching data to checksum. So
unless you have something really clever to work around this, you need a copy
on write file system to do data checksums.

------
dredmorbius
"Data tends to corrupt. Absolute data tends to corrupt absolutely."

In both sense of the word.

Many moons ago, in one of my first professional assignments, I was tasked with
what was, for the organisation, myself, and the provisioned equipment, a
stupidly large data processing task. One of the problems encountered was a
failure of a critical hard drive -- this on a system with no concept of a
filesystem integrity check (think a particularly culpable damned operating
system, and yes, I said that everything about this was stupid). The process of
both tracking down, and then demonstrating convincingly to management (I
_said_ ...) the nature of the problem was infuriating.

And that was with hardware which was reliably and replicably bad. Transient
data corruption ... because cosmic rays ... gets to be one of those
particularly annoying failure modes.

Yes, checksums and redundancy, please.

------
swinglock
If I were to run ZFS on my laptop with a single disk and copies=1, and a file
becomes corrupted, can I recover it (partially)?

My assumption is the read will fail and the error logged but there is no
redundancy so it will stay unreadable.

Will ZFS attempt to read the file again, in case the error is transient? If
not, can I make ZFS retry reading? Can I "unlock" the file and read it even
though it is corrupted, or get a copy of the file? If I restore the file from
backup, can ZFS make sure the backup is good using the checksum it expects the
file to have?

Single disk users seem to be unusual so it's not obvious how to do this, all
documentation assumes a highly available installation rather than laptop, but
I think there's value in ZFS even with a single disk - if only I understood
exactly how it fails and how to scavenge for pieces when it does.

~~~
mrb
It depends. Metadata is always redundant with 2 copies (even when using
copies=1). So if the file's metadata is corrupted, yes ZFS will fully recover
and rewrite a 2nd good copy of the metadata. But if the data is corrupted,
then ZFS can do nothing to recover (you may be able to partially read the
file, but the rest of it will return I/O errors.)

~~~
h2hn
Ok, so zfs for single drive users doesn't fix single data corruption.

Definitely I'm going to use my solution so. All the next generation FS stuff
is cool (btrfs also indeed) but for the _simplest_ use case people just need
safe data and fix it it the disk goes bad.

~~~
mrb
Why don't you use copies=2 with a single disk?

------
mrb
The exact same silent data corruption issues just happened to my 6 x 5TB ZFS
FreeBSD fileserver. But unlike what the poster concluded, mine were caused by
bad (ECC!) RAM. I kept meticulous notes, so here is my story...

I scrub on a weekly basis. One day ZFS started reporting silent errors on disk
ada3, just 4kB:

    
    
        pool: tank
       state: ONLINE
      status: One or more devices has experienced an unrecoverable error.  An
              attempt was made to correct the error.  Applications are unaffected.
      action: Determine if the device needs to be replaced, and clear the errors
              using 'zpool clear' or replace the device with 'zpool replace'.
         see: http://illumos.org/msg/ZFS-8000-9P
        scan: scrub repaired 4K in 21h05m with 0 errors on Mon Aug 29 20:52:45 2016
      config:
              NAME        STATE     READ WRITE CKSUM
              tank        ONLINE       0     0     0
                raidz2-0  ONLINE       0     0     0
                  ada3    ONLINE       0     0     2  <---
                  ada4    ONLINE       0     0     0
                  ada6    ONLINE       0     0     0
                  ada1    ONLINE       0     0     0
                  ada2    ONLINE       0     0     0
                  ada5    ONLINE       0     0     0
    

I monitored the situation. But every week, subsequent scrubs would continue to
find errors on ada3, and on more data (100-5000kB):

    
    
      2016-09-05: 1.7MB silently corrupted on ada3 (ST5000DM000-1FK178)
      2016-09-12: 5.2MB silently corrupted on ada3 (ST5000DM000-1FK178)
      2016-09-19: 300kB silently corrupted on ada3 (ST5000DM000-1FK178)
      2016-09-26: 1.8MB silently corrupted on ada3 (ST5000DM000-1FK178)
      2016-10-03: 3.1MB silently corrupted on ada3 (ST5000DM000-1FK178)
      2016-10-10: 84kB silently corrupted on ada3 (ST5000DM000-1FK178)
      2016-10-17: 204kB silently corrupted on ada3 (ST5000DM000-1FK178)
      2016-10-24: 388kB silently corrupted on ada3 (ST5000DM000-1FK178)
      2016-11-07: 3.9MB silently corrupted on ada3 (ST5000DM000-1FK178)
    

The next week. The server became unreachable during a scrub. I attempted to
access the console over IPMI but it just showed a blank screen and was
unresponsive. I rebooted it.

The next week the server again became unreachable during a scrub. I could
access the console over IPMI but the network seemed non-working even though
the link was up. I checked the IPMI event logs and saw multiple _correctable_
memory ECC errors:

    
    
      Correctable Memory ECC @ DIMM1A(CPU1) - Asserted
    

The kernel logs reported muliple Machine Check Architecture errors:

    
    
      MCA: Bank 4, Status 0xdc00400080080813
      MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
      MCA: Vendor "AuthenticAMD", ID 0x100f80, APIC ID 0
      MCA: CPU 0 COR OVER BUSLG Source RD Memory
      MCA: Address 0x5462930
      MCA: Misc 0xe00c0f2b01000000
    

At this point I could not even reboot remotely the server via IPMI. Also, I
theorized that in addition to _correctable_ memory ECC errors, maybe the DIMM
experienced _uncorrectable /undetected_ ones that were really messing up the
OS but also IPMI. So I physically removed the module in "DIMM1A", and the
server has been working perfectly well since then.

The reason these memory errors always happened on ada3 is not because of a bad
drive or bad cables, but likely due to the way FreeBSD allocates buffer memory
to cache drive data: the data for ada3 was probably located right on defective
physical memory page(s), and the kernel never moves that data around. So it's
always ada3 data that seems corrupted.

PS: the really nice combinatorial property of raidz2 with 6 drives is that
when silent corruption occurs, the kernel has 15 different ways to attempt to
rebuild the data ("6 choose 4 = 15").

~~~
blablabloe
When a double-parity error is detected, the operating system should halt.
Maybe that didn't happen properly. Tripple parity errors may go undetected,
but how likely is that.

I wonder what 'really' happened.

~~~
mrb
I saw dozens of ECC errors in the IPMI log, and dozens of MCAs in dmesg. So
the memory was failing so badly that it is probably there were 3+ bit errors.

------
RX14
I know for sure that btrfs scrub found 8 correctable errors on my home server
filesystem last July. This is obviously great news for me. Contrary to a lot
of people here I've personally found btrfs to be really stable (as long as you
don't use raid5/6 though).

------
kev009
People grossly under-intuit the channel error rate of the SATA. At datacenter
scale it's alarmingly high
[http://www.enterprisestorageforum.com/imagesvr_ce/8069/sas-s...](http://www.enterprisestorageforum.com/imagesvr_ce/8069/sas-
sata-table2.jpg)

------
platosrepublic
I'm not a database expert, but this seems like something I should worry about,
at least a bit. Is this a problem if you store all your persistent data in a
database like MySQL?

~~~
Mister_Snuggles
If your database does ZFS-like checksumming on all of its data, including the
structures that it uses to find the data that you put in, and has the ability
to correct errors, then no.

Realistically though, I don't know if MySQL has this. You'd probably be better
off using a filesystem that gives these kind of guarantees and running your
database on that.

~~~
zzzcpan
Every ACID compliant database is supposed to calculate checksums.

~~~
copperx
Good to know.

------
pmarreck
Of course, my ZFS NAS backup is sound until a file that got bitrotted on my
non-ZFS computer is touched and then backed up to it :/

It's kind of (literally?) like immutability. If you allow even a little
mutability, it ruins it.

I think all filesystems should be able to add error-correction data to ensure
data integrity.

------
lasermike026
Shouldn't RAID 1,5,6 protect against data corruption because of disk errors?

~~~
rcthompson
Some disk errors, yes, but something as simple as a power failure can easily
corrupt your data:

[http://www.raid-recovery-guide.com/raid5-write-hole.aspx](http://www.raid-
recovery-guide.com/raid5-write-hole.aspx)

[https://blogs.oracle.com/bonwick/entry/raid_z](https://blogs.oracle.com/bonwick/entry/raid_z)

~~~
rajachan
Hardware RAID does not suffer from the write-hole like MD-RAID does (thanks to
on-board supercap-backed non-volatile memory).

I can't remember if it was merged upstream, but some folks from Facebook
worked on a write-back cache for MD-RAID (4/5/6 personalities) in Linux which
essentially closes the write-hole too. It allows one to stage dirty RAID
stripe data in a non-volatile medium (NVDIMMs/flash) before submitting block
requests to the underlying array. On recovery, the cache is scanned for dirty
stripes, which are restored before actually using the P/Q parity to rebuild
user data. I worked on something similar in a prior project where we cached
dirty stripes in an NVDIMM, and also mirrored it on its controller-pair (in a
dual-controller server architecture) using NTB. It was a fun project, when
neither the PMEM nor the NTB driver subsystems were in the mainline kernel.

~~~
qb45
RAID journaling.

[https://lwn.net/Articles/665299/](https://lwn.net/Articles/665299/)

Haven't tried it but it seems to already be merged, at least write-through.
They now work on implementing writeback and this IIRC isn't merged yet.

------
blablabloe
The story here is not how Silent Data Corruption is real. The story is that
somebody did a bad home brew server build and fucked up.

So ZFS protects against end-user mistakes.

I was really hoping about a story on some large-scale study on silent data
corruption, but no, just an ankedote.

Sad!

:D

~~~
jgoerzen
I didn't go into detail in the article, but the server in question is running
a Supermicro X10SLH-F-O motherboard, ECC RAM, and a Haswell CPU, in a Rosewill
RSV-L4411 4U chassis. Is there a hardware problem here? For sure. But you
can't write this off as being some dusty overclock mess bought at someone's
garage sale.

I have, incidentally, seen this in corporate environments on traditionally-
engineered sever-class hardware as well. This is just a much more easily-
discussed case.

------
meesterdude
interesting find! I wonder what would be a good safeguard to this. I feel like
just backing up your data would offer something - but a file could silently
change and become corrupted in the backup too.

~~~
ralfd
Hm. You could make two backups and checksum each file and safe the checksum.
Then you could compare regularly file contents with the initial checksum, if
there is a mismatch copy from the other backup.

~~~
icebraining
Git-annex is nice for this; the fsck command will check the file against a
checksum and request a new copy from other node automatically if the check
fails.

------
ksec
Yes, it is VERY real. Because no one gives a damn. Most people ( consumers )
would just ignore that corrupted Jpeg.

I am in the minority group that gets very frustrated and paranoid when my
Video or Photos gets corrupted.

Synology has Btrfs on some range of their NAS. But most of them are expensive.

I really want a Consumer NAS, or preferably even Time Capsule ( with two 2.5"
HDD instead of one drive ) with built in ZFS and ECC Memory, by default weekly
scrub drive. And alert you when there is problem.

And lastly, do any of those consumer Cloud Storage, OneDrive, DropBox, Amazon,
iCloud have these protection in place? Because I would much rather Data
Corruption be someone else problem then complexity at my end.

------
hawski
That gives me more reasons to experiment with DragonFly BSD by building a NAS
using HAMMER file system.

------
greenshackle
Uh. My ZFS-backed BSD NAS also has the hostname 'alexandria'.

------
danjoc
Closed source firmware on drives contain bugs that corrupt data. Are there any
drives, available anywhere, that have open source firmware?

~~~
dandelion_lover
There is a corresponding project: [http://www.openssd-
project.org](http://www.openssd-project.org)

~~~
duskwuff
Note that it's essentially a dead project. The "NEW!!!" platform mentioned on
their homepage is from 2014.

------
_RPM
Just like data black markets.

------
h2hn
I started a really simple and effective project the last month to be able to
fix from bitrot in linux(MacOs/Unix?). It's "almost done" just need more real
testing and make the systemd service. I've been pretty busy the last weeks so
I've only been able to improve the bitrot performance.

[https://github.com/liloman/heal-bitrots](https://github.com/liloman/heal-
bitrots)

Unfortunatly, btrfs is not stable and zfs needs a "super computer" or at least
as much GBs of ECC RAM as you can buy. This solution is designed to any
machine and any FS.

~~~
tnorgaard
Please stop spreading this misinformed statement. I assume you are referring
to the ZFS ARC (Adaptive Replacement Cache). It works in much the same way as
a regular Linux page cache. It does not take much more memory (if you disable
prefetch) and will only use what is available/idle. We use Linux with ZFS on
production systems with as low as 1GB memory. We stopped counting the times it
has saved the day. :-)

ECC is a nice to have, but ZFS does not have special requirement over say a
regular page cache. The only difference is that ZFS will discovery bit-flips
instead of just ignoring them as ext4 or xfs would do.

~~~
3legcat
> ECC is nice to have.

Actually it seems ECC is important for ZFS filesystems see:

[http://louwrentius.com/please-use-zfs-with-ecc-
memory.html](http://louwrentius.com/please-use-zfs-with-ecc-memory.html)

~~~
solarengineer
To be clear, it is not ZFS that requires or even mandates ECC. Since ZFS uses
data as present in memory and has checks for everything post that, it is
prudent to have memory checks at the hardware level.

Thus, if one is using ZFS for data reliability, one ought to use ECC memory as
well.

