
Bitrot and atomic COWs: Inside “next-gen” filesystems - pedrocr
http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/
======
Freaky
Bit worrying not to see a single mention of ECC memory when discussing
protection from bitrot, especially with filesystems which depend on correctly
functioning memory to provide the protections people expect from them.

People sneer at me for being a stickler for it, but between 50GB of memory and
26TB of ZFS-protected storage, I see ECC corrections about as often as I see
disk checksum errors - maybe half a dozen of each in the past year or two.
Frankly I think it's idiotic it's not more common and better supported.

~~~
ChuckMcM
The article was, sadly, missing a lot of stuff. When it said _" (most arrays
don't check parity by default on every read)"_ I knew the author was not up to
the task of writing this article. FWIW, a RAID system _has_ to calculate
parity every time it reads a stripe so that it will know how to change it if
something in the stripe changes. There are of course file system errors that
are invisible to RAID, if your FS _writes_ corrupted data (as it might with
ECC failures) the RAID subsystem will happily compute the correct parity for
the stripe.

Granted I spent nearly 10 years immersed in storage systems (5 at NetApp, 4 at
Google) but still there is some easily checked stuff missing from this article
and so its point (which is hardening in the filesystems is good) is lost.

~~~
dspillett
_> a RAID system has to calculate parity every time it reads a stripe so that
it will know how to change it if something in the stripe changes_

Nope. On a non-degraded array with parity (R5, 6, ...) it will only read the
block it needs to from the drive it is on. For an array that mirrors without
parity (R1, ...) it will read the block from one of the drives. There is no
need to bother another drive with the read operation unless it needs to check
parity (because it has been told to on each ready as some controllers can be
so told), you actually reduce the performance benefit of the striping if you
do as you may be moving the heads of the other drive(s) "unnecessarily" and
potentially away from another block they were about to be asked to read.

With an un-degraded array unless explicitly told to check parity on read the
controller will not touch the parity blocks until a write happens at which
point it will read the other relevant data blocks in order to regenerate the
parity block. This is why the RAID 5 write penalty exists but there is no read
penalty (in fact there is a read bonus due to striping over multiple devices).

Parity blocks will, unless checking parity on every read, only ever get read
if the array is in a degraded state, in which case you can only derive some
data blocks by reading the other blocks in that stripe (data and parity) and
working out from them what the missing block should be.

~~~
ChuckMcM
I will take your word for it that some systems make this "optimization" but
any such systems is fundamentally broken at the design level. Disk drives can,
and do, return corrupted sectors without any indication of failure, the RAID
system you describe would not catch those failures and would thus 'fail' in
terms of providing any sort of data reliabilty.

~~~
wmf
So let's say you read a full RAID-5 stripe including parity and the parity
does not match because of a silent error. How do you know which block contains
the error? Classic RAID does not have any checksums. AFAICT classic RAIDs are
screwed in the face of silent errors, so they might as well implement as many
optimizations as they can within the bounds of their (unrealistic) failure
model.

~~~
ChuckMcM
Fiber Channel drives used the DIF part of the sector, reading 526 bytes per
sector rather than 512. SATA based RAID systems will often use a separate
sector in the group to hold individual sector checksums. So 16 sectors, where
15 are 'data' and the 16th is crc data.

~~~
zerd
Could you tell us what RAID controllers actually does this? Because I've never
seen one do checksumming at all.

Anyone that is actually available for "normal" people (which is relevant for
the discussion of the article), i.e. not enterprise SAN?

~~~
ChuckMcM
All fiber channel HBAs can generate errors when the DIF don't correlate, its
part of the FC spec. I don't have access to the LSI controller firmware source
so I couldn't say one way or another if they do this with SATA drives. It
should be possible to test though.

------
omh
As a Solaris admin, it's a little strange to hear ZFS called a "next-gen"
filesystem. (It's been around for at least 7-8 years!) But it's good to see
these ideas getting usable in other operating systems, especially with the
current state of Solaris licensing.

~~~
MBCook
I'm using HFS+. It's a last-last gen filesystem if ever there was one.

It's the only FS I've ever had silent corruption on, and it used to happen
(10.4? 10.5? 10.6?) _all the time_.

I'd kill to have Apple just buy a license to NTFS.

~~~
groby_b
Uh, you might want to look at NTFS performance for your given use case. If,
say, you're a C++ programmer and your use case is reading lots of small files
quickly and closing them again so you can compile, NTFS is a horror cabinet.

HFS beats that quite handily (and gets _obliterated_ by ExtFS2 performance.)

~~~
gaadd33
So having slow compiles on NTFS is worse than having silent corruption on HFS?

~~~
groby_b
Depends on the use case, no?

Would I want to store a lot of important write-once data on HFS? Reluctantly,
if at all. High-throughput database? Yuck, no.

Would I trust it with a source-tree that is stored in a DVCS and immediately
restorable, while also giving me much faster compiles? Yep.

Would I choose either FS for a server? Nope. That was the point - there is no
single "best" file system, or the world would've long ago settled on it. Look
at what you need it for, and make your decision accordingly.

------
chadly
Is bitrot something the average hard drive user even needs to worry about? I
know the hard drives themselves at the hardware level implement checksums. Is
it really necessary to also have it at the filesystem level?

I am legitimately asking because I have a good 2TB of family photos on hard
drives and spooky stories about random bits flipping freak me out.

~~~
theatrus2
Running ZFS for many years has shown, yes, this happens, and will continue to
get worse as density goes up.

~~~
olavgg
I've been running ZFS on several servers with tenfolds of TB's with data. I
see checksum errors every month, from a single bit to several megabytes.

~~~
booi
That sounds like you have something else wrong. With ECC memory on server
hardware, we've seen 0 checksum errors in the last 6 months and I've seen only
2 ever. A typical server has 136TB of raw hdd space and we get about 71TiB
usable. It's about 80% full.

------
acd
I think the next generation file system has CoW, are based on erasure codes
and are internet distributed. For example your home NAS have InternetFS, now
you can reach all your files where ever you are, all files in the home
directory are always in sync. Failure of a hard drive does not matter. If your
house burns down, your friends or someone else on the internet has the pieces
you need to reconstruct your data. The next gen filesystem also have built in
versioning via the CoW so you can always revert to an earlier version of an
file.

InternetFS(Encrypted, distributed, always in sync, cheap snapshots, p2p)

Want to share Photos or movies with your friends and family? No problem just
right click on the file, select the friend from a list, they see the file and
they can view it on their computer.

This next gen file system will be incredibly easy to use, cross platform. As
simple to use as Facebook, Skype, Email.

~~~
xiaomai
Hey! It sounds like you're describing ori [1]. Be sure to read the paper [2]
because the website is a bit sparse on info. It's still young but the syncing,
cheap snapshots, auto backups, etc. have been really nice. (it's only
encrypted over transport though, I wish it were encrypted locally too).

1: [http://ori.scs.stanford.edu/](http://ori.scs.stanford.edu/) 2:
[http://dl.acm.org/ft_gateway.cfm?id=2522721&ftid=1403940&dwn...](http://dl.acm.org/ft_gateway.cfm?id=2522721&ftid=1403940&dwn=1)

~~~
foobarqux
Great project.

Needs encryption, access controls and some kind of public namespace for
publishing.

------
minikites
I have a small NAS4Free box at home with ZFS and automatic zpool scrubbing.
Barring a physical disaster (fire, flood, etc), I expect my data to be safe
indefinitely. It was pretty easy to set it up on an HP Microserver, I
recommend it.

~~~
scrumper
I have a Drobo in my Amazon cart, was about to pull the trigger until I read
this article.

Do you know if your NAS4Free solution a viable solution for a household full
of Macs that need a shared Time Machine destination? I'm tempted: it's
cheaper, and NAS4Free will keep evolving.

~~~
mynegation
Anything that can run Samba server can be a target for Time Machine backup,
google DIY Time Machine. I considered FreeNAS but figured it would be too much
time to ste up and went with the Synology Intel box instead, could not be
happier with it. Synology has step-by-step instructions for setting it up as a
Time Machine backup target.

~~~
dmd
Be warned that network-based TM is _extremely_ unreliable. (Just google time
machine synology failure).

~~~
sitkack
I dumped time machine entirely for an rsync based solution. Take a look @
`--link-dest`

------
anton_gogolev
Listen to Belt and Suspenders [1] and Computational Skeuomorphism [2] episodes
of Hypercritical for an excellent discussion of filesystems: what the hell do
they do and how do they compare.

[1]: [http://5by5.tv/hypercritical/56](http://5by5.tv/hypercritical/56) [2]:
[http://5by5.tv/hypercritical/57](http://5by5.tv/hypercritical/57)

~~~
billyhoffman
+1 on Hypercritical's discussion on widely used file systems and their
problems. When I saw this article on Ars I was surprised that John Siracusa
_wasn 't_ the author. Somewhere he must be smiling.

~~~
anton_gogolev
My thoughts exactly. Listening to John, he somehow manages to refer to how
sucky HFS+ is literally in every other episode of whatever podcast he happens
to be on.

------
bensummers
ZFS is also wonderfully usable. It's a delight to admin. See
[http://rudd-o.com/linux-and-free-software/ways-in-which-
zfs-...](http://rudd-o.com/linux-and-free-software/ways-in-which-zfs-is-
better-than-btrfs)

~~~
pedrocr
That seems pretty out of date. I only follow btrfs lightly but at least the
RAIDZ and send/receive points are wrong. Current btrfs supports those.

------
higherpurpose
I want F2FS for my phones. It seems to make storage 50-100 percent faster
compared to ext4 from the benchmarks I've seen. Motorola started using it in
Moto X and Moto G, but I hope Google makes it the default for Android in the
next Android version.

~~~
pedrocr
F2FS sounds like a bit of a hack. It's basically a way to get performance out
of the storage-provided flash translation layer that pretends the underlying
storage isn't flash. It also doesn't have the fancy ZFS/btrfs features. I
wonder if the COW that these do can be tuned to work well with flash devices.

------
AndrewDucker
Anyone know how the latest versions of NTFS stand up against these?

~~~
ghh
Microsoft has introduced ReFS [1] as a potential successor to NTFS in Windows
Server 2012. It has a few of the 'next-gen' filesystem features that the
article mentions, such as integrity checking, but no copy-on-write and it's
not feature-equivalent to NTFS yet. Also, you can't boot from it yet.

[1] [http://en.wikipedia.org/wiki/ReFS](http://en.wikipedia.org/wiki/ReFS)

~~~
masklinn
It seems to do COW and checksumming of metadata, but not of data. According to
[http://blogs.msdn.com/b/b8/archive/2012/01/16/building-
the-n...](http://blogs.msdn.com/b/b8/archive/2012/01/16/building-the-next-
generation-file-system-for-windows-refs.aspx) there's a feature "integrity
stream" which is opt-in per-file (or per-subtree and inherited by all files)
checksumming. It doesn't seem to do COW, but can be paired with "storage
spaces"[0] from which bitrotted files can be recovered.

[0]
[http://blogs.msdn.com/b/b8/archive/2012/01/05/virtualizing-s...](http://blogs.msdn.com/b/b8/archive/2012/01/05/virtualizing-
storage-for-scale-resiliency-and-efficiency.aspx)

~~~
Freaky
Integrity streams are enabled by default on mirrored pools, as per your first
link:

> By default, when the /i switch is not specified, the behavior that the
> system chooses depends on whether the volume resides on a mirrored space. On
> a mirrored space, integrity is enabled because we expect the benefits to
> significantly outweigh the costs.

Plus:

> When this option, known as “integrity streams,” is enabled, ReFS always
> writes the file changes to a location different from the original one. This
> allocate-on-write technique ensures that pre-existing data is not lost due
> to the new write

------
wernerb
If you are really concerned about bitrot, and raid is apparently "not the
solution", generate your own parity files for important stuff:
[http://parchive.sourceforge.net/#clients](http://parchive.sourceforge.net/#clients)

~~~
lmm
That's very fiddly compared to using ZFS and getting it "for free".

~~~
e40
Unless ZFS isn't easily available on the OS you are using.

~~~
rbanffy
Linux has an easy to install port
([http://zfsonlinux.org/](http://zfsonlinux.org/)). I remember OSX as having
to use it via FUSE. I guess that leaves only Windows.

OS/2 had installable filesystems. Does Windows have something like it?

~~~
Freaky
OS X has a native port - [http://open-zfs.org/wiki/Distributions#ZFS-
OSX](http://open-zfs.org/wiki/Distributions#ZFS-OSX)

Windows has ReFS -
[http://en.wikipedia.org/wiki/ReFS](http://en.wikipedia.org/wiki/ReFS)

~~~
rbanffy
I didn't mean like ZFS, but like installable filesystems. It should be
possible, with some effort, make ZFS run on Windows.

ReFS is beta-quality and lack several features ZFS has since its first
production-grade release. It's not really an apples to apples comparison.

------
visarga
In our days, it can take 10+ hours to stream all the data off a multi TB disk.
A single disk can contain an amazing quantity of data, thus, it can be very
valuable and sensitive.

I'd like a file system that duplicates sensitive data on the same drive. The
file system data should be duplicated too, and marked in some way as to be
able to reconstruct a disk after failure.

I don't need to save on space, I only need safety. And keep in mind that it
takes many hours to dump a single time the contents of a disk - so, it's
practically inaccessible as a whole over short periods of time.

For now, I get by with DropBox and TimeMachine but it's far from perfect. My
photo collection alone is 1TB, so, no luck in backing it up in the cloud.

~~~
bensummers
Use ZFS. Create a filesystem in your pool with copies=n to get n copies on the
same disc.

But still use mirrors or raidz to protect against failure of a drive.

------
kibwen
Stupid question: even if we use checksums and parity files and such to verify
the integrity of our data, how do we verify that these integrity measures
don't themselves become corrupted? Is it just "that's very unlikely to
happen"?

~~~
ChuckMcM
Its actually a reasonable question, for parity it was always possible for two,
complementary, bit errors to result in a successful parity computation in the
presence of corrupted data. The CRC function can detect multi-bit errors
because it encodes not only bit state but also bit sequence in the error
check. There is an excellent discussion of the tradeoffs in the book with
Richard Feynman's lectures on computation.

Generally when you design such a system you can often say what would have to
be true for you to "miss" that something was corrupted. In our parity example,
an even number of bits would have to change state. In the CRC example bit
changes would need to be correlated across a longer string of bits. Once you
have ways that you know you would not be able to detect errors, then you start
breaking the system apart to change up detection and correction. So for
example at NetApp a block (which was 4K bytes at the time I was there) on disk
was 8 sectors, then there was an additional sector that included information
about both a CRC calculation for the available bytes, as well as information
about which block it was and what 'generation' it was (monotonically
increasing number indicating file generation). The host bus adapter (HBA)
would do its own CRC check on the data that came from the drive, passed
through it, and landed in memory. That would detect most bit flips that
occurred on the channel (SATA or Fibre Channel port) as data went through it.
ECC on memory would detect if memory written had its bits flipped. Software
would recompute the block parameters and compare them to the data in the check
sector.

So if data on the disk was bad, that check sector would not work, if the data
had been written correctly initially and gone bad, the RAID parity check would
catch it, if the data was corrupted crossing the disk/memory channel the HBA
would catch it, if the data got to memory but memory corrupted it, the ECC
would catch it, if the memory some how didn't see the corruption the check
against the block check sector would catch it. All layers of interlocking
checks and re-checks in order to decrease the likelyhood that something
corrupted yoru data without knowing it.

~~~
kibwen
This is fascinating! Thanks everyone for answering.

------
transfire
Huge kudos to ZFS and btrfs, but I am very disappointed with one detail of
next generations file systems: Why, oh why, do we still not have a file-type
metadata field? We are still using silly file name extensions and magic mime-
type detection. In the Age of Types (OOP and FP strong type systems) it only
makes sense for the file system to do the same.

~~~
chrismonsanto
I personally think the biggest problem with current file systems is lack of
ACID semantics. It is a tragedy that you can't rollback a set of file system
changes if a shell script fails halfway. Why should I have to bring out SQLite
if I want transactions, but don't need relations?

(Also I agree--yes MIME types in filesystem please)

~~~
stouset
Even if such a thing were to exist, it would probably by necessity be built on
some lower-level system that doesn't provide such guarantees… like a
filesystem.

~~~
wmf
IIRC in the old Tandem OS transactions were a basic service, even lower level
than the file system. So their file system could use the transaction service
to perform arbitrary transactions without much complexity. These days all the
world's a VAX though.

------
curmudgeoned
Because the author does seem to care about defining what COW means, because
"atomic cow" doesn't readily google to its proper meaning, and because it
doesn't seem to be otherwise mentioned here:

COW: Copy-On-Write

[http://en.wikipedia.org/wiki/Copy-on-
write](http://en.wikipedia.org/wiki/Copy-on-write)

------
cpncrunch
Given that hard disks already do a CRC check when reading data (as far as I am
aware), I don't see how this adds anything. The author is assuming that there
is no hardware-level checksum or CRC, which I believe is incorrect.

------
fsiefken
just backup your data to blueray (25 GB) or use PaperBack 1.0 for a 1 or more
MB per A4 depending on compression.
[http://www.ollydbg.de/Paperbak/index.html](http://www.ollydbg.de/Paperbak/index.html)
My girlfriend is already printing out the family pictures as she doesn't trust
me keeping the backups of the jpegs.

~~~
masklinn
There's no guarantee that the data will reach your BR disk intact. Or that a
writable BR will survive for years (let alone decades)

~~~
Dylan16807
It's pretty trivial to check that a write succeeded. Unless you mean getting
corrupted before putting the data into the redundant/backup system, which can
happen just as easily with ZFS or almost any setup.

------
touristtam
So, Btrfs+linux or ZFS+BSD for a home server ? I thought the former was not
production ready. I am confused.

------
rodgerd
Nothing about Hammer? Pity.

