Hacker News new | comments | show | ask | jobs | submit login
Bitrot and atomic COWs: Inside “next-gen” filesystems (arstechnica.com)
169 points by pedrocr 1260 days ago | hide | past | web | 132 comments | favorite

Bit worrying not to see a single mention of ECC memory when discussing protection from bitrot, especially with filesystems which depend on correctly functioning memory to provide the protections people expect from them.

People sneer at me for being a stickler for it, but between 50GB of memory and 26TB of ZFS-protected storage, I see ECC corrections about as often as I see disk checksum errors - maybe half a dozen of each in the past year or two. Frankly I think it's idiotic it's not more common and better supported.

The article was, sadly, missing a lot of stuff. When it said "(most arrays don't check parity by default on every read)" I knew the author was not up to the task of writing this article. FWIW, a RAID system has to calculate parity every time it reads a stripe so that it will know how to change it if something in the stripe changes. There are of course file system errors that are invisible to RAID, if your FS writes corrupted data (as it might with ECC failures) the RAID subsystem will happily compute the correct parity for the stripe.

Granted I spent nearly 10 years immersed in storage systems (5 at NetApp, 4 at Google) but still there is some easily checked stuff missing from this article and so its point (which is hardening in the filesystems is good) is lost.

> a RAID system has to calculate parity every time it reads a stripe so that it will know how to change it if something in the stripe changes

Nope. On a non-degraded array with parity (R5, 6, ...) it will only read the block it needs to from the drive it is on. For an array that mirrors without parity (R1, ...) it will read the block from one of the drives. There is no need to bother another drive with the read operation unless it needs to check parity (because it has been told to on each ready as some controllers can be so told), you actually reduce the performance benefit of the striping if you do as you may be moving the heads of the other drive(s) "unnecessarily" and potentially away from another block they were about to be asked to read.

With an un-degraded array unless explicitly told to check parity on read the controller will not touch the parity blocks until a write happens at which point it will read the other relevant data blocks in order to regenerate the parity block. This is why the RAID 5 write penalty exists but there is no read penalty (in fact there is a read bonus due to striping over multiple devices).

Parity blocks will, unless checking parity on every read, only ever get read if the array is in a degraded state, in which case you can only derive some data blocks by reading the other blocks in that stripe (data and parity) and working out from them what the missing block should be.

I will take your word for it that some systems make this "optimization" but any such systems is fundamentally broken at the design level. Disk drives can, and do, return corrupted sectors without any indication of failure, the RAID system you describe would not catch those failures and would thus 'fail' in terms of providing any sort of data reliabilty.

I will take your word for it that some systems make this "optimization" but any such systems is fundamentally broken at the design level.

I'm fairly sure some current systems (MD on Linux for example) have this sort of optimization. And that is because the designers assumed that the disk would be fine or just fail, and not be in some state in-between.

This assumption used to be more true that it is now. The MTBF for a read error has not been increasing as fast as hard drive capacity has been. In the old days, it was highly unlikely to get a silent read error. However, these days it is more common due to the massive increase in capacity.

"And that is because the designers assumed that the disk would be fine or just fail, and not be in some state in-between."

Interesting, if you make that assumption you will get burned. I have personally experienced disks that through firmware errors returned what was essentially a freed memory block from their cache as the sector data, returned success status on a write that never actually happened, and flipped bits in the data they actually returned (without error). The whole point of RAID for data reliability is to catch these things. (RAID for performance is difference and people tolerate errors in exchange for faster performance).

We used to start new hire training at NetApp with the question, "How many people here think a disk drive is a storage system?" and then proceed to demolish that idea with real world data that showed just how crappy disks actually were. You don't see these things when you look at one drive, but when you have a few thousand to a couple of million out there spinning and reporting on their situation you see these things happen every day.

The issue of course is that margins in drives are razor thin and they are always trying to find ways to squeeze another penny out here and there. Generally the expresses itself as dips in drive reliability on a manufacturing cohort basis by a few basis points. You can make a more reliable drive, you just have a hard time justifying it to someone who is essentially going to get only one in a million 'incorrect' operations.

This is the same reason ECC RAM is so hard to find on 'regular' PCs, the chance of you being screwed is low enough that you'll assume it was something else or not be willing to pay a premium for the extra chip you need on your DIMMs for ECC.

Firstly, I must say that I'm talking about consumer available RAID systems here, since that's what the article discusses.

>> "And that is because the designers assumed that the disk would be fine or just fail, and not be in some state in-between."

> Interesting, if you make that assumption you will get burned.

Probably why the designers of RAID have moved on to other storage systems. Yet, RAID is currently the only viable option for consumers to try to get "reliable" storage. Which is why the article was written.

> The whole point of RAID for data reliability is to catch these things. (RAID for performance is difference and people tolerate errors in exchange for faster performance).

No, RAID is for withstanding disk failures, where the disk fails in a predicted manner. This is a common misunderstanding. Nothing in the RAID specifications say that they should detect silent bit flipping. If you find a system that does this, it goes beyond the RAID specification (which is good). But you seem to be under the impression that most systems do that, but that is not true. Most RAID systems that consumers can get their hands on won't detect a single bit flip. The author demonstrated it on software raid. I am fairly certain that the same thing will happen if you try it on a hardware RAID controller from any of LSI, 3ware, Areca, Promise etc.

For instance Google has multiple layers of checksumming to guard themselves from bitrot, at filesystem and application level. If RAID did protect against it, why don't they trust it?

In this particular case (bad data from disk, without an error indication from the disk) is pretty straight forward programming. If the device doesn't allow for additional check bits in the sectors, then reserving one sector out of 15 or one out of 9 (say if you're doing 4K blocks) for check data on the other sectors will let you catch misbehaving drives, and, if necessary, reconstruct bad data with the RAID parity.

If you use a 15/16ths scheme you only "lose" a bit more than 7% of your drive to check data, and you gain the ability to avoid really nasty silent corruption. I'm not sure why even a "consumer" RAID solution wouldn't do that (even the RAID-1 folks (mirroring) can use this for a bit more protection, although the mean-bit-error spec still bites them in trying to do a re-silver)

As for Google, If you'd like to understand the choice they made (and may un-make) you have to look at the cost of adding an available disk. The 'magic' thing they figured out was they were adding lots and lots of machines, and most of those machines had a small kernel, a few apps, and some memory, and an unused IDE (later SATA) port or ports. So adding another disk to the pizza box was "free" (and if you look at the server in the Computer History Museum you will see they just wrapped a piece of velcro around the disk and stuck it down next to the motherboard on the 'pizza pan'.) In that model an R3 system which takes no computation (its just copying, no ECC computation) with data "chunks" (in the GFS sense) which had built in check bits) and voila "free" storage. And that really is very cost effective if you have stuff for the CPUs to do, it breaks down when the amount of storage you need exceeds what you can acquire by either adding a drive to a machine with a spare port, or replacing all the drives with their denser next generation model. Steve Kleiman the former CTO of NetApp used to model storage with what he called a 'slot tax' which was the marginal cost of adding a drive to the network (so fraction of a power supply, chassis, cabling, I/O card, and carrier (if there was one)). Installing in pre-existing machines at Google the slot tax was as close to zero as you can reasonably make it.

That said, as Googles storage requirements exceeded their CPU requirements it became clear that the cost was going to be an issue. I left right about that time, but since that time there has been some interesting work at Amazon and Facebook with so called "cold" storage, which are ways to have the drive live in a datacenter but powered off most of the time.

I can't agree with this statement, "No, RAID is for withstanding disk failures, where the disk fails in a predicted manner." mostly because disks have never failed in a "predicted manner", that is what Garth Gibson invented RAID in the first place, he noted all the money DEC and IBM were spending on trying to make an inherently unreliable system reliable, and observed if you were willing to give up some of the disk capacity (through redundancy, check bits) you could take inexpensive, unreliable, drives, and turn them into something that was as reliable as the expensive drives from these manufacturers. It is a very powerful concept, and changed how storage was delivered. The paper is a great read, even today.

So let's say you read a full RAID-5 stripe including parity and the parity does not match because of a silent error. How do you know which block contains the error? Classic RAID does not have any checksums. AFAICT classic RAIDs are screwed in the face of silent errors, so they might as well implement as many optimizations as they can within the bounds of their (unrealistic) failure model.

Fiber Channel drives used the DIF part of the sector, reading 526 bytes per sector rather than 512. SATA based RAID systems will often use a separate sector in the group to hold individual sector checksums. So 16 sectors, where 15 are 'data' and the 16th is crc data.

Could you tell us what RAID controllers actually does this? Because I've never seen one do checksumming at all.

Anyone that is actually available for "normal" people (which is relevant for the discussion of the article), i.e. not enterprise SAN?

All fiber channel HBAs can generate errors when the DIF don't correlate, its part of the FC spec. I don't have access to the LSI controller firmware source so I couldn't say one way or another if they do this with SATA drives. It should be possible to test though.

> So let's say you read a full RAID-5 stripe including parity and the parity does not match because of a silent error.

If you have an "iffy" sector that is on th eway out you might get a decent read after a few attempts. You can then make sure that the blocks on the other drives are OK and (assuming all the failures are in the same device) drop the problem device when done.

In reality most drives these days do this for you: a certain number of sectors are reserved for reallocation in the case of small surface failures so the drive itself will retry a few times to get a reliable read (each sector on a disk has checksums that allow error detection) then the controller will remap that sector. This happening once or twice is considered normal wear and tear, this heppening beyond a certain threshold or a certain rate is a sign of imminent failure and is measureed by SMART indicators - so running the scan will not let md raid do much directly but will give the drive chance to remap data that is in danger (and if you have mdadm setup right, email you a warning if you shoudl consider dropping the drive immediately and replace it).

> but any such systems is fundamentally broken

Um, isn't this the EXACT point of the article you are so happily criticizing? That filesystems don't do this, and they should?

fwiw, with linux md, you can (and should) scrub the array periodically.

    for raid in /sys/block/md*/md/sync_action; do
        echo "check" >> ${raid}
which will check and fix errors (or fail). i run this weekly as a cron job.

of course, it doesn't help with any data that are corrupted and read before it runs.

Wouldn't you get higher read throughput (2x for R1, less for R5) by reading the parity disks as well?

Not for R5, because in a non-degraded array the parity blocks do not contain anything you have not read from other blocks in that stripe and unlike the data blocks can not be used as meaningful data without at least one other block in the stripe being read. You do get the speed benefit fromt he striping of course: RAID5 over three drives performs similarly to RAID0 over two for read operations (though be warey of write performance if your I/O pattern is write heavy, particularly for many small randomly positioned writes, as RAID5 has issues here).

For R1 you can in theory use an elevtor algorithm to chose which or the two mirrors serves a given read depending on how close the block may be to the most recently accessed blocks on each device, reducing average head movement distance. For bulk reads or an I/O pattern including much writing this will make little or no difference, but for significant random access read patterns you might see a measurable (perhaps even noticable to a human user) latency improvement.

Last time I tested this myself I found that the Linux software RAID arrangement did not perform any such optimisation for RAID1. Results like https://raid.wiki.kernel.org/index.php/Performance agree with my findings, though they do suggest that with certain layout options the newer RAID10 single-level driver provides such optimisation for reads (at the expense of slower write performance). Linux's md-raid "raid10" module is not your traditional RAID10 (a nested arrangement of RAID 1 and RAID0, a stripset of mirrors needing at least four drives) - it is a single level driver that offers greater layout flexibility with regard to the copies of data on each device and even supports operation on three devices (this operates similarly to what IBM RAID controllers call RAID1E and some others call RAID0+1) or even just two devices (in which case it behaves as RAID1 but with the extra layout options). See http://neil.brown.name/blog/20040827225440 amongst other references. IIRC you should avoid the "far" layour for your boot devices as boot loaders don't tend to understand it.

As commented already, this is actually incorrect. RAID cannot and will not detect data errors unless the drive itself report the errors and issues the "I can't retrieve this data" call back to the raid controller. If it lies, RAID is fat dumb and happy. Sadly, drives lie all the time.

Now in reality in big enterprise SAN arrays, the vendors usually add checksum blocks onto the underlying disk (aka larger sectors in the Clariion, or checksums into the filesystem on NetApp) OR use special certified firmware on the drives that make sure the drives never lie themselves! So largely if you are on a big enterprise SAN, you probably don't seriously need to worry about bit rot.

But outside of that space, most PCI card RAID controllers, and I suspect at least a few "enterprise lite" arrays, probably don't do checksum calculations. So BTRFS and ZFS provide notable value. (and definitely on raw drives)

And we can lay blame squarely on Intel and others here to a certain extent.

Intel could easily offer ECC support in its consumer line of desktop (none) and laptop processors (only offers it on 3 of them: http://ark.intel.com/search/advanced/?s=t&FamilyText=4th%20G...), but doesn't.

To me, ECC support should be like SSL/TLS for modern applications; don't do it without it!

I agree and it's one of the reasons it's much easier to go with AMD for cheap home servers/NAS (HP Microserver line is great for example) as Intel tends to reserve ECC for very expensive parts.

What doesn't seem easy is to get ECC in laptops. I don't think even traditionally business-focused lines have it (e.g., Thinkpads).

Nowadays you can get ECC with intel for quite cheap: http://ark.intel.com/search/advanced/?s=t&MarketSegment=DT&E...

Unfortunately while the CPUs themselves are quite reasonable, motherboards tend to be far more expensive and less available. I wonder how much of a markup Intel has on the server chipsets vs. desktop ones; it'd be interesting to know whether Intel or motherboard manufacturers are making bank on the boards.

That's good to know. It seems the issue is actually motherboard support. Newegg shows a single intel motherboard with ECC and no AMD ones. I suppose the low-end server hardware (e.g., HP Microservers) just tend to be AMD so it's easiest to get a cheap AMD server with ECC than an Intel one. I had assumed it was an intel issue but apparently not.

It would be nice if the intel ultrabook standard started including ECC as well. One can dream...

On LGA1150 you need a C22x chipset for ECC support: http://www.intel.com/content/www/us/en/chipsets/server-chips...

Prices for these start about 3x higher than other boards, their choice is extremely limited, and availability even more so :/

As a Solaris admin, it's a little strange to hear ZFS called a "next-gen" filesystem. (It's been around for at least 7-8 years!) But it's good to see these ideas getting usable in other operating systems, especially with the current state of Solaris licensing.

I think this is why there's a distinction between "new" and "next-gen". ZFS still has a lot of features and concepts that make most "current generation" file systems feel like toys.

I'm using HFS+. It's a last-last gen filesystem if ever there was one.

It's the only FS I've ever had silent corruption on, and it used to happen (10.4? 10.5? 10.6?) all the time.

I'd kill to have Apple just buy a license to NTFS.

Uh, you might want to look at NTFS performance for your given use case. If, say, you're a C++ programmer and your use case is reading lots of small files quickly and closing them again so you can compile, NTFS is a horror cabinet.

HFS beats that quite handily (and gets obliterated by ExtFS2 performance.)

At this point, I'd just like a filesystem I don't have to worry about. I never had a problem with NTFS.

So having slow compiles on NTFS is worse than having silent corruption on HFS?

Depends on the use case, no?

Would I want to store a lot of important write-once data on HFS? Reluctantly, if at all. High-throughput database? Yuck, no.

Would I trust it with a source-tree that is stored in a DVCS and immediately restorable, while also giving me much faster compiles? Yep.

Would I choose either FS for a server? Nope. That was the point - there is no single "best" file system, or the world would've long ago settled on it. Look at what you need it for, and make your decision accordingly.

Also, Docker vs Zones. Linux still has a way to go. :-)

Is bitrot something the average hard drive user even needs to worry about? I know the hard drives themselves at the hardware level implement checksums. Is it really necessary to also have it at the filesystem level?

I am legitimately asking because I have a good 2TB of family photos on hard drives and spooky stories about random bits flipping freak me out.

Yes, both latent sector errors (data that is lost that you don't know is lost) and detectable and undetectable bit errors happen at rates high enough to affect 2TB of family photos. NetApp[1] and the Internet Archive[2] have published good data on this in the past.

Scrubbing (periodically reading the data and comparing it to checksums) is one way to get around this. It's very effective against small numbers of sector errors in backups (you should have more than one), and in detecting marginal data on drives. It's less effective against some other kinds of corruption. Another option is to store multiple copies (replication), or additional ECC information that can be used to recover the data if one copy is lost.

How much effort you put into this really depends on how much you need or want the data.

[1] http://www.cs.wisc.edu/adsl/Publications/latent-sigmetrics07... [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=

Super links, thanks.

Running ZFS for many years has shown, yes, this happens, and will continue to get worse as density goes up.

Density is exactly the problem. The chances of an unrecoverable error are around 1 per trillion! But oh wait, you have 2 trillion bits on that disk?

I've been running ZFS on several servers with tenfolds of TB's with data. I see checksum errors every month, from a single bit to several megabytes.

That sounds like you have something else wrong. With ECC memory on server hardware, we've seen 0 checksum errors in the last 6 months and I've seen only 2 ever. A typical server has 136TB of raw hdd space and we get about 71TiB usable. It's about 80% full.

Absolutely! Over the last ~10 years, with perhaps an average of 6-8 disks at any one time and relatively low intensity use, I've probably had 4 hard drives where the files had checksum errors, before complete failure.

But you don't need anything fancy; 2TB is small enough though that you can just buy another HDD and back it up manually.

Except that if you don't notice the corruption, you'll just back up the bad data.

A single bit can be corrected, but there's a threshold per block. As far as I know, the hard drive won't auto-scan blocks that have been sitting still, so it's probably not a bad idea to run a full SMART scan every once in a while.

> As far as I know, the hard drive won't auto-scan blocks that have been sitting still

Seagates at least report their error correction rates via SMART, and make very obvious curves that show them scanning the heads across the disk surface gradually when idle.

I think the next generation file system has CoW, are based on erasure codes and are internet distributed. For example your home NAS have InternetFS, now you can reach all your files where ever you are, all files in the home directory are always in sync. Failure of a hard drive does not matter. If your house burns down, your friends or someone else on the internet has the pieces you need to reconstruct your data. The next gen filesystem also have built in versioning via the CoW so you can always revert to an earlier version of an file.

InternetFS(Encrypted, distributed, always in sync, cheap snapshots, p2p)

Want to share Photos or movies with your friends and family? No problem just right click on the file, select the friend from a list, they see the file and they can view it on their computer.

This next gen file system will be incredibly easy to use, cross platform. As simple to use as Facebook, Skype, Email.

Hey! It sounds like you're describing ori [1]. Be sure to read the paper [2] because the website is a bit sparse on info. It's still young but the syncing, cheap snapshots, auto backups, etc. have been really nice. (it's only encrypted over transport though, I wish it were encrypted locally too).

1: http://ori.scs.stanford.edu/ 2: http://dl.acm.org/ft_gateway.cfm?id=2522721&ftid=1403940&dwn...

Great project.

Needs encryption, access controls and some kind of public namespace for publishing.

Thanks for the ori link! :)

My SSD gives me ~ 900 Mbps, to several Gbps, of random write performance. My ISP gives me 2 Mbps of upload bandwidth and a data cap.

Differential backup. Checksumming and deduplication. If the file is from the base distribution of the OS, then chances are it is already in the cloud, so no need to back it up again.

Who is backing up their OS? I thought most people were only backing up their unique data, which by definition wouldn't exist in 'the cloud' already.

Lots of people do - some backup providers charge tiered amounts or 'unlimited' data, and the convenience of getting your OS back exactly the way it was is absolutely worth using an extra 10-20GB of storage.

Sound to me like Tahoe-LAFS [0] on global scale. Shall we build this ?


You need to run special gateway servers, I don't think it uses transparent encryption and there isn't a global namespace (for sharing publicly or with non-static groups).

First, more reading: https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/ab...

> You need to run special gateway servers

You don't _need_ to, each peer can run its own gateway on localhost. Gateway servers are useful if you're on a computer where you can't install anything.

> I don't think it uses transparent encryption

I'm not sure what you mean by that, but if it means you need to manually encrypt things before storing it in tahoe-lafs, you're wrong. The whole point of Tahoe-LAFS is to push your files to it, and it will automatically encrypt, erasure-encode and distribute it to all nodes in the swarm you're in. All you need to care about are URIs returned by Tahoe-LAFS, and which contain all you need to use files (no need for extra password or keys, in fact they're contained in the URIs)

> there isn't a global namespace

Tahoe-LAFS swarms are defined by their introducers, the central piece that ties all storage nodes and potential clients together, kind of like a bittorrent tracker. There are no namespaces because swarms are not globally defined (although a node can take part in multiple swarms, this is completely invisible from Tahoe-LAFS users. Exactly like bittorrent). My proposition was to build a world-level swarm, where anybody could register and participate in the swarm. If you wonder about the reliability of a central introducer, there is work in the pipes for creating decentralized introducers.

I was under the impression that Tahoe-LAFS doesn't work offline. Is that not the case?

Indeed, but Tahoe-LAFS is not a replacement for your filesystem, rather a virtual place to backup and share files. Common use would be to have all files local, and a copy on your Tahoe-LAFS swarm.

I think this is obvious and also painful that no one is working on such fundamental internet infrastructure.

Reliable storage is hard (thanks ChuckMcM and notacoward for educating us), distributed systems are hard, security is hard, so secure P2P erasure-coded storage is like hard^N. Would you also like it open source so it can't be monetized? Space Monkey looks like they're making good progress but I have some doubts about its scalability.

Yeah it's hard but fundamentally important. I think it's really difficult for such a system to be as valuable as closed source in terms of audit-ability (encryption is worthless if product is backdoored) and application development. I don't have the answer as to what would incentivize anyone to create such a system but it wouldn't be the first piece of important open source infrastructure that has been created.

Space Monkey looks really interesting but I agree that this isn't something that is going to perform well running on residential internet connections. It also doesn't look like there is anything like non-explicit sharing (i.e. public files) which I think is an essential part of a distributed data store because it would be a boon for small-time publishing (no more web servers!).

I feel like we won't see anything like that become mainstream until IPv6 becomes the norm.

I would love the net to be more decentralized but everyone living behind NATs means we need central servers to connect the users between them.

So freenet but less annoying to use?

I have a small NAS4Free box at home with ZFS and automatic zpool scrubbing. Barring a physical disaster (fire, flood, etc), I expect my data to be safe indefinitely. It was pretty easy to set it up on an HP Microserver, I recommend it.

I have a Drobo in my Amazon cart, was about to pull the trigger until I read this article.

Do you know if your NAS4Free solution a viable solution for a household full of Macs that need a shared Time Machine destination? I'm tempted: it's cheaper, and NAS4Free will keep evolving.

Anything that can run Samba server can be a target for Time Machine backup, google DIY Time Machine. I considered FreeNAS but figured it would be too much time to ste up and went with the Synology Intel box instead, could not be happier with it. Synology has step-by-step instructions for setting it up as a Time Machine backup target.

Be warned that network-based TM is extremely unreliable. (Just google time machine synology failure).

I dumped time machine entirely for an rsync based solution. Take a look @ `--link-dest`

I run FreeBSD on a microserver with 4x3TB in raidz1, and it was simple enough to set up, and NAS4Free is even easier. I believe it's supported time machine for a long time, though it'll be the annoying disk image based time machine (same as used on the time capsule). It'd be really nice if there was a nice way to do timemachine backups and use ZFS's snapshots. I'd love to see rsync with this ability without needing to write a script to form the snapshot after the sync.

Unfortunately, I have no idea about Time Machine, the one Mac I have is running 10.4 (pre Time Machine).

SMB, SSH/SFTP (even set up key-based auth), regular FTP all work great. Transmission works great and has a decent web interface. UPS support seems to be okay, haven't really tested it.

DLNA/UPnP (fuppes) crashes all the time and I only tried it with the very buggy VLC for iOS, so I'm not sure how stable it actually is.

OK thanks - you've set me on a big research path here, and hopefully saved me some money (and a few bits).

I recently moved back to FreeNAS after iXsystems took over the project. FreeNAS includes Time Machine support: http://doc.freenas.org/index.php/Apple_%28AFP%29_Shares#Usin...

I don't know much about NAS4Free, but if it can install netatalk, then you can do time machine easily

Doesn't the lack of an fsck tool for ZFS worry you? I know it's supposed to never ever get corrupted, but HDDs do lie sometimes, bits get flipped other times and knowing there's nothing that can try to discover lost files is a deal-breaker for me.

Also, from what I remember, due to the fact ZFS keeps previous versions of files around, the effective data capacity you get is something like 1/2 or even 1/3 of the capacity of the HDD, no?

You seem pretty uninformed about ZFS so I'll try and help. As other have mentioned, there's the zpool scrub command which checks the integrity of the drive.

Also, I'm not sure where you got the "ZFS keeps previous versions of files around" idea from, it only does that if you make a snapshot, and only modifications of the file cause the file to be copied; this is waht copy on write is. It means you can have the whole history of your file system with almost no space overhead except for changes. Apple's time Machine does the same thing, but in a different way: each directory that hasn't been changed since the last backup is hard linked to the previous backup's version (which might itself be a hardlink). this makes is quite space efficient (and a pretty easy to understand hack to get versioning on a file system that doesn't natively support snapshots)

Answering as a novice:

The whole point of ZFS is that if a bit gets flipped, it's corrected the next time the file is accessed. There's a scrub command that manually hits every file to check for integrity and you can schedule it through cron.

And I assume you can control the size of the snapshots, I haven't looked into it but it would be pretty silly if you couldn't. I have four 3TB drives in RAID-Z and I have 9TB to play with, same as RAID-5.

The command "zpool scrub POOLNAME" will go through and check all the checksums for you, and, you should use either RAID1 or RAIDZ (similar to RAID5) so that if it does detect an error, then the other copy of the data can be used to correct it.

For snapshots - they only get made, if you make them (or setup an automated script that makes them). You can list the snapshots and see how much space each uses, as well.

Flipped bits will be detected and corrected if you're running at least a mirror, which you should be. This is one of the main reasons you should use ZFS for data you want to keep, and one of the reasons why it's so much better than traditional RAID.

You can create snapshots of FS to keep old versions, but how many, and therefore how much space is used, is up to you.

Listen to Belt and Suspenders [1] and Computational Skeuomorphism [2] episodes of Hypercritical for an excellent discussion of filesystems: what the hell do they do and how do they compare.

[1]: http://5by5.tv/hypercritical/56 [2]: http://5by5.tv/hypercritical/57

+1 on Hypercritical's discussion on widely used file systems and their problems. When I saw this article on Ars I was surprised that John Siracusa wasn't the author. Somewhere he must be smiling.

My thoughts exactly. Listening to John, he somehow manages to refer to how sucky HFS+ is literally in every other episode of whatever podcast he happens to be on.

ZFS is also wonderfully usable. It's a delight to admin. See http://rudd-o.com/linux-and-free-software/ways-in-which-zfs-...

That seems pretty out of date. I only follow btrfs lightly but at least the RAIDZ and send/receive points are wrong. Current btrfs supports those.

I want F2FS for my phones. It seems to make storage 50-100 percent faster compared to ext4 from the benchmarks I've seen. Motorola started using it in Moto X and Moto G, but I hope Google makes it the default for Android in the next Android version.

F2FS sounds like a bit of a hack. It's basically a way to get performance out of the storage-provided flash translation layer that pretends the underlying storage isn't flash. It also doesn't have the fancy ZFS/btrfs features. I wonder if the COW that these do can be tuned to work well with flash devices.

Anyone know how the latest versions of NTFS stand up against these?

Microsoft has introduced ReFS [1] as a potential successor to NTFS in Windows Server 2012. It has a few of the 'next-gen' filesystem features that the article mentions, such as integrity checking, but no copy-on-write and it's not feature-equivalent to NTFS yet. Also, you can't boot from it yet.

[1] http://en.wikipedia.org/wiki/ReFS

It seems to do COW and checksumming of metadata, but not of data. According to http://blogs.msdn.com/b/b8/archive/2012/01/16/building-the-n... there's a feature "integrity stream" which is opt-in per-file (or per-subtree and inherited by all files) checksumming. It doesn't seem to do COW, but can be paired with "storage spaces"[0] from which bitrotted files can be recovered.

[0] http://blogs.msdn.com/b/b8/archive/2012/01/05/virtualizing-s...

Integrity streams are enabled by default on mirrored pools, as per your first link:

> By default, when the /i switch is not specified, the behavior that the system chooses depends on whether the volume resides on a mirrored space. On a mirrored space, integrity is enabled because we expect the benefits to significantly outweigh the costs.


> When this option, known as “integrity streams,” is enabled, ReFS always writes the file changes to a location different from the original one. This allocate-on-write technique ensures that pre-existing data is not lost due to the new write

You also can't store things like WSUS server's updates on it; I have no idea why. It rather mitigates the point, and frustrates me, that I can't use it for a lot of the things I'd like to be robust against such problems...

If you are really concerned about bitrot, and raid is apparently "not the solution", generate your own parity files for important stuff: http://parchive.sourceforge.net/#clients

The original intention for parchive was to split files into multiples for posting on usenet. The bonus - you get n extra files, so that if any of the sub-files are broken, you can reconstruct it.

I use it for my backups, both automated and manual. For my photos, I write a set of 10-15 DVD-sized tar archives containing files that haven't been backed up yet, using a tool I wrote. I then run parchive to generate two or three DVD-sized ECC files, so if any one DVD gets trashed I can recover it. Then it's just a matter of burning the DVDs and stacking them somewhere off-site.

That's very fiddly compared to using ZFS and getting it "for free".

Depends on what you find fiddly. With par2 I can move the files to other filesystems without problems, not being bound to ZFS (bsd) means a lot to me. I will reconsider when Btrfs is stable.

Unless ZFS isn't easily available on the OS you are using.

Linux has an easy to install port (http://zfsonlinux.org/). I remember OSX as having to use it via FUSE. I guess that leaves only Windows.

OS/2 had installable filesystems. Does Windows have something like it?

I didn't mean like ZFS, but like installable filesystems. It should be possible, with some effort, make ZFS run on Windows.

ReFS is beta-quality and lack several features ZFS has since its first production-grade release. It's not really an apples to apples comparison.

And if you don't want to do it manually, use solutions that already do this along with other nice stuff: I'm thinking about Snapraid[0] for instance.

[0] http://snapraid.sourceforge.net/compare.html

In our days, it can take 10+ hours to stream all the data off a multi TB disk. A single disk can contain an amazing quantity of data, thus, it can be very valuable and sensitive.

I'd like a file system that duplicates sensitive data on the same drive. The file system data should be duplicated too, and marked in some way as to be able to reconstruct a disk after failure.

I don't need to save on space, I only need safety. And keep in mind that it takes many hours to dump a single time the contents of a disk - so, it's practically inaccessible as a whole over short periods of time.

For now, I get by with DropBox and TimeMachine but it's far from perfect. My photo collection alone is 1TB, so, no luck in backing it up in the cloud.

Use ZFS. Create a filesystem in your pool with copies=n to get n copies on the same disc.

But still use mirrors or raidz to protect against failure of a drive.

>My photo collection alone is 1TB, so, no luck in backing it up in the cloud.

Amazon glacier would store that for $10/month, though the retrieval costs if you needed to restore the whole thing are a bit more complicated.

I use BackBlaze for several hundred GB of data - and that's mostly over a 3Mb/s ADSL connection. It's been running for several years now and I can vouch for it being absolutely rock solid.

Stupid question: even if we use checksums and parity files and such to verify the integrity of our data, how do we verify that these integrity measures don't themselves become corrupted? Is it just "that's very unlikely to happen"?

Its actually a reasonable question, for parity it was always possible for two, complementary, bit errors to result in a successful parity computation in the presence of corrupted data. The CRC function can detect multi-bit errors because it encodes not only bit state but also bit sequence in the error check. There is an excellent discussion of the tradeoffs in the book with Richard Feynman's lectures on computation.

Generally when you design such a system you can often say what would have to be true for you to "miss" that something was corrupted. In our parity example, an even number of bits would have to change state. In the CRC example bit changes would need to be correlated across a longer string of bits. Once you have ways that you know you would not be able to detect errors, then you start breaking the system apart to change up detection and correction. So for example at NetApp a block (which was 4K bytes at the time I was there) on disk was 8 sectors, then there was an additional sector that included information about both a CRC calculation for the available bytes, as well as information about which block it was and what 'generation' it was (monotonically increasing number indicating file generation). The host bus adapter (HBA) would do its own CRC check on the data that came from the drive, passed through it, and landed in memory. That would detect most bit flips that occurred on the channel (SATA or Fibre Channel port) as data went through it. ECC on memory would detect if memory written had its bits flipped. Software would recompute the block parameters and compare them to the data in the check sector.

So if data on the disk was bad, that check sector would not work, if the data had been written correctly initially and gone bad, the RAID parity check would catch it, if the data was corrupted crossing the disk/memory channel the HBA would catch it, if the data got to memory but memory corrupted it, the ECC would catch it, if the memory some how didn't see the corruption the check against the block check sector would catch it. All layers of interlocking checks and re-checks in order to decrease the likelyhood that something corrupted yoru data without knowing it.

This is fascinating! Thanks everyone for answering.

> how do we verify that these integrity measures don't themselves become corrupted?

In ZFS, a storage pool is a merkle tree[0], the checksums themselves are checksummed (as part of their ancestor blocks). The one risk is that the uberblock itself (the root of the tree holding a checksum for the whole thing) becomes corrupted, which is why IIRC zfs stores the last 128 revisions of the uberblock in 4 different physical locations[1].

[0] http://en.wikipedia.org/wiki/Merkle_tree

[1] sadly does not mean you can rollback to any of them, when a new uberblock is created (which is common) the previous's set of metadata (meta-object set MOS) becomes reclaimable, and once the MOS has been reclaimed/reused the corresponding uberblock is useless

Basically, yes. In the old days, integrity was ensured by using 1 bit in 8 parity, that is for every eight bits there would be an extra bit, and counting the bits set to 1 should always be even (or odd, depending). This meant that if there was a single bit flipped, you were guaranteed to detect the failure, but powerless to correct it. However, if you had two bits flipped, the parity would pass.

Nowadays, we use something more like a hash function, where we take the block of 4kB or so and produce a 64-bit hash of it. In order for this check to fail, the failure has a 1 in 2^64 chance of guessing a correct checksum for the corrupted data, which is very unlikely to happen.

Also, we have the possibility of using Reed-Solomon error correcting codes in some applications, which can not only detect errors but also correct them.

I've only had one cup of coffee so far, but...

The problem would be if the checksum AND the data _both_ became corrupted in a way that the checksum were still valid. After all, if the checksum were to change, you'd get a failure, and then you could just verify that the data was not actually changed and/or lost. Not to mention that if you stored the checksum in two places, it'd be pretty easy to see that it was the checksum that itself changed.

Huge kudos to ZFS and btrfs, but I am very disappointed with one detail of next generations file systems: Why, oh why, do we still not have a file-type metadata field? We are still using silly file name extensions and magic mime-type detection. In the Age of Types (OOP and FP strong type systems) it only makes sense for the file system to do the same.

I think pretty much all modern file systems (including ntrfs and ZFS) have the facility to support this but it's more a matter of applications and operating systems taking advantage of it.

On Linux for example you can set the user.mime_type xattr and it's possible to have Apache use that for it's mime-type.

Not sure why the filesystems are getting blamed for this, many filesystems on Linux support xattrs for storing related data about the file. But the applications and the tools need to support storing and reading and doing something with that (meta)data.

I personally think the biggest problem with current file systems is lack of ACID semantics. It is a tragedy that you can't rollback a set of file system changes if a shell script fails halfway. Why should I have to bring out SQLite if I want transactions, but don't need relations?

(Also I agree--yes MIME types in filesystem please)

Even if such a thing were to exist, it would probably by necessity be built on some lower-level system that doesn't provide such guarantees… like a filesystem.

IIRC in the old Tandem OS transactions were a basic service, even lower level than the file system. So their file system could use the transaction service to perform arbitrary transactions without much complexity. These days all the world's a VAX though.

Depending on exactly what you're doing, ZFS (and probably btrfs?) provide at least the Consistency & Durability of ACID, and maybe Isolation as well. But yes, it seems like it would be relatively straightforward to build Atomicity guarantees on top of a COW-based filesystem. I guess it's unlikely that Oracle would add that to ZFS now though - they don't tend to like discouraging the use of relational databases!

I remember reading some blog post that talked about implementing an InnoDB storage engine using a zfs backend, and using zfs snapshot operations. Seemed to be more of an idea than an actual implementation though.

While ZFS doesn't provide ACID semantics directly, you can achieve similar functionality using zfs snapshots.

Take a snapshot before you make changes in a script, then run the script, if the script fails, just rollback!

"Why, oh why, do we still not have a file-type metadata field?"

Because you haven't added it. Extended attributes can store arbitrary data. Propose something, experiment, see how it works, and maybe you'll come to something we can standardize on.

This could easily be implemented on top of any existing filesystem, why would it benefit to be implemented at the fs level?

Versioning makes sense for performance but adding metadata sounds like a huge can of worms with no obvious benefits for a low level implementation IMHO.

Besides your metadata wouldn't carry well to other filesystems (most USB drives still use FAT...) so it would be a bit of a headache to get right. Look at the mess that are file permissions and ACL already.

> oh why, do we still not have a file-type metadata field?

Because it's a bad idea. Apple tried it and it didn't work.

See: http://en.wikipedia.org/wiki/Creator_code and http://en.wikipedia.org/wiki/Type_code

Because the author does seem to care about defining what COW means, because "atomic cow" doesn't readily google to its proper meaning, and because it doesn't seem to be otherwise mentioned here:

COW: Copy-On-Write


Given that hard disks already do a CRC check when reading data (as far as I am aware), I don't see how this adds anything. The author is assuming that there is no hardware-level checksum or CRC, which I believe is incorrect.

just backup your data to blueray (25 GB) or use PaperBack 1.0 for a 1 or more MB per A4 depending on compression. http://www.ollydbg.de/Paperbak/index.html My girlfriend is already printing out the family pictures as she doesn't trust me keeping the backups of the jpegs.

I would expect blueray to be worse life expectancy than DVD since it is more dense. DVDs have worse life expectancy than CDs. CDs have somewhere between poor and acceptable lifetimes. Good luck.


* http://en.wikipedia.org/wiki/CD-R#Lifespan

* Library of Congress "CD-R and DVD-R RW Longevity Research" http://www.loc.gov/preservation/scientists/projects/cd-r_dvd...

The actual study: http://www.loc.gov/preservation/resources/rt/NIST_LC_Optical...

* http://www.thexlab.com/faqs/opticalmedialongevity.html (references the NIST study above)

There's no guarantee that the data will reach your BR disk intact. Or that a writable BR will survive for years (let alone decades)

It's pretty trivial to check that a write succeeded. Unless you mean getting corrupted before putting the data into the redundant/backup system, which can happen just as easily with ZFS or almost any setup.

1) I would never trust my data to something that degrades as easily as a plastic disc

2) I would never trust my data to a medium that can get destroyed if touched with your finger the wrong way and you're required to handle it manually to read the data on it

is there any other medium that would be suitable and provides more storage then engraving to metal or stone?

If you expect long term storage, you better be using M-Discs, not cheap Blurays. They've been tested to last 1000 years, which is far longer than other optical media.


They've been tested to last 1000 years

With what? A time machine?

So, Btrfs+linux or ZFS+BSD for a home server ? I thought the former was not production ready. I am confused.

Nothing about Hammer? Pity.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact