Hacker News new | comments | show | ask | jobs | submit login

Never mind all the promised wondrous features of ZFS, what are the recovery / fsck tools like? If the data recovery tools aren't mature, reliable and useful, then I'd advise anyone to stay away from a filesystem, no matter how world-ready the underlying FS code is.

I'm probably biased; but I got screwed using ReiserFS. It was fast and great, but one day something went wrong and then I found that the reiserfsck program was practically useless. Very little work had been done on it, so any small inconsistencies in the FS meant your data was toast.

Sure, keep backups and all that, but just be aware how important a good fsck tool is.

My understanding is that ZFS does a lot of healing itself on the fly. When you access a block, the block is checked against a checksum stored elsewhere on disk. If the block is bad, it is restored from a replica (so it's advisable to have 2 disks) while the original is healed.

On writes ZFS also checks that each written block is valid before committing the new block to the filesystem. In the event of a bad write the new block is not committed and the old block remains, blocks are written originally to a separate area of the disk.

Yes, and those are good features. But there's still plenty of room for errors. Bugs in the code, bad interactions with hardware, etc. Those are areas where ZFS is likely to be just as vulnerable as other file systems. More so, in fact, given the code size/complexity and its (relative) immaturity.

I suspect (but I have no data!) that the majority of times that users have file system corruption and need to run fsck is due to bad software behaviour, and not failing disks.

ZFS itself has been around for a while, was originally part of Solaris IIRC. It's using it on Linux that's new.

If you have a bad software write, ZFS should be able to detect that in the same way it would a hardware error since the checksums won't match. Also worth noting that the checksums are themselves checksummed in a merkle tree. It also has a tool called "scrub" which can check data integrity without unmounting the disk.

Of course if your filesystem has buggy code for checksumming/repair then you are boned whatever.

Disk lie. A lot. Ask anybody who has the misfortune to work on a filesystem for any length of time.

One thing to keep in mind is that ZFS is more complex than your typical filesystem, but it's also one of the few filesystems that's effectively a cryptographically self-validating tree. You're a lot more likely to have a disk lie to you than ZFS flake out.

And the whole point of RAID is to combat that... But since there is no decent recovery tools for ZFS your entire FS might be toast if the wrong bits are turned.

Yes ZFS is quite resilient against corruption but when it bites, which has happened plenty of times in the past, you are truly screwed.

The philosophy behind ZFS and file system recovery has always been that it is too inconvenient. It is much better to restore from tape backups so why make any effort on recovery?

Thing is, most consumers don't have tape backups...

Because of that, ZFS is in the very vast majority of cases a bad choice for home users, especially under linux where it can hardly be considered mature.

BRTFS just isn't ready yet.

As a home user you can't even get a decent COW FS with decent snapshotting or checksums. That is the sad state we are in and will be in for many years to come.

"And the whole point of RAID is to combat that..."

Actually the whole point of RAID is to combat a MISSING drive, not a corrupt drive. Parity works when I'm lacking one bit ... not when any of my bits might be lying.

Half of the reason of ZFS is to protect against the scenarios when a drive's internal checksum should catch a data error - but the drives returns bogus data and says 'oh its good'. RAID5 can't catch that(mathematically impossible) ... RAID6 can but most implementations don't check because its expensive(extra IO and calcs)

This is the reason most of your high end SAN vendors don't just deploy RAID5/6/10 but also checksum the data themselves under the covers. They don't trust even high end SAS/FC drives.

Please tell me how my entire FS will be toast if the wrong bits are turned, especially when zraid2 is really the only valid solution that should be used.

I've been using ZFS for years now, i've had disks go bad, I've replaced them. ZFS has protected me from data loss better than anything else. In the past when I've had the disk lie to me, ZFS was able to tell me what files were corrupted and thus had invalid data.

I'm going to need to throw a [citation needed] on your post.

If this is news to you you really haven't been paying attention...


I have been paying attention, and in a lot of cases where there is corruption a fsck wouldn't help anyway. I've had corruption on ZFS, UFS, ext3, ext4, HFS, HFS+ and the only one where I have lost almost no data or no data at all is ZFS (the only time I've lost data on ZFS is when it warned me that the disk was silently given me bad data back because of its end-to-end checksumming, and I didn't have a mirror).

UFS, ext3, ext4 would have continued on giving me bad data back. Are there going to be cases whereby a file system fails completely? Yes, but writing tools to attempt to fix those issues that happen once in a million times is almost impossible because of the difficulty of replicating those failure scenarios and then writing code to deal with the disks having that corruption.

I still trust ZFS with more data than any other file system.

As I've asked elsewhere in this thread: given the way ZFS works, you'll have to elaborate on what an fsck would do that ZFS doesn't automatically do already.

I'm getting "we need an fsck that does magic" vibes here.

I don't quite understand what corruption scenarios fsck on Ext4 would save you from that ZFS wouldn't?

Most data on consumer's computers is 1 disk failure away from loss anyway, hence the popularity of cloud syncing services.

There is no fsck for ZFS. Because the ZFS developers at Sun/Oracle believe that ZFS is robust enough to never corrupt (untrue) and in case it does you should have a backup.

I got screwed by ZFS, and I was able to recover the filesystem by learning its on-disk structure and fixing it directly on-disk. Something that a decent fsck tool could do. But no, ZFS never breaks. Go figure.

fsck doesn't make sense for ZFS. If you want to check for fs integrity, you can scrub the pool. If you want to get the fs back to a state before there was corruption, you use the transaction history. If you want to import degraded arrays, there are commands for that. There's no magic in ZFS. What exactly is missing that you want to see?

> If you want to get the fs back to a state before there was corruption, you use the transaction history.

How? ZFS refuses to mount/import a corrupt filesystem.

In my case, the latest superblock (or some internal bookkeeping structures that the superblock points to) was corrupted in a way that ZFS completely gave up. So what I ended up doing is to manually invalidate the latest superblocks until ZFS was able to mount the filesystem. I may have lost the changes written in the few minutes before the corruption, but that's still way better than loosing everything.

Before I decided to poke around the raw disk with dd (to invalidate the superblocks by overwriting them with zeros), I googled around and I wasn't the only one with that problem. One other guy asked on the ZFS mailing list and the response was along the lines of 'Your filesystem is FUBAR, restore from backup'.

You may argue that ZFS itself should do what I did (dropping a few minutes of transaction history and roll back to the latest non-corrupt state) upon mounting. Fair enough. I don't really care if that functionality is built into ZFS or an external fsck binary. The fact is that ZFS wasn't able to recover from the situation. One that I would argue is very trivial to recover from if you know the internal ZFS on-disk structure.

"How? ZFS refuses to mount/import a corrupt filesystem."

zpool clear -F $POOLNAME


So, you're argument then is that this is something a fsck would normally do?

You may have found a corner case in the fs and perhaps this sort of thing should be added to the import command, but I'm not sure simply having an "fsck" fixes this. I just think the import command appears to have a bug/needs a feature.

How's zdb different from fsck? Nobody cares whether the tool is called fsck, scrub or zdb. You just need to have a tool which can recover from corruption. And for very long the ZFS developers didn't think the end-users needed it. Maybe zdb was there all the time. But it wasn't advertised or documented. People were told their filesystem is FUBAR when it wasn't.

"How's zdb different from fsck?"

That's answered very well in the article connected to this other currently active HN discussion:


In short, fsck simply checks to see that the metadata makes sense, and that all inodes belong to files, and that all files belong to directories, and if it finds any that don't, it attaches them to a file with a number for a name in lost+found.

It's pretty crude compared to a filesystem debugger.

If you want to compare apples-to-apples, you'd be better off asking how zdb compares to debugfs (for ext2/3/4) as both are filesystem debuggers.

You could also ask "How's zfs scrub different from fsck?" and the answer to that would be: zfs scrub checks every bit of data and metadata against saved checksums to ensure integrity of everything on-disk. In comparison, fsck cannot detect data corruption at all, and can only detect metadata corruption when an inode points to an illegal disk region (for example).

Even that comparison shows fsck is crude when compared to scrub.

The tool to recover from corruption is a rollback: usage: clear [-nF] <pool> [device]

My employer has thousands of large machines with ZFS on them. We've seen corruption happen once, five years ago.

Maybe we've just been really lucky.

It's more likely you are underestimating your corruption because you are not monitoring it thoroughly. If you have thousands of machines, running for five years, you're gonna see corruption occasionally. No FS can truly protect you (but a well designed FS will reduce the probability that a corruption will become a user-visible event).

Of course there's problems with the disks -- corrupt sectors, phantom writes, misdirected reads, etc. The point is that ZFS handles those problems for us.

I won't discuss the nature of the business, but it's unlikely that actual corruption that isn't automatically repaired would go undetected for any amount of time.

The idea that you can have corruption and nothing to fix it with already sounds scary enough to me.

Especially for those of us that don't have thousands of machines and can therefor be badly screwed by one issue.

Given the way ZFS works, you'll have to elaborate on what an fsck would do that ZFS doesn't automatically do already.

What I find scary is silently corrupt data, something which is a problem for most other filesystems. I've seen ZFS catch and fix that error orders of magnitude more often than I've seen ZFS flake out. If we're talking risk analysis, I feel you're worried about a mouse in the corner, while a starved tiger is hungrily licking its chops while staring at you.

Just look at the recent KDE git disaster. A lot of things went wrong there, but fundamentally the issue was ext4 silently returning bad data.

The thing about fsck-like recovery tools is you need to have a failure mode in mind when you write them. ZFS can fix most of those types of errors thanks to the checksums and on-disk redundancy on the fly. Or at least tell you that something is now going wrong and which files are affected.

I think it's unfortunate that Sun called it 'scrub' instead of 'fsck'. Because the two are largely equivalent in functionality (to the extents supported by the filesystem). Both fix filesystem corruption if they can. Just scrub can be run while the filesystem is mounted, whereas the traditional fsck must be run when the filesystem is offline.

However, scrub does not make ZFS perfect. There are still ways the filesystem can become corrupted without scrub noticing. Or corrupted in a way so that ZFS fails to recover from, even though recovering would be dead simple.

The attitude of the ZFS developers only works in the enterprise market: Your data is safe (checksummed, scrubbed, replicated using RAID-Z), but if a bit flips in the superblock just restore from your backup, because we won't provide tools to recover from that.

While I concur with most of your points, I must point out that there are four uberblocks, not one; a flipped bit will not impair the pool.

Let's argue that bits flip in all four uberblocks though, then ZFS will use the previous valid commit, which also has four uberblocks (ZFS is a CoW FS). And so on backwards for quite a few transactions. All these uberblocks are spread out at the beginning and end of each disk.

ZFS has a policy that the more important the data (and the top of the tree is most important), the more copies there are, although a user can also define how many duplicates there should be at the leaves of the tree.

Basically, you'd need a very trashed disk to render an entire pool dead. You're not going to recover from that, regardless of filesystem.

Well, ZFS clearly didn't use the extra copies, nor did it try to use the previous uberbocks. Otherwise it would've been able to mount the filesystem, don't you think? And I disagree that the disk was very trashed, if all it took was to invalidate a few uburblocks to get the filesystem mounted again.

Did you use zpool import -F (are you sure)? I'm surprised that'd be necessary though, given my experiences. How long ago was this?

As others have stated, if the current state has corruption, you roll it back. As I posted in response to another post above, this is as easy as "zpool clear -F $POOLNAME".

The fact that the vast majority of other filesystems have no way to detect silent corruption of data (only metadata inconsistencies) is far more frightening to me.

Here is an nice article written by someone who discovered just how unreliable disks are, after switching to ZFS (because other filesystems couldn't detect the corruption). Quite an eye-opener. http://www.oracle.com/technetwork/systems/opensolaris/data-r...

If you use chrome and are getting the same error as I am, here's the google cache link (Firefox will load it): http://webcache.googleusercontent.com/search?q=cache:caEwhGD...

The article references sources of studies of hard disk corruption, if you'd want something with even more detail and statistics:



ummm.. last I used "zpool scrub tank" used to do the trick. has that changed/unavailable for linux ?

Nothing inspires confidence like a filesystem with a 0.61 version number.

zpool import -fFX, but sssh, don't tell anyone :)

Actually. It does.

Unless you assume that all hardware is perfect and that no bugs exists in ZFS (have been plenty of those in the past that just makes the whole FS completely worthless).

Nut no, ZFS doesn't need fsck. Because, we don't want it to need one. Oh, and we don't want to spend the resources developing one.

The only reason ZFS doesn't have a fsck tool is because in the enterprise world it doesn't need one. When it is needed you just restore from tape instead. It is that simple.

Actually. It. Does. Not.

See Checking ZFS File System Integrity[1] in the ZFS administration guide.

[1] http://docs.oracle.com/cd/E19082-01/817-2271/gbbwa/

A useful link. But it is unclear just how much a 'zpool scrub' checks. Sure, it is double checking the block checksums to ensure that your file contents haven't become corrupted. But how much checking does it do to the ZFS structures themselves?

At first glance, it certainly seems to depend upon some high-level ZFS data in order to start. A command like 'zpool scrub pool-name' still needs to navigate the ZFS pool data on disk in order to locate the named pool.

But how much checking does it do to the ZFS structures themselves?

They're validated as well. Everything has a cryptographic checksum that's stored in the parent block, starting at the data and working all the way up to the top of the tree.

Furthermore, the higher up the tree you go, the more redundant copies there are. The top of the tree has four copies, if I recall. This ignores support for mirroring and striping, which further improves data redundancy.

But wait, there's more! Those four blocks aren't overwritten. ZFS is a copy-on-write filesystem (data and metadata) which behaves a lot like the persistent data structures that Clojure hackers are so fond of, so if the newest writes do not validate, it'll roll back to the newest valid commit of the tree.

That's a nice way of saying that you're guessing given your experience with other filesystems. ZFS was a genuinely revolutionary filesystem, and doesn't behave like other filesystems, the sole OSS exception being BTRFS. Read up on it a bit, you'll find something interesting. :)

The pool data is stored on disk, but when a pool is already available on the system that information is cached in memory and on disk in a zpool.cache file. That data may be stale or wrong, you can in that case export the zpool and re-import it (or if it is missing, re-import with -f to force it).

It then reads the data about what zpool a disk belongs to from disk. That data is itself stored in multiple locations so that it is unlikely that all of those are corrupted, after importing the disks you can run scrub.

If the pool metadata itself is corrupted, generally you can roll-back to a previous time when the data is not corrupted.

This document: http://docs.oracle.com/cd/E19082-01/817-2271/6mhupg6qg/index... describes fairly well what all the options along the way are.

None of the failure modes described would be any better if there was a fsck tool available... in all file systems it is going to cause dataloss.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact