Hacker News new | comments | show | ask | jobs | submit login
ZFS on Linux 0.6.1 released: Ready for wide scale deployment (groups.google.com)
209 points by iso8859-1 on Mar 29, 2013 | hide | past | web | favorite | 101 comments

So the two things I don't see are a comparison with BTRFS in terms of speed, and stability. I know ZFS is cool, but while we were waiting for it BTRFS got pretty darn good. I know ZFS has some additional features that BRTFS doesn't have, but If I have to use a unstable or slow filesystem to get those features when I could use by perfectly stable and fast filesystem while I wait for a few nice features I am going to say ZFS has missed its oppertuinity

Btrfs is actually a sordid state of affairs. It's been "two years out" for five years, which seems very disingenuous looking back. Groundbreaking FS development follows a pretty regular formula. 10 years seems to be the magic number. Major functionality (i.e RAID 5) is still just landing and has glaring issues. It's just now a couple years out.

Meanwhile, ZFS has over 10 years of history, stably implements most of Btrfs _planned_ features, and has battle tested deployments. Granted, the SPL for Linux adds variables, but there are some big users of this particular project.

So I approach Btrfs with the exact opposite mindset. It's guilty until innocent despite some FUD from the Linux camp early on that has settled down a bit since Oracle now has it's hands on both.

We have deployed ZFS for the last few years on some large backup servers (Solaris and FreeBSD) and our experience has been pretty rotten - the command line is slow at managing a few hundred volumes, it becomes unusable during a rebuild, and it has / had a rotten bug where if you fill a volume up to 100%, you need to allocate more space before you can delete it.

It's been a multi-year mistake for us and we're busy changing these servers back to nice simple XFS volumes.

And if you use more than 80% space performance degrades like a dog. We still use XFS for really large volumes though and it has always been fast and never missed a beat.

This is why people think BTRFS isn't any good, because someone waltses up and says yeah well it doens't have XXX feature which only _one_ other filesystem has so its clearly lacking and immature. The truth it it works great its sufficently fast and has been stable for years. Yes it doesn't have Raid5 or Raid Z if you would like and I can't import and export pools the way I can in ZFS, but honestly if you jumping on a FS that has been marked stable for all of two days on Linux and your reasoning is you can't do software raid on the other one then best of luck to you.

To be fair: FUD only flourishes in the absence of evidence. People will take btrfs more seriously when distros start shipping it by default and when "serious players" start blogging about how they are using it and how great it works.

Honestly the same can be said about ZFS on Linux (and, quite frankly, even ZFS on Solaris). When it becomes "just the standard filesystem that everyone knows works", then people will stop yelling about the software (though the license mess will be with us forever, sadly).

OpenSuse uses it, and I run it as my root partition. The snapshots are a wonderful system restore mechanism, and I have a cron script to snapshot root after a boot.

The only time I've ever had a problem with btrfs was after a power failure, but btrfsck worked to fix it (although since it isn't in fsck.btrfs, it doesn't automatically run on boot, so I had to use the repair partition).

I've been using it for a year. I have around 8 hard disks in my machine, running half a dozen filesystems from fat32 to ntfs to ext4 to xfs, but I like btrfs the most of the native linux FSes atm.

We use btrfs in production on thousands of nodes, and we make use of advanced features (nested snapshots, seed device snapshots, etc.) It has performed admirably so far, with only one problem attributable to btrfs itself (seed device incompatibility with device pools, seen in 3.2.0).

It may not have RAID 5/6, or time-proven stability, but it kicks ass and over time it will kick ZFS's ass not on technical grounds, but simply because ZFS has too much baggage and not enough community support.

What baggage does ZFS have?

Reiterating the GP's point, I don't see anything concrete about speed and stability. Let me throw in my 2e-2 USD. I've been running BTRFS on my laptop since Ubuntu 12.10 came out without any troubles. Installing Ubuntu on BTRFS was effortless and I even rigged up a script to do scheduled snapshots of my home directory (hourly, daily, etc.) and found this feature to be very handy a couple of times.

And i started using btrfs in 12.04 and reverted back to ext4 in 12.10 because it was awfully slow. Atleast noticable slower then ext3 (or 4, not sure what i had earlier). May have changed today, but yes.. It's not all great and nice for everyone ;)

i.e. == that is. e.g. = example. RAID5 is not major functionality, it should no longer exist.

Can you explain why you feel that way? ZFS has intelligently managed RAID that keeps the performance penalty minimal. I'd rather have multiple-redundancy parity than 2x mirroring, and it takes less space.

ZFS' raid-z is nothing but pain. Does it have standard raid-5 now?

There is RAID-Z2 which allows for 2 discs to fail and RAID-Z3 which allows for 3. Problem is rebuild time is still slow. I wouldn't recommend to use RAID-Z1.

People love to harp about the benefits and wonders of ZFS, but they often forgot to mention its limitations: most features have an impact on performance, memory consumption is ginormous, IO performance is generally not great (lower than with "simpler" FS), and managing your pools isn't that easy once you get serious about it.

Performance seems ok to me. 6x500gb 7.2k sata raid10 + 96gb ssd, 8gb ram.

        Run began: Fri Mar 29 08:32:24 2013

        Include fsync in write timing
        Include close in write timing
        Command line used: iozone -ec -r 32 -s 16380m -l 2 -i 0 -i 1 -i 8
        Throughput test with 2 processes
        Each process writes a 16773120 Kbyte file in 32 Kbyte records

        Parent sees throughput for  2 initial writers   =  996073.19 KB/sec
        Avg throughput per process                      =  499898.70 KB/sec

        Parent sees throughput for  2 rewriters         =  216128.36 KB/sec
        Avg throughput per process                      =  108075.14 KB/sec

        Parent sees throughput for  2 readers           = 1232245.62 KB/sec
        Avg throughput per process                      =  616353.25 KB/sec

        Parent sees throughput for 2 re-readers         = 1240045.15 KB/sec
        Avg throughput per process                      =  620251.62 KB/sec

that's really not so impressive. you're just dumping extents down on an empty filesystem and re-reading them. you're really just measuring the fact that the FS isn't particularly getting in the way of using your hardware.

Note also that rewriting drops to 108M/sec, which is pretty unimpressive.

216M/sec - rewrite is suffering from a striding mismatch there with the COW, bs is 32k while stripe is 128k. Consumer uses 1M.

I'm wondering which volume manager and fs under linux you'd expect better performance from in this configuration?

"memory consumption is ginormous"

Yes, it will use whatever memory you allow it to, but this is purely dynamic just as caching is in Linux. If you're not talking about the ARC, but about deduplication, of course it uses more memory - how would it not?

"IO performance is generally not great (lower than with "simpler" FS)"

This is sounding troll-ish, as that statement equates to: A filesystem that checksums all data and metadata and performs copy-on-write to protect the integrity of the on-disk state at all times is slower than a filesystem that does neither.

Well, of course.

But do those "simpler FS" have the ability to massively negate that effect by use of an SSD for the ZIL and L2ARC? There have been many articles showing higher throughput with large, slow 5400RPM drives combined with an SSD ZIL & L2ARC massively outperforming much faster enterprise drives.

"managing your pools isn't that easy once you get serious about it"

I'm fairly stunned by this statement, as I've yet to see an easier, more elegant solution for such a task. Before ZFS, I liked VXVM with VXFS, but I now consider it obsolete. Linux's LVM is downright painful in comparison. I've yet to play with btrfs, so I'll bite my tongue on what I've read so far on it.

The deep integration of volume and pool management directly with the filesystem, essentially making them one and the same, is simply beautiful. Having these things separate (md, LVM, fs) after years of using ZFS seems so archaic and awkward to me.

Disclosure: 100% of my ZFS experience has been on Solaris (10-11.1) and OpenSolaris/Nevada. I've not tried it on Linux, yet.

Yeah... what are you talking about?

Enormous memory consumption = deduplication? Otherwise it doesn't seem enormous too me.

Sure ZFS made a lot of departures from the standard toolset with all its own stuff... but that stuff is better once you learn it.

I can easily get 600MB+/sec reads and writes on my pools with are pretty commodity stuff. One of my pools is 95% full, which is high, and I just tested it and got over ~100MB/sec write ~150MB read.... good enough for me. (Oh, and I just checked, it was scrubbing)

Here are my little ZFS peeves: 1. you can't grow raidZ vdevs. Reduces flexibility. 2. I wish there was a version of scrub that rebalanced data across vdevs.

By comparison BTRFS is nice, but nowhere close to ZFS. Last I saw a BTRFS scrub consists of cat'ing every file to null... which doesn't even cover the FS and metadata.

Not to mention the whole lack of an upstream.

But all that aside ZFS was great when it was first introduced a few years back, especially with Sun backing OpenSolaris, but we've been jerked around for a few years.

A ran a few ZFS pools on Linux a few builds ago and it was quite stable then. For something like a home storage system where I do not care about all the shortfalls vs. BTRFS I would reach to ZFS just because its what I know. For something that needs to be enterprise I'd hire someone that knows much more about it than me :).

we've been jerked around for a few years

Oracle's ZFS (and Solaris) fork is a deadish end. Most of the interesting action the past couple years has been happening among a collection of illumos (including critical ZFS members formerly from Sun), FreeBSD, OSX and even Linux hackers.

I don't follow their activities closely, except to observe that they've been anything but idle. A lot of work has happened in the meantime, and continues. Hopefully one shows up to comment about it.

Oracle's ZFS is now different then BSD/Linux ZFS. They have become incompatible file systems because Oracle has picked up their ball and taken it back home.

For OSS ZFS the only upstream is 'collection of Illlumos'

I was running OpenIndiana for awhile at home before I just got tired of trying to hack together a bunch of patches. The problem with having a project that is pseudo OSS is that you can never actually use it seriously, its never reliable and for the most part its going to end in slow quiet death.

In my opinion there are a were only a handful of reasons to use Solaris: Oracle, SPARC, ZFS and dtrace. The only reason to use it nowadays is because you can't scale x86 and need big iron for Oracle et al.

Unrelated to the topic at hand, but I've been happy with SmartOS. It's firmly and solely an OSS server OS, it's easy to use, many of the uglier Solaris legacy bits have been sanded off, and it has a decent package system (pkgsrc).

My earlier experiences with Solaris and OpenSolaris were mixed. SmartOS has been uniformly positive. Perhaps give it a try, although be aware that it (like everything illumos) is picky about the hardware.

List of different versions of ZFS


Let me understand: You claim that ZFS is the unstable or slow filesystem, and that BTRFS, which is not even yet marked as stable by its developers is the perfectly stable and fast one? That seems an opposite way of thinking.

Not ZFS, but ZFS for Linux.

> I know ZFS is cool, but while we were waiting for it BTRFS got pretty darn good.

My own anecdotal evidence is that btrfs is not ready for prime time. I ran a couple btrfs file systems from kernel about 2.6.30 to 3.4 and finally determined that my system was so slow because of btrfs. I switched to xfs and an operation that used to take weeks (literally) now takes hours.

Perhaps it's gotten significantly better in the last few revisions but I'd use caution if you want to use btrfs and still have good performance.

I'm getting +20k IOPs and nearly 1GB/sec read/write out of a mirror, something like 40K 1.5GB/sec out of a mirror+stripe on ZFS under FreeBSD. Disks are either 2x256GB SSD or 4x128GB SSD. Meanwhile BTRFS was so slow on random IO I had to toss it out in favor of the less reliable JFS or EXT4 on Gentoo.

Don't use RAIDZ if you want performance with ZFS however, and you need to set it up twice before you get how to performance tune it.

I tried resizing btrfs recently, my file came out truncated :(

I think Oracle's decision to not keep Solaris open source, at least the ZFS portions, is enormously shortsighted.

The fact that there is a ZFS implementation (with so many of the original ZFS engineers behind it) still being developed in the open and used by multiple operating systems, but especially the Linux juggernaut, shows that Oracle really doesn't have any benefit to keeping it closed any longer.

It has forked.

We now have the Oracle ZFS with features and functionality that is not in the open source variant, and the open source ZFS that is apparently adding features and commands that are not in the closed source ZFS.

A huge selling point for the open source implementation is the fact that if you decide to change vendors or OSes, you can easily do so without having a huge data migration (zpool export on the old, zpool import on the new).

Except, for Solaris 11+. You can't go back and forth between them, so once you're on Solaris 11, you're stuck.

Yes, this is most definitely FUD, but I can easily see this being used in the not-so-distant future once Linux vendors start supporting/advertising it for themselves.

I think that if Oracle were to open (and keep open) the newer releases (even if a few releases behind, like they were originally claiming they would), it would eliminate that argument completely.

Personally, I'd be absolutely thrilled with a cross-platform on-disk ZFS (I triple-boot Linux/OSX/Solaris on my notebook).

Professionally, I'd love to see a cross-platform on-disk ZFS simply to be able to throw the appropriate OS behind the data.

The big thing missing from the OSS implementations that Oracle 11 has got is FDE.

It's possible with dmcrypt and LUKS, but you void most of the reasons for using ZFS. encryptFS is also a possibility, but it's both much less secure (an attacker can run ls) and also drops a few ZFS features like dedup.

Meanwhile, Solaris 11 has FDE that actually works, and there seems to be no serious FOSS effort either to reverse engineer their work, or to design something better.

In 2013, no software developer should be storing things without FDE unless they have unusual physical security measures. That makes FOSS ZFS a non-starter for most of the people who would use it.

Filesystem encryption is very handy for a notebook computer (I use it, myself), but in a datacenter it's value is a bit more questionable.

Where do you store your encryption keys? If they're on a removable USB device (for example), you'd have to contact the datacenter personnel to plug them in in the event you had to reboot (which happens from time to time if you perform OS SRU uprades). If the USB device is left in the server and someone gains privileged access to the machine, they've got the key as well as the data. If the USB device is not in the server and someone gains privileged access to the machine, they still have access to the data.

The only time disk encryption is valuable is when the machine is off or the disks are being transported.

> The only time disk encryption is valuable is when the machine is off or the disks are being transported.

I'm thinking about the case where you have a computer in the home or office environment, e.g. the burglary attack vector. It's not just notebooks that get stolen.

I'm also considering the corporate or government espionage vector, where you have a reasonably skilled on-site attacker trying to read or modify the disk contents. In this case you are typing the keys in on boot, and you either ignore the RAM extraction attacks (which are difficult to execute in practice for non-academic attackers) or you can mitigate them by storing in L1 cache and so on.

Datacenters have unusually high physical security, and in addition I don't generally store highly sensitive data there (my work doesn't involve PCI compliance etc.) Whereas I do have PCI-level data about myself on my own computers.

"It's not just notebooks that get stolen."

Very good point. I guess I've been working in environments with datacenters too long. :)

"I'm also considering the corporate or government espionage vector, where you have a reasonably skilled on-site attacker trying to read or modify the disk contents."

Again though, that only protects you if the disks are out or the machine is down. If you have someone with that level of skill within your organization, they'll likely gain access to the running computer where the data has already been made accessible in decrypted form.

FDE Is exactly why my home FS is Solaris 11 (32gb / AES-NI capable xeon). Linux ZFS + their encryption layers combined with large file preallocations performs very poorly.

Its too bad, i really dislike Oracle, but its the best tool for the job currently.

And for the person above, as long as you restrict the pool version to 28 when creating under solaris 11 you can still migrate between solaris/linux/bsd. Obviously you lose out on native FDE though.

If it's ready for wide scale deployment, I wonder why the team decided against signaling that by calling it 1.0 instead of 0.6.1. A version number < 1 makes me think the developers still think it's pre-release quality. That may not be the case here, but it makes me wonder.

I'm not part of zfs on linux, I just looked at their github milestones[1]. It looks like they're reserving the 1.0 version for when it's "Fully functional and feature complete native ZFS implementation for Linux." Right now it's just usable and stable. Their previous milestones are also interesting. v0.6.0 had several RCs before being released as stable.[2]

[1]: https://github.com/zfsonlinux/zfs/issues/milestones

[2]: https://github.com/zfsonlinux/zfs/issues/milestones?state=cl...

No, I can assure you we definitely do not think it's "pre-release" quality. I work closely with Brian Behlendorf, and I'm sure he would not have made the stable release if he thought it wasn't ready for prime time.

It's a little disheartening to see people pay so much attention to a fictitious number, which really has no regard to the actual state of the code/project. Whether v1.0 or v0.1 was used, the quality of the release would not differ any.

We can't easily know the state of the code, but we can easily read the number that you have assigned it. That number should reflect your own assessment of the code's stability.

It's also a matter of how much cognitive load you want to impose on your users. We use dozens of different open-source packeages, each with its own version number. Can you really expect us to keep track of all of them? "Which Linux ZFS release was the first stable one?" "Uh, I think it was 0.6 something, or maybe 0.5.1?"

Don't do that to your users. Call it 1.0.0.

> We can't easily know the state of the code, but we can easily read the number that you have assigned it.

Exactly. To those who have been using 0.6.0_rc*, version 0.6.1 conveys the meaning "go ahead and upgrade, here's the stable release with no backward compatibility issue".

You will get your version 1.0 after several pre-1.0 RC releases. That's the proper release management.

It's a little disheartening to see people pay so much attention to a fictitious number

So, rather than adapt to people everywhere and call it v1.0, let's be a little sad that calling it 0.6.1 doesn't have the same effect, and then leave it the way that it is, and complain about how irrational people are?


I know, let's make a new release of v0.6.1 called...wait for it...v1.0. And then we can move on to real issues, not fake controversy.

The adage "pick your battles" comes to mind....

My thoughts exactly. If you're going to declare something production ready, but don't have the nuts to call it 1.0, I don't believe you.

Maybe 0.6.1 just has a nice ring that 1.0 doesn't quite get.

Never mind all the promised wondrous features of ZFS, what are the recovery / fsck tools like? If the data recovery tools aren't mature, reliable and useful, then I'd advise anyone to stay away from a filesystem, no matter how world-ready the underlying FS code is.

I'm probably biased; but I got screwed using ReiserFS. It was fast and great, but one day something went wrong and then I found that the reiserfsck program was practically useless. Very little work had been done on it, so any small inconsistencies in the FS meant your data was toast.

Sure, keep backups and all that, but just be aware how important a good fsck tool is.

My understanding is that ZFS does a lot of healing itself on the fly. When you access a block, the block is checked against a checksum stored elsewhere on disk. If the block is bad, it is restored from a replica (so it's advisable to have 2 disks) while the original is healed.

On writes ZFS also checks that each written block is valid before committing the new block to the filesystem. In the event of a bad write the new block is not committed and the old block remains, blocks are written originally to a separate area of the disk.

Yes, and those are good features. But there's still plenty of room for errors. Bugs in the code, bad interactions with hardware, etc. Those are areas where ZFS is likely to be just as vulnerable as other file systems. More so, in fact, given the code size/complexity and its (relative) immaturity.

I suspect (but I have no data!) that the majority of times that users have file system corruption and need to run fsck is due to bad software behaviour, and not failing disks.

ZFS itself has been around for a while, was originally part of Solaris IIRC. It's using it on Linux that's new.

If you have a bad software write, ZFS should be able to detect that in the same way it would a hardware error since the checksums won't match. Also worth noting that the checksums are themselves checksummed in a merkle tree. It also has a tool called "scrub" which can check data integrity without unmounting the disk.

Of course if your filesystem has buggy code for checksumming/repair then you are boned whatever.

Disk lie. A lot. Ask anybody who has the misfortune to work on a filesystem for any length of time.

One thing to keep in mind is that ZFS is more complex than your typical filesystem, but it's also one of the few filesystems that's effectively a cryptographically self-validating tree. You're a lot more likely to have a disk lie to you than ZFS flake out.

And the whole point of RAID is to combat that... But since there is no decent recovery tools for ZFS your entire FS might be toast if the wrong bits are turned.

Yes ZFS is quite resilient against corruption but when it bites, which has happened plenty of times in the past, you are truly screwed.

The philosophy behind ZFS and file system recovery has always been that it is too inconvenient. It is much better to restore from tape backups so why make any effort on recovery?

Thing is, most consumers don't have tape backups...

Because of that, ZFS is in the very vast majority of cases a bad choice for home users, especially under linux where it can hardly be considered mature.

BRTFS just isn't ready yet.

As a home user you can't even get a decent COW FS with decent snapshotting or checksums. That is the sad state we are in and will be in for many years to come.

"And the whole point of RAID is to combat that..."

Actually the whole point of RAID is to combat a MISSING drive, not a corrupt drive. Parity works when I'm lacking one bit ... not when any of my bits might be lying.

Half of the reason of ZFS is to protect against the scenarios when a drive's internal checksum should catch a data error - but the drives returns bogus data and says 'oh its good'. RAID5 can't catch that(mathematically impossible) ... RAID6 can but most implementations don't check because its expensive(extra IO and calcs)

This is the reason most of your high end SAN vendors don't just deploy RAID5/6/10 but also checksum the data themselves under the covers. They don't trust even high end SAS/FC drives.

Please tell me how my entire FS will be toast if the wrong bits are turned, especially when zraid2 is really the only valid solution that should be used.

I've been using ZFS for years now, i've had disks go bad, I've replaced them. ZFS has protected me from data loss better than anything else. In the past when I've had the disk lie to me, ZFS was able to tell me what files were corrupted and thus had invalid data.

I'm going to need to throw a [citation needed] on your post.

If this is news to you you really haven't been paying attention...


I have been paying attention, and in a lot of cases where there is corruption a fsck wouldn't help anyway. I've had corruption on ZFS, UFS, ext3, ext4, HFS, HFS+ and the only one where I have lost almost no data or no data at all is ZFS (the only time I've lost data on ZFS is when it warned me that the disk was silently given me bad data back because of its end-to-end checksumming, and I didn't have a mirror).

UFS, ext3, ext4 would have continued on giving me bad data back. Are there going to be cases whereby a file system fails completely? Yes, but writing tools to attempt to fix those issues that happen once in a million times is almost impossible because of the difficulty of replicating those failure scenarios and then writing code to deal with the disks having that corruption.

I still trust ZFS with more data than any other file system.

As I've asked elsewhere in this thread: given the way ZFS works, you'll have to elaborate on what an fsck would do that ZFS doesn't automatically do already.

I'm getting "we need an fsck that does magic" vibes here.

I don't quite understand what corruption scenarios fsck on Ext4 would save you from that ZFS wouldn't?

Most data on consumer's computers is 1 disk failure away from loss anyway, hence the popularity of cloud syncing services.

There is no fsck for ZFS. Because the ZFS developers at Sun/Oracle believe that ZFS is robust enough to never corrupt (untrue) and in case it does you should have a backup.

I got screwed by ZFS, and I was able to recover the filesystem by learning its on-disk structure and fixing it directly on-disk. Something that a decent fsck tool could do. But no, ZFS never breaks. Go figure.

fsck doesn't make sense for ZFS. If you want to check for fs integrity, you can scrub the pool. If you want to get the fs back to a state before there was corruption, you use the transaction history. If you want to import degraded arrays, there are commands for that. There's no magic in ZFS. What exactly is missing that you want to see?

> If you want to get the fs back to a state before there was corruption, you use the transaction history.

How? ZFS refuses to mount/import a corrupt filesystem.

In my case, the latest superblock (or some internal bookkeeping structures that the superblock points to) was corrupted in a way that ZFS completely gave up. So what I ended up doing is to manually invalidate the latest superblocks until ZFS was able to mount the filesystem. I may have lost the changes written in the few minutes before the corruption, but that's still way better than loosing everything.

Before I decided to poke around the raw disk with dd (to invalidate the superblocks by overwriting them with zeros), I googled around and I wasn't the only one with that problem. One other guy asked on the ZFS mailing list and the response was along the lines of 'Your filesystem is FUBAR, restore from backup'.

You may argue that ZFS itself should do what I did (dropping a few minutes of transaction history and roll back to the latest non-corrupt state) upon mounting. Fair enough. I don't really care if that functionality is built into ZFS or an external fsck binary. The fact is that ZFS wasn't able to recover from the situation. One that I would argue is very trivial to recover from if you know the internal ZFS on-disk structure.

"How? ZFS refuses to mount/import a corrupt filesystem."

zpool clear -F $POOLNAME


So, you're argument then is that this is something a fsck would normally do?

You may have found a corner case in the fs and perhaps this sort of thing should be added to the import command, but I'm not sure simply having an "fsck" fixes this. I just think the import command appears to have a bug/needs a feature.

How's zdb different from fsck? Nobody cares whether the tool is called fsck, scrub or zdb. You just need to have a tool which can recover from corruption. And for very long the ZFS developers didn't think the end-users needed it. Maybe zdb was there all the time. But it wasn't advertised or documented. People were told their filesystem is FUBAR when it wasn't.

"How's zdb different from fsck?"

That's answered very well in the article connected to this other currently active HN discussion:


In short, fsck simply checks to see that the metadata makes sense, and that all inodes belong to files, and that all files belong to directories, and if it finds any that don't, it attaches them to a file with a number for a name in lost+found.

It's pretty crude compared to a filesystem debugger.

If you want to compare apples-to-apples, you'd be better off asking how zdb compares to debugfs (for ext2/3/4) as both are filesystem debuggers.

You could also ask "How's zfs scrub different from fsck?" and the answer to that would be: zfs scrub checks every bit of data and metadata against saved checksums to ensure integrity of everything on-disk. In comparison, fsck cannot detect data corruption at all, and can only detect metadata corruption when an inode points to an illegal disk region (for example).

Even that comparison shows fsck is crude when compared to scrub.

The tool to recover from corruption is a rollback: usage: clear [-nF] <pool> [device]

My employer has thousands of large machines with ZFS on them. We've seen corruption happen once, five years ago.

Maybe we've just been really lucky.

It's more likely you are underestimating your corruption because you are not monitoring it thoroughly. If you have thousands of machines, running for five years, you're gonna see corruption occasionally. No FS can truly protect you (but a well designed FS will reduce the probability that a corruption will become a user-visible event).

Of course there's problems with the disks -- corrupt sectors, phantom writes, misdirected reads, etc. The point is that ZFS handles those problems for us.

I won't discuss the nature of the business, but it's unlikely that actual corruption that isn't automatically repaired would go undetected for any amount of time.

The idea that you can have corruption and nothing to fix it with already sounds scary enough to me.

Especially for those of us that don't have thousands of machines and can therefor be badly screwed by one issue.

Given the way ZFS works, you'll have to elaborate on what an fsck would do that ZFS doesn't automatically do already.

What I find scary is silently corrupt data, something which is a problem for most other filesystems. I've seen ZFS catch and fix that error orders of magnitude more often than I've seen ZFS flake out. If we're talking risk analysis, I feel you're worried about a mouse in the corner, while a starved tiger is hungrily licking its chops while staring at you.

Just look at the recent KDE git disaster. A lot of things went wrong there, but fundamentally the issue was ext4 silently returning bad data.

The thing about fsck-like recovery tools is you need to have a failure mode in mind when you write them. ZFS can fix most of those types of errors thanks to the checksums and on-disk redundancy on the fly. Or at least tell you that something is now going wrong and which files are affected.

I think it's unfortunate that Sun called it 'scrub' instead of 'fsck'. Because the two are largely equivalent in functionality (to the extents supported by the filesystem). Both fix filesystem corruption if they can. Just scrub can be run while the filesystem is mounted, whereas the traditional fsck must be run when the filesystem is offline.

However, scrub does not make ZFS perfect. There are still ways the filesystem can become corrupted without scrub noticing. Or corrupted in a way so that ZFS fails to recover from, even though recovering would be dead simple.

The attitude of the ZFS developers only works in the enterprise market: Your data is safe (checksummed, scrubbed, replicated using RAID-Z), but if a bit flips in the superblock just restore from your backup, because we won't provide tools to recover from that.

While I concur with most of your points, I must point out that there are four uberblocks, not one; a flipped bit will not impair the pool.

Let's argue that bits flip in all four uberblocks though, then ZFS will use the previous valid commit, which also has four uberblocks (ZFS is a CoW FS). And so on backwards for quite a few transactions. All these uberblocks are spread out at the beginning and end of each disk.

ZFS has a policy that the more important the data (and the top of the tree is most important), the more copies there are, although a user can also define how many duplicates there should be at the leaves of the tree.

Basically, you'd need a very trashed disk to render an entire pool dead. You're not going to recover from that, regardless of filesystem.

Well, ZFS clearly didn't use the extra copies, nor did it try to use the previous uberbocks. Otherwise it would've been able to mount the filesystem, don't you think? And I disagree that the disk was very trashed, if all it took was to invalidate a few uburblocks to get the filesystem mounted again.

Did you use zpool import -F (are you sure)? I'm surprised that'd be necessary though, given my experiences. How long ago was this?

As others have stated, if the current state has corruption, you roll it back. As I posted in response to another post above, this is as easy as "zpool clear -F $POOLNAME".

The fact that the vast majority of other filesystems have no way to detect silent corruption of data (only metadata inconsistencies) is far more frightening to me.

Here is an nice article written by someone who discovered just how unreliable disks are, after switching to ZFS (because other filesystems couldn't detect the corruption). Quite an eye-opener. http://www.oracle.com/technetwork/systems/opensolaris/data-r...

If you use chrome and are getting the same error as I am, here's the google cache link (Firefox will load it): http://webcache.googleusercontent.com/search?q=cache:caEwhGD...

The article references sources of studies of hard disk corruption, if you'd want something with even more detail and statistics:



ummm.. last I used "zpool scrub tank" used to do the trick. has that changed/unavailable for linux ?

Nothing inspires confidence like a filesystem with a 0.61 version number.

zpool import -fFX, but sssh, don't tell anyone :)

Actually. It does.

Unless you assume that all hardware is perfect and that no bugs exists in ZFS (have been plenty of those in the past that just makes the whole FS completely worthless).

Nut no, ZFS doesn't need fsck. Because, we don't want it to need one. Oh, and we don't want to spend the resources developing one.

The only reason ZFS doesn't have a fsck tool is because in the enterprise world it doesn't need one. When it is needed you just restore from tape instead. It is that simple.

Actually. It. Does. Not.

See Checking ZFS File System Integrity[1] in the ZFS administration guide.

[1] http://docs.oracle.com/cd/E19082-01/817-2271/gbbwa/

A useful link. But it is unclear just how much a 'zpool scrub' checks. Sure, it is double checking the block checksums to ensure that your file contents haven't become corrupted. But how much checking does it do to the ZFS structures themselves?

At first glance, it certainly seems to depend upon some high-level ZFS data in order to start. A command like 'zpool scrub pool-name' still needs to navigate the ZFS pool data on disk in order to locate the named pool.

But how much checking does it do to the ZFS structures themselves?

They're validated as well. Everything has a cryptographic checksum that's stored in the parent block, starting at the data and working all the way up to the top of the tree.

Furthermore, the higher up the tree you go, the more redundant copies there are. The top of the tree has four copies, if I recall. This ignores support for mirroring and striping, which further improves data redundancy.

But wait, there's more! Those four blocks aren't overwritten. ZFS is a copy-on-write filesystem (data and metadata) which behaves a lot like the persistent data structures that Clojure hackers are so fond of, so if the newest writes do not validate, it'll roll back to the newest valid commit of the tree.

That's a nice way of saying that you're guessing given your experience with other filesystems. ZFS was a genuinely revolutionary filesystem, and doesn't behave like other filesystems, the sole OSS exception being BTRFS. Read up on it a bit, you'll find something interesting. :)

The pool data is stored on disk, but when a pool is already available on the system that information is cached in memory and on disk in a zpool.cache file. That data may be stale or wrong, you can in that case export the zpool and re-import it (or if it is missing, re-import with -f to force it).

It then reads the data about what zpool a disk belongs to from disk. That data is itself stored in multiple locations so that it is unlikely that all of those are corrupted, after importing the disks you can run scrub.

If the pool metadata itself is corrupted, generally you can roll-back to a previous time when the data is not corrupted.

This document: http://docs.oracle.com/cd/E19082-01/817-2271/6mhupg6qg/index... describes fairly well what all the options along the way are.

None of the failure modes described would be any better if there was a fsck tool available... in all file systems it is going to cause dataloss.

I've used both, and prefer btrfs in terms of implementation. For some reason the use maintenance and control feels more Linux to me. ZFS feels more Unix, which makes sense given the lineage.

The raid arguments against btrfs are silly. Raid 5 is quite unnecessary, particularly with these kinds of file systems. I can only assume people making this argument have never maintained file systems of this type and clearly don't understand how to implement it correctly.

Both ZFS and btrfs are the future though. If admins aren't seriously considering one of the two for thier server infrastructure then they probably shouldn't be admins.

Here a study of the evoulution of Linux File Systems (post from 5 days ago): https://news.ycombinator.com/item?id=5431413

While it does NOT contain ZFS it is a great read.

Why is Oracle investing in Btrfs (GPL), rather than dual-licensing zfs as cddl/GPL?

They started it before they acquired Sun.

Well, that kind of makes sense in a "we're incredibly lazy" way. But why not license ZFS as GPL, and see if that project makes better headway?

No real cost there, except some possible confusion. Move the developers to whichever project seems to be winning.

What about the license compatibility?

The thing with both GPL and CDDL is that they both put restrictions on how you are allowed to distribute the program/source. i.e. you can't legally distribute code that has both GPL and CDDL parts.

However, neither license puts restrictions on (end-user) use, which means that as long as you distribute the bits separately, running them together is perfectly fine. Which means (as their FAQ states) that while it can't ship as part of the kernel, there's no reason end-users wouldn't be allowed to load a separately distributed module into their kernel.

It looks like very similar issues on the use end to the nvidia binary driver. Can't ship it in Linux but nothing (legally) prevents end users from incorporating it.

My concern with the license isn't legal so much as pain of administration. Maybe I'm just old fashion or lazy, but I prefer not to have to check my kernel repo against my kernel module repo to make sure they play nice. It is so much easier when you can just get it as part of the kernel.

That is the reason why I avoid the Nvidia binary driver as well.

If you're on Ubuntu, there's an apt package available in ZFS On Linux project's ppa. It downloads the source, builds it against your installed kernel and installs it. Rebuilds on any kernel update. Very handy.


Still unsolved, I guess they just mean it's technically ready for for wide scale deployment.

I misread this article title as meaning "ZFS has been backported to Linux 0.6.1". :)

What happened - did NetApp waive their patents? Am I missing something? Wasn't the consensus that ZFS is encumbered, and Oracle might or might not defend it, but who knows?

If I remember correctly, Sun responded with even more of their own patents at them and eventually they just settled.

Lose Interest in ZFS when dev is pretty much stopped. I am looking forward to DragonFly HAMMER 2

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact