Meanwhile, ZFS has over 10 years of history, stably implements most of Btrfs _planned_ features, and has battle tested deployments. Granted, the SPL for Linux adds variables, but there are some big users of this particular project.
So I approach Btrfs with the exact opposite mindset. It's guilty until innocent despite some FUD from the Linux camp early on that has settled down a bit since Oracle now has it's hands on both.
It's been a multi-year mistake for us and we're busy changing these servers back to nice simple XFS volumes.
Honestly the same can be said about ZFS on Linux (and, quite frankly, even ZFS on Solaris). When it becomes "just the standard filesystem that everyone knows works", then people will stop yelling about the software (though the license mess will be with us forever, sadly).
The only time I've ever had a problem with btrfs was after a power failure, but btrfsck worked to fix it (although since it isn't in fsck.btrfs, it doesn't automatically run on boot, so I had to use the repair partition).
I've been using it for a year. I have around 8 hard disks in my machine, running half a dozen filesystems from fat32 to ntfs to ext4 to xfs, but I like btrfs the most of the native linux FSes atm.
It may not have RAID 5/6, or time-proven stability, but it kicks ass and over time it will kick ZFS's ass not on technical grounds, but simply because ZFS has too much baggage and not enough community support.
Run began: Fri Mar 29 08:32:24 2013
Include fsync in write timing
Include close in write timing
Command line used: iozone -ec -r 32 -s 16380m -l 2 -i 0 -i 1 -i 8
Throughput test with 2 processes
Each process writes a 16773120 Kbyte file in 32 Kbyte records
Parent sees throughput for 2 initial writers = 996073.19 KB/sec
Avg throughput per process = 499898.70 KB/sec
Parent sees throughput for 2 rewriters = 216128.36 KB/sec
Avg throughput per process = 108075.14 KB/sec
Parent sees throughput for 2 readers = 1232245.62 KB/sec
Avg throughput per process = 616353.25 KB/sec
Parent sees throughput for 2 re-readers = 1240045.15 KB/sec
Avg throughput per process = 620251.62 KB/sec
Note also that rewriting drops to 108M/sec, which is pretty unimpressive.
I'm wondering which volume manager and fs under linux you'd expect better performance from in this configuration?
Yes, it will use whatever memory you allow it to, but this is purely dynamic just as caching is in Linux. If you're not talking about the ARC, but about deduplication, of course it uses more memory - how would it not?
"IO performance is generally not great (lower than with "simpler" FS)"
This is sounding troll-ish, as that statement equates to: A filesystem that checksums all data and metadata and performs copy-on-write to protect the integrity of the on-disk state at all times is slower than a filesystem that does neither.
Well, of course.
But do those "simpler FS" have the ability to massively negate that effect by use of an SSD for the ZIL and L2ARC? There have been many articles showing higher throughput with large, slow 5400RPM drives combined with an SSD ZIL & L2ARC massively outperforming much faster enterprise drives.
"managing your pools isn't that easy once you get serious about it"
I'm fairly stunned by this statement, as I've yet to see an easier, more elegant solution for such a task. Before ZFS, I liked VXVM with VXFS, but I now consider it obsolete. Linux's LVM is downright painful in comparison. I've yet to play with btrfs, so I'll bite my tongue on what I've read so far on it.
The deep integration of volume and pool management directly with the filesystem, essentially making them one and the same, is simply beautiful. Having these things separate (md, LVM, fs) after years of using ZFS seems so archaic and awkward to me.
Disclosure: 100% of my ZFS experience has been on Solaris (10-11.1) and OpenSolaris/Nevada. I've not tried it on Linux, yet.
Enormous memory consumption = deduplication? Otherwise it doesn't seem enormous too me.
Sure ZFS made a lot of departures from the standard toolset with all its own stuff... but that stuff is better once you learn it.
I can easily get 600MB+/sec reads and writes on my pools with are pretty commodity stuff. One of my pools is 95% full, which is high, and I just tested it and got over ~100MB/sec write ~150MB read.... good enough for me.
(Oh, and I just checked, it was scrubbing)
Here are my little ZFS peeves:
1. you can't grow raidZ vdevs. Reduces flexibility.
2. I wish there was a version of scrub that rebalanced data across vdevs.
By comparison BTRFS is nice, but nowhere close to ZFS. Last I saw a BTRFS scrub consists of cat'ing every file to null... which doesn't even cover the FS and metadata.
But all that aside ZFS was great when it was first introduced a few years back, especially with Sun backing OpenSolaris, but we've been jerked around for a few years.
A ran a few ZFS pools on Linux a few builds ago and it was quite stable then. For something like a home storage system where I do not care about all the shortfalls vs. BTRFS I would reach to ZFS just because its what I know. For something that needs to be enterprise I'd hire someone that knows much more about it than me :).
Oracle's ZFS (and Solaris) fork is a deadish end. Most of the interesting action the past couple years has been happening among a collection of illumos (including critical ZFS members formerly from Sun), FreeBSD, OSX and even Linux hackers.
I don't follow their activities closely, except to observe that they've been anything but idle. A lot of work has happened in the meantime, and continues. Hopefully one shows up to comment about it.
For OSS ZFS the only upstream is 'collection of Illlumos'
In my opinion there are a were only a handful of reasons to use Solaris: Oracle, SPARC, ZFS and dtrace. The only reason to use it nowadays is because you can't scale x86 and need big iron for Oracle et al.
My earlier experiences with Solaris and OpenSolaris were mixed. SmartOS has been uniformly positive. Perhaps give it a try, although be aware that it (like everything illumos) is picky about the hardware.
My own anecdotal evidence is that btrfs is not ready for prime time. I ran a couple btrfs file systems from kernel about 2.6.30 to 3.4 and finally determined that my system was so slow because of btrfs. I switched to xfs and an operation that used to take weeks (literally) now takes hours.
Perhaps it's gotten significantly better in the last few revisions but I'd use caution if you want to use btrfs and still have good performance.
Don't use RAIDZ if you want performance with ZFS however, and you need to set it up twice before you get how to performance tune it.
The fact that there is a ZFS implementation (with so many of the original ZFS engineers behind it) still being developed in the open and used by multiple operating systems, but especially the Linux juggernaut, shows that Oracle really doesn't have any benefit to keeping it closed any longer.
It has forked.
We now have the Oracle ZFS with features and functionality that is not in the open source variant, and the open source ZFS that is apparently adding features and commands that are not in the closed source ZFS.
A huge selling point for the open source implementation is the fact that if you decide to change vendors or OSes, you can easily do so without having a huge data migration (zpool export on the old, zpool import on the new).
Except, for Solaris 11+. You can't go back and forth between them, so once you're on Solaris 11, you're stuck.
Yes, this is most definitely FUD, but I can easily see this being used in the not-so-distant future once Linux vendors start supporting/advertising it for themselves.
I think that if Oracle were to open (and keep open) the newer releases (even if a few releases behind, like they were originally claiming they would), it would eliminate that argument completely.
Personally, I'd be absolutely thrilled with a cross-platform on-disk ZFS (I triple-boot Linux/OSX/Solaris on my notebook).
Professionally, I'd love to see a cross-platform on-disk ZFS simply to be able to throw the appropriate OS behind the data.
It's possible with dmcrypt and LUKS, but you void most of the reasons for using ZFS. encryptFS is also a possibility, but it's both much less secure (an attacker can run ls) and also drops a few ZFS features like dedup.
Meanwhile, Solaris 11 has FDE that actually works, and there seems to be no serious FOSS effort either to reverse engineer their work, or to design something better.
In 2013, no software developer should be storing things without FDE unless they have unusual physical security measures. That makes FOSS ZFS a non-starter for most of the people who would use it.
Where do you store your encryption keys? If they're on a removable USB device (for example), you'd have to contact the datacenter personnel to plug them in in the event you had to reboot (which happens from time to time if you perform OS SRU uprades). If the USB device is left in the server and someone gains privileged access to the machine, they've got the key as well as the data. If the USB device is not in the server and someone gains privileged access to the machine, they still have access to the data.
The only time disk encryption is valuable is when the machine is off or the disks are being transported.
I'm thinking about the case where you have a computer in the home or office environment, e.g. the burglary attack vector. It's not just notebooks that get stolen.
I'm also considering the corporate or government espionage vector, where you have a reasonably skilled on-site attacker trying to read or modify the disk contents. In this case you are typing the keys in on boot, and you either ignore the RAM extraction attacks (which are difficult to execute in practice for non-academic attackers) or you can mitigate them by storing in L1 cache and so on.
Datacenters have unusually high physical security, and in addition I don't generally store highly sensitive data there (my work doesn't involve PCI compliance etc.) Whereas I do have PCI-level data about myself on my own computers.
Very good point. I guess I've been working in environments with datacenters too long. :)
"I'm also considering the corporate or government espionage vector, where you have a reasonably skilled on-site attacker trying to read or modify the disk contents."
Again though, that only protects you if the disks are out or the machine is down. If you have someone with that level of skill within your organization, they'll likely gain access to the running computer where the data has already been made accessible in decrypted form.
Its too bad, i really dislike Oracle, but its the best tool for the job currently.
And for the person above, as long as you restrict the pool version to 28 when creating under solaris 11 you can still migrate between solaris/linux/bsd. Obviously you lose out on native FDE though.
It's a little disheartening to see people pay so much attention to a fictitious number, which really has no regard to the actual state of the code/project. Whether v1.0 or v0.1 was used, the quality of the release would not differ any.
It's also a matter of how much cognitive load you want to impose on your users. We use dozens of different open-source packeages, each with its own version number. Can you really expect us to keep track of all of them? "Which Linux ZFS release was the first stable one?" "Uh, I think it was 0.6 something, or maybe 0.5.1?"
Don't do that to your users. Call it 1.0.0.
Exactly. To those who have been using 0.6.0_rc*, version 0.6.1 conveys the meaning "go ahead and upgrade, here's the stable release with no backward compatibility issue".
You will get your version 1.0 after several pre-1.0 RC releases. That's the proper release management.
So, rather than adapt to people everywhere and call it v1.0, let's be a little sad that calling it 0.6.1 doesn't have the same effect, and then leave it the way that it is, and complain about how irrational people are?
I know, let's make a new release of v0.6.1 called...wait for it...v1.0. And then we can move on to real issues, not fake controversy.
The adage "pick your battles" comes to mind....
Maybe 0.6.1 just has a nice ring that 1.0 doesn't quite get.
I'm probably biased; but I got screwed using ReiserFS. It was fast and great, but one day something went wrong and then I found that the reiserfsck program was practically useless. Very little work had been done on it, so any small inconsistencies in the FS meant your data was toast.
Sure, keep backups and all that, but just be aware how important a good fsck tool is.
On writes ZFS also checks that each written block is valid before committing the new block to the filesystem. In the event of a bad write the new block is not committed and the old block remains, blocks are written originally to a separate area of the disk.
I suspect (but I have no data!) that the majority of times that users have file system corruption and need to run fsck is due to bad software behaviour, and not failing disks.
If you have a bad software write, ZFS should be able to detect that in the same way it would a hardware error since the checksums won't match. Also worth noting that the checksums are themselves checksummed in a merkle tree. It also has a tool called "scrub" which can check data integrity without unmounting the disk.
Of course if your filesystem has buggy code for checksumming/repair then you are boned whatever.
One thing to keep in mind is that ZFS is more complex than your typical filesystem, but it's also one of the few filesystems that's effectively a cryptographically self-validating tree. You're a lot more likely to have a disk lie to you than ZFS flake out.
Yes ZFS is quite resilient against corruption but when it bites, which has happened plenty of times in the past, you are truly screwed.
The philosophy behind ZFS and file system recovery has always been that it is too inconvenient. It is much better to restore from tape backups so why make any effort on recovery?
Thing is, most consumers don't have tape backups...
Because of that, ZFS is in the very vast majority of cases a bad choice for home users, especially under linux where it can hardly be considered mature.
BRTFS just isn't ready yet.
As a home user you can't even get a decent COW FS with decent snapshotting or checksums. That is the sad state we are in and will be in for many years to come.
Actually the whole point of RAID is to combat a MISSING drive, not a corrupt drive. Parity works when I'm lacking one bit ... not when any of my bits might be lying.
Half of the reason of ZFS is to protect against the scenarios when a drive's internal checksum should catch a data error - but the drives returns bogus data and says 'oh its good'. RAID5 can't catch that(mathematically impossible) ... RAID6 can but most implementations don't check because its expensive(extra IO and calcs)
This is the reason most of your high end SAN vendors don't just deploy RAID5/6/10 but also checksum the data themselves under the covers. They don't trust even high end SAS/FC drives.
I've been using ZFS for years now, i've had disks go bad, I've replaced them. ZFS has protected me from data loss better than anything else. In the past when I've had the disk lie to me, ZFS was able to tell me what files were corrupted and thus had invalid data.
I'm going to need to throw a  on your post.
UFS, ext3, ext4 would have continued on giving me bad data back. Are there going to be cases whereby a file system fails completely? Yes, but writing tools to attempt to fix those issues that happen once in a million times is almost impossible because of the difficulty of replicating those failure scenarios and then writing code to deal with the disks having that corruption.
I still trust ZFS with more data than any other file system.
I'm getting "we need an fsck that does magic" vibes here.
Most data on consumer's computers is 1 disk failure away from loss anyway, hence the popularity of cloud syncing services.
I got screwed by ZFS, and I was able to recover the filesystem by learning its on-disk structure and fixing it directly on-disk. Something that a decent fsck tool could do. But no, ZFS never breaks. Go figure.
How? ZFS refuses to mount/import a corrupt filesystem.
In my case, the latest superblock (or some internal bookkeeping structures that the superblock points to) was corrupted in a way that ZFS completely gave up. So what I ended up doing is to manually invalidate the latest superblocks until ZFS was able to mount the filesystem. I may have lost the changes written in the few minutes before the corruption, but that's still way better than loosing everything.
Before I decided to poke around the raw disk with dd (to invalidate the superblocks by overwriting them with zeros), I googled around and I wasn't the only one with that problem. One other guy asked on the ZFS mailing list and the response was along the lines of 'Your filesystem is FUBAR, restore from backup'.
You may argue that ZFS itself should do what I did (dropping a few minutes of transaction history and roll back to the latest non-corrupt state) upon mounting. Fair enough. I don't really care if that functionality is built into ZFS or an external fsck binary. The fact is that ZFS wasn't able to recover from the situation. One that I would argue is very trivial to recover from if you know the internal ZFS on-disk structure.
zpool clear -F $POOLNAME
You may have found a corner case in the fs and perhaps this sort of thing should be added to the import command, but I'm not sure simply having an "fsck" fixes this. I just think the import command appears to have a bug/needs a feature.
That's answered very well in the article connected to this other currently active HN discussion:
In short, fsck simply checks to see that the metadata makes sense, and that all inodes belong to files, and that all files belong to directories, and if it finds any that don't, it attaches them to a file with a number for a name in lost+found.
It's pretty crude compared to a filesystem debugger.
If you want to compare apples-to-apples, you'd be better off asking how zdb compares to debugfs (for ext2/3/4) as both are filesystem debuggers.
You could also ask "How's zfs scrub different from fsck?" and the answer to that would be: zfs scrub checks every bit of data and metadata against saved checksums to ensure integrity of everything on-disk. In comparison, fsck cannot detect data corruption at all, and can only detect metadata corruption when an inode points to an illegal disk region (for example).
Even that comparison shows fsck is crude when compared to scrub.
The tool to recover from corruption is a rollback:
clear [-nF] <pool> [device]
Maybe we've just been really lucky.
I won't discuss the nature of the business, but it's unlikely that actual corruption that isn't automatically repaired would go undetected for any amount of time.
Especially for those of us that don't have thousands of machines and can therefor be badly screwed by one issue.
What I find scary is silently corrupt data, something which is a problem for most other filesystems. I've seen ZFS catch and fix that error orders of magnitude more often than I've seen ZFS flake out. If we're talking risk analysis, I feel you're worried about a mouse in the corner, while a starved tiger is hungrily licking its chops while staring at you.
The thing about fsck-like recovery tools is you need to have a failure mode in mind when you write them. ZFS can fix most of those types of errors thanks to the checksums and on-disk redundancy on the fly. Or at least tell you that something is now going wrong and which files are affected.
However, scrub does not make ZFS perfect. There are still ways the filesystem can become corrupted without scrub noticing. Or corrupted in a way so that ZFS fails to recover from, even though recovering would be dead simple.
The attitude of the ZFS developers only works in the enterprise market: Your data is safe (checksummed, scrubbed, replicated using RAID-Z), but if a bit flips in the superblock just restore from your backup, because we won't provide tools to recover from that.
Let's argue that bits flip in all four uberblocks though, then ZFS will use the previous valid commit, which also has four uberblocks (ZFS is a CoW FS). And so on backwards for quite a few transactions. All these uberblocks are spread out at the beginning and end of each disk.
ZFS has a policy that the more important the data (and the top of the tree is most important), the more copies there are, although a user can also define how many duplicates there should be at the leaves of the tree.
Basically, you'd need a very trashed disk to render an entire pool dead. You're not going to recover from that, regardless of filesystem.
The fact that the vast majority of other filesystems have no way to detect silent corruption of data (only metadata inconsistencies) is far more frightening to me.
Here is an nice article written by someone who discovered just how unreliable disks are, after switching to ZFS (because other filesystems couldn't detect the corruption). Quite an eye-opener.
If you use chrome and are getting the same error as I am, here's the google cache link (Firefox will load it):
The article references sources of studies of hard disk corruption, if you'd want something with even more detail and statistics:
Unless you assume that all hardware is perfect and that no bugs exists in ZFS (have been plenty of those in the past that just makes the whole FS completely worthless).
Nut no, ZFS doesn't need fsck. Because, we don't want it to need one. Oh, and we don't want to spend the resources developing one.
The only reason ZFS doesn't have a fsck tool is because in the enterprise world it doesn't need one. When it is needed you just restore from tape instead. It is that simple.
At first glance, it certainly seems to depend upon some high-level ZFS data in order to start. A command like 'zpool scrub pool-name' still needs to navigate the ZFS pool data on disk in order to locate the named pool.
They're validated as well. Everything has a cryptographic checksum that's stored in the parent block, starting at the data and working all the way up to the top of the tree.
Furthermore, the higher up the tree you go, the more redundant copies there are. The top of the tree has four copies, if I recall. This ignores support for mirroring and striping, which further improves data redundancy.
But wait, there's more! Those four blocks aren't overwritten. ZFS is a copy-on-write filesystem (data and metadata) which behaves a lot like the persistent data structures that Clojure hackers are so fond of, so if the newest writes do not validate, it'll roll back to the newest valid commit of the tree.
That's a nice way of saying that you're guessing given your experience with other filesystems. ZFS was a genuinely revolutionary filesystem, and doesn't behave like other filesystems, the sole OSS exception being BTRFS. Read up on it a bit, you'll find something interesting. :)
It then reads the data about what zpool a disk belongs to from disk. That data is itself stored in multiple locations so that it is unlikely that all of those are corrupted, after importing the disks you can run scrub.
If the pool metadata itself is corrupted, generally you can roll-back to a previous time when the data is not corrupted.
This document: http://docs.oracle.com/cd/E19082-01/817-2271/6mhupg6qg/index... describes fairly well what all the options along the way are.
None of the failure modes described would be any better if there was a fsck tool available... in all file systems it is going to cause dataloss.
The raid arguments against btrfs are silly. Raid 5 is quite unnecessary, particularly with these kinds of file systems. I can only assume people making this argument have never maintained file systems of this type and clearly don't understand how to implement it correctly.
Both ZFS and btrfs are the future though. If admins aren't seriously considering one of the two for thier server infrastructure then they probably shouldn't be admins.
While it does NOT contain ZFS it is a great read.
No real cost there, except some possible confusion. Move the developers to whichever project seems to be winning.
However, neither license puts restrictions on (end-user) use, which means that as long as you distribute the bits separately, running them together is perfectly fine. Which means (as their FAQ states) that while it can't ship as part of the kernel, there's no reason end-users wouldn't be allowed to load a separately distributed module into their kernel.
That is the reason why I avoid the Nvidia binary driver as well.