The fact that the caching algos are so good at keeping things in memory is why everyone gets hung up on using ECC RAM with it.
I'd love to run ECC RAM, but I'd have to buy a much more expensive processor, a much more expensive mainboard, and all of my RAM would have to be swapped for moderately more expensive replacements. And DDR3+ is not cheap.
I'm curious though, if ECC consists of a ninth parity bit, is there any reason why a memory controller can't be designed that would, worst case, use every other (identical) stick just to get parity bit(s), as a BIOS configurable option? Sure it'd halve your RAM, or you'd pay a lot more in RAM costs, but not having to buy Xeon processors and mainboards, and getting to reuse your existing RAM, would be worth buying an extra stick of RAM in my opinion.
It seems as processes keep getting smaller, and RAM sizes keep getting larger, that the effects of cosmic radiation are just going to keep getting worse. If we can't get desktop CPUs and mainboards to just switch to ECC, surely this would at least be a better-than-nothing option.
As for ECC in desktops, I am using whatever interactions that I have with vendors to push for it. Here is a public one:
It's cheaper than losing your data.
We spend 20% more on drives than we have disk space and we do it because we love our data. In most use cases, it's wise to trade some performance (in the form of a slower, single socket, less memory) setup for reliability. Unless you need all the performance you can possibly get, a slightly slower but more reliable machine is sound investment.
> if ECC consists of a ninth parity bit
A single bit will allow you to detect an error in a given unit, but not correct it. For correcting a single-bit error, you'd need to pinpoint which bit got flipped and that would require a second parity bit.
If Intel were interested in allowing you to make use of any parity bits without buying a Xeon, they wouldn't be restricting ECC to Xeons in the first place.
Besides price, another issue against server hardware is that it's generally designed for a server case and therefore doesn't play well with enthusiast cases. The board I got is a little odd in that it uses a layout more typical of consumer boards, though the socket placement is still not ideal.
You choose based on power budget, thermal budget, and presence of on-die graphics. Regardless of how you choose, all processors in a given generation (back to at least the Phenom series and associated mobo chipsets) have ECC support, virtualization support, and whatever other fancy features were slapped in in that generation.
If not, then this issue should go away when I publish ZFS support patches for syslinux later this year. syslinux is capable of generating initramfs archives on the fly, so adding ZFS support to it should largely eliminate the need for distribution-specific initramfs generators.
The biggest problem is actually grub: grub hasn't really been updated to handle feature-flags yet, so booting off a ZFS partition is tricky.
The easiest solution so far is just to keep /boot on a small ext4 partition, principally because ZFS changes seem to break grub a little too easily.
I remember a thread where RMS claimed to Bruno Haible that clisp was a derivative work of readline, since it had optional readline support.
I always thought that position was untenable, but since Haible was open to licensing clisp under GPL anyway, there wasn't a whole lot of pushback.
Note that I am not associated with the Debian project and therefore I was not involved in the discussion referenced here.
ROS (robot operating system) runs into this with nodelets, which are shared objects that are loaded into a nodelet manager. Is it a GPL violation to supply a launchfile which specs the loading of BSD and GPL nodelets into a single running process?
What else I got from the OP thread is that if you do this (lets just assume the two licenses are incompatible) then you are not the violator, since you haven't distributed these as one binary package, but the users might be (only if they go on to redistribute the pre-built confabulation of binaries/processes as one package, or even just in uploading them together to, say, a hosting provider.)
My experiences with ZFS have been quite good, and with btrfs quite bad.
In the end I assume this is intentional to keep the license issue of dtrace and ZFS in a dubious state. Thus preventing other distributions from shipping it. Meanwhile Oracle can ship it because they aren't going to sue themselves and the kernel folks seem rather uninterested in lawsuits as well (and even if they did then Oracle could drag it out forever considering that the company seems to have more lawyers than developers).
So far, I am not aware of anyone who has gotten any legally binding assurance from Oracle that shipping btrfs will not be a problem. I am also not aware of anyone in the btrfs community asking Oracle to do something about it. If btrfs takes off, the ZFS patent portofilo could ensure that Oracle is the only company that can legally distribute btrfs. Consequently, Oracle's legal department would likely be able to have a field day with any company distributing btrfs. In the meantime, their competition will have to develop workarounds and Oracle would be ahead because they will have a better filesystem than they would have had they been upfront about this issue.
What Oracle might or might not do with the ZFS patent portfolio in the future is speculation, but the fact is that Oracle appears to have reserved the option to use it in future lawsuits.
This whole post blows straight through "speculation" and into "unpaid marketing for Oracle" in the worst possible way.
I don't like Oracle, I think they're one of the most evil companies in existence. But that doesn't automatically make anyone who attempts to explain Oracle's behavior a shill or "unpaid marketing for Oracle".
I, for one, do want my FS to run on 32-bit systems and, don't want it to require gigs of RAM to get its work done. (My information may be out of date, as I stopped looking at ZFS once it became clear that btrfs was stable enough for the tasks that I was putting it to.)
That aside, >99% of memory used by ZFS on many systems is used for cache. It does not need that to operate, but like all filesystems, it performs better with more cache.
Here's hoping Debian can develop something similar so that users can create and boot from a ZFS partition during their installer.
> CCDL is an Open Source License that is DFSG compliant
I don't mean to nitpick, but if you're going to discuss the legalities of a license, at least spell it correctly. It's not CCDL, it's CDDL, or Common Development and Distribution License.
The only case where I suspect btrfs might significantly outperform ZFS involves getdents() when the caches is cold. This is because ZoL does not presently do readahead on directory lookups. This has a noticeable impact on the performance of `ls` on the first access of a directory on mechanical storage, but it is not a problem for subsequent reads or on solid state storage. Prefetch support in getdents() will likely be implemented in either the next release or the one that follows.
This is especially important because on use cases where performance differences are significant, that is, not on general desktop usage, the maturity of the FS is funamental, and btrfs is discouraged right now.
CoreOS uses btrfs as its rootfs. I imagine that the CoreOS developers managed to avoid issues like ENOSPC by virtue of not writing to their rootfs very much. I did not have that luxury since I compiled a Gentoo GNU userland on top of it during the course of development. I encountered numerous ENOSPC errors on btrfs when developing the ZFS port to CoreOS and even hit ENOSPC errors when trying to correct the btrfs ENOSPC errors with `btrfs balance /`. I would not consider btrfs ready for production use, but your mileage will vary.
$ df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/dm-1 42G 41G 280M 100% /home
$ btrfs fi df .
Data, single: total=40.48GiB, used=40.21GiB
System, single: total=4.00MiB, used=12.00KiB
Metadata, single: total=1.01GiB, used=649.18MiB
$ uname -r
CoreOS uses Linux 3.15.y.
Edit: Nevermind about the advertising, it's still intrusive, I'm just running adblock now.
As for Phoronix's ZFS benchmarks, the test hardware used drives that misreported themselves as having 512-byte sectors, which handicapped ZFS performance. Phoronix rejected all suggestions that it correct for this as end-users had been doing. Phoronix refused to meet half way by posting two results (one with proper configuration and one without), and also refused the suggestion that it to mention the existence of that problem in its test hardware. I eventually wrote code to identify drives known to misreport their sector sizes so that ZFS will automatically use the correct settings on them. That lead to the Phoronix August 2013 benchmarks showing a remarkable improvement in ZFS performance in FIO. It was so great that it sparked a discussion among the btrfs developers:
Later that month, I publicly criticized Phoronix for posting misleading benchmarks:
Phoronix has not posted ZFS benchmarks since that time.
More recent derivatives of OpenSolaris were reported to do 1.6 million IOPS last year:
Your mileage will vary, but ZFS has always had strong performance.
Here's a benchmark from 2013 with most results showing that it has worse performance on Linux than XFS and EXT4:
As for the benchmarks you cite, there are several key problems:
1. You say that they show ZFS having worse performance than XFS and ext4, but XFS is not in those benchmarks and the FIO tester shows ZFS as outperforming its competition by a significant margin.
2. They use a single disk. No server does this and while desktops and laptops do this, it is not clear how the benchmarks are of any relevance there. Additionally, LZ4 compression is not in use, when practically everyone deploying ZFSOnLinux would configure it to use LZ4.
3. ext4 manages to perform better than the theoretical limit of the SATA II interface and there is no discussion as to why.
4. ZFSOnLinux 0.6.2 is an old release. The most recent 0.6.3 release includes a new IO elevator and other improvements that enable ZFSOnLinux 0.6.3 to outperform its precedessor by a significant margin in many workloads.
Would you post something constructive that you actually did yourself? I am beginning to think that you have never even used ZFS.
As with most things, this isn't a binary decision, but typically ZFS is a good (if not the best) solution for most storage arrays.
As for XFS, it is a single block device filesystem that relies on external shims to scale to multiple disks. ZFS can outscale it when various shims are put into place to allow multiple disks to be used. One user on freenode had difficulty getting good sequential performance from XFS + LVM + MD RAID 5 on 4 disks. He reported that he could not get better than 44MB/sec writes while ZFS managed 210MB/sec. I had a similar problem in 2011 with ext4 + LVM + MD RAID 6 on 6 disks. In that case, I could only manage 20MB/sec. It is why I am a contributor to the ZFSOnLinux project today. To make these anecdotes constructive, it would be nice if we had documentation on how to configure XFS + LVM + MD RAID 5/6 in a way that sequential performance does not suffer. In my case, my performance issue involved KVM/Xen guests. I never confirmed whether that user ran his tests on bare metal, but I suspect that he did.
XFS' inability to scale past a single block device is not its only issue. Until a disk format change occurs that will add checksums, it has none to protect itself against corruption. When it gains them, it will not be able to do anything about that corruption when it detects it (aside from keeping the kernel from panicing) and it does nothing to protect your data.
That said, ZFS has always focused on obtaining strong performance. This is why it has innovations like ZIL, ARC and L2ARC. The only instances where ZFS purposefully sacrifices performance is when getting a few extra percentage points means jeopardizing data integrity. There would have been no sense in developing ZFS as a replacement for existing filesystems if it did not keep data safe.
Usually I hear people complaining about slow file creates/deletes/renames with ZFS or slowness navigating or dealing with large directories. A quick Google confirms that there are lots of people that have seen these behaviours. But I don't use it myself so I don't have any specific data. I was just trying to say that ZFS has a reputation as being a better filesystem, but not a faster one, at least with the people I talk to about storage.
ARC is a technology from IBM. Intent logs go way back. Sure ZFS is doing some interesting things, but it's a bit ironic to talk about innovation given the fact that it was more or less based on Netapp's WAFL filesystem, and all of the lawsuits that followed on from that.
I am beginning to think that not only do you not use ZFS, but that you have never used ZFS. I suggest that you try running your own benchmarks and workloads on it. It is a joy to use. I think you would agree if you were to use it in a blind fold comparison test.
I don't believe that's what the linked post is saying. I may be missing additional context that's posted elsewhere, but at least what I read in this thread is: 1) the Debian ftpmasters rejected the binary ZFS module upload; and 2) the Debian ZFS-on-Linux team met at Debconf 14 and agreed on this summary/response, arguing why it should be accepted.
But has that response itself been accepted? Where is the mentioned vote? The only other post I see in the linked thread is from Lucas Nussbaum (Debian project leader), which sounds inconclusive,
I think that adding an actual question to our legal counsel would help focus their work. ... I'll wait for comments or ACK from ftpmasters before forwarding your mail to SFLC.
 e.g. https://packages.debian.org/sid/nvidia-driver
Consider: you load data from ext4 into RAM, it gets corrupted. You change some bytes, then save it. Corrupted data is then written to the disk. Could be anything inside the allocation unit size, which is 4K generally - not insubstantial.
ZFS's worst case behavior is exactly the same: it can't protect you from corruption in RAM after checksum validation if you then ask it to save that data to disk.
There is no difference: if you do not have ECC RAM, you are vulnerable to bitflips corrupting your data - and they will have been.
ZFS won't corrupt data which gets altered in memory due to a bitflip but is only ever read. It can't - because in-memory bitflips can't be detected. Even if at some point during checksum validation there was a mismatch, the restore is done from checksum protected parity/mirror image data. And if the restore block were corrupted, the checksum won't match and ZFS will rebuild the block correctly next time.
If you do not have ECC RAM, every filesystem will potentially corrupt your data. ZFS is more resilient then pretty much all of them even in this case.
This makes it far more risky to run ZFS with non server grade parts.
Always fun to post on HN and getting downvoted ;o
It's toast if you were running mdadm or RAID. It's toast if a file gets deleted and recovered by ext3/4 (good luck figuring out which chunk you're looking at in lost+found).
The second article is simply stressing my original point: ZFS cannot protect you from bad RAM. No file system can. If data is corrupted in RAM, ZFS will checksum that and write it to disk. This is exactly what will happen with any other filesystem.
ZFS storage pools can recover from fairly dramatic failures. You can lose a disk out of a non-redundant stripe and still have complete metadata (just incomplete data, but checksums tell you exactly what you lost). Unimportable pools are not something that easily happens without severe damage - a corrupted uberblock can recovered by using the zpool import -T to rewind a txg and get back a verifiably complete copy of your data.
This idea that somehow the failure mode of other filesystems to faulty RAM is better is destructive fiction. You have no idea if your data is valid on disk. You have no way to verify if the data is valid in the first place. And the degree and number of failures required to wipe out a pool would also wipe out any other storage system. The idea that recovery tools will save you in a catastrophic failure (of what kind?) is laughable. This can be trivially explored by trying to recover a deleted file in ext4. Sure, it can technically be done. Good luck with doing it more then once.
But this also wouldn't be a random failure. ECC RAM with a consistent defect would give you the same issue. It would also proceed to destroy your regular filesystem/data/disk by messing inode numbers, pointers etc.
Your scenario would require an absurd number of precision defects: ZFS would have to always be handed the exact same block of memory, where it always stores only the data currently being compared, and then only ever uses that memory location to store the rebuilt data. And then also probably something to do with wiping out the parity blocks in a specific order.
This is a weird precision scenario. That's not a random defect (bit-flip, which is what ECC fixes) - that's a systematic error. And I'd be amazed if the odds of it were higher then the odds of a hash collision given the number of things you're requiring to go absolutely right to generate it.
EDIT: In fact I'm looking at the RAID-Z code right now. This scenario would be literally impossible because the code keeps everything read from the disk in memory in separate buffers - i.e. reconstructed data and bad data do not occupy or reuse the same memory space, and are concurrently allocated. The parity data is itself checksummed, as ZFS assumes it might be reading bad parity by default.
By the way, the above "insane" scenario comes from the freenas forums
and is used to motivate why ZFS should have ECC ram. I am glad to hear that is much less of an issue than it's made out to be.
The author there makes some crazy assumptions like this: "But, now things get messy. What happens when you read that data back? Let's pretend that the bad RAM location is moved relative to the file being loaded into RAM. Its off by 3 bits and move it to position 5."
Any good ideas on how one might fix this incorrect "common wisdom"?
The whole ECC recommendation is due to ZFS (unlike other filesystems) providing guarantees to data correctness, but as you know while ZFS can discover data corruption on the disk thanks to checksums, it can't guarantee data correctness in RAM because because as any program it is bound to trust it. That's why ECC RAM is highly recommended.
That being said, it is not a good idea to use hardware that lacks ECC RAM. No filesystem developer will claim that their kernel driver can operate properly without ECC RAM because it is not possible. Email the linux-fsdevel mailing list if you want confirmation.