Hacker News new | past | comments | ask | show | jobs | submit login
Summary of ZFS on Linux for Debian (gmane.org)
141 points by ferrantim on Sept 9, 2014 | hide | past | web | favorite | 96 comments

I use ZFS on Linux for my file server. One of the nicest things about it is that the caching algos actually work - they're resilient to scans, so I can, for instance, have an intensively used database file that remains in RAM, even while & after doing a linear scan over a large number of files (eg during rsync). With the standard Linux page replacement algos, linear reads will flush the stuff you're actually using out of the page cache.

The fact that the caching algos are so good at keeping things in memory is why everyone gets hung up on using ECC RAM with it.

This is not the reason why people recommend ECC. ZFS is so good at detecting corruption (CKSUM errors) that when it happens it is hard to tell what caused it: faulty RAM corrupting data before being written to disk or after being read from disk, or data corrupted on disk? ECC simply helps reduce a good chunk of corruption errors caused by faulty RAM. That said, in my experience there are a few tell-tale signs that RAM is the issue: when you see CKSUM errors poping up infrequently, and not being attributed to consistently the same drive(s).

I haven't had any issues so far using four drives mirrored on ZFS, but the ECC issue certainly worries me.

I'd love to run ECC RAM, but I'd have to buy a much more expensive processor, a much more expensive mainboard, and all of my RAM would have to be swapped for moderately more expensive replacements. And DDR3+ is not cheap.

I'm curious though, if ECC consists of a ninth parity bit, is there any reason why a memory controller can't be designed that would, worst case, use every other (identical) stick just to get parity bit(s), as a BIOS configurable option? Sure it'd halve your RAM, or you'd pay a lot more in RAM costs, but not having to buy Xeon processors and mainboards, and getting to reuse your existing RAM, would be worth buying an extra stick of RAM in my opinion.

It seems as processes keep getting smaller, and RAM sizes keep getting larger, that the effects of cosmic radiation are just going to keep getting worse. If we can't get desktop CPUs and mainboards to just switch to ECC, surely this would at least be a better-than-nothing option.

The "ECC issue" is relevant for all software and by extension, all filesystems. ZFS eliminates so many problems in the storage stack that things it cannot eliminate (like bit-flips from not having ECC) become more noticeable by virtue of fewer things going wrong. Here is some documentation on this:


As for ECC in desktops, I am using whatever interactions that I have with vendors to push for it. Here is a public one:


> I'd love to run ECC RAM, but I'd have to buy a much more expensive processor, a much more expensive mainboard, and all of my RAM would have to be swapped for moderately more expensive replacements. And DDR3+ is not cheap.

It's cheaper than losing your data.

We spend 20% more on drives than we have disk space and we do it because we love our data. In most use cases, it's wise to trade some performance (in the form of a slower, single socket, less memory) setup for reliability. Unless you need all the performance you can possibly get, a slightly slower but more reliable machine is sound investment.

> if ECC consists of a ninth parity bit

A single bit will allow you to detect an error in a given unit, but not correct it. For correcting a single-bit error, you'd need to pinpoint which bit got flipped and that would require a second parity bit.

> is there any reason why a memory controller can't be designed that would, worst case, use every other (identical) stick just to get parity bit(s), as a BIOS configurable option?

If Intel were interested in allowing you to make use of any parity bits without buying a Xeon, they wouldn't be restricting ECC to Xeons in the first place.

I went ECC a couple months ago and managed to find some parts that didn't break the bank, mostly because I held back and stayed a generation behind: GA-6UASL3, E3-1230V2. There are other sacrifices: the motherboard is rather no-frills, and the memory bandwidth isn't that great.

Besides price, another issue against server hardware is that it's generally designed for a server case and therefore doesn't play well with enthusiast cases. The board I got is a little odd in that it uses a layout more typical of consumer boards, though the socket placement is still not ideal.

One of the reasons I went with AMD processors and mobo chipsets for my non-laptop system (and recommend them for anyone who's not concerned about performance per watt) is because AMD doesn't segment the market by processor features.

You choose based on power budget, thermal budget, and presence of on-die graphics. Regardless of how you choose, all processors in a given generation (back to at least the Phenom series and associated mobo chipsets) have ECC support, virtualization support, and whatever other fancy features were slapped in in that generation.

Most MB vendors (other than ASUS) disable ECC for non-server cpu's... Also, it's Unbuffered ECC, not Registered ECC for those looking, and unsure.

FYI: Mid-higher end AMD on Asus motherboards support unbuffered ecc ram. The ram is 2x as much and you won't go above 8/4 cores at modest speed, but it's a good option for homebrew solutions.

ECC might not cost as much as you think. Here is a barebones ECC setup with 4GB of ram for $230.


FYI, that mb seems to support unbuffered ecc, not the registered ecc stuff you put in the list... AMD + ASUS MB is probably a better option for under $600-800 price points.

Most modern CPUs have the memory controller integrated on-die, so you'd need a different class of CPU -- and then you're right back to buying a Xeon.

Not necessarily. Some Atoms (e.g. Centerton) are now shipping with ECC support; similarly AMD Kabini processors support ECC.

Installing / on ZFS is fairly easy on most major Linux distributions, but Debian has been the main exception due to its initramfs generator lacking ZFS support. I am cautiously optimistic that will change.

If not, then this issue should go away when I publish ZFS support patches for syslinux later this year. syslinux is capable of generating initramfs archives on the fly, so adding ZFS support to it should largely eliminate the need for distribution-specific initramfs generators.

I am not a particularly good Linux user ( I have to look up most commands/ where things are) and I had no problem getting ZFS set up on Debian. I'm worried now that you say this, because I feel I may have messed something up.

He's talking about the root partition being on ZFS, not just setting up an array.

Don't most people just use a /boot partition anyway?

If it works for you, then there is no need to worry. I only said that because there are others who had difficulty setting this up.

Do you boot from your ZFS partition? The initramfs limitation would only come into play in that situation, I believe.

Naa, I'm good then. Or at least in that regard.

zfs-initramfs is helpfully provided by the ZoL project and works great.

The biggest problem is actually grub: grub hasn't really been updated to handle feature-flags yet, so booting off a ZFS partition is tricky.

The easiest solution so far is just to keep /boot on a small ext4 partition, principally because ZFS changes seem to break grub a little too easily.

GRUB2 upstream has support for feature flags and is compatible with ZoL 0.6.3. If Debian's GRUB2 does not support it, it would be rather out of date. You could make a version 28 pool if you need to use the Debian GRUB2 version.

It's interesting that the Debian people feel that ZoL is not a derivative work.

I remember a thread where RMS claimed to Bruno Haible that clisp was a derivative work of readline, since it had optional readline support.

I always thought that position was untenable, but since Haible was open to licensing clisp under GPL anyway, there wasn't a whole lot of pushback.

In that case, readline was not a loadable module. A comparison that does use loadable modules is the FSF's GCC project. The FSF resisted implementing support for loadable modules in GCC for a long time under the belief that it would allow the use of GPL-incompatible modules. It was not until LLVM made it a moot point because GCC itself could be replaced entirely non-copyleft code that GCC gained support for this. Linux kernel module support analogously permits loading modules that are under GPL-incompatible licenses.

Note that I am not associated with the Debian project and therefore I was not involved in the discussion referenced here.

Another one I've been wondering about recently is the inverse— loading at runtime a GPL module into an otherwise BSD codebase.

ROS (robot operating system) runs into this with nodelets, which are shared objects that are loaded into a nodelet manager. Is it a GPL violation to supply a launchfile which specs the loading of BSD and GPL nodelets into a single running process?

BSD and GPL are actually compatible and can be distributed together. It's my understanding that the advertising clause is the bit that makes them incompatible, and that regular BSD and GPL code can be bundled and distributed together.

What else I got from the OP thread is that if you do this (lets just assume the two licenses are incompatible) then you are not the violator, since you haven't distributed these as one binary package, but the users might be (only if they go on to redistribute the pre-built confabulation of binaries/processes as one package, or even just in uploading them together to, say, a hosting provider.)

If you are looking at playing around with ZFS on Linux, be sure to check out Aaron Toponce's awesome series of articles, entitled "Install ZFS on Debian GNU/Linux" [1]. I have also done a two part screencast about using ZFS on Linux [2], part two will be released later today.

[1] https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux...

[2] https://sysadmincasts.com/episodes/35-zfs-on-linux-part-1-of...

Why doesn't Oracle just change ZFS to dual-license GPL/CDDL, and scrap btrfs?

My experiences with ZFS have been quite good, and with btrfs quite bad.

Because Oracle wants you to buy Solaris if you do serious business.

Oracle wants you to buy support contracts. They'd be perfectly happy to have those be for Oracle Linux rather than Solaris since they don't have to do all of the engineering that way.

or Oracle Linux. I don't know if they are shipping ZFS support with it yet. But they are already shipping dtrace with it which suffers from the same license issues.

I asked the Oracle representative at LinuxCon North America 2014 this question and explicitly mentioned that it was already shipping CDDL kernel code because of DTrace. He could not answer this question, but claimed that he would get back to me. So far, I have not heard back.

They are currently using a trick trying to circumvent the EXPORT_SYMBOL_GPL issue. But it is highly dubious whether this is legally sound:

ktime_t dtrace_gethrtime(void) { return ktime_get(); } EXPORT_SYMBOL(dtrace_gethrtime);


In the end I assume this is intentional to keep the license issue of dtrace and ZFS in a dubious state. Thus preventing other distributions from shipping it. Meanwhile Oracle can ship it because they aren't going to sue themselves and the kernel folks seem rather uninterested in lawsuits as well (and even if they did then Oracle could drag it out forever considering that the company seems to have more lawyers than developers).

I guess the question remains, why do they keep developing btrfs ? (do they?)

Sun received dozens if not hundreds of patents for techniques used in ZFS and since Oracle purchased Sun, it now has those patents. btrfs uses many of the ideas that ZFS uses, so it is highly unlikely that btrfs does not infringe on at least some of those patents. People who use ZFS have a patent grant through the CDDL, but people who use btrfs have no such protection from the GPLv2.

So far, I am not aware of anyone who has gotten any legally binding assurance from Oracle that shipping btrfs will not be a problem. I am also not aware of anyone in the btrfs community asking Oracle to do something about it. If btrfs takes off, the ZFS patent portofilo could ensure that Oracle is the only company that can legally distribute btrfs. Consequently, Oracle's legal department would likely be able to have a field day with any company distributing btrfs. In the meantime, their competition will have to develop workarounds and Oracle would be ahead because they will have a better filesystem than they would have had they been upfront about this issue.

What Oracle might or might not do with the ZFS patent portfolio in the future is speculation, but the fact is that Oracle appears to have reserved the option to use it in future lawsuits.

> What Oracle might or might not do with the ZFS patent portfolio in the future is speculation,

This whole post blows straight through "speculation" and into "unpaid marketing for Oracle" in the worst possible way.

I don't think that's fair at all.

I don't like Oracle, I think they're one of the most evil companies in existence. But that doesn't automatically make anyone who attempts to explain Oracle's behavior a shill or "unpaid marketing for Oracle".

They had the principal btrfs developer a few years back, but he's currently at Facebook, where they're apparently migrating to btrfs.

My experiences with btrfs over the past many years on my laptop and my desktop have been quite good.

I, for one, do want my FS to run on 32-bit systems and, don't want it to require gigs of RAM to get its work done. (My information may be out of date, as I stopped looking at ZFS once it became clear that btrfs was stable enough for the tasks that I was putting it to.)

ZFS does not require gigs of RAM to work, it's ARC that does. You can have ARC disabled, but then ZFS performance isn't that good.

ARC would be perfectly happy with small amounts of memory. ZFS ' ARC implementation will free memory as required by the system (provided that the Linux kernel asks) and grow when memory is not in use up to zfs_arc_max. The limit is completely configurable and can be set to something small like 64MB.

32-bit support is a sore spot caused by differences in how Solaris and Linux do memory allocations. Expect to see it fixed in the next 6 months.

That aside, >99% of memory used by ZFS on many systems is used for cache. It does not need that to operate, but like all filesystems, it performs better with more cache.

So it sounds like they are okay with binary kernel modules, just not built-in to the base kernel. FreeBSD manages to do ZFS as a kernel module (plus another module for Open Solaris abstractions) quite successfully. Although it has somewhat of an ugly ZFS-on-root shim loader for booting from a ZFS partition, it does certainly get the job done.

Here's hoping Debian can develop something similar so that users can create and boot from a ZFS partition during their installer.

> CCDL is an Open Source License that is DFSG compliant

I don't mean to nitpick, but if you're going to discuss the legalities of a license, at least spell it correctly. It's not CCDL, it's CDDL, or Common Development and Distribution License.

ZFS on Linux is very good!

But is it, performance-wise? How will ZFS compare to btrfs?

I would expect ZFS to outperform btrfs because of ARC and ZIL. ARC allows ZFS to increase the cache hit rate. ZIL permits ZFS to update its on-disk trees asynchronously. I am not aware of any analog to this in btrfs. Desktop users typically report that their systems become more responsive after switching to ZFS.

The only case where I suspect btrfs might significantly outperform ZFS involves getdents() when the caches is cold. This is because ZoL does not presently do readahead on directory lookups. This has a noticeable impact on the performance of `ls` on the first access of a directory on mechanical storage, but it is not a problem for subsequent reads or on solid state storage. Prefetch support in getdents() will likely be implemented in either the next release or the one that follows.

At this point in time, they still can't be compared, because btrfs is still not [considered] production-ready, and subject to significant changes (also performance-related).

This is especially important because on use cases where performance differences are significant, that is, not on general desktop usage, the maturity of the FS is funamental, and btrfs is discouraged right now.

There are many situations where performance is important, and durability is... not. For example, ephemeral CoreOS cloud instances running Docker containers: they use lots of copy-on-write layers, but they don't actually need to persist any state across reboots (the layers may as well be stored in volatile memory.) Btrfs is perfectly "production-ready" for this particular use-case, so a [current] performance comparison would be pretty useful.

I am working on ZFS support for CoreOS. A snapshot of my WIP proof of concept was posted by my employer yesterday:


CoreOS uses btrfs as its rootfs. I imagine that the CoreOS developers managed to avoid issues like ENOSPC by virtue of not writing to their rootfs very much. I did not have that luxury since I compiled a Gentoo GNU userland on top of it during the course of development. I encountered numerous ENOSPC errors on btrfs when developing the ZFS port to CoreOS and even hit ENOSPC errors when trying to correct the btrfs ENOSPC errors with `btrfs balance /`. I would not consider btrfs ready for production use, but your mileage will vary.

When were you hitting these ENOSPC errors, and what kernel were you using when you hit them?

  $ df -h .
  Filesystem      Size  Used Avail Use% Mounted on
  /dev/dm-1        42G   41G  280M 100% /home
  $ btrfs fi df .
  Data, single: total=40.48GiB, used=40.21GiB
  System, single: total=4.00MiB, used=12.00KiB
  Metadata, single: total=1.01GiB, used=649.18MiB
  $ uname -r
As you can see, I've been working with a pretty much full btrfs volume. It used to be TERRIBLE to deal with btrfs in such a situation, but I haven't had an ENOSPC issue in ages.

The Gentoo Prefix boootstrap on a developer image:


CoreOS uses Linux 3.15.y.

You may be interested in Phoronix[1] linux review site. When I used to follow it through RSS more diligently, I routinely saw filesystem benchmark shootouts such as this[2]. I'm not sure if there's a more recent one than that, but there are more recent benchmarks of Btrfs and ZFS against prior versions of themselves so you can see improvements (or regressions).

1: http://www.phoronix.com

2: http://www.phoronix.com/scan.php?page=article&item=linux_313...

I would be really really careful with Phoronix benchmarking. If I recall correctly, I've seen them 'benchmark' and compare a FreeBSD system with a recent (at the time) Linux release. Issue was, they weren't even running the tests on the same hardware. <sarcasm> Really?! you saw performance differences between and i5 and i7?! Each running a different OS?! Get outa here!! </sarcasm>

I'd be interested in seeing that. It's been a while since I really followed the site, but I seem to recall he was at least forward about what he was testing and why in the text around the benchmarks, I will admit to him seeming to be overly eager to get articles out, sometimes of ambiguous usefulness, probably due to his advertising model (at least at the time, when I was there earlier it seemed far less riddled with advertising).

Edit: Nevermind about the advertising, it's still intrusive, I'm just running adblock now.

Phoronix does try to be honest about its methodology, but the person running benchmarks gives no thought to what his results mean, if anything. It shows in the distinct lack of documentation on why a given benchmark matters.

As for Phoronix's ZFS benchmarks, the test hardware used drives that misreported themselves as having 512-byte sectors, which handicapped ZFS performance. Phoronix rejected all suggestions that it correct for this as end-users had been doing. Phoronix refused to meet half way by posting two results (one with proper configuration and one without), and also refused the suggestion that it to mention the existence of that problem in its test hardware. I eventually wrote code to identify drives known to misreport their sector sizes so that ZFS will automatically use the correct settings on them. That lead to the Phoronix August 2013 benchmarks showing a remarkable improvement in ZFS performance in FIO. It was so great that it sparked a discussion among the btrfs developers:


Later that month, I publicly criticized Phoronix for posting misleading benchmarks:


Phoronix has not posted ZFS benchmarks since that time.

That's unfortunate, but doesn't surprise me all that much. Not because I think/thought of Larabel as likely to do something like that, but because I believe it's all too easy for organizations that are a single individual or organizations where a disproportionate amount of operation and decision making is really a single individual to make calls based on time and ability, and then fall back to defending that decision ever more vehemently long past the point where it should have been reassessed. Honest trusted feedback is priceless.

Anecdotally I see better performance using ZoL than I did when I was running the same zpool and hardware on FreeBSD.

This would suggest that the block device layer under ZFS and/or the VFS above it have room for improvement on FreeBSD.

ZFS supports better compression algorithms. Those can even speed up slow disks.

btrfs is almost painfully slow at times, so hopefully ZFS is a lot better.

Even on Solaris, ZFS doesn't have a great reputation for performance.

Can you substantiate that? ZFS on Solaris produced excellent numbers back in 2008:


More recent derivatives of OpenSolaris were reported to do 1.6 million IOPS last year:


Your mileage will vary, but ZFS has always had strong performance.

One of those is an 8 year old benchmark, the other used a 9.6TB flash array so hardly a realistic storage platform. I'm not saying that ZFS performance is awful, just that other features like data protection come first. If you're hoping for some salvation from btrfs you might be waiting a while. Of course they're both developed by the same company now!

Here's a benchmark from 2013 with most results showing that it has worse performance on Linux than XFS and EXT4:


Oracle has nothing to do with the Open ZFS community these days.

As for the benchmarks you cite, there are several key problems:

1. You say that they show ZFS having worse performance than XFS and ext4, but XFS is not in those benchmarks and the FIO tester shows ZFS as outperforming its competition by a significant margin.

2. They use a single disk. No server does this and while desktops and laptops do this, it is not clear how the benchmarks are of any relevance there. Additionally, LZ4 compression is not in use, when practically everyone deploying ZFSOnLinux would configure it to use LZ4.

3. ext4 manages to perform better than the theoretical limit of the SATA II interface and there is no discussion as to why.

4. ZFSOnLinux 0.6.2 is an old release. The most recent 0.6.3 release includes a new IO elevator and other improvements that enable ZFSOnLinux 0.6.3 to outperform its precedessor by a significant margin in many workloads.

Would you post something constructive that you actually did yourself? I am beginning to think that you have never even used ZFS.

Compared to what exactly? It's consistently competitive against other similar FSes (XFS, JFS, etc..) When using caching, it's supposedly the best for database storage. And, feature wise, it beats everything else out there.

As with most things, this isn't a binary decision, but typically ZFS is a good (if not the best) solution for most storage arrays.

Most people I know that are looking for performance on Linux stick with XFS, but then again people that are serious about Linux aren't running ZFS anyway. And of course there aren't any benchmarks for XFS vs ZFS on Solaris. Where I see consistent problems with ZFS performance is with metadata operations. Sure it has great features for managing disks and protecting data, but those are the primary feature set and performance has always taken a back seat.

Would you elaborate on what metadata operations are slow? If you had problems, I would like to know so that I could look into fixing them. Otherwise, I do not see anything actionable here. Your criticism so far contains zero verifiable claims.

As for XFS, it is a single block device filesystem that relies on external shims to scale to multiple disks. ZFS can outscale it when various shims are put into place to allow multiple disks to be used. One user on freenode had difficulty getting good sequential performance from XFS + LVM + MD RAID 5 on 4 disks. He reported that he could not get better than 44MB/sec writes while ZFS managed 210MB/sec. I had a similar problem in 2011 with ext4 + LVM + MD RAID 6 on 6 disks. In that case, I could only manage 20MB/sec. It is why I am a contributor to the ZFSOnLinux project today. To make these anecdotes constructive, it would be nice if we had documentation on how to configure XFS + LVM + MD RAID 5/6 in a way that sequential performance does not suffer. In my case, my performance issue involved KVM/Xen guests. I never confirmed whether that user ran his tests on bare metal, but I suspect that he did.

XFS' inability to scale past a single block device is not its only issue. Until a disk format change occurs that will add checksums, it has none to protect itself against corruption. When it gains them, it will not be able to do anything about that corruption when it detects it (aside from keeping the kernel from panicing) and it does nothing to protect your data.

That said, ZFS has always focused on obtaining strong performance. This is why it has innovations like ZIL, ARC and L2ARC. The only instances where ZFS purposefully sacrifices performance is when getting a few extra percentage points means jeopardizing data integrity. There would have been no sense in developing ZFS as a replacement for existing filesystems if it did not keep data safe.

I would say that most places that are using XFS at scale are doing so with an external RAID array or SAN that handles all of the individual disks and presents a single logical LUN to the filesystem. ZFS is fairly unique in that it combines lower level disk and RAID management with a filesystem, typically these functions were handled in separate layers (like md), or in hardware.

Usually I hear people complaining about slow file creates/deletes/renames with ZFS or slowness navigating or dealing with large directories. A quick Google confirms that there are lots of people that have seen these behaviours. But I don't use it myself so I don't have any specific data. I was just trying to say that ZFS has a reputation as being a better filesystem, but not a faster one, at least with the people I talk to about storage.

ARC is a technology from IBM. Intent logs go way back. Sure ZFS is doing some interesting things, but it's a bit ironic to talk about innovation given the fact that it was more or less based on Netapp's WAFL filesystem, and all of the lawsuits that followed on from that.

I believe that the people with whom you spoke are referring to getdents() being slow when the cache is cold. That is because ZFS does not at this time implement directory prefetch logic. That will change in a future release of ZFSOnLinux, but it is a low priority because cold cache performance is not terribly important for production use.

I am beginning to think that not only do you not use ZFS, but that you have never used ZFS. I suggest that you try running your own benchmarks and workloads on it. It is a joy to use. I think you would agree if you were to use it in a blind fold comparison test.

> Debian maintainers vote to ship ZFSonLinux in Debian

I don't believe that's what the linked post is saying. I may be missing additional context that's posted elsewhere, but at least what I read in this thread is: 1) the Debian ftpmasters rejected the binary ZFS module upload; and 2) the Debian ZFS-on-Linux team met at Debconf 14 and agreed on this summary/response, arguing why it should be accepted.

But has that response itself been accepted? Where is the mentioned vote? The only other post I see in the linked thread is from Lucas Nussbaum (Debian project leader), which sounds inconclusive,

I think that adding an actual question to our legal counsel would help focus their work. ... I'll wait for comments or ACK from ftpmasters before forwarding your mail to SFLC.

Also, their legal reasoning (based on EXPORT_SYMBOL_GPL) is one that has been repeatedly contested by various people including some Linux authors.

People do dispute that reasoning, but I think this statement is correct that if you don't accept that reasoning, Debian also needs to stop shipping proprietary drivers [1] that depend on the same rationale (Debian currently considers such drivers non-free, but not GPL-violating).

[1] e.g. https://packages.debian.org/sid/nvidia-driver

You are correct. There is no sign of a vote. That inaccuracy aside, the outline of the general understanding of the licensing situation is a step in the right direction. The email itself suggests that there was tentative agreement over this at DebConf 14, which is promising.

My read was that there was agreement at DebConf 14 among the ZFS-on-Linux maintainers specifically, not necessarily that they'd gotten buy-in from the wider Debian community. It's somewhat unclear though.

Sorry for introducing that inaccuracy when I posted the title. I mistakenly interpreted "As agreed at DebConf 14, Debian ZFS on Linux Maintainers have concluded..." as the equivalent of a vote.

I stopped using ZFS when I learn that using non ECC memory was dangerous and could corrupt sane data.

The filesystem that you use has nothing to do with the opportunity for bit flips in non-ECC RAM to cause issues. If you are concerned about the effect of bit flips in non-ECC memory on filesystems, then the solution is to use only systems that have ECC memory. There is no other way to deal with that problem.

I'm confused by this statement, using 'non ECC' memory is dangerous and does corrupt otherwise sane data. Using or not using ZFS doesn't change this danger.

Other FS won't try to correct sane data on HDD but corrupted when loaded in memory.

ZFS does not do this. Your assumptions about what other FSs do is inaccurate.

Consider: you load data from ext4 into RAM, it gets corrupted. You change some bytes, then save it. Corrupted data is then written to the disk. Could be anything inside the allocation unit size, which is 4K generally - not insubstantial.

ZFS's worst case behavior is exactly the same: it can't protect you from corruption in RAM after checksum validation if you then ask it to save that data to disk.

There is no difference: if you do not have ECC RAM, you are vulnerable to bitflips corrupting your data - and they will have been.

ZFS won't corrupt data which gets altered in memory due to a bitflip but is only ever read. It can't - because in-memory bitflips can't be detected. Even if at some point during checksum validation there was a mismatch, the restore is done from checksum protected parity/mirror image data. And if the restore block were corrupted, the checksum won't match and ZFS will rebuild the block correctly next time.

If you do not have ECC RAM, every filesystem will potentially corrupt your data. ZFS is more resilient then pretty much all of them even in this case.

The worst case with ZFS and ram corruption is that you can lose your ENTIRE Zpool. As there are no zfs recovery tools available this means your data is as good as gone or a minimum 15k spend to get it re-assembled.

This makes it far more risky to run ZFS with non server grade parts.

It is possible for this to happen with other filesystems too. The only thing is that when it happens with another filesystem, it is not news.

It becomes the news as there are no recovery tools available for ZFS as there are with most other file systems. Meaning that the possibility of losing ALL data due to ram corruption becomes a real threat.

Yes it does or I don't understand that: http://louwrentius.com/please-use-zfs-with-ecc-memory.html https://pthree.org/2013/12/10/zfs-administration-appendix-c-...

Always fun to post on HN and getting downvoted ;o

The first article doesn't contribute useful metrics - it simply infers that it would be bad. Here's the thing: have you run filesystem recovery tools lately? Because if you're at that point, your data is toast.

It's toast if you were running mdadm or RAID. It's toast if a file gets deleted and recovered by ext3/4 (good luck figuring out which chunk you're looking at in lost+found).

The second article is simply stressing my original point: ZFS cannot protect you from bad RAM. No file system can. If data is corrupted in RAM, ZFS will checksum that and write it to disk. This is exactly what will happen with any other filesystem.

ZFS storage pools can recover from fairly dramatic failures. You can lose a disk out of a non-redundant stripe and still have complete metadata (just incomplete data, but checksums tell you exactly what you lost). Unimportable pools are not something that easily happens without severe damage - a corrupted uberblock can recovered by using the zpool import -T to rewind a txg and get back a verifiably complete copy of your data.

This idea that somehow the failure mode of other filesystems to faulty RAM is better is destructive fiction. You have no idea if your data is valid on disk. You have no way to verify if the data is valid in the first place. And the degree and number of failures required to wipe out a pool would also wipe out any other storage system. The idea that recovery tools will save you in a catastrophic failure (of what kind?) is laughable. This can be trivially explored by trying to recover a deleted file in ext4. Sure, it can technically be done. Good luck with doing it more then once.

As far as I've understood, the gist of the "ZFS without ECC is scary"-mantra is that one bad memory location (say a flaw that results in a word always getting zeroed) might corrupt all your on disk data due to scrubbing ("read data, compute checksum, write data, repeat") Is this in fact a non-issue with ZFS?

Read data -> compute checksum -> check against on-disk checksum -> rebuild from parity -> repeat.

But this also wouldn't be a random failure. ECC RAM with a consistent defect would give you the same issue. It would also proceed to destroy your regular filesystem/data/disk by messing inode numbers, pointers etc.

Your scenario would require an absurd number of precision defects: ZFS would have to always be handed the exact same block of memory, where it always stores only the data currently being compared, and then only ever uses that memory location to store the rebuilt data. And then also probably something to do with wiping out the parity blocks in a specific order.

This is a weird precision scenario. That's not a random defect (bit-flip, which is what ECC fixes) - that's a systematic error. And I'd be amazed if the odds of it were higher then the odds of a hash collision given the number of things you're requiring to go absolutely right to generate it.

EDIT: In fact I'm looking at the RAID-Z code right now. This scenario would be literally impossible because the code keeps everything read from the disk in memory in separate buffers - i.e. reconstructed data and bad data do not occupy or reuse the same memory space, and are concurrently allocated. The parity data is itself checksummed, as ZFS assumes it might be reading bad parity by default.

Thank you! I was always a bit puzzled why the Sun engineers would not verify against the on disk checksum, but it made some sense that in a server setting ECC can be assumed.

By the way, the above "insane" scenario comes from the freenas forums


and is used to motivate why ZFS should have ECC ram. I am glad to hear that is much less of an issue than it's made out to be.

The mechanism that you describe should be impossible to achieve with one bit-flip. I have no idea how you came to think this.

There's a lot of misconception spread by FreeNAS forums. I suspect that's where he read it from:


The author there makes some crazy assumptions like this: "But, now things get messy. What happens when you read that data back? Let's pretend that the bad RAM location is moved relative to the file being loaded into RAM. Its off by 3 bits and move it to position 5."

This is indeed where I read it. When googling to find it again I found a similar story at:


Any good ideas on how one might fix this incorrect "common wisdom"?

Honestly no idea, except just pointing it out when someone mentions it.

The whole ECC recommendation is due to ZFS (unlike other filesystems) providing guarantees to data correctness, but as you know while ZFS can discover data corruption on the disk thanks to checksums, it can't guarantee data correctness in RAM because because as any program it is bound to trust it. That's why ECC RAM is highly recommended.

I think there is some confusion about the impact of bit flips on reads and on writes. Those are separate cases. That blog post is absolutely correct, but it talks about writes. People here are talking about reads.

The forum post and blog both state that data also gets corrupted even more during reads. They also use strange assumption that data would be written to RAM ignoring byte boundary (i.e. shifted by 4 bits).

You are describing what undefined behavior might manifest in response to a bit flip. It is undefined for the reason that anything can go wrong, although I have trouble seeing how the particular scenario you describe here would cause a problem. If the checksum verification fails on a read because of a bit flip in the data, ZFS will just load another copy.

That being said, it is not a good idea to use hardware that lacks ECC RAM. No filesystem developer will claim that their kernel driver can operate properly without ECC RAM because it is not possible. Email the linux-fsdevel mailing list if you want confirmation.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact