Last year when I wanted to build a 10 disk ZFS server in RAIDZ2 and researched about the data-integrity and fault-tolerance aspects, I found this video of guys literally making hardware failure by conducting electricity into the motherboard attached to a RAIDZ2 array:
I think irradiating the system would be a better test since it would induce random bitflips anywhere in the system while it is keeps running and reading/writing data instead of inducing a massive fault that will almost immediately stop any IO operations.
Worked at a major switch vendor as one of the 50 original staff. We put our gear through a ton of test - 4 corners - low power high heat, high power low heat, etc. One of the most interesting was taking the device to a government lab here in NorCal in order to shoot radioactive particle into the board while passing packets. It was pretty cool. The goal was to test TCAM and switch OS RAM to see how it handle bit flips. It passed. The board itself, well we could not have it back for a few weeks to allow it to "cool". The other thing of note was winning a US gov. contact was to enable commands to override any heat related auto shutdown. We were told "look, if the device is in a situation where this is an issue, burn it up but keep it working as long as you can". This was for navy ships.
It wouldn’t matter. An error would have to occur in such a way as to also corrupt the data and the checksum in such a way that the corrupted checksum is valid for the errant block.
Not sure why the downvotes, this is in fact how it works and why ZFS doesn't need ECC Ram. It doesn't matter how much corruption you experience it always comes down to does the checksum of the block of data read from disk match the checksum stored when that block was written. If either mismatch, then the block is rebuilt from parity drives. If you're going to downvote, at least say why.
You still want ECC to make sure the data doesn't get corrupted on the way in or on the way out. ZFS will store what you send it, and it isn't responsible for your data once in hands it to you.
The best part about ECC RAM, IMO, isn't the correction, but the checking. I want to know when my RAM goes bad, and I'd prefer to know before ZFS detects a problem.
I'm so sick of this argument every time a ZFS article comes up.
This is by the inventor, Matthew Ahrens: “There’s nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem.”
The comment from the inventor isn't supporting your position. ZFS doens't need it any more, but if you care about data integrity, then you want ECC memory in your system. Yes, ZFS can take care of errors while it inside the ZFS API boundary, but the data on the outside of that is at risk, unless you have some higher level error correction protocol going on.
Don't make the mistake of thinking Non-ECC ram doesn't have built-in error protection it's just not as robust. Random memory bit flips are pretty rare but it takes multiple at the same time for ECC to be effective. Additionally, the thing that makes no sense about ECC Ram is that you have no parity checks when data is in transit too and from the CPU(s). So a little corrosion on a trace or bad capacitor and ECC Ram won't be able to do anything about it.
Furthermore, there is error correction at the chip level. I think a fitting analogy is like this: A single-engine airplane might seem much more dangerous than a dual engine airplane, but most single-engine airplanes have dual fuel pumps, dual ignitions sources, and additional auxiliary devices sharing the same engine body. Will it ever be as safe as a dual engine plane, no, but it's not as dangerous as having no failsafes.
Modern RAM is believed, with much justification, to be reliable, and error-detecting RAM has largely fallen out of use for non-critical applications. By the mid-1990s, most DRAM had dropped parity checking as manufacturers felt confident that it was no longer necessary.
"The SDRAM and DDR modules that replaced the earlier types are usually available either without error-checking or with ECC (full correction, not just parity)."
The thing about ECC vs. non-ECC. If you don't have ECC, you don't know silent bit corruption is happening, and when it's finally (if it's ever) discovered, there's nothing that can be done. ECC (SECDED) keeps data safer than it would be otherwise.
ECC is necessary for anything real or important based on the financial cost of losing data. For everything else, it can be optional.
Plus, don't forget the security issues of bitsquatting and other attacks that are the real results of silent bitflips.
It’s a lot further than this though with unbuffered RAM, registered RAM in the mix affecting performance and such. Error detection and error correction are different and require different levels of parity codes. Standard RAM at one point (SIMMs in the 90s) detected errors but won’t ever correct them. ECC can detect more but error patterns and now correct error patterns up to so many bits.
The issues are more subtle beyond hardware though - your OS kernel has to understand the detection from the BIOS / EFI and if the motherboard manufacturer decided not to opt for wiring in the checksum fail signal you will run blind to data corruption that’s potentially really subtle on typical consumer hardware. With some BIOSes RAM failures result in a hard panic and will hard reboot the machine (the idea being a crash is better than continuing with an error).
For me, ECC (both UDIMM and RDIMM) today are so inexpensive it’s a no-brainer for builds for business.
Can't speak for the downvoters, but I think what beatgammit is saying is that, in cases where you care about data corruption in ZFS, you also care about data corruption in parts of the system that aren't ZFS. Like the memory maps of every piece of application software. Nothing about ZFS needs ECC, but the circumstances ZFS claims to protect against need ECC.
Exactly! You build a system with ZFS, because you need better defense against corruption, but without ECC, RAM will be the weak point in the system, so if you want the same level of guarantees, you need ECC, not because of ZFS.
How so? You have a file from source A acquired via network, external storage, disc, etc and you want to store it on a ZFS array. If the ram corrupts the file before ZFS can store it, then the destination checksum will not match the source checksum. In reverse, you load the file from ZFS and ram corrupts the data before transit, then the transmitted file's checksum will not match the source's checksum. In either event, a retry will likely solve the issue as its unlikely for the same exact bytes of physical memory to be used over and over again
> the destination checksum will not match the source checksum
There is no source checksum.
> transmitted file's checksum will not match the source's checksum
Most applications don't checksum their files when reading them.
> a retry will likely solve the issue
You can't retry; the only copy of the edited file is corrupt, and it's not possible to (automatically, in the general case) distinguish edits you made since loading the file out of ZFS from memory corruption that happened since loading the file out of ZFS.
No ZFS never protects against data in transit. It makes 1 promise, that the bytes written to disk will be the same bytes read from disk at a later point in time. ZFS can't be responsible for being given bad data. Corrupt data looks just as writable as normal data to a storage controller. The article I linked shows how even bad ram isn't damaging to the integrity of a properly configured ZFS file system.
1. ZFS is heavy on memory, vastly increasing the odds that a failure will happen during filesystem operations. This is something many learnt while both encrypting their drives and overclocking their CPU. Encrypting your drives makes the CPU spend orders of magnitude more time processing filesystem data, and if an error is about to happen it is orders of magnitudes more likely you will have corrupt data because of it. Same goes with memory.
2. If an error occurs on ZFS the consequences are worse, because there are very limited recovery-tools available for ZFS. The official checkdisk is "just restore from tape, it's quicker than fsck anyway". Very enterprise oriented.
Now you might be okay with the above. But the risk is greater and the potential consequences are worse. Not because of ZFS being special, but because of the way it makes use of your hardware.
Also keep in mind that a lot of people pick ZFS to reduce potential corruption so the addition of ECC is par for the course.
this is cool, but I have to wonder how much of this is ZFS and how much is the hardware. don’t get me wrong: it’s impressive, but to make the claims about ZFS you need to have a scientific approach that would involve a control group, various types of motherboards, memories, power supplies, etc + thorough reproducibility of the conditions.
Anecdote, but I had once a broken SATA controller that would send borked data to the disks. That was tricky because the disks were correctly writing the data sent to them and from the OS side all the operations were completed successfully. I realized that only because there was a continuous stream of checksum errors on all the disks of the pool.
Same. Specifically, I the fault is inserted in a way which may never even manifest at zfs level. Any relevant failures could be more reliably replicated by injecting read/write failures and by writing random bytes to the disk. What else would you expect zfs to correct? Whether it worked out not can be read back from the error counters.
that is just a sensational science video. there’s no testimony given that normal raid5 wouldn’t have survived exactly as well. he needed to have mentioned what happened on reboot fatter running a scrub.
well not exactly true. we know that the raid5 write hole won’t tolerate random power off / memory removal etc.
as someone else said, x ray test would have been better.
I’m really confused what the author is trying to accomplish here. He sets a scenario that Btrfs openly says is known to cause issues. He doesn’t necessarily come to the conclusion that one needs to trash Btrfs but I’m not sure why you would go through this exercise prior to deployment if the exercise is pitting an undeployable configuration against something that has already been heavily battle tested. Until Btrfs development marks the RAID5/6 write hole issue fixed this test is pointless.
I’m a little disheartened to read all of the negative comments about Btrfs in this thread as well. I’ve spent a ton of time researching Btrfs in RAID 10 for deployment on my home lab (99% Linux environment) for when it’s time to expand storage and from everything I read it seemed like it was going to be a good idea. Now I’m back to wondering if I should research ZFS again.
I took it as: in an unsupported case this is what happens and these are the errors you get. It's useful information and makes the description easier to Google when you run across it.
I have been using btrfs in RAID 10 at home for several years with no issues. From what it sounds like, he didn't really have any issues with RAID 5, but btrfs scrub took longer than the ZFS equivalent.
I'm even using btrfs with zstd compression on my laptop since it only has a small ssd (64 GB) and it makes it a lot more usable.
I think you're best off using what you're familiar with, and keeping multiple backup copies of the important stuff. I've used Btrfs for 9-10 years and haven't ever had unplanned data loss. Planned, due to testing, including intentional sabotage, for bug reporting, yes. And in the vast majority of those cases, I could still get data off by mounting read-only. I use it for sysroot on all of my Linux computers, even the RPi, and for primary network storage and three backup copies. A fourth copy is ZFS.
If you've had a negative experience, it can leave a bad taste in the mouth. People do this with food too, "I once got violently sick off chicken soup, I'll never eat it again." I wouldn't be surprised if there's an xkcd to the effect of how filesystem data loss is like food poisoning.
There is a gotcha with Btrfs raid10, it's does not really scale like a strict raid 1+0. In that traditional case, you specify drive pairs to be mirrors, sometimes with drives on different controllers so if a whole controller dies, you still have all the other mirrors on other controller and the array lives on. You just can't lose two of any mirrored pair. Btrfs raid10 is not a raid at the block level, it's done at the block group level. The only guarantee you have with any size Btrfs raid10 is the loss of one drive.
Really good article. I also agree with the author: ZFS is light years ahead of Btrfs. I currently use it in my home backup server (used to be a rpi with 2 hdd, moved to rockpro64 with 4 hdd and a sata controller) and it's just great: super easy to maintain and fix, even swapping disks is not a huge endeavor.
Is btrfs still actively developed? RedHat ditched it and never heard btrfs ever got widely popular, so I take it as an abandonware after all these years of slow progress.
And they use it to deliver some beadm-alike solution to SuSE. Its integrated into their package management so tou can easily boot and rollback stuff from grub in case of a botched upgrade. Its even more interesting with the new transactional-update servers in which the rootfs is read only and upgrades are applied to a new clone which will be promoted to main file system during the next boot.
Reading through most comments and I still think a few points are worth mentioning:
* btrfs has a few features that are Really Nice and missing in ZFS: ability of have a file system of mixed drives and adding and removing drives at will, with rebalancing. With ZFS growing a file system is painful and shrinking impossible(? still?). There has been work on it recently though.
* ZFS has a IMO MUCH cleaner design and concepts (pool & filesystems); mirrored by a much cleaner and clearer set of commands. Working with btrfs still feels like an unfinished hack. As human error is still a major concern, this is not a trivial issue.
I _have_ lost data to MD, had scary issues with BTRFS, but never had issues with ZFS in 8+ years. (The fact that FreeNAS is FreeBSD based which I'm less inclined to mess with also means that I mostly leave my appliance alone.)
Device removal was added in the recent ZoL 0.8 release, so with that you can remove a vdev that is either a sole drive, or a mirror. Currently it can't be used to remove a RAIDZx vdev thou. It does carry a memory overhead for a remap table , but this does shrink as old allocations are retired back to the pool.
And the first alpha for RAIDZ expansion became available a week or two ago. (For going from say a 6 disk RAIDZ2 vdev, to a 7+ disk RAIDZ2 vdev). Just in case anyone decides to play with this feature, the on disk format for this feature is not stable yet, only use it on test pools.
I've been using btrfs for years with no problems. I currently use a btrfs volume for my backup drive. It mounts, accepts the backup, takes a snapshot and unmounts.
Has anyone seen how btrfs handles sync, it's pretty awesome, like rsync, but only sends changed blocks.
I first started using zfs in 2005 when my hardware raid failed. Since then I’ve moved the disks to a new server in 2009 and replaced all the disks twice (one at a time for redirecting). Finally I built a new server this year. This time I’m using zfs send/recv to copy data to the new disks. The old server is still working 10 years later & its latest disks have been in use 24x7 for over 5 years now. Zpool scrub on the old server takes days now (compared to one hour on copied zpool on the new server).
Even back in 2009 I heard some Linux enthusiasts tell me how btrfs was going to be better than zfs!
> Even back in 2009 I heard some Linux enthusiasts tell me how btrfs was going to be better than zfs!
What's sad is that it should have been; the CDDL situation is really unfortunate. Honestly, even if BTRFS performance were worse, it would be worth it in order to have a fully-supported mainlined FS... but instead its reputation is for data loss, so it's dead (yes, I know it works if you're careful, but that's a terrible quality in a filesystem).
I run bcachefs on all my personal machines and it's been quite a joy. As fast as ext4 with most of the features of btrfs. I can't wait until it's mainlined.
Personally, I wouldn't trust it for actual production yet, Kent does claim it is. I'd want it mainlined before deploying it on my servers.
But also by experience, bcachefs is incredibly stable. The only issues I had was when mismatching the tools and the kernel, leading to fsck being confused when outdated compared to the kernel. But even with that, I haven never lost any data or had it even as much as hiccup.
ZFS is the only filesystem I've ever had completely crap out on me without any indication of a hardware issue[1]. I don't recall the error message I got currently, but asking around about it on the various ZFS irc channels the answer was invariably "I hope you have backups" This was probably a fluke, but did sour me a bit.
1: Btrfs refused to mount at one point due to a bug; the helpful folks on #btrfs walked me through the process of downgrading my linux kernel to get it into a working stat eagina. At this point I switched away from btrfs.
I use a FreeNAS box at home..from IXsystems. Good stuff. I paid a bit more for ECC. Why? my families history, all this pictures are kept on the NAS (backup to backblaze) and a local dive. In the last 8 years, in Photos I have encountered maybe 11 issue where the picture was screwed. Each time I looked at the NAS copy, snapshot, etc. where I was able to recover the correct photo. The cost difference over time in a few cups of coffee. It is worth it. If you can afford the NAS I do not understand how you cannot afford the ECC.
> Myth: ZFS requires tons of memory [...] The only situation in which ZFS requires lots of memory is if you specifically use de-duplication
It's also totally worth tons of memory when you use that feature with intent. If you use dedup in combination with automated snapshots you get the most space efficient, fast and reliable incremental backup solution in existence - yes it will consume your whole server, that's the cost (works best separately as a backup server).
I have ZFS on another box that failed badly but lost no data... That was with bad ram and a motherboard that was on the "do not use" list for making ZFS NAS boxes. It always recovered the errors
Currently using it in RAID mode to hold large data sets and CCTV footage for my homelab on three drives that have smart warnings for age without any issues at all for the past two and a half years and two Ubuntu upgrades
I mentioned this elsewhere, but as a counter anecdote, ZFS is the only filesystem that failed and was rendered unrecoverable in all of my life. That being said btrfs was quite janky, and I had to recover it on more than one occasion.
I've been using btrfs as my primary filesystem on all of my computers (other than my phone) since 2013.
Initially I had performance issues due to the default block size back then (4 KiB rather than 16 KiB; ended up just rebuilding the filesystem). There were some other issues back then regarding rebalancing and scrubbing stability, and sometimes I would have to run "btrfs-zero-log" before mounting, but I haven't had those sorts of issues for a while.
I've had multiple drive failures on my systems, and btrfs seemed to handle them as expected. I use "raid1" for metadata and "single" for data. As far as I can tell, all of the errors were due to bad blocks on physically failing drives, and btrfs was able to indicate which files were affected in all of those cases.
I've also used it to "fix" the Raspberry Pi SD card corruption issue by just running btrfs in "dup" mode—prior to that, the SD card would randomly end up with blocks being zeroed and the system would obviously start to crash, whereas btrfs just fixes the blocks up in place as it accesses the data the next time.
I've been using btrfs on my home NAS (a custom build) and on my Linux laptops with SSDs since 2012. I run Fedora on everything which I think helps because I've never been stuck on an old kernel with old, stupid bugs. Only the freshest bugs for me!
The NAS had two WD Red drives fail over the years with bad blocks. Not at the same time. They were detected and I replaced them with new bigger drives. Since 2012 I added more drives until now it is at 6 drives in RAID10: 4x6 TB and 2x4 TB.
I've had out of space errors on my laptops which I had to repair by adding more storage so I could successfully rebalance. Stealing the swap partition worked great for that.
Never lost any data or had any corrupt files.
Also, I always build with quality Gold standard PSUs, ECC RAM, and run my systems always plugged into an APC UPS. So I've never tried to recover a system that crashed after a lightning storm, built with WD Green drives in external USB enclosures plugged into a $2 power-strip and an HP "desktop" built out of an old laptop motherboard and the cheapest PSU HP could dig out of the trash pile.
Catastrophic, my BTRFS died twice on me after a hard reset and the recovery tools haven't helped me. Luckily I had backups around, so switching to ZFS 0.8 with encryption was a breeze.
NAS usage, over mdadm RAID1, over about 10 years (and 1 or 2 drive failures, 1 or 2 bad RAM dies) I didn't lose a file that I could definitely attribute to btrfs. I did lose a piece of a file I keep on it, but I cannot conclude it didn't arrive there damaged.
I used to hit out of space errors on rebalancing at high usage levels, but since around kernel 4.0 the only time I hit one was when I altered data duplication settings.
I don't understand why anyone would run btrfs with MD RAID. That configuration loses the ability to recover from corrupted blocks. All that MD can do is in response to actual reported drive errors. If one of the drives accepts the write but then later silently returns the old data, which is a failure mode I have actually seen, MD can't do anything reasonable to fix it.
Catastrophic. BTRFS failed twice with a large raid array when one of the drives started developing bad sectors (two different drives at two different times, same result, entire raid array unrecoverable).
After catastrophic failure number two I decided to never run BTRFS again.
Honestly not bad. About 7 years of use on personal machine, single volume. It simpler than ZFS and allows cheap incremental backups with send/receive.
One year ago I built a RAID56 with 5 used 2TB drives from eBay. Risky move, but it went smoothly so far. It's only for home storage and important stuff is backuped off site anyway.
One Seagate drive died with lots a bad sectors. The replace command took really long even with the "-r" flag (don't read from replaced drive, in theory), so I ended up unplugging the drive and rebalancing from there.
I have high hopes for bcachefs. We have a real need for a modern FS with tiered caches. I backed the project but I don't have the skills or time to help.
Yes, I'm aware of that. Sorry, bad wording. I choose btrfs because it's simpler to setup. It is also lighter on memory for uses on personal machines. I know ZFS can work with less memory but at a performance disadvantage.
I found the need for a plenty of RAM a little overblown for ZFS. My anecdotal experience: I repaired my home file server replacing faulty RAM with 2G stick (because that was only spare piece I had) as a temporary solution. But as it often happens with temporary solutions it lasted longer, for slightly less then two years. Prefetch feature was automatically disabled, but there were no delays which my perception can register. I suppose speed degradation would be visible on heavy-lifting server, but exactly for personal use one doesn't need any special amount of RAM.
I thought I did things right and had setup regular scrubbing. The two external drives that were btrfs mirrors of each other got out of sync, I assume caused by a power outage (the host was a laptop and has a battery, a rare scenario I guess, so I can see how this happened). Due to not reading the man page correctly, I then messed things up further...
This was like a year ago and I still haven't cleaned it all up due to: (1) lack of a third drive to restore stuff to (I'd prefer to leave the damaged filesystems in read-only), (2) a lack of time, (3) there now seems to be a bug in the kernel drive for my particular drives (perhaps it wasn't a power outage after all?), and (4) because I now live far away from the physical location of the drives.
Instead of using Btrfs to fix problems, I am now looking for the simplest possible solution. I was thinking of just using plain old ext4 and relying on one local and one off-site backup. It will be more work to manually look at the state of my files when a failure happens, but with something like Restic I'm at least confident that any completed backups are sound (as well as secure: Restic is the first system that I've found to be efficient while also trusting the crypto enough to back up to untrusted locations). The only open question is what to do about bit rot on the main system, since any bit rot on ext4 would just be backed up as if it was the original data. So then... maybe I'll go with Btrfs after all, using only the checksumming feature should not cause any bugs right? I haven't decided yet. First up is trying to fix this driver issue somewhere next week.
For the years I ran btrfs it was a great way to testing my backup and restore processes; it also gave me a nostalgia trip back to the mid-late nineties Linux experience, where you had to carefully select kernels based on which combination of features and regressions you can live with.
My cold storage NAS with 12 old HDDs (400GB to 1TB) acts as my btrfs test-env. It's all old hardware: Disks are salvaged from the company [after being shredded], the system is an Athlon Dual Core 5000+ on some MoBo with 4 SATA ports; no ECC, recently replaced the LSI 8 disk SAS controller with a 3ware 9560 12 disk variant (both as JBOD).
No problems in the past few years, though it's only seeing a few hours uptime per week. It just acts as backup for data stored elsewhere, so a crash would be annoying, but not fatal - but I'm contemplating getting some new 4TB disks (now that I have a suitable controller) and putting a Plex on the thing.
I haven't lost any data on BTRFS, but on a couple drives set to do frequent snapshots they ended up getting slower and slower until hanging for up to many seconds at a time. And this was with only 30-40 snapshots existing at a time.
And then in another instance it got all confused and impossible to mount read-write. It kept trying to resume some kind of transaction that wouldn't complete even with dozens of hours of CPU time.
But on the plus side it deduplicates properly, without extreme overhead.
It is curious the performance differences found between ZFS and Btrfs, as I've always had the reverse experience, with ZFS being slower by maybe 15%. The scrubbing on md raid does take a while, every block must be checked as it has no idea what blocks are in use or not; although a write-intent bitmap would avoid a complete resync after an unclean shutdown.
I have a joke NAS and I use ZFS for the disks. Unfortunately, at some point I started having problems where deleting a file will take half a minute (each file is only around a gigabyte). I have no idea why performance is killed like this, and nobody I asked on IRC seems to know why this is happening.
Oh, yeah: ZFS at least used to degrade catastrophically in terms of performance once you got above... I think 80%? Granted, that was literally Solaris, so I don't know if it's been fixed since then.
One of my long time Btrfs raid1 backups is 99% full. Writes still go at full speed (the much reduced speed you expect to get writing to the interior tracks of spinning disks). But these do avoid that nastiest form of fragmentation due to it only receiving snapshots from another filesystem, making it act like a tape backup until full. Deleting snapshots produces large contiguous holes of space for sequential writes.
This form of snapshotting also scales up well. I've had hundreds of snapshots with no performance reduction, and deleting them is fast also. In cases with many changes in between snapshots, it results in much more complicated metadata. And now while making a snapshot is still cheap and fast, deleting older ones starts to become more expensive for the backref walk.
RAID5 has been excessively risky and obsolete for a long time - not enough parity, and too much risk of data loss from unrecoverable read error during a massive-drive multi drive rebuild (like, an eight drive * 8TB RAID5). Better tests for production use would be RAIDZ2.
I agree about RAID5, but personally I don't think RAIDZ2 really works as an alternative. Everyone should use RAID10.
The reason for this actually has nothing to do with data protection (although not having to read your entire disc set to rebuild is nice). The reason is that it's hard to figure out who RAID5/6 are actually for. The enterprise is all on RAID10 (hence no one fixing the BTRFS write hole). So you'd think it would be for enthusiasts who don't want to purchase as many discs, right? Well, in my case at least, parity disc modes are useless for me because it means I would have to buy discs of exactly the same size into the indefinite future!
I started with 4TB discs, then I've added 8TB discs, and now I'm looking at newer Western Digital 10TB helium discs, which are actually much lower power than the 8TB ones. But none of those 4TB or 8TB discs have failed! So while in theory if stick to the drive size you start out with, you can save money with RAIDZ by using parity discs and adding a drive when you need more storage (at the cost of a decreased parity %), practically speaking a lot of that money is wasted since you either have to neglect larger more efficient drives, or replace working older drives when you want to upgrade.
https://www.youtube.com/watch?v=vxFNBZIAClc
and they could not make any errors. This was pretty brutal. When I saw this video, I decided I never want to use any other filesystem than ZFS ever.