Many folks I know who manage storage don't make the boot volume RAID (redundant)- instead, it's some rapidly duplicatable thing like an NMVE flash containing the root filesystem, and there's a replacement handy. Then you can bring up and bring the full power of userspace to bear on the RAID repair.
I just finished another late night "thanks" to unknown person who decided to use this approach several years ago while setting up 17 servers that I inherited. The /boot partition of these RHEL machines was put on SD cards, which have been gradually dying. So I got to boot the machine (which is 6000 km away) to recovery mode from ISO and re-create the /boot filesystem on HDD (RAID).
Of the 17 servers, only three are now remaining where I haven't had to do this. Not all of these were actual SD card failures, some where done preventively. Still, there have been several SD card failures requiring emergency repair work at inconvenient times. There have been 0 RAID failures requiring similar emergency work on the systems where /boot has been migrated to HDD based RAID.
Of course there is no replacement SD card handy with working /boot filesystem. Actually there is no such thing as "handy" in my case. I can actually reconstruct /boot partition on the HDD faster than anybody could go to the datacenter and replace the SD card. And if there were a replacement SD card, I would need to keep it up to date manually every time the kernel or initrd are updated.
I never want to see this kind of setup again, and frankly it felt insane the first time I saw it.
Of course I should have migrated all of the machines off the SD cards by now, but my excuse is that maintaining Linux on these machines is not really my responsibility (it is nobody's responsibility apparently, although many people are interested in keeping these systems online).
When I die I want to have a SD-card shaped tombstone.
This was a trend for a minute back in the late 00's. Any system that was
1.HA
2.Had an OS that would get fully loaded into memory (think ESX).
The idea was that SD-cards take less power and where cheap enough you could have a whole gang of spares with images ready to just drop in place. I'm not saying it was a good idea, just that I've encountered this more than once.
I don't get why you need to do so. RAID is not a substitute of backups, and shouldn't used that way. Meaning that of course you have a backup of the data that you can recover from.
RAID is used for two things:
1. improving the performance of slow disks (at least the read performance)
2. having 1/2 disk fail with your system remaining completely usable, till you replace the faulty disk as soon as possible.
The second point is fundamental to me: there shouldn't be any disruption of service whatsoever, meaning that only the sysadmin should notice the fault (beside maybe a reduced performance of the system since you have 1 less drive). Database transaction that were in progress when the disk did break shouldn't fail, writes/reads on the FS shouldn't fail, the only thing that should happen is an alarm triggered in the monitoring system to inform that a disk needs to be changed as soon as possible.
Having a RAID with manual recovery... it means that you could end up a Saturday evening in front of a computer to bring a system back online, and still some data corruption may have happened.
Regardless of what you think, many people keep a separate boot and data disk and don't use RAID on the boot disk. I never said anything about it being used for backups nor did I said anything about service outages. I think my point- about what other people do since I don't use RAID- is more about the choices different people make in terms of trusting their boot volume's recoverability in times of crisis.
By default the metadata profile is 'DUP' (i.e. 2 copies on one device), the same can be done for data but this reduces the usable capacity. On normal HDD or SSD this should not be needed but having both data and metadata DUP/DUP has been useful on raspberry pi with the micro SD card storage. It's not perfect but increases the chances to get the data back if the card is partially damaged due to power spikes.
I have wondered about this myself - does keeping 2 copies of the same data on a single flash-based drive actually increase reliability? Or is the flash controller going to end up combining the two writes into the same block?
That seems like a very hacky way to accomplish something that should just work. I mean, it's better than nothing, but it seems far from ideal.
If it's your own personal box, and you're okay with that fiddling, fine. But if people are supposed to get work done while the admin gets a new disk, it's not great.
There's pretty solid reasons why you might choose to have a very simple RAID1 set for a boot volume-- because you want to be sure to boot even if the huge array is degraded, and because you want faster I/O for the OS install, etc...
The way I used to do it (back then getting grub/initrd raid aware wasn't really doable - not sure about now) with mdadm was to use RAID1 on the boot volume such that a disk from a broken mirror was still usable by itself. Other volumes were RAID 5 or 10 etc.
How would a mirrored boot work in practice? Normally the first bootable partition gets loaded if it’s corrupt, it would keep failing at the same bad sectors.
Would it randomly pick one of the bootable disks? That way you have an n-1/n chance of avoiding the bad disk?
Back then, grub and initrd didn't understand mdadm so would be set to boot off one of the boot vols as a raw disk. Once initrd handed over to the root vol, it would understand mdadm properly.
If a disk with a boot vol failed, it would have a 50/50 chance of still booting depending on which one failed. You'd create a backup grub entry to boot off the other one manually.
Not as transparent as hardware RAID, and maybe now grub is aware enough to auto handle mdadm? I don't know - I haven't done bare metal mdadm for a decade or so.
The same way it has worked for the past 30+ years: the firmware boots from first designated boot device; if that device isn't bootable, it moves on to the next one. The next one boots since it's part of a mirror and has all the necessary data to do so, and in the rare case where the device is "half bootable" one would simply intervene and select the next good bootable device manually.
That's fair enough for a system that isn't critical, but you can almost guarantee that the disk failure will happen at the most inconvenient time. I'd rather have a couple of cheap disks and MD software RAID so that the system boots as long as one of the drives hasn't failed. Then you can replace/fix the failing disk at a convenient time.
I have 2 NixOS-based NASes that run ZFS. each one has 3 equal-sized 256gb SSDs in addition to the pile o' spinning rust.
each SSD has a small UEFI boot partition, then the rest of the space is cut in half.
the root filesystem is a 3-way ZFS mirror of the first half of the SSDs. the second half of each SSD is another 3-way mirror, this time as the "special" / metadata device for the main hard-drive-backed zpool.
any of the 3 SSDs can fail, and it will boot up and mount the storage perfectly fine. I could also easily upgrade to larger SSDs, in-place and with minimal downtime (zero downtime, if my case had hot-swappable SSD bays)
it would work just as well with only 2 SSDs, but the incremental cost of a 3rd is small enough relative to the whole that I went for it.
Back when solaris wasn't using zfs yet, we were using the live upgrade utility to duplicate the main boot environment on a local disk every night.
Machines booting from SAN had only one local disk, others would have 3 local disks (small and cheap ones because application data was on SAN anyway).
Main raid would do the job for availability in term of disk failure, 3rd bootdisk did the job in case of user erroror data corruption on the main boot env. It saved our asses a few times, bringing apps back online quickly and saving us from reinstalling or fixing stuff from a tenporary live environment.
The way I like to do it is with mirroring of the nvme storage and then lvm snapshots on top of it. Done correctly you get boot off either of them but they're kept in sync automagically by the OS. then you snapshot the boot volume regularly and copy those to some kind of archival storage for backups.
That lets you then even hotswap them if you need to while keeping the root filesystem workable.
I haven't used raid for an OS or boot partition in years. We focus on provisioning and HA. Data goes on a raid an that's it.
Servers go boom, refuse to boot, or even get trigger a few alarms. Disk is replaced, server is booted, and provisioning paves the entire thing faster than you'd be able to debug.
Regarding resilvering, I am amazed by ZFS' capability of quickly bringing a stale mirror back into synchronization.
I have used the FUSE port of ZFS to write only one member of a mirroset, then upon mounting both members elsewhere, the stale mirror was very quickly resilvered, so ZFS was able to determine only the blocks needing to be refreshed.
In btrfs, I understand that this requires a rebalance, which will read/write every used portion of the filesystem.
Scrub, not balance. It's primarily a read operation. The on-disk format has all the information needed to do an automatic abbreviated scrub, but the feature isn't implemented yet.
No, I've heard that a scrub will not reallocate missing blocks on mirrors. A rebalance was required in the article below, and this situation really requires attention.
ZFS is far friendlier in a crisis.
"We'll even manually trigger a scrub—a procedure that storage admins generally understand to look for and automatically repair any data issues... even though we manually initiated a scrub and let it finish, our array is still inconsistent and even outright non-mountable, because it ran for a little while without a disk and then that disk was re-added. The command that we were supposed to run was btrfs balance—with both drives connected and a btrfs balance run, it does correct the missing blocks, and we can now mount degraded on only the other disk, /dev/vdc. But this was a very tortuous path, with a lot of potential for missteps and zero discoverability."
Scrub checks every data and metadata block and compare to checksum. If missing, or wrong transid, or wrong csum, btrfs will replace the block from a good copy.
There is a potential gotcha with parity being wrong following a crash or powerfail, the so-called write hole. Scrub doesn't check parity, and parity isn't checksummed. The write hole though is arguably two parts: wrong parity and propagated wrong reconstruction.
In the usual write-hole case, both happen silently. Wrong parity could happen after a power fail, crash, misdirected or torn write - whether the array is functioning normally or degraded. However, wrong reconstruction only happens if there's a failure to read a data strip (bad sector or full device failure).
On btrfs, only the wrong parity being written is possible. Upon reconstruction from bad parity, the resulting data is compared to csum and will fail, thus propagation doesn't happen.
To cause parity to be recomputed and rewritten, yes you need to do a full balance. But among all the block group profiles, that's only raid5 and raid6. Single, DUP, raid1, raid1c3, raid1c4, raid10 aren't affected and a scrub does reallocate missing blocks on mirrors.
Remember that DM is still there for you even if you use btrfs.
Every discussion about btrfs immediately escalates to how the btrfs RAID support develops. But btrfs is useful on its own without that mode.
Should you wish to put a btrfs file system on software RAID (DM), you can still do that. The btrfs RAID modes are something else, where you cut the DM driver out of the picture. That's not going to be as well tested as plain old DM, even if it works, and it doesn't offer any functionality beyond plain DM RAID.
If the use case is storing real data, you want the data storage layer to be as well tested as possible. There's also more to software RAID than the block device. You want a mature resilvering daemon on a well tested schedule, and you need working monitoring to alert you of issues. RAID without monitoring will only delay the inevitable.
wikipedia:
The device mapper is a framework provided by the Linux kernel for mapping physical block devices onto higher-level virtual block devices. It forms the foundation of the logical volume manager (LVM), software RAIDs and dm-crypt disk encryption, and offers additional features such as file system snapshots.[1]
Device mapper works by passing data from a virtual block device, which is provided by the device mapper itself, to another block device. Data can be also modified in transition, which is performed, for example, in the case of device mapper providing disk encryption or simulation of unreliable hardware behavior.
Yeah but DM doesn't have the btrfs flexible block allocation scheme that allows me to expand capacity one drive at a time and with drives of different capacities. This is very important for a home user like me.
To expand a typical RAID setup, you have to essentially fail every single drive in the array and replace them one by one with new identical drives. Expanding an array of N drives requires N resilvers which not only takes a long time but increases the chance of actually failing those drives due to massive I/O load. It's easier to build a second array and copy the data over. This combined with the fact you have to buy all the hardware up front makes it cost prohibitive for home users. You can't buy drives one at a time and expand capacity as needed.
> It won't boot on a degraded array by default, requiring manual action to mount it
If you want it to behave like that then add 'degraded' to fstab. That a device is missing can have unknown reasons, the user should know better and resolve it or allow such boot. It's not automatic as there's no way to inform the user that it's degraded state.
I don't quite understand the use case here. If I'm setting up RAID it's because I want the system to stay up. That's the only purpose for it.
If a device goes missing for "unknown reasons", then the machine should still work, and I'll figure out what happened when monitoring pokes me and says RAID is degraded.
The use case is: Enough drives failed that your raid is degraded. Any more data you write is not replicated and it may be due to software/hardware issue that will kill more drives soon.
It's up to you to choose at that point - is availability more important for you (add degraded to fstab), or data consistency (deal with the array first).
That's not the only purpose for it. There's three reasons I can think of that you might set up a RAID array:
* You want better uptime. (your use case)
* You want to protect from data loss. (my assumption was that this is the most common use case, but I could be wrong. This also helps with uptime because there's nothing worse for uptime than having to restore lost data from a cold backup)
* You want better performance, data integrity be damned. (RAID 0)
Booting a RAID array with a failed disk is a bad idea if you care a lot about not losing data, because now you're one less disk failure away.
Booting from a degraded array is only a fine idea in some circumstances, not all. That's why the kernel should not default to automatically doing so; but a distro or sysadmin that has better knowledge of the broader situation (eg. presence of hot spares or a working monitoring/alert system) can reasonably change that default when the risks of booting from a degraded array have been mitigated.
Backups cannot be perfectly real-time unless they are very nearly RAID. Any time you are generating/collecting important data, you will unavoidably have some amount of that important data in the state of not yet backed up.
It's reasonable to want to preserve all the data you currently have—some of which probably hasn't been backed up yet—and not accept new data to be written with the durability guarantees the array was originally configured for silently violated.
Since the kernel has no way of knowing which volumes may contain important data that didn't get the chance to be backed up, it should try its best to maintain the original durability standards the filesystem was configured until some mechanism outside the kernel authorizes the relaxation of those standards.
> It's reasonable to want to preserve all the data you currently have—some of which probably hasn't been backed up yet—and not accept new data to be written with the durability guarantees the array was originally configured for silently violated.
IE (by your logic) the system should stop the writes as soon as the array became degraded.
But this is not what happens with btrfs: it would happily continue to write the data on the array until reboot.
And then suddenly it's "oh my god array is degraded!!!111 you should not write to it1111".
To add on that: I never seen for a HW RAID card to stop booting by a mere degraded state of the array. Changes in configuration of arrays, loss of more than enough for redundancy drives - yes, that would halt the boot and require the operator intervention. Array in a degraded state? Just spit the warnings to the console and boot. Nobody has the time to walk to each server with a degraded array on every reboot.
Yeah but monitoring is not something that comes with the filesystem. If you have to set up the system to be a HA and configure monitoring, email notifications whatever, making sure the filesystem is created with redundant profiles, then I'm expecting that also adding the 'degraded' to the fstab is part of the configuration.
Most distros have a udev rule in place that inhibits a mount attempt of a multiple device Btrfs until all devices are visible to the kernel. The degraded mount option won't even matter in this case, because mount isn't attempted.
If you remove this udev rule and then add degraded mount option to fstab, it's very risky because now even a small delay in drives appearing can result in a degraded mount. And it's even possible to get a split brain situation.
Btrfs needs automatic abbreviated scrub, akin to the mdadm write intent bitmap which significantly reduces the resync operation.
The quote is "At this stage of development, a disk failure should cause mount failure so you're alerted to the problem." and it's from over 9 years ago on an ancient kernel. That was just 3+ years since btrfs project got started.
Let's treat it as archived historical content it is.
Automatic degraded mount at boot time isn't implemented by md either, it's implemented by a dracut module (or whatever creates the initramfs). The gist is that when mdadm assembly fails, the dracut module's script performs a 5 minute wait to see if all devices will appear, and then if not, tries to assemble degraded.
These days, with UEFI prevalent, a system won't boot without EFI system partitions replicated (and possibly sync'd) on every drive.
Hold on, people make this out to be like if a disk fails it requires immediate intervention. However if the system already booted it stays up. Who reboots a production system without monitoring it? Also if you boot with a degraded disk are you not asking for massive trouble, because if another disk fails you might end up with data loss which is IMO much worse on a production system than not being able to boot until you add another disk.
The risks are the same whether you keep running with a failed disk vs. rebooting/booting with a failed disk. Also, given a system in which the OS is not running and requires booting, how exactly are you going to resync a replacement drive if you don't boot the system with a degraded raid volume?
> Who reboots a production system without monitoring it?
Anyone.
> Also if you boot with a degraded disk are you not asking for massive trouble
> not being able to boot until you add another disk
Great.
I'm just a [sys]admin who were given a task to do something on $server.
I jump around the red tape, claw out a 15 minute downtime because the task requires reboot, proceed with all that corporate dance with notification emails.
I do my thing, reboot the server and it doesn't return back online.
Suddenly I broke the server, missed maintenance window, amount of mails with CC and RE: in my mailbox grows in geometric progression and the most important - now I need to find out who were responsible for the server, contact him and [kick his ass] ask him to diagnose what is going on.
Bonus points if:
server doesn't have a meaningful BMC/iLO/iDRAC with KVM console
it does have it but it's broken for some reason eg requires Java 6 on Vista
server was configured 10 years ago by a greybeard who is not only retired but already died from the old age
server is 6000km away from any place with a replacement disks and the earliest time when you can send the replacement would be the next spring when the ice would break and taw enough for the ships to move. Of course you can hire a helo to deliver it, which get your ass chewed for an unplanned $50k expenses
It sounds like you're hypothesizing a long chain of bad decisions, and then ridiculing btrfs for taking the choice that means your next bad decision only culminates in (predictable, preventable) downtime rather than data loss.
This is all the things what I encountered in my admin days.
Including BL670 with incorrectly connected drives, so despite everything saying (and indicating) what the failed drive was in bay 2 it was actually in bay 1.
> then ridiculing btrfs
I ridicule btrfs for it's RAID mode not being a RAID mode by default.
RAID is about availability of data.
If you so hell bent on the data safety then btfrs should kernel panic as soon as one drive degrades. THAT would make sure someone would come and investigate what happened and no data loss.
> If you so hell bent on the data safety then btfrs should kernel panic as soon as one drive degrades. THAT would make sure someone would come and investigate what happened and no data loss.
> Right?
Panicking an already-running kernel would only serve to prevent userspace from handling the failure through mechanisms that are inherently beyond the scope and capabilities of the kernel alone (ie. stuff like alerting a sysadmin, activating a hot spare, or initiating a rebalance to restore redundancy among the remaining drives).
Perhaps the kernel should default to freezing a non-root filesystem when it becomes degraded, absent an explicit configuration permitting otherwise. But for the root filesystem, that would be counterproductive and prevent the failure from being handled gracefully.
Obviously, the tradeoffs are different for a system that is still trying to boot as opposed to one that is fully up and running.
As someone who too had all of these happen at one point or another over the years: These are all process issues, not technical issues. No matter your RAID config or FS features,you're gonna have a bad time.
It's not a backup. It doesn't protect against mistakes, bugs, or the server failing in some other way. What it's good for is for ensuring that work keeps happening if a disk fails. And disk failures happen to be fairly frequent, since spinning rust is a rather delicate technology.
If somebody has to connect to the machine and fiddle with it by hand it means that the system has been down, possibly for hours, when that was the exact thing you were trying to prevent by setting up RAID on it.
You don't have to modify it after a failure. You can set it right now to explicitly say "when I end up in a degraded state, I want to boot anyway".
> What it's good for is for ensuring that work keeps happening if a disk fails.
Different services have different priorities. I want my work servers to keep running, but my home server under the desk to stop until I replace the failed drive. Btrfs gives you a choice.
This is why I want servers to always boot to an OS whenever possible. It's fine if production data is not available until someone fixes an array, but without the OS, it's a pain to even figure out what's happened, let alone fix it.
The list of bugs and counter-intuitive design decisions in BTRFS RAID should make anyone pause before ever considering it as a viable filesystem for anything.
It's like people justifying MySQL saying that it's okay for it to lose or corrupt data, and that transactional integrity doesn't matter as much as people say it does.
Yeah, maybe if you're website is a blog. But anyone storing real data should run away screaming from systems like this.
Maybe MySQL is sort-of-okay now? I dunno. It's possible BTRFS RAID 5 won't eat your data or crash your server regularly now.
I've had more issues with my ZFS arrays than the BTRFS ones.
The Linux implementation is pretty awful, with it demanding that it uses the absolute `/dev/sdx` reference even if you try to build it using serials or other unchaging reference.
Replace a disk and then after the next restart it fails to bring up the RAID because `sdf` points to a different disk. At least BTRFS uses internal UUIDs so can bring up the arrays when the disk ids change.
Fairly fundamental stuff, rather than complaining about an option not being enabled by default.
Uh, using /dev/disk/by-id to assemble a zfs array is pretty standard for zol. I never use device names and shuffling disks has never required any manual intervention on my part. Was this a very old ZoL?
At a time, it was the default, and you had to go out of your way to use /dev/disk/by-id (you had to create your pool, export it, and then import it by id), only to be hit by another issue, where when preparing grub, it was looking for the IDs straight in /dev (that's how I learned that ZPOOL_VDEV_NAME_PATH=YES exists).
Yeah, putting / on zfs was not smart. Nowadays, when the distro doesn't support it out of the box and doesn't come with kernel+bootloader+zfs tested together (i.e. proxmox, ubuntu), I wouldn't do it.
I don't know if you're talking about a veeeeery old implementation or haven't properly configured it, but you can use both the device reference or the partition UUID.
This is one of my systems using both identifications:
root@multivac[~]# zpool status
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:01:33 with 0 errors on Fri Dec 9 03:46:34 2022
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sdg3 ONLINE 0 0 0
errors: No known data errors
pool: multivac-slow
state: ONLINE
scan: scrub repaired 0B in 07:05:51 with 0 errors on Sun Nov 20 09:05:55 2022
config:
NAME STATE READ WRITE CKSUM
multivac-slow ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ad3395b6-ac43-453c-9ba2-cf65542cb710 ONLINE 0 0 0
2c423fe3-07a4-44cb-bc39-8b267c79d8c3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
8ca9d5a2-04cb-47a0-987a-8b3d27326975 ONLINE 0 0 0
bec4c67f-fa41-4ab2-907f-0e5b1e052300 ONLINE 0 0 0
cache
bfef4d7f-c484-4c37-a0b7-81d59fa1e189 ONLINE 0 0 0
root@multivac[~]# blkid|grep zfs_member
/dev/sda2: LABEL="multivac-slow" UUID="4053361656876756561" UUID_SUB="363024978934966364" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="2c423fe3-07a4-44cb-bc39-8b267c79d8c3"
/dev/sdd2: LABEL="multivac-slow" UUID="4053361656876756561" UUID_SUB="10877444988141002268" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="8ca9d5a2-04cb-47a0-987a-8b3d27326975"
/dev/sdb2: LABEL="multivac-slow" UUID="4053361656876756561" UUID_SUB="9244362969528724588" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="ad3395b6-ac43-453c-9ba2-cf65542cb710"
/dev/sdc2: LABEL="multivac-slow" UUID="4053361656876756561" UUID_SUB="11658072268678819533" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="bec4c67f-fa41-4ab2-907f-0e5b1e052300"
/dev/sdg3: LABEL="boot-pool" UUID="13291833732043257716" UUID_SUB="15424416082895860251" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="b6c0b302-7599-4862-888b-be6f7a85b970"
I think the advice was to simply use /dev/disk/by-id/foo when originally creating a pool and one would never have a problem with /dev/sda being a different disk and this was true at least back to 2015ish.
I've never lost data with btrfs, but I have purposefully avoided anything other than basic disk configurations. I use it on my workstations because it's the default in Fedora these days and haven't had any complaints.
On the flip-side, I've done some of the most horrible things possible and screwed up my 30-disk ZFS array a number of times and I've never lost data. I doubt btrfs could recover from anything I've done to break my ZFS pool.
Overall I think it's a huge shame that it was determined that the CDDL was incompatible with the GPL, because it wasn't intentional, and we would be in a completely different spot for storage otherwise.
Btrfs is able to do one important thing that ZFS cannot: defragment.
I see XFS as the performance leader (appears on TPC.org the most often that I can see), btrfs as the fullest featured, and ZFS with the strongest reliability.
As far as that goes, Oracle prohibits the installation of their database on btrfs (note 2290489.1: "Oracle DB has specifically said that they do not support using BTRFS filesystems... BTRFS is optimized for non-database workloads.").
XFS for databases is what is used on tpc.org, but perhaps these new improvements may help.
Unfortunately, as best as I can tell you can’t shrink an XFS volume other than by completely rewriting it. (Also mkfs.xfs formats with Y2038-susceptible 32-bit timestamps by default, but that can be overriden.)
Never lost any data either in 10y of use as main driver.
I've ditched RAID 5 mostly because performance sucked so much, but I have never lost any data despite going through many changes of hard drives, changes of RAID mode, rebalancing, etc.
The closest I've come to losing data is probably rotten HDD sectors, but it's always been caught and remapped by scrubs.
Yes, there is the infamous write hole, but that's not a BTRFS only problem.
I feel like the data loss risk doesn't really make sense to be worried about. It doesn't matter what FS or storage system you use, you _need_ a backup if you actually care about the data. And if you have backups, you are fine.
As a random anecdote I've been using BTRFS for everything for over 7 years now and had no issues. Think the only FS I've ever had problems with was ExFAT.
I think it is important for two reasons. First, the primary purpose of RAID is to improve availability of your data, such that you can continue operating in spite of a hardware failure. If the RAID feature of btrfs has serious data loss problems, then it is taking a feature that is supposed to increase availability, and instead decreasing it. It is like an UPS that is more likely to cause a power outage than to protect you from one.
The other reason is that the atomic snapshot feature, and ability to easily transfer diffs of snapshots makes a wonderful foundation for incremental backups. But if I can't trust the filesystem to avoid corrupting my data, how can I trust it to avoid corrupting my snapshots? Hence I can't trust my backups. So I need to go back to a more independent backup processes like rsync.
With those two features gone, my whole motivation for using btrfs over simpler file systems like ext4 is gone, so why bother.
data corruption is slightly orthogonal to data loss.
If your filesystem silently corrupts some data in a way you don't notice, you might just replicate the corrupted data to your backup system over time. And in that case you will still have lost it.
Data loss - e.g. due to a disk failure - is a lot more obvious.
Only time I've lost data with zfs was when i was first doing something supremely stupid, mostly as a dare to see if it would work.
I had recently upgraded my pool and had a bunch of old disks lying around so i fired them up in a new pool, 6x 4TB and 6x 8TB disks. I took the 4TB disks and set them in in a raid0 to get effectively 9x 8TB disks. Apparently doing this will work but you MUST make sure you properly wipe all of the newly raided disks of their old ZFS data or you can end up with ZFS trying to check that pool and breaking the mdadm superblocks.
After I properly zeroed the 4TB disks it's been working fine for weeks. This let me set it up a pool in raidz3 with them all so i can loose any of the much older 4tb disks or up to 3 of the 8tb ones. I don't use it for anything that's absolutely critical to store, just random bulk data that i don't want to worry that much about, but it's been reliable enough since then.
Is there a zfs native way to create raid0 disks now? On Linux I usually set up a linear mapping with dmsetup, and then point zfs to that. But this hides the native disks from zfs.
I've done that a couple times to make a bunch of mismatched disk fit into zfs's requirement that all are of the same size. The ability to work with a mix of disk sizes is one advantage of btrfs - if you remember to be extra careful on a disk failure and bad shutdown. Plus the ability to dynamically grow the array, but there is work to improve that in the zfs side as well.
Indeed, it is less raid0 but creating a raid0 and then building a raidz on top of it.
Say you have `1, 1, 2, 2` sized devices. A raidz requires them all to be equally sized which can be done by first adding `1 + 1 = 2` and then creating a zpool with `[1+1], 2, 2`. Can this 1 + 1 step be done within zfs? Hypothetically this would be done by creating a raid0 pool and then a raidz on top via `zfs create tank raidz (raid0 a1 b1) c2 d2`, or first running `zfs create ab1 a1 ab` and use that in the next `zfs create ab1 raidz ab1 c2 d2`.
RAID0 could be striping which would require all devices to be the same size or it could be concatenation which leaves it to some other layer to balance load across drives. What I suggested allows the use of the full capacity of the drives with weighted data placement based on available space on the drives.
What I suggested is not really RAID0. raidz is not really RAID5. ZFS intentionally improves on these for performance and reliability reasons.
I believe you can accomplish this with partitioning the disks with sizes that match the smallest disk too for zfs. You still have all the same reliability issues though and there's probably some kimd of performance hit too. Though yoy could then likely set copies=2 or something to add some redundancy to bad sectors at least and get a little better there than the linear mapping with dm or lvm2 on linux. You should the. Also be able to grow the array but like the linear mapping it's going to be unbalanced as to where free space is.
My personal preference in this kind of situation is to still just use lvm2 to jbod the disks together if I;m fine with the risks of losing it all anyway, and just setup a good backup strategy. More because I'm familiar with it and all the tools than other setups. I think zfs might give me more early warning signs of a disk with an issue but the performance hit isn't always worth it to me there. I've got an 8tb raid0 nvme ssd array (4x2tb) setup to give me some blazing fast local storage for a few key VMs thst I just backup daily to the zfs storage. I can get 7.5GB/s reads and 6.7GB/s writes to that array doing that and using an lvm-thin pool so I can nicely snapshot during backups.
You can stripe disks (raid0) in ZFS but you should not be using any hardware raid solution, motherboard or pcie card. Host bus adapters are preferred, anything with direct access to the disks.
Glad to see Btrfs getting continual updates. It's my favorite filesystem for my personal machines (work and home PCs). The feature set is just awesome. I just hope it doesn't get completely abandoned as its development seems to have slowed significantly.
The only thing it's missing before I consider it full-featured is stable RAID 5/6. But it looks like that hasn't been forgotten.
The development speed has to balance the schedule of linux kernel development (merge window, release candidates, 3 months cycle) and demand to merge several distinct features or core changes. There are no formal deadlines but we have to make sure that the new code is feasible to be stabilized in the given time. Once a new feature is in the wild some bugs or fixups are still needed so this takes some time from the new development and has to be accounted for.
My strategy to pull new things is to have one big feature that has ideally been reviewed and iterated in the mailinglist or there was a lot of testing already done. In addition two smaller features can be merged, with limited scope, not affecting default setup and possibly easy to debug/fix/revert if needed. Besides that there are cleanups or core updates going on so this should not touch the same code to make testing less painful. With new features the test matrix grows, code might need wide cleanups or generalizaionts before the actual feature code is merged. So this can indeed slow down development.
The raid56 is progressing but until the 6.2 pull from today there was not much to announce regarding stability/reliability. There were proposed fixes but as incompatible features, which means some changes on the user side and with backward compatibility issues. What's pending for 6.2 should fix one of the bad problems at least for raid5.
> The only thing it's missing before I consider it full-featured is stable RAID 5/6. But it looks like that hasn't been forgotten.
I've been running BTRFS RAID6 since 2016 and have only had one issue (arch kernel needed to be rolled back) and never suffered anything catastrophic. It's perfectly happy humming along with a 15x8TB raid array.
Are you using Raid6 for both data and metadata? If yes, you could just as well have been running raid 0 and having more space and performance out of the same disks. Because of well documented issues, metadata on raid6 is not yet safe in power/hw failure situations. And if you have no safety in those situations, what is raid even for?
(It's perfectly safe to run metadata on raid1 and data on raid5/6.)
I'm running RAID6 for data and RAID1 for metadata. Wanted the metadata replicated. So far I've had no HW failure situations and I have suffered a few power outages while I was away from home even though I have a backup battery.
Although these drives have surpassed 50,000 hours of running time, I am fairly certain the day of reckoning is coming, but I have backups and a plan to recover.
I may recover to something not BTRFS. Not sure just yet.
Metadata on raid1 doesn't guarantee the resilience to two-drive failures that raid6 is used for, but metadata on raid1c3 does give you that degree of resilience (with more space overhead, but that's seldom a problem for the metadata).
Maybe I'm just cynical but I think the ship has sailed for BTRFS RAID 5/6. It's now part of the global mindshare that BTRFS RAID 5/6 == data loss, no one wants to be the guinea pig that proves it works.
Better to direct resources towards bcachefs or ZFS IMO.
Ideally we'd be working based on actual information rather than global mindshare. There will be guinea pigs to test it. There are already quite a few people running it despite the warnings.
As much as I want to see a wide use of bcachefs, it's still years away. As someone who actually wants to store data - why would you direct resources to bcachefs which is known experimental, rather than btrfs which plainly documented raid5 as not ready and now may decide to change it to ready... if it is? (I'm donating to bcachefs patreon, but commenting from the PoV of someone choosing a solution today)
Right but "actual information" here is "developers over and over fail to nail what ZFS and MDRAID did well decades ago". There is no reason to trust their next try will be "the one"
MDRAID had write hole until 4.4 (2015, https://lwn.net/Articles/665299/) and it had been long awaited back then. ZFS to my knowledge deals with that by variable stripe length that has its own problems but yeah it works. Btrfs changes implementation of the stripe update while preserving the on-disk format (i.e. can't do the same as ZFS without introducing incompatible change), intent log/bitmap have been proposed (which would be the MDRAID approach) but that's another incompatible change. So the 'next one' is at the cost of performance but with same compatibility.
Did well in what context? RAID5? There were enough people with motivation to spend time on it rather than other things for mdadm/zfs. It's not like btrfs devs wouldn't know how to implement it if you paid them to work full time on it.
From widespread use, yes. Bcache was already stable at least 7 years ago. Bcachefs it's not even included in upstream kernel yet, much less any distro as an official option. There's only some plan to start submitting it soon™ and it will take months to complete.
Months? I thought it would just go through all the data and verify that the stored checksums match. Even if it fetches filedata more than once to ensure the parity bits are ok, how can it take longer than reading the entire disk a couple times?
ZFS is widely used by many users and organizations. Just because it is not currently included in the Linux kernel does not mean that it is useless. There is ongoing work to integrate ZFS into the Linux kernel, and in the meantime, many users continue to find value in using ZFS on their systems. As an alternative, you could try using FreeBSD, which includes ZFS as part of the kernel. This can make it easier to take advantage of the features and benefits of ZFS without having to install and configure it separately. Additionally, because ZFS is integrated into the kernel, it can take advantage of FreeBSD's advanced security and performance features.
Enlighten me on how a filesystem with an incompatible license is ever going to be mainlined. Having to use an out-of-tree patchset is an immediate no as far as I’m concerned. I also have no intention of switching to an hobbyist operating system, thank you.
Linux hasn’t really been a hobbyist operating system for years as anyone taking even a casual glance at a list of contributors would know but sure let’s all pretend FreeBSD is somehow relevant or properly maintained. That’s probably in the same universe where ZFS has a chance of being mainlined and isn’t heavily encumbered by Oracle owned patents.
Why people would contribute to this mess is beyond me but everyone is free to do what they want.
> Linux hasn’t really been a hobbyist operating system for years as anyone taking even a casual glance at a list of contributors would know but sure let’s all pretend FreeBSD is somehow relevant or properly maintained.
You haven't glanced at a list of FreeBSD's contributors, have you?
I know FreeBSD is used by Sony on its console, by Netflix and by WhatsApp. Everyone knows because they are pretty much the sole serious contributors to it. I don’t think it’s particularly relevant nor do I think it’s particularly good to be honest but we are getting far from the initial discussion about ZFS.
We had tried RAID-1 on btrfs, and the experience was not that great either. Our current systems, when they use btrfs, are on top of md (mdadm) RAID-1 arrays.
AFAIK they use it internally, there are articles on lwn.net how, the use cases is for root filesystems and containers. I'm not sure I understand what you mean by the community sentiment, there are examples of code they'd developed internally first and sent it upstream, but in all cases I remember there were no problems. What can happen in the community is e.g. how the patches are organized or if the changelogs are complete. It's of course easier to develop something internally, if it touches other subsystems or if there's enough coverage just for the new code the test/fix/deploy cycle is much flexible. Once it's supposed to go through mailinlists or convincing other maintainers to accept changes it takes longer and must stick to the development cycle. This benefits both sides in the long run.
Meta and related companies have much better systems available for durability than RAID5. I would be moderately surprised if they commonly use any standard RAID level.
Using traditional RAID (or moderate improvements on it, such as provided by btrfs and zfs) to provide drive-level redundancy seems like a waste of time when you also have to worry about redundancy between servers, racks, datacenters and regions.
All of the arguments for why it's better to have something like zfs handle RAID rather than layering something on top of a traditional hardware RAID controller also work for explaining why you should prefer managing storage more globally rather than layering on top of something like zfs—if you have the resources to develop and maintain a true "full stack" storage solution. Which Facebook/Meta obviously does, when they can do things like publish their own spec documents that SSD vendors design around.
As someone who is in the middle of pulling apart a BTRFS volume by hand (read: writing code to interpret the data structures) to try and recover it, I think being burnt enough is once.
No indication of any hardware issue: No recent power loss (& it's on a UPS), no SMART issues, no memory test positives. But the block tree (at least, WIP) is f*cked across all the disks (looks like two competing writers went at it) and none of the available tools can deal with it.
It wasn't a super exotic setup either: RAID10 with 4 disks (2 stripes), fairly full and regular snapshots/cleanup, but that's it.
I already converted my root to ext4 because paranoia and I'm probably going to move bulk data (what can be recovered) to ZFS.
This just can't happen with btrfs since the _old_ block tree should be still in the disk somewhere. i.e. to corrupt the image so that btrfs shits itself is easy to do, but the _previous_ version of the data is still there by construction and if you are already manipulating the data structures it should be easy enough to just point the sb to it.
The worst issue I've ever had with btrfs simply required zeroing the journal (by hand -- both the kernel and *fsck tools would crash when reading it). The first time I reported the issue to the mailing list and the underlying issue was promptly fixed. However, it still happened a second time, and I didn't bother reporting it. As many people say on this thread, once is too many when it comes to filesystems.
For what it's worth, about a year ago I had data being corrupted with ZFS (something about NVMe + raidz + dedup if I recall correctly) -- this was fortunately while I was testing deployments, and quickly confirmed as a bug (and fixed) by the ZFS team. But it left me equally burned.
Using xfs|ext4 + mdraid is just a lot simpler and much faster, and it addresses most of my use cases.
I've been managing a raid6 ext4 array with mdadm for 10 years. Started with 4 x 4TB disks and kept adding, up to 11 disks now. It works reliably and as designed. Had a few disk failures and replaced them without issues. That's one of the nice things about mdadm vs ZFS: you can add and remove disks from the array as you see fit, rather than being forced to upgrade all disks if you want to increase the size of your array.
Why is this comment downvoted? ZFS limited disk management forcing you to use the size of the smaller disk and making it difficult to resize array is not a feature. It’s a serious limitation when using it as a normal user and not in a professional environment. It’s extremely annoying that most of the ZFS zealots are in denial.
I did that for a long time. I reached 15 disks. Then something went wrong, and I lost everything. I now run multiple 6 disk raidz2s, in part because I wanted to remove the option that led me down the bad path.
Reading these horror stories just makes me not want to rely on smart filesystems at all. If a bug can take out my whole array, or if the system is as inflexible and magical as ZFS, I'd rather not rely on it at all.
I've been managing a home NAS for years with SnapRAID and plain old ext4. Sure, it doesn't have the bells and whistles of something like ZFS, but it's simple to use and understand, scales with anything you throw at it, and there's no way to lose the whole array.
I'm currently transitioning to a multi-node setup, and will probably move to Ceph, otherwise I would stick with SnapRAID. Every so often I look into btrfs/ZFS/mdadm, but keep reaching the same conclusion that it's just not worth the risk.
> will probably move to Ceph ... Every so often I look into btrfs/ZFS/mdadm, but keep reaching the same conclusion that it's just not worth the risk.
The worst thing about my experiences with mdadm and ZFS is that I've had exactly one catastrophic failure in 15 years.
The best thing about my experiences with Ceph is that it's really reinforced the importance of a good backup strategy.
The combination of the two means I now run ZFS, and I have a comprehensive backup strategy that is regularly tested, but has never needed to be invoked.
I strongly recommend developing a strong backup story before you go down your Ceph journey.
Cars require regular maintenance and have known failure modes if you don't bother to do it its neither magical nor dangerous. Snapraid + ext4 doesn't particularly simpler than zfs just different. The fundamental tradeoff of space vs reliability seems to be mathematically tied. It's impossible to do better in one dimension without sacrificing the other. Making it possible to lose SOME of your data in fact seems to be the worst choice. ZFS makes it trivial to lose nothing. Use enough hardware and do cheap replication regularly and scrub periodically.
If anything the fact that snapraid by not live replicating every change to every disk in the array makes it impossible to achieve the reliability offered by ZFS. It's forever the inferior cousin and in fact more complicated for layering a dissimilar technology on top of the other.
SnapRAID has its drawbacks, sure. As with any technology, deciding to use one solution over another is a balancing act of choosing the set of drawbacks that are acceptable for a particular use case.
In this case, _for my simple needs_ of running a home NAS, the fact SnapRAID doesn't run in real-time is not an issue. In fact, I prefer being in control of when it runs, and what it's doing exactly. Having a short time window where some new data is not replicated is a negligible drawback to me.
OTOH, while ZFS solves this particular issue, its drawbacks of being difficult to scale, and a huge black box that claims to "just work", when in fact my _entire_ array relies on it working perfectly, 100% of the time, are a tough pill to swallow. I'm sure that with the years of dedicated improvements and stability fixes, it's a battle-tested system where the likelihood of it failing is close to 0, even on Linux. But the fact that it's theoretically possible is a deal-breaker to me. In this sense, I much prefer SnapRAID's approach that makes this literally impossible. SnapRAID could stop working entirely or disappear tomorrow, and all my data is perfectly safe.
So, yes, SnapRAID + ext4 is radically simpler than ZFS, IMO.
Would I recommend this setup in a corporate environment, where company resources are on the line? Probably not. But I would still advocate against ZFS and mdadm, and probably suggest something like Ceph instead.
Sorry that you lost everything. Keep in mind that RAID is not backup. I also don't understand the correlation between reaching a certain number of disks vs any other scenario. Disaster can always strike, so it's best to be prepared.
We run BTRFS on our work laptops and have been for a few years now. Keep in mind that we're only 4 devs, so our sample size is small, but we haven't had any issues with it so far and it's been very pleasant to use so far!
We published our internal doc for how we install our Arch setup with fully encrypted BTRFS if anybody is curious. Happy to answer any questions too!
That is a pretty awesome article. I recently reinstalled Arch on a new PC and used archinstall to setup encryption, but they had a few bad defaults that I had to later correct... like using a sha512 hash algo, which slowed down opening the drives significantly, and somehow adding 2 different types of compression to the fstab. Once that was fixed, it has been working great.
i am running fedora on encrypted btrfs, on top of a mirror raid, and i have a lot of problem with the performance of the disk. I/O frequently hangs the GUI for a few seconds. throughput is fine once it gets going, but random access to many files appears to be an issue. searches pointing to btrfs on encryption being slow, but i think the raid is also a factor.
I ran into a major problem with performance on btrfs when used on an extremely high write volume that filled up. Between fragmentation and full disk I couldn’t even /copy/ data off the drive at a reasonable rate.
It was a perfect storm of me being an idiot and heavy use. But I dunno if they have remotely resolved the issues
I actually hit this for the first time in all my years of using btrfs a few weeks ago.
Every month or so I spin up external drives to write out monthly backups of my NAS and personal computers. Forgot I'd taken on the home video collection and overfilled the array. Deleting some old subvolumes and having the space reclaimed took hours (on a 99.9% full 5 TB RAID-1 volume). Part of that is that I use compress=zstd:10 on the external drives, but I wasn't expecting it. Writes speeds tanked to tens of KB/s. Since these are those Seagate drives from Costco, probably also SMR drives...
Yeah. On my end it just slowed to a completely unusable level. So while it technically didn’t lose data it effectively did since I was looking at weeks to replicate any data off it. I let it run for a week and it was still crawling when I gave up (4TB drive)
I just double checked the drive and it is. I forgot about that aspect of it.
However I have run this same drive out of space using ext4, zfs, and xfs playing with it (it’s a drive I use for torrent etc) and none of them have crapped out like btrfs did on it.
I use BTRFS on my laptop and in case anyone is interested in trying it, please be sure you know how to chroot into the system in case it won't start. Arch Linux had a problematic GRUB package pushed to stable a few months ago and it bricked people's computers by booting directly to the BIOS. Whatever you do, make sure to take notes on the configuration (if it is DIY) because there are many difference between BTRFS and EXT4, so you can't use a normal chroot guide to fix your system.
They might update their kernel, just not the kernel version. They seem to be using SoCs in their devices and then respective BSPs by the SoC vendor, which are notorious for not being updated for a new kernel.
I have one of the infamous Atom C2538-based Synology there, and it is running kernel 3.10.108, built in october 2022. The device is 2019 model (the SoC was introduced in 2013).
I, as many, have a BTRFS raid 5/6 story. I once worked at a place that had double digit TB databases. A few of the servers that predated me were set up with BTRFS raid 10 as their backup strategy. Basically stop the server, snapshot the volume, backup from the snapshot. It worked well enough.
There came a point where we were going to have to move to a larger server as there were no more slots for additional disks. New hardware procured, provisioned, prepped, ready to roll with the exception of an actual cut over. However, c-suite got involved as they tend to do from time to time, and kept kicking the can down the road.
Eventually disk space was a critical issue and a junior sysadmin decided it was a good idea to tell the COO that btrfs can be converted to raid 5 on the fly. I over the phone told them that this would guarantee data loss and refused. They literally came to my office to force my hand. Some people just have to learn the hard way.
That story aside, I'm a big fan of BTRFS and have been using in various configurations on my desktops for years. I'm glad to see this progressing.
I have used btrfs for a low memory footprint raid 0 solution to span drives. Mind you for expendable datasets. It has worked really well, never had failures due to btrfs shall we say -rot.
Filesystem is light on requirements and works really well. Compression works great too.
The only thing I haven't played around with are reflinks.
For critical datasets I apply 3-2-1 backup strategy and use ZFS.
Yeah, looks like that one is actually designed to do what it does, not just "somehow implemented". The BTRFS bugs fixed in every kernel release for how many years now? are horrific and show that BTRFS is crawling with corner cases.
Are you saying that complex software getting bug fixes is an indication of something? I don't see how it's a bad thing that btrfs has been squashing bugs and corner cases.
I read all these horror stories here for RAID config and data availability/integrity and I wonder how do the Cloud providers resolve all these issues at their scale?
Why bare-metal admins have to spend countless man-hours on troubleshooting these?
You can start with backblaze articles on their data storage tier, will give some hints
Also note, comments on this thread are heavily biased on DYI things and not covering enterprise storage areas, which for instance can come as Blackbox with exported LUNs
Having used ZFS and mdadm+conventional filesystem, the vertical integration is a lot nicer and smoother to work with. You don't waste time scrubbing empty space. If there's corruption then you know specifically what files were affected, effortlessly.
In theory it should allow fine grained control over raid levels at the file/directory/subvolume level. So that depending on the specific needs you can set direct/raid1/raid10/raid5/raid6 without having to setup different arrays for each level you want to use and then have to manage the relative space between them. I can imagine uses for that but the reliability of the more tested solutions is much more important.
The flexibility of software RAID is nice that you can mix and match hard drive manufacturers and generally have zero issues. For hardware RAID, I've always been told to stick to one drive family from one manufacturer and not to mix and match.
Generally every recommendation I've heard is don't use hardware RAID.
Some of the problems:
* RAID cards usually have under powered CPUs, and can easily be a bottleneck. Often the best performance with a RAID card is with the RAID disabled.
* Even battery backed RAM is often pretty slow and on the wrong end of a high latency connection between CPU/RAM and disks. Generally investments in ram or intentlog/writelog/slog is a better investment.
* Metadata is often undocumented, often needs backed up via obscure device dependent methods, and is required for a RAID adapter failure
* The firmware is often buggy, especially in the handling of the numerous failure modes.
* Recovery often requires the same card with the same firmware and a copy of the backed up metadata. Not all cards can recover all metadata from just the drives.
* Often are "too" smart, and won't export raw drives for use with more advanced filesystems like btrfs or ZFS. Some require ugly work arounds like exporting each disks as a single disk RAID0.
* Often RAID cards hide the SMART info from the OS, often crippling the ability to predict drive failures. Or hides the functionality behind a weird software stack that assumes a SMTP server and integrated poorly into whatever monitoring/alerting system you use for the operating system.
* Some RAID cards require a network connection and run a buggy and insecure out of data web stack on some undocumented and rarely (if ever) patched CPU that was obsolete the day it shipped.
My experience with hardware RAID cards (LSI, PERC, etc) is vastly different than the recommendations you heard.
I have been running production servers with LSI hardware RAID cards for the past 15 years and have not experienced the items you note. The only thing I did experience was a bad RAID card (solved by replacing it with a similar LSI model).
To contrast your "problem" list:
* Compared with MDADM (mirrors, RAID-5, RAID-6), the LSI RAID cards tend to be on par as far as performance (esp with BBU)
* A failed drive won't prevent the array from coming on line (in my experience)
* All my servers have battery backed RAM; never had an issue with slowness or high-latency. Can't say the same for ZFS.
* Never, ever had a problem with metadata on the disk. Not sure why this seems to be an issue
* LSI cards have pass-thru mode and can easily be used with BTRFS/ZFS. I have done lots of tests; never an issue
* LSI cards have their own "patrol read" mechansim to scan drives for bad sectors and send alerts when something seems wrong
On a positive note:
* Hard drive failures are truly plug-n-play. Send a tech out to the cabinet, spot the RED light, replace drive. Done. No OS work, no console, no crash cart needed.
* You can logically divide the drives using the RAID manager tools just like Linux MDADM, ZFS, BTRFS, etc. Easy.
* You don't need the specific card to replace a failed card. You just need a card that can read the metadata on the array to bring it back to life.
* Lots of monitoring scripts available in the wild to get RAID stats, rebuild times, etc.
I am not saying hardware cards are indestructible. But, all in all, hardware RAID cards are solid devices that have been in many, many production servers all over the world. I certainly would not dismiss them in favor of ZFS RAID (which tends to be slow for many operations). Pick the best tool for the job.
Glad it's worked for you. I don't think I've had anything that conflicts with your experiences. I didn't mean to imply a degraded hardware RAID wouldn't mount, I think that's a BTRFS RAID5 issue. Generally I'd consider BTRFS a toy, doubly so for RAID5/6 on BTRFS.
Generally I (and colleagues) consider a hardware RAID failure a nightmare and a SAS HBA failure an annoyance. Doubly so if your storage design included cross connected servers and you just mount the storage from the other server. I've never tried similar with a hardware RAID. Can you cross mount and easily import/export RAID sets between controllers?
The main hardware RAID performance issues I see is with higher drive counts or NVME drives when combined with RAID6. Seen any hardware RAIDs and can manage a few GB/sec with 2 disks of redundancy? 3 disks? Even spending a few $k on hardware RAID seems to lose to a random 5 year old server using ZFS or software RAID by a large factor. Seems like even pretty old x86-64 servers manage a few GB/sec per core.
Careful on the passthru, I had many generations of LSI cards that worked, and 3ware and Areca before that. However I've been hearing that the new LSI RAID cards lack passthru/JBOD mode. I've seen reviews and complaints about the lack of it. One case had involved a LSI hardware RAID connected to a Dell JBOD array, but maybe it was specifically crippled by Dell.
I avoided ZFS for years, generally slower at the relevant workloads (used IO logs collected with systemtap and representative loads created with FIO). However once I added cache and the benchmark included multiple streams of writes, ZFS was a huge win. In particular 64 or more sequential write streams to 120 disks. In fact ZFS did better with 3 disks of redundancy that other filesystems did with 2.
In that past I had more luck with ZFS or MDADM for doing things like /dev/sd[ab]1 for a 32gb RAID1 for boot, then /dev/sd[abcd]2 for RAID5 or RAIDz2. Sounds like the hardware RAID tools are get better.
I've heard of some hardware RAID setups managing to recover hardware RAID metadata with a new card, and some not. Even cases of successes and failures from the same company. Even cases where support claimed it would work, but then decided the card and/or firmware wasn't close enough. Sure decent support can overnight a card, but nowhere near as nice as just being able to mount the drives on any Linux box.
Interesting, thanks for the perspective. Myself, I treat MDADM/ZFS RAID failure more of a headache than RAID cards. As I mentioned earlier, RAID cards allow for easy hot-swap without any OS intervention. I am always nervous when I have to replace drives out of a ZFS pool - mainly because I forget the exact commands that need to be run (in order) for a successful swap. With hardware RAID - no commands :-)
Thus far, we don't have any high-density NVMe servers so I can't really test high-end servers. My experience is limited to 8x SSDs or 16x spinning drives in our arrays.
Finally, I avoided ZFS for a long time as well. Early on (OpenZFS 0.6, 0.7, etc), I spent an enormous amount of time trying to tweak/tune the system just to get on par with our HW RAID devices. No matter what I did, I simply could not get the server (typical Linux NFS NAS) to get any decent performance. In fact, I clearly remember HW RAID hitting 1.8GB/sec on our SSDs while OpenZFS could only get around 400MB/sec. Only until OpenZFS 2.1 mark did I see any real performance gains.
...Speaking of ZFS... I recently worked on a project to replace XFS with ZFS for Postgresql servers. I learned a lot about OpenZFS - specifically the memory latency when using compression. I had a lengthy discussion with the OpenZFS "gang" (https://zfsonlinux.topicbox.com/groups/zfs-discuss/T5122ffd3...) to identify some large latency numbers. Turns out, you need to disable ARC compression and "ADB Scatter Enabled" otherwise you will definitely hit some performance issues. Take a look at that thread - especially the very end where I publish some tuning suggestions. It was a fun learning exercise :-)
> mainly because I forget the exact commands that need to be run (in order) for a successful swap. With hardware RAID - no commands :-)
I find it useful to implement some sort of tag or link to wiki be added to alerts information [like on failed drive in your case] - makes much easier to every team member to fix the issue without guessing/invent own way during the outage.
And thanks for the link provided, quite interesting. I plan to redo my canary server with zfs and mysql (primary case for me is ~ 3 times compression) as current one is not much stable highly likely due to my tries to use latest zfs with Ubuntu HWE on 20.04 - this time will use 22.04 with stock ZFS and will go through your findings more closely.
Cool! Feel free to reach out if you have any questions (contact details in profile). I am by no means a ZFS expert, but I have been doing servers/storage for a long time and can probably help out if you run into an issue.
I run my two LSI cards in HBA mode. The initial minor annoyance that I had was having to flash the cards in. Which coincidentally was also my first time ever doing something like that.
Do you have any experience with hardware RAID provided by motherboard manufacturers like Asus? I have a spare dual xeon system I may utilize for my next NAS build and I'll pair it with two more RAID cards that can be put in HBA.
The biggest problem for hardware RAID is controller compatibility. If the controller dies, chances are the whole array is dead, if you couldn't find the exactly same model.
"This should ..." doesn't belong in a fucking file system! Take a deep breath and try again. Maybe limit the scope to something within your understanding this time.
Though I've used XFS a lot over the years. Mostly because the Debian installer gave 12yo me the choice between ext2, ext3 and xfs, so XFS it was because it sounded cooler.
Come to think of it vast majority of fs voes was ext3/4 family, I've seen zeroed file, I've also seen file replaced by content of another file on unclean shutdowns
It has gems such as:
* It won't boot on a degraded array by default, requiring manual action to mount it
* It won't complain if one of the disks is stale
* It won't resilver automatically if a disk is re-added to the array
I think the first is the killer. RAID is a High Availability measure. Your system is not Available if it fails to boot.