Btrfs in Linux 6.2 brings performance improvements, better RAID 5/6 reliability

dale_glass · on Dec 12, 2022

But is the RAID handling remotely sane yet? See https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu...

It has gems such as:

* It won't boot on a degraded array by default, requiring manual action to mount it

* It won't complain if one of the disks is stale

* It won't resilver automatically if a disk is re-added to the array

I think the first is the killer. RAID is a High Availability measure. Your system is not Available if it fails to boot.

dekhn · on Dec 12, 2022

Many folks I know who manage storage don't make the boot volume RAID (redundant)- instead, it's some rapidly duplicatable thing like an NMVE flash containing the root filesystem, and there's a replacement handy. Then you can bring up and bring the full power of userspace to bear on the RAID repair.

encabulator · on Dec 13, 2022

I just finished another late night "thanks" to unknown person who decided to use this approach several years ago while setting up 17 servers that I inherited. The /boot partition of these RHEL machines was put on SD cards, which have been gradually dying. So I got to boot the machine (which is 6000 km away) to recovery mode from ISO and re-create the /boot filesystem on HDD (RAID).

Of the 17 servers, only three are now remaining where I haven't had to do this. Not all of these were actual SD card failures, some where done preventively. Still, there have been several SD card failures requiring emergency repair work at inconvenient times. There have been 0 RAID failures requiring similar emergency work on the systems where /boot has been migrated to HDD based RAID.

Of course there is no replacement SD card handy with working /boot filesystem. Actually there is no such thing as "handy" in my case. I can actually reconstruct /boot partition on the HDD faster than anybody could go to the datacenter and replace the SD card. And if there were a replacement SD card, I would need to keep it up to date manually every time the kernel or initrd are updated.

I never want to see this kind of setup again, and frankly it felt insane the first time I saw it.

Of course I should have migrated all of the machines off the SD cards by now, but my excuse is that maintaining Linux on these machines is not really my responsibility (it is nobody's responsibility apparently, although many people are interested in keeping these systems online).

When I die I want to have a SD-card shaped tombstone.

StreamBright · on Dec 13, 2022

So you discovered the well known problem in "IoT" that SD cards should never be used for filesystems that are online 100% of the time?

I think the parent views on storage is still valid. Do not boot from the RAID volume where you store many terabytes.

I usually have a small mirror SSD for the OS (like 128G or max 256G). Those drives are super cheap, easy to replace.

For the /data I usually use ZFS or mdadm + xfs. Depending on the use case (for example do you want to add more capacity later? ) you can decide.

Using SDcards for server filesystem is negligent at the very least.

krageon · on Dec 13, 2022

flash dying after a few years is not exactly weird or unexpected, regardless of the form factor

mattpallissard · on Dec 13, 2022

This was a trend for a minute back in the late 00's. Any system that was

  1.HA
  2.Had an OS that would get fully loaded into memory (think ESX).

The idea was that SD-cards take less power and where cheap enough you could have a whole gang of spares with images ready to just drop in place. I'm not saying it was a good idea, just that I've encountered this more than once.

alerighi · on Dec 13, 2022

I don't get why you need to do so. RAID is not a substitute of backups, and shouldn't used that way. Meaning that of course you have a backup of the data that you can recover from.

RAID is used for two things:

1. improving the performance of slow disks (at least the read performance) 2. having 1/2 disk fail with your system remaining completely usable, till you replace the faulty disk as soon as possible.

The second point is fundamental to me: there shouldn't be any disruption of service whatsoever, meaning that only the sysadmin should notice the fault (beside maybe a reduced performance of the system since you have 1 less drive). Database transaction that were in progress when the disk did break shouldn't fail, writes/reads on the FS shouldn't fail, the only thing that should happen is an alarm triggered in the monitoring system to inform that a disk needs to be changed as soon as possible.

Having a RAID with manual recovery... it means that you could end up a Saturday evening in front of a computer to bring a system back online, and still some data corruption may have happened.

dekhn · on Dec 13, 2022

Regardless of what you think, many people keep a separate boot and data disk and don't use RAID on the boot disk. I never said anything about it being used for backups nor did I said anything about service outages. I think my point- about what other people do since I don't use RAID- is more about the choices different people make in terms of trusting their boot volume's recoverability in times of crisis.

linuxandrew · on Dec 13, 2022

3. Prevent bit-rot w/ software RAID. Running ZFS, btrfs et al. on a single disk can detect bit-rot but not repair it.

_kdave · on Dec 13, 2022

By default the metadata profile is 'DUP' (i.e. 2 copies on one device), the same can be done for data but this reduces the usable capacity. On normal HDD or SSD this should not be needed but having both data and metadata DUP/DUP has been useful on raspberry pi with the micro SD card storage. It's not perfect but increases the chances to get the data back if the card is partially damaged due to power spikes.

ac29 · on Dec 16, 2022

I have wondered about this myself - does keeping 2 copies of the same data on a single flash-based drive actually increase reliability? Or is the flash controller going to end up combining the two writes into the same block?

dale_glass · on Dec 12, 2022

That seems like a very hacky way to accomplish something that should just work. I mean, it's better than nothing, but it seems far from ideal.

If it's your own personal box, and you're okay with that fiddling, fine. But if people are supposed to get work done while the admin gets a new disk, it's not great.

mlyle · on Dec 13, 2022

There's pretty solid reasons why you might choose to have a very simple RAID1 set for a boot volume-- because you want to be sure to boot even if the huge array is degraded, and because you want faster I/O for the OS install, etc...

antod · on Dec 13, 2022

The way I used to do it (back then getting grub/initrd raid aware wasn't really doable - not sure about now) with mdadm was to use RAID1 on the boot volume such that a disk from a broken mirror was still usable by itself. Other volumes were RAID 5 or 10 etc.

koolba · on Dec 13, 2022

How would a mirrored boot work in practice? Normally the first bootable partition gets loaded if it’s corrupt, it would keep failing at the same bad sectors.

Would it randomly pick one of the bootable disks? That way you have an n-1/n chance of avoiding the bad disk?

antod · on Dec 13, 2022

Back then, grub and initrd didn't understand mdadm so would be set to boot off one of the boot vols as a raw disk. Once initrd handed over to the root vol, it would understand mdadm properly.

If a disk with a boot vol failed, it would have a 50/50 chance of still booting depending on which one failed. You'd create a backup grub entry to boot off the other one manually.

Not as transparent as hardware RAID, and maybe now grub is aware enough to auto handle mdadm? I don't know - I haven't done bare metal mdadm for a decade or so.

Annatar · on Dec 13, 2022

"How would a mirrored boot work in practice?"

The same way it has worked for the past 30+ years: the firmware boots from first designated boot device; if that device isn't bootable, it moves on to the next one. The next one boots since it's part of a mirror and has all the necessary data to do so, and in the rare case where the device is "half bootable" one would simply intervene and select the next good bootable device manually.

ndsipa_pomu · on Dec 13, 2022

That's fair enough for a system that isn't critical, but you can almost guarantee that the disk failure will happen at the most inconvenient time. I'd rather have a couple of cheap disks and MD software RAID so that the system boots as long as one of the drives hasn't failed. Then you can replace/fix the failing disk at a convenient time.

ilyt · on Dec 13, 2022

Why ? mdadm RAID1 is quick and easy solution for that...

evil-olive · on Dec 13, 2022

why_not_both.jpg

I have 2 NixOS-based NASes that run ZFS. each one has 3 equal-sized 256gb SSDs in addition to the pile o' spinning rust.

each SSD has a small UEFI boot partition, then the rest of the space is cut in half.

the root filesystem is a 3-way ZFS mirror of the first half of the SSDs. the second half of each SSD is another 3-way mirror, this time as the "special" / metadata device for the main hard-drive-backed zpool.

any of the 3 SSDs can fail, and it will boot up and mount the storage perfectly fine. I could also easily upgrade to larger SSDs, in-place and with minimal downtime (zero downtime, if my case had hot-swappable SSD bays)

it would work just as well with only 2 SSDs, but the incremental cost of a 3rd is small enough relative to the whole that I went for it.

prmoustache · on Dec 13, 2022

Back when solaris wasn't using zfs yet, we were using the live upgrade utility to duplicate the main boot environment on a local disk every night.

Machines booting from SAN had only one local disk, others would have 3 local disks (small and cheap ones because application data was on SAN anyway).

Main raid would do the job for availability in term of disk failure, 3rd bootdisk did the job in case of user erroror data corruption on the main boot env. It saved our asses a few times, bringing apps back online quickly and saving us from reinstalling or fixing stuff from a tenporary live environment.

simcop2387 · on Dec 13, 2022

The way I like to do it is with mirroring of the nvme storage and then lvm snapshots on top of it. Done correctly you get boot off either of them but they're kept in sync automagically by the OS. then you snapshot the boot volume regularly and copy those to some kind of archival storage for backups.

That lets you then even hotswap them if you need to while keeping the root filesystem workable.

mattpallissard · on Dec 13, 2022

I haven't used raid for an OS or boot partition in years. We focus on provisioning and HA. Data goes on a raid an that's it.

Servers go boom, refuse to boot, or even get trigger a few alarms. Disk is replaced, server is booted, and provisioning paves the entire thing faster than you'd be able to debug.

bravetraveler · on Dec 13, 2022

Plain old USB thumb drives are pretty common for this

I can't knock it, but I imagine I'd be yelling for at least RAID1 if I had to wait for a replacement.

ilyt · on Dec 13, 2022

There are even some servers with hardware RAID1 on 2 microsd cards

chasil · on Dec 12, 2022

Regarding resilvering, I am amazed by ZFS' capability of quickly bringing a stale mirror back into synchronization.

I have used the FUSE port of ZFS to write only one member of a mirroset, then upon mounting both members elsewhere, the stale mirror was very quickly resilvered, so ZFS was able to determine only the blocks needing to be refreshed.

In btrfs, I understand that this requires a rebalance, which will read/write every used portion of the filesystem.

cmurf · on Dec 13, 2022

Scrub, not balance. It's primarily a read operation. The on-disk format has all the information needed to do an automatic abbreviated scrub, but the feature isn't implemented yet.

chasil · on Dec 13, 2022

No, I've heard that a scrub will not reallocate missing blocks on mirrors. A rebalance was required in the article below, and this situation really requires attention.

ZFS is far friendlier in a crisis.

"We'll even manually trigger a scrub—a procedure that storage admins generally understand to look for and automatically repair any data issues... even though we manually initiated a scrub and let it finish, our array is still inconsistent and even outright non-mountable, because it ran for a little while without a disk and then that disk was re-added. The command that we were supposed to run was btrfs balance—with both drives connected and a btrfs balance run, it does correct the missing blocks, and we can now mount degraded on only the other disk, /dev/vdc. But this was a very tortuous path, with a lot of potential for missteps and zero discoverability."

https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu...

cmurf · on Dec 13, 2022

Scrub checks every data and metadata block and compare to checksum. If missing, or wrong transid, or wrong csum, btrfs will replace the block from a good copy.

There is a potential gotcha with parity being wrong following a crash or powerfail, the so-called write hole. Scrub doesn't check parity, and parity isn't checksummed. The write hole though is arguably two parts: wrong parity and propagated wrong reconstruction.

In the usual write-hole case, both happen silently. Wrong parity could happen after a power fail, crash, misdirected or torn write - whether the array is functioning normally or degraded. However, wrong reconstruction only happens if there's a failure to read a data strip (bad sector or full device failure).

On btrfs, only the wrong parity being written is possible. Upon reconstruction from bad parity, the resulting data is compared to csum and will fail, thus propagation doesn't happen.

To cause parity to be recomputed and rewritten, yes you need to do a full balance. But among all the block group profiles, that's only raid5 and raid6. Single, DUP, raid1, raid1c3, raid1c4, raid10 aren't affected and a scrub does reallocate missing blocks on mirrors.

patmorgan23 · on Dec 13, 2022

Yeah ZFS does some sort of block level hash table magic. Very IO efficient for what it's doing.

xorcist · on Dec 13, 2022

Remember that DM is still there for you even if you use btrfs.

Every discussion about btrfs immediately escalates to how the btrfs RAID support develops. But btrfs is useful on its own without that mode.

Should you wish to put a btrfs file system on software RAID (DM), you can still do that. The btrfs RAID modes are something else, where you cut the DM driver out of the picture. That's not going to be as well tested as plain old DM, even if it works, and it doesn't offer any functionality beyond plain DM RAID.

If the use case is storing real data, you want the data storage layer to be as well tested as possible. There's also more to software RAID than the block device. You want a mature resilvering daemon on a well tested schedule, and you need working monitoring to alert you of issues. RAID without monitoring will only delay the inevitable.

exceptione · on Dec 13, 2022

DM = device mapper

wikipedia: The device mapper is a framework provided by the Linux kernel for mapping physical block devices onto higher-level virtual block devices. It forms the foundation of the logical volume manager (LVM), software RAIDs and dm-crypt disk encryption, and offers additional features such as file system snapshots.[1]

Device mapper works by passing data from a virtual block device, which is provided by the device mapper itself, to another block device. Data can be also modified in transition, which is performed, for example, in the case of device mapper providing disk encryption or simulation of unreliable hardware behavior.

matheusmoreira · on Dec 13, 2022

Yeah but DM doesn't have the btrfs flexible block allocation scheme that allows me to expand capacity one drive at a time and with drives of different capacities. This is very important for a home user like me.

To expand a typical RAID setup, you have to essentially fail every single drive in the array and replace them one by one with new identical drives. Expanding an array of N drives requires N resilvers which not only takes a long time but increases the chance of actually failing those drives due to massive I/O load. It's easier to build a second array and copy the data over. This combined with the fact you have to buy all the hardware up front makes it cost prohibitive for home users. You can't buy drives one at a time and expand capacity as needed.

_kdave · on Dec 12, 2022

> It won't boot on a degraded array by default, requiring manual action to mount it

If you want it to behave like that then add 'degraded' to fstab. That a device is missing can have unknown reasons, the user should know better and resolve it or allow such boot. It's not automatic as there's no way to inform the user that it's degraded state.

dale_glass · on Dec 12, 2022

I don't quite understand the use case here. If I'm setting up RAID it's because I want the system to stay up. That's the only purpose for it.

If a device goes missing for "unknown reasons", then the machine should still work, and I'll figure out what happened when monitoring pokes me and says RAID is degraded.

viraptor · on Dec 12, 2022

The use case is: Enough drives failed that your raid is degraded. Any more data you write is not replicated and it may be due to software/hardware issue that will kill more drives soon.

It's up to you to choose at that point - is availability more important for you (add degraded to fstab), or data consistency (deal with the array first).

p1necone · on Dec 13, 2022

> That's the only purpose for it.

That's not the only purpose for it. There's three reasons I can think of that you might set up a RAID array:

    * You want better uptime. (your use case)

    * You want to protect from data loss. (my assumption was that this is the most common use case, but I could be wrong. This also helps with uptime because there's nothing worse for uptime than having to restore lost data from a cold backup)

    * You want better performance, data integrity be damned. (RAID 0)

Booting a RAID array with a failed disk is a bad idea if you care a lot about not losing data, because now you're one less disk failure away.

justsomehnguy · on Dec 13, 2022

> Booting a RAID array with a failed disk is a bad idea

Booting a RAID array with a failed disk is absolutely fine idea.

How else I get access to the tools to identify the bad drive and resilver RAID on a replacement, be it in the same bay or not?

wtallis · on Dec 13, 2022

Booting from a degraded array is only a fine idea in some circumstances, not all. That's why the kernel should not default to automatically doing so; but a distro or sysadmin that has better knowledge of the broader situation (eg. presence of hot spares or a working monitoring/alert system) can reasonably change that default when the risks of booting from a degraded array have been mitigated.

justsomehnguy · on Dec 13, 2022

Ie you are treating RAID as a backup.

wtallis · on Dec 13, 2022

Backups cannot be perfectly real-time unless they are very nearly RAID. Any time you are generating/collecting important data, you will unavoidably have some amount of that important data in the state of not yet backed up.

It's reasonable to want to preserve all the data you currently have—some of which probably hasn't been backed up yet—and not accept new data to be written with the durability guarantees the array was originally configured for silently violated.

Since the kernel has no way of knowing which volumes may contain important data that didn't get the chance to be backed up, it should try its best to maintain the original durability standards the filesystem was configured until some mechanism outside the kernel authorizes the relaxation of those standards.

justsomehnguy · on Dec 13, 2022

> It's reasonable to want to preserve all the data you currently have—some of which probably hasn't been backed up yet—and not accept new data to be written with the durability guarantees the array was originally configured for silently violated.

IE (by your logic) the system should stop the writes as soon as the array became degraded.

But this is not what happens with btrfs: it would happily continue to write the data on the array until reboot.

And then suddenly it's "oh my god array is degraded!!!111 you should not write to it1111".

To add on that: I never seen for a HW RAID card to stop booting by a mere degraded state of the array. Changes in configuration of arrays, loss of more than enough for redundancy drives - yes, that would halt the boot and require the operator intervention. Array in a degraded state? Just spit the warnings to the console and boot. Nobody has the time to walk to each server with a degraded array on every reboot.

happymellon · on Dec 13, 2022

No, you are treating RAID as a protection against longer outage of restoring from backups.

wtallis · on Dec 13, 2022

Another way of thinking about it: should uptime with bad data or services making false guarantees about data durability actually count as uptime?

jonhohle · on Dec 13, 2022

RAID 0 should be called AID, since it’s not really RAID.

vetinari · on Dec 13, 2022

The '0' says it exactly: the amount of data you are left with, once one of the drives fails.

_kdave · on Dec 13, 2022

Yeah but monitoring is not something that comes with the filesystem. If you have to set up the system to be a HA and configure monitoring, email notifications whatever, making sure the filesystem is created with redundant profiles, then I'm expecting that also adding the 'degraded' to the fstab is part of the configuration.

RealStickman_ · on Dec 13, 2022

The system stays up just fine. You just can't reboot it without fixing/ignoring the problem. I think that's fair.

cmurf · on Dec 13, 2022

Most distros have a udev rule in place that inhibits a mount attempt of a multiple device Btrfs until all devices are visible to the kernel. The degraded mount option won't even matter in this case, because mount isn't attempted.

If you remove this udev rule and then add degraded mount option to fstab, it's very risky because now even a small delay in drives appearing can result in a degraded mount. And it's even possible to get a split brain situation.

Btrfs needs automatic abbreviated scrub, akin to the mdadm write intent bitmap which significantly reduces the resync operation.

Hello71 · on Dec 12, 2022

according to https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu..., btrfs devs say you should not add degraded to fstab, and doing so can easily result in data loss.

viraptor · on Dec 12, 2022

The quote is "At this stage of development, a disk failure should cause mount failure so you're alerted to the problem." and it's from over 9 years ago on an ancient kernel. That was just 3+ years since btrfs project got started.

Let's treat it as archived historical content it is.

cmurf · on Dec 13, 2022

Automatic degraded mount at boot time isn't implemented by md either, it's implemented by a dracut module (or whatever creates the initramfs). The gist is that when mdadm assembly fails, the dracut module's script performs a 5 minute wait to see if all devices will appear, and then if not, tries to assemble degraded.

These days, with UEFI prevalent, a system won't boot without EFI system partitions replicated (and possibly sync'd) on every drive.

ndsipa_pomu · on Dec 13, 2022

You can boot from a MD RAID1 system that's degraded easily enough. Here's some info on how to implement it https://outflux.net/blog/archives/2018/04/19/uefi-booting-an...

diegocg · on Dec 12, 2022

No, they have plans for a new different RAID implementation as part of the "extent tree v2" thing AFAIK

klysm · on Dec 12, 2022

> * It won't boot on a degraded array by default, requiring manual action to mount it

That by itself is a complete deal breaker.

cycomanic · on Dec 13, 2022

Hold on, people make this out to be like if a disk fails it requires immediate intervention. However if the system already booted it stays up. Who reboots a production system without monitoring it? Also if you boot with a degraded disk are you not asking for massive trouble, because if another disk fails you might end up with data loss which is IMO much worse on a production system than not being able to boot until you add another disk.

wang_li · on Dec 13, 2022

The risks are the same whether you keep running with a failed disk vs. rebooting/booting with a failed disk. Also, given a system in which the OS is not running and requires booting, how exactly are you going to resync a replacement drive if you don't boot the system with a degraded raid volume?

justsomehnguy · on Dec 13, 2022

> Who reboots a production system without monitoring it?

Anyone.

> Also if you boot with a degraded disk are you not asking for massive trouble

> not being able to boot until you add another disk

Great.

I'm just a [sys]admin who were given a task to do something on $server.

I jump around the red tape, claw out a 15 minute downtime because the task requires reboot, proceed with all that corporate dance with notification emails.

I do my thing, reboot the server and it doesn't return back online.

Suddenly I broke the server, missed maintenance window, amount of mails with CC and RE: in my mailbox grows in geometric progression and the most important - now I need to find out who were responsible for the server, contact him and [kick his ass] ask him to diagnose what is going on.

Bonus points if:

server doesn't have a meaningful BMC/iLO/iDRAC with KVM console

it does have it but it's broken for some reason eg requires Java 6 on Vista

server was configured 10 years ago by a greybeard who is not only retired but already died from the old age

server is 6000km away from any place with a replacement disks and the earliest time when you can send the replacement would be the next spring when the ice would break and taw enough for the ships to move. Of course you can hire a helo to deliver it, which get your ass chewed for an unplanned $50k expenses

Also: https://news.ycombinator.com/item?id=33963000

wtallis · on Dec 13, 2022

It sounds like you're hypothesizing a long chain of bad decisions, and then ridiculing btrfs for taking the choice that means your next bad decision only culminates in (predictable, preventable) downtime rather than data loss.

justsomehnguy · on Dec 13, 2022

> It sounds like you're hypothesizing

This is all the things what I encountered in my admin days.

Including BL670 with incorrectly connected drives, so despite everything saying (and indicating) what the failed drive was in bay 2 it was actually in bay 1.

> then ridiculing btrfs

I ridicule btrfs for it's RAID mode not being a RAID mode by default.

RAID is about availability of data.

If you so hell bent on the data safety then btfrs should kernel panic as soon as one drive degrades. THAT would make sure someone would come and investigate what happened and no data loss.

Right?

wtallis · on Dec 13, 2022

> If you so hell bent on the data safety then btfrs should kernel panic as soon as one drive degrades. THAT would make sure someone would come and investigate what happened and no data loss.

> Right?

Panicking an already-running kernel would only serve to prevent userspace from handling the failure through mechanisms that are inherently beyond the scope and capabilities of the kernel alone (ie. stuff like alerting a sysadmin, activating a hot spare, or initiating a rebalance to restore redundancy among the remaining drives).

Perhaps the kernel should default to freezing a non-root filesystem when it becomes degraded, absent an explicit configuration permitting otherwise. But for the root filesystem, that would be counterproductive and prevent the failure from being handled gracefully.

Obviously, the tradeoffs are different for a system that is still trying to boot as opposed to one that is fully up and running.

awiesenhofer · on Dec 13, 2022

As someone who too had all of these happen at one point or another over the years: These are all process issues, not technical issues. No matter your RAID config or FS features,you're gonna have a bad time.

chomp · on Dec 12, 2022

Why? Are you in a position where you can’t modify grub or fstab?

dale_glass · on Dec 12, 2022

Because RAID is a High Availability technology.

It's not a backup. It doesn't protect against mistakes, bugs, or the server failing in some other way. What it's good for is for ensuring that work keeps happening if a disk fails. And disk failures happen to be fairly frequent, since spinning rust is a rather delicate technology.

If somebody has to connect to the machine and fiddle with it by hand it means that the system has been down, possibly for hours, when that was the exact thing you were trying to prevent by setting up RAID on it.

viraptor · on Dec 12, 2022

You don't have to modify it after a failure. You can set it right now to explicitly say "when I end up in a degraded state, I want to boot anyway".

> What it's good for is for ensuring that work keeps happening if a disk fails.

Different services have different priorities. I want my work servers to keep running, but my home server under the desk to stop until I replace the failed drive. Btrfs gives you a choice.

alerighi · on Dec 13, 2022

No, because I want to be elsewhere on a Sunday evening than in front of a computer modifying fstab to bring a production server back online.

RealStickman_ · on Dec 13, 2022

Unless you want to reboot the server this shouldn't affect you in any way.

ndsipa_pomu · on Dec 13, 2022

Unless you get a power outage or some other event that causes a reboot (e.g. STONITH)

happymellon · on Dec 13, 2022

Unless you give it permission.

It's an option to enable. If you needing to configure something is a dealbreaker then having RAID is probably not for you anyway.

josephcsible · on Dec 12, 2022

IMO, the inconvenience this causes is outweighed by the prevention of data loss for people who otherwise would never have noticed the degraded array.

ilyt · on Dec 13, 2022

Right but if it's remote server you're fucked and you can't login on it to even assess what happened

josephcsible · on Dec 13, 2022

Don't servers tend to include some kind of out-of-band management like HP iLO, Dell iDRAC, or OpenBMC?

ndsipa_pomu · on Dec 13, 2022

This is why I want servers to always boot to an OS whenever possible. It's fine if production data is not available until someone fixes an array, but without the OS, it's a pain to even figure out what's happened, let alone fix it.

kiririn · on Dec 13, 2022

That’s what the likes of dropbear-initramfs is for

ClumsyPilot · on Dec 13, 2022

I think this is the difference in target audience - for home use, maybe I won't have monitoring, so it should fail to boot.

For enterprise, if you dont have monitoring setup it's on you

jiggawatts · on Dec 12, 2022

The list of bugs and counter-intuitive design decisions in BTRFS RAID should make anyone pause before ever considering it as a viable filesystem for anything.

It's like people justifying MySQL saying that it's okay for it to lose or corrupt data, and that transactional integrity doesn't matter as much as people say it does.

Yeah, maybe if you're website is a blog. But anyone storing real data should run away screaming from systems like this.

Maybe MySQL is sort-of-okay now? I dunno. It's possible BTRFS RAID 5 won't eat your data or crash your server regularly now.

I'm going to stay away from both anyway.

happymellon · on Dec 13, 2022

I've had more issues with my ZFS arrays than the BTRFS ones.

The Linux implementation is pretty awful, with it demanding that it uses the absolute `/dev/sdx` reference even if you try to build it using serials or other unchaging reference.

Replace a disk and then after the next restart it fails to bring up the RAID because `sdf` points to a different disk. At least BTRFS uses internal UUIDs so can bring up the arrays when the disk ids change.

Fairly fundamental stuff, rather than complaining about an option not being enabled by default.

wokkel · on Dec 13, 2022

Uh, using /dev/disk/by-id to assemble a zfs array is pretty standard for zol. I never use device names and shuffling disks has never required any manual intervention on my part. Was this a very old ZoL?

vetinari · on Dec 13, 2022

At a time, it was the default, and you had to go out of your way to use /dev/disk/by-id (you had to create your pool, export it, and then import it by id), only to be hit by another issue, where when preparing grub, it was looking for the IDs straight in /dev (that's how I learned that ZPOOL_VDEV_NAME_PATH=YES exists).

Yeah, putting / on zfs was not smart. Nowadays, when the distro doesn't support it out of the box and doesn't come with kernel+bootloader+zfs tested together (i.e. proxmox, ubuntu), I wouldn't do it.

happymellon · on Dec 14, 2022

This was last year on Ubuntu.

I assembled the zfs array using by-id, but after a reboot zfs reports the drives as being on /dev/sdx.

tecleandor · on Dec 13, 2022

I don't know if you're talking about a veeeeery old implementation or haven't properly configured it, but you can use both the device reference or the partition UUID.

This is one of my systems using both identifications:

  root@multivac[~]# zpool status
    pool: boot-pool
   state: ONLINE
    scan: scrub repaired 0B in 00:01:33 with 0 errors on Fri Dec  9 03:46:34 2022
  config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sdg3      ONLINE       0     0     0

  errors: No known data errors

    pool: multivac-slow
   state: ONLINE
    scan: scrub repaired 0B in 07:05:51 with 0 errors on Sun Nov 20 09:05:55 2022
  config:

        NAME                                      STATE     READ WRITE CKSUM
        multivac-slow                             ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            ad3395b6-ac43-453c-9ba2-cf65542cb710  ONLINE       0     0     0
            2c423fe3-07a4-44cb-bc39-8b267c79d8c3  ONLINE       0     0     0
          mirror-1                                ONLINE       0     0     0
            8ca9d5a2-04cb-47a0-987a-8b3d27326975  ONLINE       0     0     0
            bec4c67f-fa41-4ab2-907f-0e5b1e052300  ONLINE       0     0     0
        cache
          bfef4d7f-c484-4c37-a0b7-81d59fa1e189    ONLINE       0     0     0
  root@multivac[~]# blkid|grep zfs_member
  /dev/sda2: LABEL="multivac-slow" UUID="4053361656876756561" UUID_SUB="363024978934966364" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="2c423fe3-07a4-44cb-bc39-8b267c79d8c3"
  /dev/sdd2: LABEL="multivac-slow" UUID="4053361656876756561" UUID_SUB="10877444988141002268" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="8ca9d5a2-04cb-47a0-987a-8b3d27326975"
  /dev/sdb2: LABEL="multivac-slow" UUID="4053361656876756561" UUID_SUB="9244362969528724588" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="ad3395b6-ac43-453c-9ba2-cf65542cb710"
  /dev/sdc2: LABEL="multivac-slow" UUID="4053361656876756561" UUID_SUB="11658072268678819533" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="bec4c67f-fa41-4ab2-907f-0e5b1e052300"
  /dev/sdg3: LABEL="boot-pool" UUID="13291833732043257716" UUID_SUB="15424416082895860251" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="b6c0b302-7599-4862-888b-be6f7a85b970"

michaelmrose · on Dec 13, 2022

I think the advice was to simply use /dev/disk/by-id/foo when originally creating a pool and one would never have a problem with /dev/sda being a different disk and this was true at least back to 2015ish.

DominoTree · on Dec 12, 2022

I've never lost data with btrfs, but I have purposefully avoided anything other than basic disk configurations. I use it on my workstations because it's the default in Fedora these days and haven't had any complaints.

On the flip-side, I've done some of the most horrible things possible and screwed up my 30-disk ZFS array a number of times and I've never lost data. I doubt btrfs could recover from anything I've done to break my ZFS pool.

Overall I think it's a huge shame that it was determined that the CDDL was incompatible with the GPL, because it wasn't intentional, and we would be in a completely different spot for storage otherwise.

chasil · on Dec 12, 2022

Btrfs is able to do one important thing that ZFS cannot: defragment.

I see XFS as the performance leader (appears on TPC.org the most often that I can see), btrfs as the fullest featured, and ZFS with the strongest reliability.

cosmojg · on Dec 13, 2022

XFS is the best filesystem for people who don't want to deal with filesystems. Btrfs is there for the rest of us.

chasil · on Dec 13, 2022

As far as that goes, Oracle prohibits the installation of their database on btrfs (note 2290489.1: "Oracle DB has specifically said that they do not support using BTRFS filesystems... BTRFS is optimized for non-database workloads.").

XFS for databases is what is used on tpc.org, but perhaps these new improvements may help.

mananaysiempre · on Dec 13, 2022

Unfortunately, as best as I can tell you can’t shrink an XFS volume other than by completely rewriting it. (Also mkfs.xfs formats with Y2038-susceptible 32-bit timestamps by default, but that can be overriden.)

chasil · on Dec 13, 2022

RedHat 8 and clones began creating safe XFS timestamps in their installers.

I think they are safe with this version:

  XFS (sda1): Mounting V5 Filesystem

tagrun · on Dec 13, 2022

That's incorrect, you can defrag with

    btrfs filesystem defragment /path

There is also the mount option autodefrag

ttfkam · on Dec 13, 2022

Reread the comment you replied to. Slowly.

fguerraz · on Dec 12, 2022

Never lost any data either in 10y of use as main driver.

I've ditched RAID 5 mostly because performance sucked so much, but I have never lost any data despite going through many changes of hard drives, changes of RAID mode, rebalancing, etc.

The closest I've come to losing data is probably rotten HDD sectors, but it's always been caught and remapped by scrubs.

Yes, there is the infamous write hole, but that's not a BTRFS only problem.

Gigachad · on Dec 12, 2022

I feel like the data loss risk doesn't really make sense to be worried about. It doesn't matter what FS or storage system you use, you _need_ a backup if you actually care about the data. And if you have backups, you are fine.

As a random anecdote I've been using BTRFS for everything for over 7 years now and had no issues. Think the only FS I've ever had problems with was ExFAT.

pavon · on Dec 12, 2022

I think it is important for two reasons. First, the primary purpose of RAID is to improve availability of your data, such that you can continue operating in spite of a hardware failure. If the RAID feature of btrfs has serious data loss problems, then it is taking a feature that is supposed to increase availability, and instead decreasing it. It is like an UPS that is more likely to cause a power outage than to protect you from one.

The other reason is that the atomic snapshot feature, and ability to easily transfer diffs of snapshots makes a wonderful foundation for incremental backups. But if I can't trust the filesystem to avoid corrupting my data, how can I trust it to avoid corrupting my snapshots? Hence I can't trust my backups. So I need to go back to a more independent backup processes like rsync.

With those two features gone, my whole motivation for using btrfs over simpler file systems like ext4 is gone, so why bother.

Gigachad · on Dec 12, 2022

>With those two features gone

Still seen no real proof this is true. It's always vague feelings and anecdotes. has anyone done any real testing to measure file corruption?

Matthias247 · on Dec 13, 2022

data corruption is slightly orthogonal to data loss.

If your filesystem silently corrupts some data in a way you don't notice, you might just replicate the corrupted data to your backup system over time. And in that case you will still have lost it.

Data loss - e.g. due to a disk failure - is a lot more obvious.

simcop2387 · on Dec 13, 2022

Only time I've lost data with zfs was when i was first doing something supremely stupid, mostly as a dare to see if it would work.

I had recently upgraded my pool and had a bunch of old disks lying around so i fired them up in a new pool, 6x 4TB and 6x 8TB disks. I took the 4TB disks and set them in in a raid0 to get effectively 9x 8TB disks. Apparently doing this will work but you MUST make sure you properly wipe all of the newly raided disks of their old ZFS data or you can end up with ZFS trying to check that pool and breaking the mdadm superblocks.

After I properly zeroed the 4TB disks it's been working fine for weeks. This let me set it up a pool in raidz3 with them all so i can loose any of the much older 4tb disks or up to 3 of the 8tb ones. I don't use it for anything that's absolutely critical to store, just random bulk data that i don't want to worry that much about, but it's been reliable enough since then.

zeotroph · on Dec 13, 2022

Is there a zfs native way to create raid0 disks now? On Linux I usually set up a linear mapping with dmsetup, and then point zfs to that. But this hides the native disks from zfs.

I've done that a couple times to make a bunch of mismatched disk fit into zfs's requirement that all are of the same size. The ability to work with a mix of disk sizes is one advantage of btrfs - if you remember to be extra careful on a disk failure and bad shutdown. Plus the ability to dynamically grow the array, but there is work to improve that in the zfs side as well.

mgerdts · on Dec 13, 2022

> Is there a zfs native way to create raid0 disks now?

zpool create swimming disk1 disk2 …

That has worked for over 15 years.

zeotroph · on Dec 14, 2022

Indeed, it is less raid0 but creating a raid0 and then building a raidz on top of it.

Say you have `1, 1, 2, 2` sized devices. A raidz requires them all to be equally sized which can be done by first adding `1 + 1 = 2` and then creating a zpool with `[1+1], 2, 2`. Can this 1 + 1 step be done within zfs? Hypothetically this would be done by creating a raid0 pool and then a raidz on top via `zfs create tank raidz (raid0 a1 b1) c2 d2`, or first running `zfs create ab1 a1 ab` and use that in the next `zfs create ab1 raidz ab1 c2 d2`.

derkades · on Dec 13, 2022

This does not create a RAID 0 / stripe, but it is close.

mgerdts · on Dec 13, 2022

RAID0 could be striping which would require all devices to be the same size or it could be concatenation which leaves it to some other layer to balance load across drives. What I suggested allows the use of the full capacity of the drives with weighted data placement based on available space on the drives.

What I suggested is not really RAID0. raidz is not really RAID5. ZFS intentionally improves on these for performance and reliability reasons.

simcop2387 · on Dec 13, 2022

I believe you can accomplish this with partitioning the disks with sizes that match the smallest disk too for zfs. You still have all the same reliability issues though and there's probably some kimd of performance hit too. Though yoy could then likely set copies=2 or something to add some redundancy to bad sectors at least and get a little better there than the linear mapping with dm or lvm2 on linux. You should the. Also be able to grow the array but like the linear mapping it's going to be unbalanced as to where free space is.

My personal preference in this kind of situation is to still just use lvm2 to jbod the disks together if I;m fine with the risks of losing it all anyway, and just setup a good backup strategy. More because I'm familiar with it and all the tools than other setups. I think zfs might give me more early warning signs of a disk with an issue but the performance hit isn't always worth it to me there. I've got an 8tb raid0 nvme ssd array (4x2tb) setup to give me some blazing fast local storage for a few key VMs thst I just backup daily to the zfs storage. I can get 7.5GB/s reads and 6.7GB/s writes to that array doing that and using an lvm-thin pool so I can nicely snapshot during backups.

wil421 · on Dec 13, 2022

You can stripe disks (raid0) in ZFS but you should not be using any hardware raid solution, motherboard or pcie card. Host bus adapters are preferred, anything with direct access to the disks.

seanw444 · on Dec 12, 2022

Glad to see Btrfs getting continual updates. It's my favorite filesystem for my personal machines (work and home PCs). The feature set is just awesome. I just hope it doesn't get completely abandoned as its development seems to have slowed significantly.

The only thing it's missing before I consider it full-featured is stable RAID 5/6. But it looks like that hasn't been forgotten.

_kdave · on Dec 12, 2022

The development speed has to balance the schedule of linux kernel development (merge window, release candidates, 3 months cycle) and demand to merge several distinct features or core changes. There are no formal deadlines but we have to make sure that the new code is feasible to be stabilized in the given time. Once a new feature is in the wild some bugs or fixups are still needed so this takes some time from the new development and has to be accounted for.

My strategy to pull new things is to have one big feature that has ideally been reviewed and iterated in the mailinglist or there was a lot of testing already done. In addition two smaller features can be merged, with limited scope, not affecting default setup and possibly easy to debug/fix/revert if needed. Besides that there are cleanups or core updates going on so this should not touch the same code to make testing less painful. With new features the test matrix grows, code might need wide cleanups or generalizaionts before the actual feature code is merged. So this can indeed slow down development.

The raid56 is progressing but until the 6.2 pull from today there was not much to announce regarding stability/reliability. There were proposed fixes but as incompatible features, which means some changes on the user side and with backward compatibility issues. What's pending for 6.2 should fix one of the bad problems at least for raid5.

warmwaffles · on Dec 12, 2022

> The only thing it's missing before I consider it full-featured is stable RAID 5/6. But it looks like that hasn't been forgotten.

I've been running BTRFS RAID6 since 2016 and have only had one issue (arch kernel needed to be rolled back) and never suffered anything catastrophic. It's perfectly happy humming along with a 15x8TB raid array.

Tuna-Fish · on Dec 12, 2022

Are you using Raid6 for both data and metadata? If yes, you could just as well have been running raid 0 and having more space and performance out of the same disks. Because of well documented issues, metadata on raid6 is not yet safe in power/hw failure situations. And if you have no safety in those situations, what is raid even for?

(It's perfectly safe to run metadata on raid1 and data on raid5/6.)

warmwaffles · on Dec 13, 2022

I'm running RAID6 for data and RAID1 for metadata. Wanted the metadata replicated. So far I've had no HW failure situations and I have suffered a few power outages while I was away from home even though I have a backup battery.

Although these drives have surpassed 50,000 hours of running time, I am fairly certain the day of reckoning is coming, but I have backups and a plan to recover.

I may recover to something not BTRFS. Not sure just yet.

ludocode · on Dec 13, 2022

With data on raid6 you should be running metadata on raid1c3 or raid1c4, not raid1 nor raid5/6.

nwallin · on Dec 13, 2022

Wait, what's wrong with metadata on raid1?

wtallis · on Dec 13, 2022

Metadata on raid1 doesn't guarantee the resilience to two-drive failures that raid6 is used for, but metadata on raid1c3 does give you that degree of resilience (with more space overhead, but that's seldom a problem for the metadata).

candiddevmike · on Dec 12, 2022

Maybe I'm just cynical but I think the ship has sailed for BTRFS RAID 5/6. It's now part of the global mindshare that BTRFS RAID 5/6 == data loss, no one wants to be the guinea pig that proves it works.

Better to direct resources towards bcachefs or ZFS IMO.

viraptor · on Dec 12, 2022

Ideally we'd be working based on actual information rather than global mindshare. There will be guinea pigs to test it. There are already quite a few people running it despite the warnings.

As much as I want to see a wide use of bcachefs, it's still years away. As someone who actually wants to store data - why would you direct resources to bcachefs which is known experimental, rather than btrfs which plainly documented raid5 as not ready and now may decide to change it to ready... if it is? (I'm donating to bcachefs patreon, but commenting from the PoV of someone choosing a solution today)

ilyt · on Dec 13, 2022

Right but "actual information" here is "developers over and over fail to nail what ZFS and MDRAID did well decades ago". There is no reason to trust their next try will be "the one"

_kdave · on Dec 13, 2022

MDRAID had write hole until 4.4 (2015, https://lwn.net/Articles/665299/) and it had been long awaited back then. ZFS to my knowledge deals with that by variable stripe length that has its own problems but yeah it works. Btrfs changes implementation of the stripe update while preserving the on-disk format (i.e. can't do the same as ZFS without introducing incompatible change), intent log/bitmap have been proposed (which would be the MDRAID approach) but that's another incompatible change. So the 'next one' is at the cost of performance but with same compatibility.

viraptor · on Dec 13, 2022

Did well in what context? RAID5? There were enough people with motivation to spend time on it rather than other things for mdadm/zfs. It's not like btrfs devs wouldn't know how to implement it if you paid them to work full time on it.

doublepg23 · on Dec 12, 2022

Is bcachefs years away? My understanding is the critical chain of code is pretty well tested - bcache.

viraptor · on Dec 12, 2022

From widespread use, yes. Bcache was already stable at least 7 years ago. Bcachefs it's not even included in upstream kernel yet, much less any distro as an official option. There's only some plan to start submitting it soon™ and it will take months to complete.

coptun · on Dec 12, 2022

I've been rocking raid 5/6 btrfs for years for my personal data, no problems apart from the speed. A scrub literally takes months.

zeotroph · on Dec 13, 2022

Months? I thought it would just go through all the data and verify that the stored checksums match. Even if it fetches filedata more than once to ensure the parity bits are ok, how can it take longer than reading the entire disk a couple times?

coptun · on Dec 13, 2022

Well this is just a desktop not a server, it isn't on 24/7.

sedatk · on Dec 13, 2022

Yeah same here. Directory deletions are excruciatingly slow too.

WastingMyTime89 · on Dec 12, 2022

ZFS will never be mainlined so it’s beyond useless.

waynesonfire · on Dec 12, 2022

ZFS is widely used by many users and organizations. Just because it is not currently included in the Linux kernel does not mean that it is useless. There is ongoing work to integrate ZFS into the Linux kernel, and in the meantime, many users continue to find value in using ZFS on their systems. As an alternative, you could try using FreeBSD, which includes ZFS as part of the kernel. This can make it easier to take advantage of the features and benefits of ZFS without having to install and configure it separately. Additionally, because ZFS is integrated into the kernel, it can take advantage of FreeBSD's advanced security and performance features.

viraptor · on Dec 13, 2022

> There is ongoing work to integrate ZFS into the Linux kernel

Could you paste a link to it, please? As far as I know, all efforts now are around ZoL which is not included for legal reasons.

tecleandor · on Dec 13, 2022

BTW, ZoL is not anymore. ZoL and OpenZFS merged and it's all OpenZFS, sharing a codebase.

WastingMyTime89 · on Dec 12, 2022

Enlighten me on how a filesystem with an incompatible license is ever going to be mainlined. Having to use an out-of-tree patchset is an immediate no as far as I’m concerned. I also have no intention of switching to an hobbyist operating system, thank you.

skyyler · on Dec 12, 2022

You're already using a hobbyist operating system... Your horse isn't as high as you think.

WastingMyTime89 · on Dec 13, 2022

Linux hasn’t really been a hobbyist operating system for years as anyone taking even a casual glance at a list of contributors would know but sure let’s all pretend FreeBSD is somehow relevant or properly maintained. That’s probably in the same universe where ZFS has a chance of being mainlined and isn’t heavily encumbered by Oracle owned patents.

Why people would contribute to this mess is beyond me but everyone is free to do what they want.

yjftsjthsd-h · on Dec 13, 2022

> Linux hasn’t really been a hobbyist operating system for years as anyone taking even a casual glance at a list of contributors would know but sure let’s all pretend FreeBSD is somehow relevant or properly maintained.

You haven't glanced at a list of FreeBSD's contributors, have you?

skyyler · on Dec 13, 2022

You're pretending that FreeBSD isn't relevant...

You're probably behind a pfsense firewall right now.

You have probably seen someone playing games on a PlayStation 3 or 4.

You've definitely benefitted from the FreeBSD project without even realising.

Have you taken a casual glance at the list of FreeBSD contributors?

WastingMyTime89 · on Dec 13, 2022

I know FreeBSD is used by Sony on its console, by Netflix and by WhatsApp. Everyone knows because they are pretty much the sole serious contributors to it. I don’t think it’s particularly relevant nor do I think it’s particularly good to be honest but we are getting far from the initial discussion about ZFS.

skyyler · on Dec 13, 2022

>Everyone knows because they are pretty much the sole serious contributors to it

You haven't even looked at the list have you?

chasil · on Dec 12, 2022

I recently tried out the Ubuntu native ZFS installation.

I was surprised to see that it taints the kernel when I checked the demsg.

matja · on Dec 13, 2022

Why is that surprising? It is a difference license. So is the Nvidia driver.

blibble · on Dec 13, 2022

I'll dump Linux before I dump ZFS

mekster · on Dec 13, 2022

What's wrong with loading the module yourself? Ubuntu is just a command away to get it running with installing zfsutils-linux.

How's that beyond useless?

yjftsjthsd-h · on Dec 13, 2022

DKMS is fine, and Ubuntu even ships prebuilt .ko files; it works fine in practice.

helf · on Dec 12, 2022

Not mainlined = useless?

Ok.

eikenberry · on Dec 13, 2022

Ideally on a ZFS replacement that has a compatible license.

mekster · on Dec 13, 2022

Using Ubuntu is hell of a lot faster than waiting for such a file system.

timbit42 · on Dec 13, 2022

Ubuntu is embracing an extending Linux through it's Snap repo. Dump it now.

michaelmrose · on Dec 13, 2022

Alternatively just continue ignoring snap until it goes away

timbit42 · on Dec 13, 2022

I do, by not using Ubuntu.

eikenberry · on Dec 13, 2022

So is using another file system.

ansible · on Dec 12, 2022

We had tried RAID-1 on btrfs, and the experience was not that great either. Our current systems, when they use btrfs, are on top of md (mdadm) RAID-1 arrays.

denkmoon · on Dec 12, 2022

Do Meta not develop btrfs for their internal use? I don't think community sentiment is a big factor for them.

_kdave · on Dec 12, 2022

AFAIK they use it internally, there are articles on lwn.net how, the use cases is for root filesystems and containers. I'm not sure I understand what you mean by the community sentiment, there are examples of code they'd developed internally first and sent it upstream, but in all cases I remember there were no problems. What can happen in the community is e.g. how the patches are organized or if the changelogs are complete. It's of course easier to develop something internally, if it touches other subsystems or if there's enough coverage just for the new code the test/fix/deploy cycle is much flexible. Once it's supposed to go through mailinlists or convincing other maintainers to accept changes it takes longer and must stick to the development cycle. This benefits both sides in the long run.

mekster · on Dec 13, 2022

Their culture of "works for me" attitude makes most stuff they care irrelevant to the rest of the world.

Scaevolus · on Dec 12, 2022

Meta and related companies have much better systems available for durability than RAID5. I would be moderately surprised if they commonly use any standard RAID level.

wtallis · on Dec 13, 2022

Using traditional RAID (or moderate improvements on it, such as provided by btrfs and zfs) to provide drive-level redundancy seems like a waste of time when you also have to worry about redundancy between servers, racks, datacenters and regions.

All of the arguments for why it's better to have something like zfs handle RAID rather than layering something on top of a traditional hardware RAID controller also work for explaining why you should prefer managing storage more globally rather than layering on top of something like zfs—if you have the resources to develop and maintain a true "full stack" storage solution. Which Facebook/Meta obviously does, when they can do things like publish their own spec documents that SSD vendors design around.

AdamJacobMuller · on Dec 14, 2022

I imagine the "does not boot degraded" was mildly annoying to them, until they figured out the rootfsopts flag and deployed it.

That whole thread above debating not booting degraded really made me lose any faith in btrfs.

boardwaalk · on Dec 13, 2022

As someone who is in the middle of pulling apart a BTRFS volume by hand (read: writing code to interpret the data structures) to try and recover it, I think being burnt enough is once.

No indication of any hardware issue: No recent power loss (& it's on a UPS), no SMART issues, no memory test positives. But the block tree (at least, WIP) is f*cked across all the disks (looks like two competing writers went at it) and none of the available tools can deal with it.

It wasn't a super exotic setup either: RAID10 with 4 disks (2 stripes), fairly full and regular snapshots/cleanup, but that's it.

I already converted my root to ext4 because paranoia and I'm probably going to move bulk data (what can be recovered) to ZFS.

AshamedCaptain · on Dec 13, 2022

This just can't happen with btrfs since the _old_ block tree should be still in the disk somewhere. i.e. to corrupt the image so that btrfs shits itself is easy to do, but the _previous_ version of the data is still there by construction and if you are already manipulating the data structures it should be easy enough to just point the sb to it.

The worst issue I've ever had with btrfs simply required zeroing the journal (by hand -- both the kernel and *fsck tools would crash when reading it). The first time I reported the issue to the mailing list and the underlying issue was promptly fixed. However, it still happened a second time, and I didn't bother reporting it. As many people say on this thread, once is too many when it comes to filesystems.

stingraycharles · on Dec 13, 2022

For what it's worth, about a year ago I had data being corrupted with ZFS (something about NVMe + raidz + dedup if I recall correctly) -- this was fortunately while I was testing deployments, and quickly confirmed as a bug (and fixed) by the ZFS team. But it left me equally burned.

Using xfs|ext4 + mdraid is just a lot simpler and much faster, and it addresses most of my use cases.

AdamJacobMuller · on Dec 14, 2022

I've been there, not with BTRFS but with reiserfs many many years ago. I don't envy you.

herpderperator · on Dec 12, 2022

I've been managing a raid6 ext4 array with mdadm for 10 years. Started with 4 x 4TB disks and kept adding, up to 11 disks now. It works reliably and as designed. Had a few disk failures and replaced them without issues. That's one of the nice things about mdadm vs ZFS: you can add and remove disks from the array as you see fit, rather than being forced to upgrade all disks if you want to increase the size of your array.

WastingMyTime89 · on Dec 12, 2022

Why is this comment downvoted? ZFS limited disk management forcing you to use the size of the smaller disk and making it difficult to resize array is not a feature. It’s a serious limitation when using it as a normal user and not in a professional environment. It’s extremely annoying that most of the ZFS zealots are in denial.

oarsinsync · on Dec 12, 2022

I did that for a long time. I reached 15 disks. Then something went wrong, and I lost everything. I now run multiple 6 disk raidz2s, in part because I wanted to remove the option that led me down the bad path.

I hope your journey ends better than mine did.

imiric · on Dec 12, 2022

Reading these horror stories just makes me not want to rely on smart filesystems at all. If a bug can take out my whole array, or if the system is as inflexible and magical as ZFS, I'd rather not rely on it at all.

I've been managing a home NAS for years with SnapRAID and plain old ext4. Sure, it doesn't have the bells and whistles of something like ZFS, but it's simple to use and understand, scales with anything you throw at it, and there's no way to lose the whole array.

I'm currently transitioning to a multi-node setup, and will probably move to Ceph, otherwise I would stick with SnapRAID. Every so often I look into btrfs/ZFS/mdadm, but keep reaching the same conclusion that it's just not worth the risk.

oarsinsync · on Dec 17, 2022

> will probably move to Ceph ... Every so often I look into btrfs/ZFS/mdadm, but keep reaching the same conclusion that it's just not worth the risk.

The worst thing about my experiences with mdadm and ZFS is that I've had exactly one catastrophic failure in 15 years.

The best thing about my experiences with Ceph is that it's really reinforced the importance of a good backup strategy.

The combination of the two means I now run ZFS, and I have a comprehensive backup strategy that is regularly tested, but has never needed to be invoked.

I strongly recommend developing a strong backup story before you go down your Ceph journey.

michaelmrose · on Dec 13, 2022

Cars require regular maintenance and have known failure modes if you don't bother to do it its neither magical nor dangerous. Snapraid + ext4 doesn't particularly simpler than zfs just different. The fundamental tradeoff of space vs reliability seems to be mathematically tied. It's impossible to do better in one dimension without sacrificing the other. Making it possible to lose SOME of your data in fact seems to be the worst choice. ZFS makes it trivial to lose nothing. Use enough hardware and do cheap replication regularly and scrub periodically.

If anything the fact that snapraid by not live replicating every change to every disk in the array makes it impossible to achieve the reliability offered by ZFS. It's forever the inferior cousin and in fact more complicated for layering a dissimilar technology on top of the other.

imiric · on Dec 15, 2022

SnapRAID has its drawbacks, sure. As with any technology, deciding to use one solution over another is a balancing act of choosing the set of drawbacks that are acceptable for a particular use case.

In this case, _for my simple needs_ of running a home NAS, the fact SnapRAID doesn't run in real-time is not an issue. In fact, I prefer being in control of when it runs, and what it's doing exactly. Having a short time window where some new data is not replicated is a negligible drawback to me.

OTOH, while ZFS solves this particular issue, its drawbacks of being difficult to scale, and a huge black box that claims to "just work", when in fact my _entire_ array relies on it working perfectly, 100% of the time, are a tough pill to swallow. I'm sure that with the years of dedicated improvements and stability fixes, it's a battle-tested system where the likelihood of it failing is close to 0, even on Linux. But the fact that it's theoretically possible is a deal-breaker to me. In this sense, I much prefer SnapRAID's approach that makes this literally impossible. SnapRAID could stop working entirely or disappear tomorrow, and all my data is perfectly safe.

So, yes, SnapRAID + ext4 is radically simpler than ZFS, IMO.

Would I recommend this setup in a corporate environment, where company resources are on the line? Probably not. But I would still advocate against ZFS and mdadm, and probably suggest something like Ceph instead.

herpderperator · on Dec 12, 2022

Sorry that you lost everything. Keep in mind that RAID is not backup. I also don't understand the correlation between reaching a certain number of disks vs any other scenario. Disaster can always strike, so it's best to be prepared.

freeqaz · on Dec 13, 2022

We run BTRFS on our work laptops and have been for a few years now. Keep in mind that we're only 4 devs, so our sample size is small, but we haven't had any issues with it so far and it's been very pleasant to use so far!

We published our internal doc for how we install our Arch setup with fully encrypted BTRFS if anybody is curious. Happy to answer any questions too!

https://www.lunasec.io/docs/blog/arch-linux-installation-gui...

(We run Thinkpad X1 Extremes (+ the business P1 equivalents) as our dev boxes.)

encryptluks2 · on Dec 13, 2022

That is a pretty awesome article. I recently reinstalled Arch on a new PC and used archinstall to setup encryption, but they had a few bad defaults that I had to later correct... like using a sha512 hash algo, which slowed down opening the drives significantly, and somehow adding 2 different types of compression to the fstab. Once that was fixed, it has been working great.

You have a typo in your article BTW:

> aes-exts-plain64

em-bee · on Dec 13, 2022

i am running fedora on encrypted btrfs, on top of a mirror raid, and i have a lot of problem with the performance of the disk. I/O frequently hangs the GUI for a few seconds. throughput is fine once it gets going, but random access to many files appears to be an issue. searches pointing to btrfs on encryption being slow, but i think the raid is also a factor.

M911T · on Dec 25, 2022

Check both of these links, I added the configurations to my laptop with BTRFS + NVME SSD + encryption and it's snappy. https://wiki.archlinux.org/title/Dm-crypt/Specialties#Discar... https://wiki.archlinux.org/title/Dm-crypt/Specialties#Disabl...

helf · on Dec 12, 2022

I ran into a major problem with performance on btrfs when used on an extremely high write volume that filled up. Between fragmentation and full disk I couldn’t even /copy/ data off the drive at a reasonable rate.

It was a perfect storm of me being an idiot and heavy use. But I dunno if they have remotely resolved the issues

Teknoman117 · on Dec 13, 2022

I actually hit this for the first time in all my years of using btrfs a few weeks ago.

Every month or so I spin up external drives to write out monthly backups of my NAS and personal computers. Forgot I'd taken on the home video collection and overfilled the array. Deleting some old subvolumes and having the space reclaimed took hours (on a 99.9% full 5 TB RAID-1 volume). Part of that is that I use compress=zstd:10 on the external drives, but I wasn't expecting it. Writes speeds tanked to tens of KB/s. Since these are those Seagate drives from Costco, probably also SMR drives...

helf · on Dec 13, 2022

Yes! I actually gave up on the one drive. It was a torrent disk (like I said worst case scenario for a lot of things) and btrfs ate itself lol.

Teknoman117 · on Dec 13, 2022

To be fair, it didn't break or lose data. Just got really, really slow until I let it finish reclaiming space.

helf · on Dec 13, 2022

Yeah. On my end it just slowed to a completely unusable level. So while it technically didn’t lose data it effectively did since I was looking at weeks to replicate any data off it. I let it run for a week and it was still crawling when I gave up (4TB drive)

vetinari · on Dec 13, 2022

Was it a SMR drive, by a chance? You can make those crawl with any filesystem. I even had one that couldn't finish initial Time Machine backup.

helf · on Dec 13, 2022

I just double checked the drive and it is. I forgot about that aspect of it.

However I have run this same drive out of space using ext4, zfs, and xfs playing with it (it’s a drive I use for torrent etc) and none of them have crapped out like btrfs did on it.

i13e · on Dec 13, 2022

I use BTRFS on my laptop and in case anyone is interested in trying it, please be sure you know how to chroot into the system in case it won't start. Arch Linux had a problematic GRUB package pushed to stable a few months ago and it bricked people's computers by booting directly to the BIOS. Whatever you do, make sure to take notes on the configuration (if it is DIY) because there are many difference between BTRFS and EXT4, so you can't use a normal chroot guide to fix your system.

alschwalm · on Dec 12, 2022

I'm curious how the benchmarks are for Btrfs on 6.1, given the improvements that (I think) landed in it: https://www.phoronix.com/news/Linux-6.1-Btrfs

DominoTree · on Dec 12, 2022

This is exactly the sort of thing that Phoronix (will probably) benchmark

mberning · on Dec 12, 2022

Wonder if this will make it into Synology DSM 7.2. Seems unlikely based on the timing.

fetzu · on Dec 12, 2022

Aren’t Kernel versions tied to device model for Synology? My DS918+ returns “4.4.180+” as its kernel version. That’s pretty.. old?

Do/can they downstream some of the changes without changing the Kernel version?

imhoguy · on Dec 12, 2022

Unlikely, I am on DSM 7.1 and it is 4.4.180+, although I know heavily patched. I have read DSM 7.2 will land 5.10.

ksec · on Dec 13, 2022

I don't think Synology ever update their kernel with DSM. So the kernel you get when you buy it will be the one that last for its life time.

But I do hope there will be model, especially entry model that comes with 5.10.

Synology's Linux timeline is simply too slow.

vetinari · on Dec 13, 2022

They might update their kernel, just not the kernel version. They seem to be using SoCs in their devices and then respective BSPs by the SoC vendor, which are notorious for not being updated for a new kernel.

I have one of the infamous Atom C2538-based Synology there, and it is running kernel 3.10.108, built in october 2022. The device is 2019 model (the SoC was introduced in 2013).

mattpallissard · on Dec 13, 2022

I, as many, have a BTRFS raid 5/6 story. I once worked at a place that had double digit TB databases. A few of the servers that predated me were set up with BTRFS raid 10 as their backup strategy. Basically stop the server, snapshot the volume, backup from the snapshot. It worked well enough.

There came a point where we were going to have to move to a larger server as there were no more slots for additional disks. New hardware procured, provisioned, prepped, ready to roll with the exception of an actual cut over. However, c-suite got involved as they tend to do from time to time, and kept kicking the can down the road.

Eventually disk space was a critical issue and a junior sysadmin decided it was a good idea to tell the COO that btrfs can be converted to raid 5 on the fly. I over the phone told them that this would guarantee data loss and refused. They literally came to my office to force my hand. Some people just have to learn the hard way.

That story aside, I'm a big fan of BTRFS and have been using in various configurations on my desktops for years. I'm glad to see this progressing.

Krisjohn · on Dec 13, 2022

We never had a BTRFS event on OS drives that didn't result in total loss of the OS. We have finally removed it from the last server it was put on.

unixhero · on Dec 13, 2022

I have used btrfs for a low memory footprint raid 0 solution to span drives. Mind you for expendable datasets. It has worked really well, never had failures due to btrfs shall we say -rot.

Filesystem is light on requirements and works really well. Compression works great too.

The only thing I haven't played around with are reflinks.

For critical datasets I apply 3-2-1 backup strategy and use ZFS.

shmerl · on Dec 13, 2022

I'm still waiting for bcachefs upstreaming.

ahartmetz · on Dec 13, 2022

Yeah, looks like that one is actually designed to do what it does, not just "somehow implemented". The BTRFS bugs fixed in every kernel release for how many years now? are horrific and show that BTRFS is crawling with corner cases.

leetnewb · on Dec 13, 2022

Are you saying that complex software getting bug fixes is an indication of something? I don't see how it's a bad thing that btrfs has been squashing bugs and corner cases.

whatever1 · on Dec 13, 2022

I read all these horror stories here for RAID config and data availability/integrity and I wonder how do the Cloud providers resolve all these issues at their scale?

Why bare-metal admins have to spend countless man-hours on troubleshooting these?

CoolCold · on Dec 13, 2022

You can start with backblaze articles on their data storage tier, will give some hints

Also note, comments on this thread are heavily biased on DYI things and not covering enterprise storage areas, which for instance can come as Blackbox with exported LUNs

BatteryMountain · on Dec 13, 2022

BSD + ZFS

walrus01 · on Dec 12, 2022

I wonder how this compares to just using mdadm block device level raid5 or raid6. And then a normal filesystem on top.

lmm · on Dec 12, 2022

Having used ZFS and mdadm+conventional filesystem, the vertical integration is a lot nicer and smoother to work with. You don't waste time scrubbing empty space. If there's corruption then you know specifically what files were affected, effortlessly.

pedrocr · on Dec 12, 2022

In theory it should allow fine grained control over raid levels at the file/directory/subvolume level. So that depending on the specific needs you can set direct/raid1/raid10/raid5/raid6 without having to setup different arrays for each level you want to use and then have to manage the relative space between them. I can imagine uses for that but the reliability of the more tested solutions is much more important.

warmwaffles · on Dec 12, 2022

The flexibility of software RAID is nice that you can mix and match hard drive manufacturers and generally have zero issues. For hardware RAID, I've always been told to stick to one drive family from one manufacturer and not to mix and match.

sliken · on Dec 12, 2022

Generally every recommendation I've heard is don't use hardware RAID.

Some of the problems:

* RAID cards usually have under powered CPUs, and can easily be a bottleneck. Often the best performance with a RAID card is with the RAID disabled.

* Even battery backed RAM is often pretty slow and on the wrong end of a high latency connection between CPU/RAM and disks. Generally investments in ram or intentlog/writelog/slog is a better investment.

* Metadata is often undocumented, often needs backed up via obscure device dependent methods, and is required for a RAID adapter failure

* The firmware is often buggy, especially in the handling of the numerous failure modes.

* Recovery often requires the same card with the same firmware and a copy of the backed up metadata. Not all cards can recover all metadata from just the drives.

* Often are "too" smart, and won't export raw drives for use with more advanced filesystems like btrfs or ZFS. Some require ugly work arounds like exporting each disks as a single disk RAID0.

* Often RAID cards hide the SMART info from the OS, often crippling the ability to predict drive failures. Or hides the functionality behind a weird software stack that assumes a SMTP server and integrated poorly into whatever monitoring/alerting system you use for the operating system.

* Some RAID cards require a network connection and run a buggy and insecure out of data web stack on some undocumented and rarely (if ever) patched CPU that was obsolete the day it shipped.

rtp4me · on Dec 12, 2022

My experience with hardware RAID cards (LSI, PERC, etc) is vastly different than the recommendations you heard.

I have been running production servers with LSI hardware RAID cards for the past 15 years and have not experienced the items you note. The only thing I did experience was a bad RAID card (solved by replacing it with a similar LSI model).

To contrast your "problem" list:

  * Compared with MDADM (mirrors, RAID-5, RAID-6), the LSI RAID cards tend to be on par as far as performance (esp with BBU)
  * A failed drive won't prevent the array from coming on line (in my experience)
  * All my servers have battery backed RAM; never had an issue with slowness or high-latency.  Can't say the same for ZFS.
  * Never, ever had a problem with metadata on the disk.  Not sure why this seems to be an issue
  * LSI cards have pass-thru mode and can easily be used with BTRFS/ZFS.  I have done lots of tests; never an issue
  * LSI cards have their own "patrol read" mechansim to scan drives for bad sectors and send alerts when something seems wrong

On a positive note:

  * Hard drive failures are truly plug-n-play.  Send a tech out to the cabinet, spot the RED light, replace drive.  Done.  No OS work, no console, no crash cart needed.  
  * You can logically divide the drives using the RAID manager tools just like Linux MDADM, ZFS, BTRFS, etc.  Easy.
  * You don't need the specific card to replace a failed card.  You just need a card that can read the metadata on the array to bring it back to life.
  * Lots of monitoring scripts available in the wild to get RAID stats, rebuild times, etc.

I am not saying hardware cards are indestructible. But, all in all, hardware RAID cards are solid devices that have been in many, many production servers all over the world. I certainly would not dismiss them in favor of ZFS RAID (which tends to be slow for many operations). Pick the best tool for the job.

sliken · on Dec 13, 2022

Glad it's worked for you. I don't think I've had anything that conflicts with your experiences. I didn't mean to imply a degraded hardware RAID wouldn't mount, I think that's a BTRFS RAID5 issue. Generally I'd consider BTRFS a toy, doubly so for RAID5/6 on BTRFS.

Generally I (and colleagues) consider a hardware RAID failure a nightmare and a SAS HBA failure an annoyance. Doubly so if your storage design included cross connected servers and you just mount the storage from the other server. I've never tried similar with a hardware RAID. Can you cross mount and easily import/export RAID sets between controllers?

The main hardware RAID performance issues I see is with higher drive counts or NVME drives when combined with RAID6. Seen any hardware RAIDs and can manage a few GB/sec with 2 disks of redundancy? 3 disks? Even spending a few $k on hardware RAID seems to lose to a random 5 year old server using ZFS or software RAID by a large factor. Seems like even pretty old x86-64 servers manage a few GB/sec per core.

Careful on the passthru, I had many generations of LSI cards that worked, and 3ware and Areca before that. However I've been hearing that the new LSI RAID cards lack passthru/JBOD mode. I've seen reviews and complaints about the lack of it. One case had involved a LSI hardware RAID connected to a Dell JBOD array, but maybe it was specifically crippled by Dell.

I avoided ZFS for years, generally slower at the relevant workloads (used IO logs collected with systemtap and representative loads created with FIO). However once I added cache and the benchmark included multiple streams of writes, ZFS was a huge win. In particular 64 or more sequential write streams to 120 disks. In fact ZFS did better with 3 disks of redundancy that other filesystems did with 2.

In that past I had more luck with ZFS or MDADM for doing things like /dev/sd[ab]1 for a 32gb RAID1 for boot, then /dev/sd[abcd]2 for RAID5 or RAIDz2. Sounds like the hardware RAID tools are get better.

I've heard of some hardware RAID setups managing to recover hardware RAID metadata with a new card, and some not. Even cases of successes and failures from the same company. Even cases where support claimed it would work, but then decided the card and/or firmware wasn't close enough. Sure decent support can overnight a card, but nowhere near as nice as just being able to mount the drives on any Linux box.

rtp4me · on Dec 13, 2022

Interesting, thanks for the perspective. Myself, I treat MDADM/ZFS RAID failure more of a headache than RAID cards. As I mentioned earlier, RAID cards allow for easy hot-swap without any OS intervention. I am always nervous when I have to replace drives out of a ZFS pool - mainly because I forget the exact commands that need to be run (in order) for a successful swap. With hardware RAID - no commands :-)

Thus far, we don't have any high-density NVMe servers so I can't really test high-end servers. My experience is limited to 8x SSDs or 16x spinning drives in our arrays.

Finally, I avoided ZFS for a long time as well. Early on (OpenZFS 0.6, 0.7, etc), I spent an enormous amount of time trying to tweak/tune the system just to get on par with our HW RAID devices. No matter what I did, I simply could not get the server (typical Linux NFS NAS) to get any decent performance. In fact, I clearly remember HW RAID hitting 1.8GB/sec on our SSDs while OpenZFS could only get around 400MB/sec. Only until OpenZFS 2.1 mark did I see any real performance gains.

...Speaking of ZFS... I recently worked on a project to replace XFS with ZFS for Postgresql servers. I learned a lot about OpenZFS - specifically the memory latency when using compression. I had a lengthy discussion with the OpenZFS "gang" (https://zfsonlinux.topicbox.com/groups/zfs-discuss/T5122ffd3...) to identify some large latency numbers. Turns out, you need to disable ARC compression and "ADB Scatter Enabled" otherwise you will definitely hit some performance issues. Take a look at that thread - especially the very end where I publish some tuning suggestions. It was a fun learning exercise :-)