Hacker News new | past | comments | ask | show | jobs | submit login
An Introduction to ZFS (servethehome.com)
374 points by arm 11 days ago | hide | past | favorite | 250 comments

You know, these articles always come up in the context of fileservers but...

... for me using ZFS has changed the way I look at files, filesystems, data, and backups for general computing. I've been a linux user for 13 years but never felt the need to have a fileserver. Now being able to plug a drive in and take a snapshot without rsyncing or thinking about what I'm snapshotting, having it be inherent to the filesystem, was a game changer.

Not to mention being able to snapshot important folders to the native drive in case I need to recover a file from a previous state. I run datasets for categories of data and I can choose categories that I want regular local snapshots of (zvol/crypt/Documents, zvol/crypt/scripts, zvol/crypt/Papers)

Essentially, ZFS manages my files for me. And it all comes with things I didn't know I needed, like filesystem compression. I know BTRFS also attempts to provide this, and there's the licensing issues with ZFS, but I wanted MacOS compatibility also. Although that was an adventure on its own.

> Essentially, ZFS manages my files for me.

Yes! Managing our files is the whole point of file systems! It's amazing how bad at it most of them are. Linux is still catching up with btrfs...

It's extremely aggravating how most file systems can't create a pool of storage out of many drives. We end up having to manually keep track of which drives have which sets of files, something that the file system itself should be doing. Expanding storage capacity quickly results in a mess of many drives with many file systems...

Unlike traditional RAID and ZFS, btrfs allows adding arbitrary drives of any make, model and capacity to the pool and it's awesome... But there's still no proper RAID 5/6 style parity support.

> It's extremely aggravating how most file systems can't create a pool of storage out of many drives.

Some would argue that is the job of the volume manager, not the filesystem. On Linux it's LVM2, FreeBSD has vinum, for example.

I would argue that a volume manager is just one of many parts of a good file system. Managing physical devices at the file system level has advantages: the file system can balance its data automatically in the background. Linux LVM can't do that.

I think this gets into the more philosophical part of the debate like with systemd, where it seems like it could be argued that the separation is better—or not—depending on your needs.

And for my own needs, I typically favor the security of not throwing too much complexity on top of my storage system (if it would be impossible for me to recover something without relying on a ton of magic I can't fully understand, I'd rather not use it in production).

> Some would argue that is the job of the volume manager, not the filesystem.

ZFS is both at the same time.

That is why there is the zpool command and the zfs command. While tight-coupling is sometimes bad, in this instance it is useful:

* https://web.archive.org/web/20070508214221/http://blogs.sun....

And the windows equivalent is Windows Storage Space which sucks hard for parity volumes (max write speed 60MB/s whereas the same set of drives in any other software RAID writes close to 1GB/s).

I would have thought this was a long solved problem. Kind of like chat applications.

It doesn't help that Microsoft has completely abandoned ReFS. Maybe there were good technical reasons why it was a dead-end, I dunno, but it's really no fun being stuck with NTFS.

It really is incredible. ZFS datasets are essentially data collections, and I can categorize my files according to how I want them managed. It's essentially what iOS dreams of but in a much more manageable, configurable, and open way.

One beef: Moving files between datasets are an unnecessarily expensive operation.

Mainly because the dataset is the ZFS version of a partition. In one pool, each dataset can have its own record size, compression settings (on/off or different levels, even different algorithms), be encrypted, have deduplication turned on... Its impossible for it to not be an "expensive operation".

> Linux is still catching up with btrfs...

ZFS is available on Linux if you want it, in fact FreeBSD is basing its support for ZFS on it.

For a mainline option, am personally holding out hope for bcachefs, rather than btrfs.

I think the way btrfs manages physical volumes is better. With btrfs, I can slowly buy drives of different capacities and add them to the file system one by one. I don't have to plan ahead of time like traditional RAID and ZFS setups. This is excellent for home data storage!

The only thing that holds it back is the lack of proper parity support. Anyone has an update on that? The kernel wiki says it still has problems...

regarding raid5/6 on btrfs - in this recent message [1] Zygo Blaxell writes:

> Not much has changed for raid5/6 since 2014, other than the introduction > of raid1c3 for metadata in 2019 to make filesystem with raid6 data usable. > Almost all of the bugs from 2014 still exist today. Developers have > been fixing more severe and less avoidable bugs in the meantime.

[1]: https://www.spinics.net/lists/linux-btrfs/msg107268.html

Also, worth mentioning another email from same author with guidelines for users running btrfs raid5 array: https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@h...

You may wish to check out the new dRAID mechanism that recently got committed:

* https://www.youtube.com/watch?v=jdXOtEF6Fh0

* https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAI...

It seems like btrfs will gain a lot of traction, considering that Facebook invest a lot in it and it's the default FS in Fedora now.

I first heard about Btrfs over ten years ago: what's taking it so long to 'gain traction'?

My guess is bugs. Early on they had some serious bugs that lost people data. So the project gained this reputation of not being production-ready which meant that only people interested in playing with a new filesystem were using it. And btrfs has always been brought up alongside zfs which has been stable and mature for decades. As far as I can tell the only advantage btrfs has over zfs is the license. Outside of that, what does it offer that zfs does not? (That’s a serious question, I haven’t kept up with it because I’m a happy zfs user.)

> what does btrfs offer that zfs does not?

As @matheusmoreira mentioned - you can use harddisks of unequal size. I.e. three disks of 4+3+3 Tb in RAID1 will happily give you 5 Tb of usable space. AFAIK, no other filesystem can do that.

Quoting @matheusmoreira further:

> I can slowly buy drives of different capacities and add them to the file system one by one. I don't have to plan ahead of time like traditional RAID and ZFS setups. This is excellent for home data storage!

The new dRAID mechanism in ZFS may help with this:

* https://www.youtube.com/watch?v=jdXOtEF6Fh0

How? I don't see how it makes ZFS more flexible. Users still need to have every storage device installed on the computer before creating the RAID configuration. They can't start with the drives they already have and then add more drives over time. At least not without essentially recreating the array from scratch.

Maybe. But I wonder if btrfs has been tainted as 'too unreliable' at this point, whether true or not.

It's been perfectly stable since years for me. Not using Raid 5/6 of course!

Are you using snapshots? qgroups (which you have to enable to see snapshot space utilization)?. I've had bad experiences with them on btrfs.

not OP, but using btrfs for 0.5 year with snapshots, snapper (which creates snapshots every hour) on NVMe under Arch. Haven't noticed any issues yet - qgroups are disabled, and snapper config decreased to have no more than 20 snapshots.

Ya, I was using snapper but didn't restrict the number of snapshots so I probably had more than 20

I use zfs now and have many, many, more than 20 snapshots without issue.

I wasn't using a nvme drive with btrfs, it was a sata ssd. So perhaps the higher io ops or lower latency also helps.

There is one huge issue with ZFS, the distributions and various crappy software will catch up the ease of use and abuse it to the maximum. I have already seen what huge pile of noise Ubuntu makes, not to mention creating snapshot for any update and how docker looks when you enable it on ZFS (forget about `zfs list` without `| grep` or `| more`). I can hardly wait for electron to start creating snapshots on each run, ...

ZFS is great. What todays developers will make out of it... is worrysome.

Can't you just solve this by not snapshotting everything? I only snapshot user folders (which I size limit), user program folders (where users write code), and long term data storage (will not snapshot user generated data). You really should control what is being snapshotted and what isn't.

Ugh, I hadn't thought about that, but you're absolutely right. If ubuntu can turn something as straightforward as mount tables into a complete dog's breakfast I tremble to imagine what they'll do with ZFS.

> electron snapshots on each run

...on a raspberry pi, for maximum trendiness, deployed in a check out aisle where you have to wait for it to painfully work through its issues after every interaction. Kill me now.

Folks at vmware will probably disagree but I'd love to see zfs natively supported in ESX. That gets me away from hardware raid controllers for my boot disks and lets me manage datastores across jbod.

Can you address the MacOS compatibility a more? What's your setup?

Does it support encryption that has no known backdoors?

ZFS on Linux supports encryption. Very few symmetric algorithms are known to be backdoored - the AES implementation available to ZoL is thought to be fine.

For years I’m running home server. Mostly for storage plus some side projects. I only had a little time for server maintenance and I ended up redoing the whole server every time a disk failed. I was using just regular desktop grade disks so that happened every 18 months or so.

This all changed when I started using ZFS. Not only it has support for raidz and mirroring (which I could get with LVM too) but it is tweekable and tunable easily. Plus commands like _zpool status_ will give you a great overview of the health of the array in no time. It might seem like nothing but it makes all the difference for me.

I can recommand for everybody running a (home) server. It will save you lots of time.

What kind of disks and brand do you use? Are you doing some kind of disk-intensive work?

I am asking because this sounds really short to me and I am wondering if I am extremely lucky never to have a failed disk in more than 25 years, mostly with more than 3 computers at a time with only consumer grade disk of random brands. I only swich them because for better performance/capacity, but some of them are almost 10 years old.

Use Backblaze's hard drive stats as a guide: https://www.backblaze.com/b2/hard-drive-test-data.html

I've yet to experience a failure of HGST drives, even after 10 years of operation in an LVM or ZFS array.

HGST remains my preference for "enterprise-y" stuff on home and small business setups. Slightly pricier but if you value your time it pays for itself easily.

I buy used 4TB HGST drives on eBay. They’ve all been solid and inexpensive, some with more wear and tear than the others.

I have a couple I'm looking to sell. Send me an email for the details.

Like my fellow commenter, I also do have a few drives with over 50k head hours on them that still work fine, but I also had probably more than half a dozen drives fail on me or break (not including Sandforce SSDs, that'd be unfair), including one that had all of my early projects on it. Really wish I had a backup of that.

Now I have a backup of everything and whenever I have to think about how much storage I need for something I always take that number and multiply it by three -- if it's only stored once or twice, it really can't be more important than /tmp.

Interesting for me, I’ve had two drive failures, both in the past year, but otherwise nothing for the past five+ years. Both were data center class drives (Seagate Exios) and died within the first 2 months of purchasing them. I have over 20 consumer grade drives with >50,000hrs of reported power on time, of which 3-4 are >70,000hrs. They constantly see writes, but I’m definitely not pushing max IOPs by no means on a regular basis. I’d say around 1-2% of the time they are under very heavy read/write activity, otherwise just constant appending of data at around 50Mb/sec across the entire RAID pool.

This is a pretty common pattern with disk failures - in my experience they either tend to fail in the first few months, or they make it for the duration.

It's even got a name: the bathtub curve[1].

[1]: http://www.applied-statistics.org/Glossary/BathTubCurve.html

Are you sure your data as stayed uncorrupted all this time? Another thing ZFS enables is detecting bitrot and corruptions (happening due to unexpected power loss or otherwise) that you otherwise may have been undetected but can still cause issues over time.

You are lucky, or bought the right brand of drives at the right time.

I had five new Seagate 7200.11s which had an easy life but not make it past 22 to 25k hours before they started failing en masse (and thats not even counting their firmware bug that got me). ZFS RAIDZ1 (RAID5) pool, which survived a second drive starting to fail while rebuilding from the first failure. It was that event that made me love ZFS forever.

Contrast to my WD Reds which are 52k+ hours without errors or issues (no failures in a set of 8). And some HGST refurb drives that are at 70k+ hours (some failures in a batch of 14, but they were refurb with wiped SMART data so the failures werent unexpected).

https://www.backblaze.com/blog/backblaze-hard-drive-stats-q2... has the best statistics that I know of. Basically a 1% failure rate each year (worse for a few bad batches of disks), so yes I think you have gotten lucky so far. (1 - .99^(25*3)) = 52% chance of at least one failure for three drives over 25 years.

Out of ~30 disks over my ~30 years of hard disks I can count at least 3 5.5" HDD failures and 5 or 6 2.5" HDD failures. Laptop drives seem to fail at a much higher rate especially for kids.

I was thinking the same thing. I still have two 2TB Ultrastar's that I bought re-furbished 5 years ago, and which have been running in various arrays for just about the entire time since then. knock on wood

I still have a 1.6gb IDE drive that is functional. I have drives with 70k hours that still run. I've had 2 drives fail in the last 30 years. It may be rare but when it happens you'll really wish you had a backup.

Agreed. I've been aware of and thinking to check out ZFS for over a decade because all these weird nerds keep talking about it but it seemed so involved and complicated.

One simple command each to:

* create a zpool

* create a dataset (persistently mounted across reboots, with optional encryption and compression etc)

While, yes, properly tweaking ZFS for performance and making proper decisions on things like recordsizes, L2ARC and SLOG requires a bit of deeper understanding of how it works, the CLI is very approachable and the man pages are straight-forward and easy to understand for a beginner.

Compared the alternatives of cryptsetup/mdadm/lvm/lvcache, it's such a breath of fresh air and a lot easier to work with. It's unified and intuitive and things just make sense.

The big game-changer for me is its caching mechanisms. For workloads that already have efficient caching (mature databases, for example), it might not make a big difference, but for things doing redundant IO it can improve the performance by an order of magnitude in a way the Linux page cache just can't.

Put a fast NVMe as L2ARC in front of a mirror of slow but inexpensive disks and you can eat the cake and have it.

Oh, and native compression (currently lz4 but zstd is coming) that barely impacts performance (and sometimes improves it).

I'm one of those weird guys now and even my nerd friends don't get me anymore :')

Not to take away from the merits of ZFS, but this sounds like something that 2 disks with software RAID1 would prevent from happening.

Also, 18 months sounds kind of short, I usually get around 5 years of use from hard drives. What brand are you using? I highly recommend checking out the quarterly Backblaze HDD failure reports.

Source: been running a server (read: consumer headless desktop) for years without issues.

> Not to take away from the merits of ZFS, but this sounds like something that 2 disks with software RAID1 would prevent from happening.

You're not wrong. And if there's a mechanical issue with a drive and it dies / stops spinning, then you'll probably get an alert and can swap it.

But if there's any kind of bit rot or data corruption, and it only happens on one side of the (traditional) RAID1 mirror, how will you (a) know that it actually occurred, or (b) know which side is good bits and which side has bad bits?

With ZFS and checksums, you can be confident that your data is still healthy. Depending on the data, this may be an important consideration.

zfs send-recv is also very handy for doing incremental backups.

What ZFS does here is not enough for me. I need to be alerted to changes done also above the file system layer, like by malware or accidental deletion etc. Only way I have found to solve that is to have checksum tool above the file system. If something is wrong, I restore from backup, so zfs does not give me anything...

I don't personally do this, but if you're taking scheduled snapshots of your ZFS filesystems, you could also have a scheduled job (say, nightly) that emails you a "zfs diff" [0] between the current snapshot and the one from 24 hours ago. It won't tell you that you've been hit by malware, but an unexpected spike in changes could be something worth investigating.

[0] - https://docs.oracle.com/cd/E36784_01/html/E36835/gkkqz.html

> I need to be alerted to changes done also above the file system layer [...]

As others have mentioned, would "zfs diff ..." be useful?

* https://www.thegeekdiary.com/how-to-identify-zfs-snapshot-di...

As the name suggests, "snapshots" are read-only and so cannot be altered. You could either copy/rsync the modified file/s to the live location, or do a rollback to a particular snapshot:

* https://www.thegeekdiary.com/solaris-zfs-how-to-create-renam...

If the machine is compromised in some way, you could reinstall and do a "zfs send-recv" of a pool from a remote system.

You can clone the snapshot (so its clone become writable, not the snapshot itself!) and you even can promote a snapshot to a "parent" filesystem (reverse parent-child).

You can do a zfs diff command on two snapshots or a snapshot and current dataset as zfs takes a just-in-time snapshot of the live dataset.

One pro of ZFS over RAID1 is that ZFS is file based and not volume based, so in case of recreating mirror disks (or general resilvering) you only copy data, not the whole volume.

More than half the time in my experience it’s usually a failing sata cable or power-supply if the hard drive continues to functions on boot but eventually goes to degraded due to checksum/read errors. Sometimes the data is just fine but a read error occurs between the drive and controller which ZFS interprets as a failng disk

modern drives have onboard diagnostics which can report errors directly to the OS. I'd be very surprised if a modern file system was unaware of this. I'd also be surprised if users of alternative file systems didn't test a suspected failing drive before concluding it was faulty

Hard drives lie. SMART data is not thorough and it is not trustworthy. Across dozens of drive failures over the years, I've never seen SMART predict a failure before ZFS detected data loss or the OS reported 'drive is slow'. SMART is not worth the effort to monitor.

Filesystems don't read SMART data. You might have a separate daemon which monitors SMART.

ZFS checksumming is amazing for this. You know, without doubt, which file(s) are bad. You can still use failing or unknown quality drives because the checksumming will protect you from silent data corruption.

SMART marking drives as failed is usually super late. But if you monitor the raw values for the important parameters (unreadable, reallocated, etc sectors), I've had good luck with replacing drives before software notices. At least for spinning drives; SSD failues were way more rare, but resulted in the drives completely disappearing from the bus, and no reliable prefail indicators. I did have one SSD go through a big reallocation that tanked throughput and the alerts from throughput and the alerts from SMART thresholds fired simultaneously.

Of course, that's great for a server farm; in home use monitoring is a lot less structured.

Expensive enterprise drives are generally better at smart stats. Of course your mileage varies by vendor. Desktop drives on the other hand are completely untrustworthy.

the great thing about zfs is you don't have to care, at all

as long as it writes and reads back (most of the time): ZFS will deal with it

The error is occurring between the drive and the sata controller on the mobo. There’s also the whole issue of whether the OS or the drive is responsible for error handling. If the the drives are SAS drives, typically the OS uses fire and forget methodologies for write commands. Other drives might not support that and require zfs to handle errors at the software/kernel level

replacing the cable / testing the disk on an isolated system would confirm this

It would but I encountered a situation once, where when I would remove the drives of the array and test the disk 1 by 1 they would all work. It wasn't until all the drives were powered on and under max write stress would the power-supply under-volt to drives randomly making it appear as multiple drives were all failing.

Strange. I have some WD disks in my qnap running 24/7 for 8 years now, without a single problem. I'm planning on reusing them in my new NAS..

I have a WD Purple I bought in 2014 and it's been running continuously since. 18 months seems extremely short.

Isn't Purple for things like camera feeds? Good on writes, bad on reads.

This seems to be the conventional wisdom but I can't say I've experienced it in practice. They have less cache than Reds and the firmware may be tuned for writes but I think it's mostly down to different market positioning/branding & price. You can compare the two: https://documents.westerndigital.com/content/dam/doc-library... and https://documents.westerndigital.com/content/dam/doc-library...

Same here, I have over a dozen WD Red drives with over 70,000 power on hours and all pass their SMART tests.

> I was using just regular desktop grade disks so that happened every 18 months or so.

That seems unusually bad. Up until earlier this month my old raid array has been mostly desktop drives(just upgraded it to IronWolf NAS drives though). I did finally do the upgrade to new drives because one died, but that drive was manufactured in 2011, and the rest of the drives are from 2010, 2013, and 2015. To be fair I've not gone five years with out a failure, I think the 2015 drive was added about 3 years ago(being my actual desktop drive before that), but otherwise have had good luck. In fact, I think my most recent failure may have been accelerated by my server being put into a temporary case with poor air flow.

> so that happened every 18 months or so

Might be power supply issue.

I like to separate storage from the rest. Had various models of synology for nearly 10 years now and it’s the ideal solution for stopping to think and worry about it (combined with UPS). Running VMs, etc, I’d rather do that on a different box. You need to reboot frequently, make some experimental changes, modify the hardware, etc. Better to do that away from your data.

OpenZFS 2.0 will come with two awesome features that I am really looking forward to.

The first one is ZSTD compression, this will work great together with MySQL and PostgreSQL.

The second major feature is persistent L2ARC, I'm using LARGE ssd's as cache. And the warmup time takes weeks. So rebooting has a major performance impact.

For the last 10 years, I have been using FreeBSD with ZFS. This has been working perfectly with good performance. But now I want to take advantage of even faster network speeds, with RoCE/RDMA. And FreeBSD support for iSER, NVMe-OF is non-existant, while Linux has excellent support for these technologies.

Support for iSER (only initiator side, sadly) has been merged years ago: https://www.freebsd.org/cgi/man.cgi?iser.

I've been using ZFS in production both in my home and at work since 2012. It's come a long way in FreeBSD, and I think is now quite clearly the best filesystem choice for nearly every workload. ZFS on Root is super usable and easy to install now, and is great with an SSD mirror.

Highly recommended as something to learn and use, it's obviously the best choice for a home built filer, but is also an excellent choice even for general purpose server use. My colo setup is running FreeBSD w/ ZFS which is very stable for backending VPN servers, web servers and app servers of all stripes, etc.

Tangentially related to ZFS, I’d like some advice.

I’ve got a couple of projects I’d like to do that require maybe 100TB of storage: some scientometrics against the sci-hub collection, as well as building a bajillion scala projects from github. I don’t really care about data redundancy.

The cheapest way I’ve been able to figure out to do this is just buy a case with 15 HDD bays, eg the Anidees AI crystal case, and just get a ridiculously beefy processor + 256gb RAM and do all the computation on a single box.

Does this sound right? All of the purchasable NAS cases all seem more expensive, but I’m out of my element.

I expect I’m going to want to figure out ZFS to make a single logical drive.

Does this seem right? Building eg a backblaze pod is outside of my budget, and my eyes glaze over whenever I try to read about NAS controllers.

Not sure if it will directly help you but can help in indirectly.

This is 1080 TB (in 12 TB disks) in one server under ZFS:


For 100 TB you would get smaller server, like 2U with 12 slots filled with 10/12/14 TB disks for your 'usable' 100 TB space.

For more ZFS and/or FreeBSD storage options check this:


Hope that helps.


> 90 x Toshiba HDD MN07ACA12TE 12 TB (Data)

I'm surprised you went with one single HDD model. Don't people normally say "Go for different HDDs from different manufacturers, so that in case of manufacturing defects not all HDDs will break at the same time"?

This is what I did, I just snagged some cheap Rosewill case off of newegg with 12 bays, a PCIe sata card and packed the case with drives.

You can slice and dice the drives however you like. I found this guy's blog posts to be useful for running a homegrown NAS:

- https://louwrentius.com/should-i-use-zfs-for-my-home-nas.htm...

- https://louwrentius.com/the-hidden-cost-of-using-zfs-for-you...

The only real beef I have is this:

"It seems though that the one URE in 10^14^ bits (an error every 12.5 TB of data read) is a worst-case specification. In real life, drives are way more reliable than this specification."

I don't believe that statement without data backing it up.

Disk drives are right at the engineering edge. There is a reason why consumer class drives have one MTBF and enterprise drives have a better MTBF. If a disk drive manufacturer could somehow cite a better MTBF, they absolutely would (see the whole GB vs GiB marketing stupidity, for example).

Oh, cool. Being routed away from ZFS is interesting. Happy to do this as simply as possible

Another +1 for the Rosewill 12-bay/15-bay cases. If you're going to run this server anywhere near you, stay away from server cases like supermicro, dell, etc. since you will suffer hearing loss if you're around them for long periods of time.

If you plan on using ZFS, make sure to take into account for the overhead and parity drives. For example, I have 12x 16TB (192TB raw) drives in raidzfs2, including overhead and parity I only have ~125TiB usable.

You might want to consider at least a minimum of redundancy, like a RAID 5 (lets you lose 1 drive without killing the whole setup). It would be a real pain to set everything up, discover one of your drives is a dud, and have to start over from scratch. Especially with that many HDDs involved, the risk might be somewhat high.

Look into iStarUSA cases. Even my tiny 2U is big enough for Noctua 80mm fans that keep it quiet as well as cool.

This sounds about right. Not very reliable but cheap and easy to set up. If you make a software raid (e.g. with lvm) you could as well use any other file system.

Yep. If you're looking for more specific recommendations, the community at /r/datahoarder is a great source. With shucked USB drives, and just the biggest ATX case you can find, your costs should be pretty low.

Thanks for the input, both of y’all. Roger that about shucking drives. It is strange to me that ~10 TB external drives end up being the cheapest solution, but happy to capitalize on it

So has anyone here used zfs and btrfs and would like to comment on the differences? I've been on a heavy zfs kick lately, but the performance loss is hard to stomach and the only research I found points to btrfs being faster (though of course they both take a hit).

Basically the reasons I'm drawn to zfs are:

- checksumming & self-healing

- ergonomics & flexibility of managing pools with zfs

- copy on write for cheap local copy/experimentation (i.e. just clone your DB folder and you have a new DB)

- zfs send/recv for very efficient incremental backups

From what I can find it seems like btrfs does all that, and faster[0]. In addition to being faster, it also is in-kernel[1], and more flexible for the user in various ways, for example allowing resizing[2]. Looking around btrfs may not be blessed as stable but there are a lot of big orgs using it.

All that said, there are articles like this one[3] which are somewhat dated but paint ZFS really positively from a maintenance point of view. Very hard to pick between these two.

I'd really like to use ZFS -- the community seems very welcoming and amazing but I'm a little worried about picking the wrong tool for the job.

[EDIT] - there's also this old comparison from phoronix[1] which is confusing. I'm still learning towards ZFS but sure would like to hear some strong opinions if anyone has em.

[0]: https://www.diva-portal.org/smash/get/diva2:822493/FULLTEXT0...

[1]: https://btrfs.wiki.kernel.org/index.php/FAQ#Is_btrfs_stable....

[2]: https://markmcb.com/2020/01/07/five-years-of-btrfs/

[3]: https://rudd-o.com/linux-and-free-software/ways-in-which-zfs...

[4]: https://www.phoronix.com/scan.php?page=article&item=freebsd-...

I've used both, the reasons I favour zfs over btrfs:

- I've corrupted btrfs filesystems with compression just through normal use, on hardware that is fine. This may have been fixed, but it was on relatively recent (post 5.x) kernels

- zfs's logical volume layer is rather more flexible than btrfs's - you can make a multi-device set of mixed disks. For example, my backup pool is 9x4TB and 7x3TB. These are individual raidz sets that are combined together into one pool. In ZFS this is all in one place and it means the fs is aware of the logical disk layout and where data is stored. To do this on btrfs I'd need to use lvm and I'd in theory lose a bunch of the self-healing ability

- btrfs's snapshotting seems excessively complicated - it requires you to create a non-trivial logical layout in the fs, and it's very easy to accidentally expose these snapshots to the system's view of the fs. It's more flexible, but more annoying for my usecase. zfs's on the other hand is really very simple, and much much easier to use

With that said, ZFS on linux is slightly awkward as it's out of tree, and most distros build the module with DKMS. I don't entirely trust this for using for / (so my / is just a md raid1) - I use zfs for bulk data instead. btrfs is in-kernel, so there's no real disadvantage with using it for /.

Other advantages:

-Unlike btrfs subvolumes, zfs datasets can be mounted with different zfs-specific mount options, such as compression algorithms or recordsizes.

-zfs can take atomic recursive snapshots of nested datasets, whereas btrfs snapshots of a subvolume do not include nested subvolumes.

Overall, zfs treats datasets as first-class citizens, whereas the only purpose of btrfs subvolumes seems to be to exclude folders from snapshots.

> With that said, ZFS on linux is slightly awkward as it's out of tree, and most distros build the module with DKMS. I don't entirely trust this for using for / (so my / is just a md raid1) - I use zfs for bulk data instead. btrfs is in-kernel, so there's no real disadvantage with using it for /.

Is that true? RHEL/CentOS have both DMKS and kmod versions; in Ubuntu ZFS is a supported package. That's not most distros, certainly, but it is the ones most people use.

Based on using CentOS + DKMS, Debian + DKMS, and Arch + DKMS :) Some of these have prebuilt modules too, but as community projects they usually lag behind kernel releases to the main distro, and that can be a problem. So I always end up using DKMS.

Not used Ubuntu ZFS, admittedly. Haven't used Ubuntu since 2010 or earlier.

> zfs's logical volume layer is rather more flexible than btrfs's - you can make a multi-device set of mixed disks. For example, my backup pool is 9x4TB and 7x3TB.

You may wish to checkout ZFS dRAID, which recently got committed:

* https://www.youtube.com/watch?v=jdXOtEF6Fh0

> - zfs's logical volume layer is rather more flexible than btrfs's - you can make a multi-device set of mixed disks

Can you please elaborate on this? Working on mixed disks was always a killer feature of btrfs for me!

I do a pool called backup with two raidz vdevs - 7x3T in one raidz vdev, and 9x4T in one raidz vdev. Data is striped on each of the two raidz and I lose one 3T and one 4T's worth of capacity. This allows me to lose any one drive at a given time without data loss, and up to two drives (as long as they come out of separate vdevs).

As far as I could work out from the btrfs documentation, this isn't currently possible? Plus, RAID5/6 in btrfs is still of questionable stability?

Thanks for clarification!

Indeed, btrfs doesn't allow such configuration. However, it allows using disks of unequal sizes within single filesystem - like, 3+3+4T in RAID1 mode gives you 5T of usable space (imagine 4T disk being split in half, and each half duplicated to a different 3T disk; remaining space of 3T disks duplicated to each other). But as I understand, it's possible to achieve the same with manual partitioning and vdev allocation on ZFS, too.

RAID5/6 is indeed a danger zone - in another thread I've already mentioned a nice write-up of "guidelines for users running btrfs raid5 arrays to survive single-disk failures without losing all the data" by Zygo Blaxell: https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@h...

I've had a few BTRFS attempts a few years ago, and I ended up twice with a suddenly unbootable system. I stress that this was a few years ago though.

ZFS needs serious tweaking if you have performance-critical workloads. I experienced this on databases, when comparing against ext4 - at the very least, one needs to move the ZIL on a separate disk. Also, up to a short time ago, ZFS had performance problems with encrypted volumes, due to (let's call) formal issues with the kernel - as a matter of fact, it was slow on a laptop of mine.

All in all, I don't use BTRFS because of trust issues. The BTRFS is in a sort of "never stable" camp, which is not a good indicator of engineering practices. The nail in the coffin was for me that in the official FAQ, at least up to some time ago, there was the tragicomic cop-out statement that the concept of stability in software is just a matter of labeling, because all the software has bugs.

Three months ago I converted an old Core 2 Duo desktop machine to a server and also deliberated whether to use btrfs or ZFS. What put me off btrfs were the rough edges, e.g. some RAID setups are considered beta. In ZFS those have been stable for a long time. The fact that btrfs has been in development for quite a long time and still has such issues is not very assuring to me. That said, if you only use the features big orgs are using, you are probably fine.

Performance has been great so far even on my underpowered machine, even with just 4GB. I don't use deduplication though which makes a huge difference.

I chose Ubuntu 20.04 as OS since they are pushing ZFS support. Did consider FreeBSD as well where I had a positive experience in the past but since they are switching to OpenZFS anyway I stuck with Ubuntu since I'm running Debian derivatives on all my servers.

Thanks for sharing this insight!

> Performance has been great so far even on my underpowered machine, even with just 4GB. I don't use deduplication though which makes a huge difference.

So almost every OpenZFS community video/talk I've seen recently has included a like like "friends don't let friends dedup" or something to that effect... I think dedup is considered unnecessary/dangerous these days with how good compression is. Not sure exactly what the dangers were, but I know that I wouldn't even turn it on.

> I chose Ubuntu 20.04 as OS since they are pushing ZFS support. Did consider FreeBSD as well where I had a positive experience in the past but since they are switching to OpenZFS anyway I stuck with Ubuntu since I'm running Debian derivatives on all my servers.

Same, I used to run lots of different OSes but I've settled down on Ubuntu for everything now, and OpenZFS having good support is what makes it possible there.

> Not sure exactly what the dangers were

No expert either but from what I've gathered the main issue is that the dedup tables require a _lot_ of memory that scales with the size of the pool, and if they can't fit in RAM performance tanks. However the real issue is that any blocks written while dedup was enabled will demand this dedup overhead, even if the dedup option is turned off. That is, it's a "sticky feature".

So if you use dedup, notice performance tanks because dedup tables are too big, well you're screwed because turning it off won't give you back the performance. Only way to recover is to send/receive the whole shebang to a separate pool.

In addition, they don't actually bring that much space saving on common workloads. Most people don't have a lot of truly identical blocks of data.

> In addition, they don't actually bring that much space saving on common workloads. Most people don't have a lot of truly identical blocks of data.

VM backing storage seems to be the biggest worthwhile use case, but that depends on whether snapshots and clones are used extensively. Installing 100 copies of Debian on empty VMs will likely get deduped quite a bit. But it's faster and provides almost the same benefits to install one VM, snapshot it, and produce the rest of the VMs by cloning from the snapshot.

The only other case I could imagine dedup being good for is storing a lot of genomic data: https://techtransfer.universityofcalifornia.edu/NCD/25080.ht...

But if the use-case is narrow enough for custom deduplication then it will probably be much more efficient than ZFS's block-based dedup.

My experience is that btfs has been quite stable over the last couple of years.

Things to consider:

While a volume can span multiple devices (physical), historically it hasn't been the most stable and many users just stick with md, so real world testing is probably limited. Details are in the wiki.

When a volume starts getting full (let's say above 98% or something), performance suffers. This is also documented behaviour. Take monitoring seriously, even more than usual.

Pools and subvolumes can be a bit confusing at first, even if you've had experience with other volume managers. Read the documentation make sure you know what you're doing.

Thanks for this -- I really haven't read all the BFS wiki and documentation just yet (since zfs was my first foray into this), will keep these points in mind. Don't think I've seen an "md" mentioned before...

md is the normal tool for doing RAID in software on Linux. It exposes a number of physical devices as one block device, which looks just like a normal hard disk.

btrfs is also a volume manager with RAID-like functionality of its own. But you can also use btrfs on an md device, just like you would with any other filesystem.

Software RAID is something I have only used for personal use, and with the enormous consumer hard drives available now, striping them seems less necessary than before.

Ahhhhh I though it was some sort of btrfs specific thing -- didn't recognize it as the 'md' in 'mdadm' (I've only dealt with software RAID before).

This is actually pretty interesting because one of the 'features' of the hosting provider I'm using is that they will software RAID your drives by default. Maybe btrfs is a better choice in that kind of environment if I don't have to undo the software raid on every machine and btrfs will interop well without too much abstraction.

I use btrfs, and the main reason I don't consider zfs instead is that zfs doesn't use the regular linux fs page cache. That and the fact that I use latest mainline kernel, so dont want to have to deal with kernel updates breaking zfs or tanking its performance, as when it lost access to the simd functions. The main feature I would want from zfs would be tiering, which can be gotten from btrfs on bcache, which is what I will likely use in the future. I think there was some issue with adding disks in zfs too? Don't exactly remember about that.

The fact that it doesn't use the page cache is one of the top reasons why I love zfs - it can do a lot smarter decisions (MRU+LRU instead of just LRU) and integrate it with L2ARC.

Of course it depends a lot on your read/write patterns what impact this has, though.

On adding disks, I don't know if that's what you're referring to, but you can't add new disks to an existing RAIDZ.

Personally I've only used mirrors and single-device vdevs so far, haven't seen any need for RAIDZ.

Thanks for sharing -- looks like zfs still doesn't support tiering and just has the ARC and using SLOG for similar functionality...

And yeah, ZFS can't be expanded willy-nilly, found a good blog post with someone's adventures that was illuminating[0].

[0]: https://markmcb.com/2020/01/07/five-years-of-btrfs/

It sounds like you're mixing up SLOG and L2ARC. How is ARC/L2ARC/base dataset not tiered?

ah I did mix this up -- thanks for pointing that out

I’ve been using btrfs for about two years now on my primary machine. Single disk, btrfs-on-luks, on a 970 evo plus. I have never had any stability issues, and as far as I can tell performance is excellent.

I’m not taking any chances, though, seeing as it is “unstable” – I’ve got snapshots and local and remote backups set up and working flawlessly. It’s sooo awesome to be able to pluck older versions of any file on my system whenever I want, and it’s saved me numerous times.

I haven’t tried zfs but I really have no need to. Btrfs does the job for me.

I have a similar setup. It all works perfectly until the disk starts filling up. At 95% performance goes down the cliff. You have to clean the disk and defrag. Running 4yo version, I would guess things might have improved since.

This is true. I’ve had it happen a few times too. I try to keep an eye on disk usage and purge things if it gets too high.

I guess the difference for me is that BTRFS is dead easy (being mainline, as you said). The knock is reliability, but BTRFS has been rock solid every time I've used it, so from personal experience, it's a no-brainer.

Plus, I love snapper and the way subvolumes and snapshots are handled.

EDIT: I will add that I don't use any RAID. I use BTRFS for snapshots, cow and data integrity on my single drive desktops. I think that's where it ready shines. I don't think anyone should be using EXT4 anymore.

Thank you for this input -- yeah it's just so weird that BTRFS is in the mainline kernel, but no one wants to stand up for it, and people are evidently using it with great success (Facebook for example).

I'm also not necessarily going to do any RAID5/6 stuff -- I'm probably just going to keep it safe (for my level of understanding) and do a RAID1/mirror setup and call it a day. The snapshots/cow/data integrity bit is definitely what I'm interested in as well. It feels to me like as long as I run ZFS under my servers I am much safer than anything else (and it's easier/possible to go back in time and undo mistakes).

Unfortunately, there's that whole thing about it being hard to boot to.. Is that still a thing?

BTRFS hard to boot to? You mean, to have it as your boot partition? I guess that could be true. I still use vfat (I think) for boot, cus I just don't care. :D

But for the rest of it, I just have everything in my fstab and it works like anything else. Super easy.

OpenSUSE uses it as the default FS.

As does Fedora as of Fedora 33!

I don't have much experience with btrfs but my understanding is whilst they're generally equivalent the multi-disk config of btrfs isn't really considered stable (think raid5/6 vs zraid1/2). Most production use case is single disk.

Regarding performance, I'm guessing you won't be happy with either if you're not happy with performance. You want to be looking at ARC / SLOG if you want higher performance.


Using the built-in RAID of btrfs is unstable in some cases, but you can still use btrfs without using its RAID on top of mdadm with that providing the RAID, like ReadyNAS and Synology NAS devices have been using by default for years.

Thanks, I learned a thing. :)

thanks for the input -- so yeah I saw the RAID5/6 thing. I'm wondering if it's a bit of a moot point, because the servers basically all come with 2 identical drives of various speed.

Moving writes to ARC or SLOG-on-a-faster-thing would also definitely help, but I'm dealing with SSDs for the most part.

Also, talking of faster storage, NVMe looks really bad for ZFS (and probably btrfs), based on this reddit post[0](graphs[1]). It's not terrible of course, and some recommended that maybe actually turning ARC off would be better, since it might have been actually getting in the way of the NVMe drive.

[0]: https://www.reddit.com/r/zfs/comments/jmdxxx/openzfs_benchma...

[1]: https://64.media.tumblr.com/0d141001aa951a44063c2cac9d2b9cb7...

I've used ZFS for 4 years and really enjoyed it. It was simple to setup and send/recv was very useful for making backups of the data on a separate machine on the network. I now have been using btrfs as I switched to OpenSuse and frankly it also just works and its been easier to setup as I didn't need to do module installs etc. However I've had to switch to rsync for my backups but that wasn't hard. The other good thing I've used is the fact I can grow the raid and can rebalance the disks this isn't possible on ZFS.

I love ZFS and I want to migrate all my data storage to it. The one thing that is holding me back is the inability to grow raid z with the demand. There is a pull request ongoing for a while now, but it is hard to tell when and if this feature will be available [1]. Anyone has any more insights if there is an ETA for this feature?

[1] - https://github.com/openzfs/zfs/pull/8853

I don't expect this to land any time soon, if ever. It's a very large change, and very complicated. It's been in progress ever since I started using zfs, and that was years ago :)

It's worth noting you can expand a RAIDZ through replacing disks - if you start off with a pool of 4x2TB for instance say giving 6TB usable, you can expand it by replacing those disks one by one with 4TB disks - in which case you eventually end up with 12TB usable, once all disks are replaced.

Alternatively, you can add another RAIDZ to the same pool with extra disks (but you will lose more capacity this way).

Otherwise, recreate the pool, and restore from your backup (which you definitely have, right?). Assuming both your live and backup are zfs, this is easy with zfs send | zfs receive.

Yeah. The backing reason is that ZFS has an axiom that data, once written to disk, never changes. That includes moving data to a different location. Changing that either means a massive relocation table, or some other yet-to-be-coded tool.

As a home user of ZFS I would give it an ETA of never. We've been waiting for probably 10 years for this feature, but it's just never been enough of a priority from more serious (paying) users to get worked on.

There's a hacky workaround, but it depends on how many drives you have and what your tolerance for downtime is. I had 2x8TB drives in RAIDZ1 configuration (sda, sdb), and I added an extra 8TB drive (sdc).

My solution was:

offlining one drive (sdb), degrading vpool1

creating a vpool2 in RAIDZ1, with sdb, sdc, and an 8TB sparse file on a flash drive

offlining the sparse file, degrading vpool2

sending a ZFS snapshot from vpool1 to vpool2

destroying vpool1, and adding sda to vpool2 to replace the sparse file

This is unnecessarily complex, you can just offline a drive, remove the disk, insert the new disk, and replace the offline drive with the new disk. ZFS will resilver the new disk.

This is to switch from 2 disk raidz1 to 3 disk raidz1 without needing additional drives to hold the data while moving.

That's pretty much why I went with LVM + ext4 for my home server. Every once in a while, when I need more space, I toss another SSD in and expand the volume. Easy peasy. Though my storage needs aren't huge; I'm only up to 4TB of total SSD space.

You can grow the pool by adding pairs of disks, if you can stomach the cost.

Just add another VDev. For example lets say you have 32 drives. It makes more sense to structure it as 4 x 8drive raidz2 instances rather than 2 x 16drive raidz3. Because then you will only have to add 8 drives vs 16 to expand the pool. You could incrementally expand your ZPool by adding new vdevs of the same or larger size. You could technically make your vdevs 4 drives in raidz1 but id recommend at least 6 drives per vdev in raidz2 or greater

> 256,000,000,000 / 128,000 * 70 = 140,000,000 bytes

> This would be a pretty common configuration choice for a lower-end VM storage box. If you only had 16GB or of RAM in your system, all of your ARC space would be wasted with L2ARC mappings and you would only have 2GB of the entire rest of your system.

Am I misunderstanding this? 140,000,000 bytes is 140MB (salesman MB, not 2^20 bytes). It looks like they're saying it's 14GB.

> salesman MB

There is a de-facto term distinction: MB megabyte (1000^2) vs MiB Mebibyte (1024^2).


I have been a FreeNAS (ZFS) user for years. While I do follow a proper rule of 3 for backups, my main FreeNAS volume (100TB) is where I will randomly dump stuff until I sort/backup it up later. It hosts a wide range of stuff from media, to small files (db). The only thing that will eat that data is a hacker, massive sw bug, or my house burning down.

I've broke a lot of systems and data over the years. With that in mind, I like my data storage (NAS/filer) to be boring and predictable. FreeNAS is exactly that for me.

What are you using for H/W and what drives compromise your 100TB volume? Do you know what the idle power draw is?

I'm running FreeNAS on an ancient Lenovo minitower with just four drives as two mirrored volumes and it's about time I upgrade the thing. But it just works and only draws about 60W.

30x 4TB HGST drives is the array. Running on multiple LSI HBAs with a NVMe drive for write cache. Idle power draw is 250watts-ish including the host hardware (cpu/mem/net). I think I will go cry now that I am thinking of my power bill.

How should one best protect a home NAS/fileserver from being hacked? I'm in the process of putting one together, but this one makes me nervous. Is it simply having an up-to-date OS, setting up, say, an SSH bastion/wireguard for remote access, and calling it a day?

Sorry I missed this one - my bad.

I do a few things including A) keep it patched B) minimize the number of exposed services C) utilize pfsense to control [network-level] access above and beyond what freenas itself provides D) stream all host/service logs to an ELK stack to review for any funny business that may occur.

Not bulletproof but I haven't had an incident yet.

Hope this helps!

P.S. All of this assumes proper backups. I can restore most of my stuff from backups, it just will take forever.

ZFS is an awesome filesystem, this article gives a good overview. Very interesting for home users. I run it on my NAS.

It is very important to realize that you can’t expand VDEVS. You can only add VDEVS.

This makes expanding storage less flexible than regular MDADM RAID. Going for Mirrors is the most flexible but you lose 50% of capacity.

You also get the random IOPs performance of a single drive per VDEV. Sequential performance does scale within a VDEV.

You scale random I/O performance by adding VDEVS.

For home usage, as a NAS, you don’t need to use SSDs for a SLOG, unless you have write-intensive random I/O workloads.

Strictly speaking you can expand a VDEV. If you replace all the leaf drives in a VDEV with larger devices, you can extend the VDEV to take the extra space.

There is work ongoing to allow adding drives to a RAIDZ VED[1]. And VDEV Removal gives some flexibility (at the cost of indirection on the read path). But yeah, ZFS is not super flexible for adding and removing random drives.

[1]: https://github.com/openzfs/zfs/pull/8853

Yes that's totally true, but it is very time consuming and cumbersome in my view.

I've detailed this also here. https://louwrentius.com/the-hidden-cost-of-using-zfs-for-you...

You may also realise from the top of my article that expanding VDEVS is a topic since 2017 but still no show.

I just did this operation on moving from 12x 4TB drives to 12x 16TB drives. I did it over 12 days (each resilver was ~24hours) and it wasn't too cumbersome. Considering how often you're going to grow your pool, it doesn't seem like a big drawback.

Fair enough

Which OS are you using for your NAS? I ruled out Unraid and I'm looking at TrueNAS Core (FreeNAS). I want it mostly for scalable backup storage where I can easily add/remove drives from the array as I need more storage. Any others that I should consider?

I am using Debian with ZoL, but it's now ancient. Just ZFS + NFS & SMB.


I would now go for Ubuntu + ZoL myself. I bought all capacity up front and paid the ZFS tax. So I don't need to expand as I go.

But if you want to expand as you go, Linux + MDADM are still fine in my opinion. It's a tradeoff. Do you want to 'pay' the ZFS tax and expand at a cost, or do you want a bit more risk (on paper) but more flexibility?

I ran a RAID6 of 20 drives before that using Linux + MDADM and that worked fine. And MDADM allows you to expand as you go, exactly as you want.

I think the risks ZFS protect against are very small.


Unraid is the only one that allows you to freely add/remove drives. And only if you're OK with taking cluster down for a minute as you add it.

If you true freedom you'd need to go Ceph(FS) based, but there's no GUI to manage a Ceph cluster that I'm aware of, and they're so focused on cloud usage that you'll find little guidance on single-node clusters.

edit: Forgot to read Louwrentius' comment, of course MDADM with raid 6 allows you to add disks in pairs of two, so that's definitely also an option. Don't forget to use LVM, it can get real awkward without it.

> Unraid is the only one that allows you to freely add/remove drives.

I think technically you can do something similar to Synology's Hybrid RAID[1] with ZFS, using partitions and creating mirror VDEVs from the partitions.

But I haven't gone through all the details so could be there are some risks I overlooked and in any case you'd have to manage it from the command line so would be a bit tedious. But I did set up a proof of concept in a VM.

[1]: https://nascompares.com/guide/shr-synology-hybrid-raid/

Don't forget GlusterFS, example: Every single disk is a ZFS-pool on top of that you run your GlusterFS-Cluster, GlusterFS then looks for the redundancy and underneath you can export/import your pools aka Disks and or vdevs (if you need more performance).

>but there's no GUI to manage a Ceph cluster that I'm aware of

There are two, Open Attic and Calamari ,but Open Attic is in maintenance mode, so all the work goes now into Ceph Dashboard:


> You also get the random IOPs performance of a single drive per VDEV

This isn't strictly true. It is true for RAIDZ-n and for writing to a mirror, but ZFS can accelerate random reads with a mirror by distributing the read requests to the underlying disks.

While planning my ZFS setup back then, I had good fun with this ZFS capacity calculator. The numbers turned out to be a little different but I still think it's worth a try.


Uhh, wow. I can barely imagine a beginner fully grasping this. I can easily imagine a professional learning from this, and it contains specialist-level insight and remarks. Great content. First time visiting STH and it's already earned my only bookmark of 2020! Thank you, Nick.

They have lots of great content on their news sections. As well as beginner articles, they have reviews of new upcoming enterprise cpus and hardware.

Their forum is also great for information about running ex-enterprise gear at home.

Could a ZFS server farm be a good alternative to a smallish Ceph installation ? If I wanted to build a storage layer for a small private DC - is Ceph the way to go? ZFS? Or maybe there's something else? I'd like to have storage details hidden away from storage layer so that applications can write to a mount or something similar - what is the best solution for such problem these days, if one needs resiliency and redundancy baked in ?

I use Ceph in my homelab. Small cluster of five all-in-one noses with ten OSDs each. I went this way since it gives me a lot more expandability and fault tolerance than ZFS can. I use min_size 2, size 3.

My use case is both CephFS and RBD for a small two node OpenStack cluster. I have found CephFS to be rather performant for my use cases, enough that I had to get a 10Gbps switch. I am not machine or the bandwidth on that switch, but my individual clients use more than 1Gbps.

OpenStack and Ceph tie together wonderfully. I have my VMs backed by NVMe drives and my VMs are snappy. Recovery is quick too. I am using crappy first gen xeon-d boards and even with those I hit 8Gbps recovery on those drives.

Ceph shines when you have a lot of parallel access. It is recommended to have at least ten nodes for a production cluster so recoveries so not take too long. If you have a lot of clients Ceph is king.

I used ZFS in the past as a simple Fileserver before using Ceph. It worked well, I could saturate a 1Gbps link, however I found the vdev resize limitation too restricting at times when I wanted to expand by a little bit. It is pretty easy to manage, though I find Ceph very easy to manage as well.

For my backup server which is a target for BorgBackup I went with btrfs for the better flexibility it offers with resizing arrays.

How many computers do you have at home, in your homelab? Do they also heat your house? Where do you store them?

I have 5 Ceph nodes, 2 OpenStack nodes, 1 backup target, 3 switches, 1 pfsense router, 1 freeipa server and my desktop. The homelab is kept in the furnace room. It is almost completely on atom or xeon-d boards, so the power usage hard drives aside isn't too bad. The room does get a bit warm though, even when the furnace is off. They are kept in a 36U rack.

> Could a ZFS server farm be a good alternative to a smallish Ceph installation ?

ZFS and Ceph work at different 'layers'.

ZFS provides redundancy with-in a server, so if drive(s) die then the service on that service can continue to run without interruption. Ceph provides redundancy between servers, so if drive(s), servers, or even entire racks/ToR switches die then things keep going. Ceph is generally for much larger scales (e.g., OpenStack, HPC) than ZFS, which is usually done on NFS or SMB servers.

Until relatively recently, Ceph was also only accessible at a block layer, so you'd have to put a file system on top of it (i.e, Ceph gave you a /dev/sdX), but somewhat recently CephFS has become/declared 'stable'.

Regardless, you still need three machines for quorum for Ceph. If you just want a file share, then it's probably unnecessarily complex.

I have a glusterfs cluster where the bricks are consisting of zfs datasets. You can so it similarly with zeph and zvols.

Zfs sounds like a good solution although I'm not experienced with Ceph. Btrfs is ab option too but it has sone pitfalls.

> "Use CMR with ZFS, not SMR"...

Well, yeah, good point to make but it is becoming harder and harder to find CMR drives, especially in 2.5".

ZFS should really adapt to this. Perhaps using bigger block sizes or something. Because SMR is not going away.

Upping the block size to more than whatever SMR overlaps on the drives works for workflows that don't care about consistent random rewrite performance. Can get 120 MB/s sequential writes on the cheap seagate SMR drives in my big slow pool which is enough for clients on gigabit or less networks.

Although frankly at this point if you care about performance you're on NVME anyway.

That depends what is your use case.

I use two 5TB 2.5 SMR drives in ZFS mirror:


These drives can slow down to 30-40 MB/s when filled to 80% or more but I use that storage over WiFi which is at most 11-12MB/s which means the SMR problem does not exist for me.

If I would be using that storage over LAN the 30-40 MB/s in 'WORST CASE' is also not bad considering that maximum real life LAN speed over gigabit network is about 80-90 MB/s.

Its also not possible to get non-SMR large 2.5 drives. I use 2.5 drives as they are silent and they need very small amount of power comparing to 3.5 drives.

"No SMR with ZFS" is probably a bit overblown. It simply depends on your use case.

Rebuilds certainly are a problem but beyond that I think it's simply "be ok with slow drives". I've been operating a decently sized SMR pool for 3 years now with no major issues, including surviving two drive failures.

If you try to put random write workloads onto SMR you're gonna have a bad time no matter what you do. In my use case it's great, since this is effectively WORM storage writing giant files to disk all at once so I have effectively 0% fragmentation and all subsequent reads of the file tend to be sequential.

That said, this is for my personal use and lab projects. Not sure I'd go ZFS+SMR for a production workload.

I find it very unfortunate that file metadata is not encrypted : if you need this to be encrypted, you need to stack LUKS on top of ZFS, and you lose many of the benefits of ZFS (per-dataset encryption, healing ability, RAIDz, etc) while doing so. Running ZFS->LUKS->ZFS to recover some of these benefits is also not feasible at all (ZFS doesn't like to self-host, even through a virtual machine).

ZFS encrypts most metadata.

Metadata not encrypted: Dataset / snapshot names, Dataset properties, Pool layout, ZFS Structure, Dedup tables

ZFS encrypts: File data and metadata ,ACLs, names, permissions, attrs Directory listings,, All Zvol data,FUID Mappings ,Master encryption keys ,All of the above in the L2ARC ,All of the above in the ZIL

For most uses and use cases this is net increase in security. You can do some operations on data without needing the keys.

Oh it seems I was mistaken about that. ZFS does encrypt enough metadata indeed. Sorry for the noise.

You don't need to stack ZFS-on-LUKS-on-ZFS, you just need to put ZFS onto LUKS rather than onto the raw disks.

The downside is that you have a choice between encrypting all file data twice, or losing the benefits of ZFS's encryption (mainly the ability to send snapshots to another pool without decrypting them). It would be nice if you could specify a pool key to be used to encrypt all blocks not covered by ZFS native encryption, which would eliminate the need for LUKS.

Yes, and there's another downside to that : it makes me lose the other ZFS benefits (raidz, mirroring, some level of performance).

Use one LUKS volume per raw disk. You'll still get raidz/mirroring.

RAIDZ on LUKS should work similarly to RAIDZ on GELI, no?

Encrypt each hard disk and add the decrypted block devices to the vdev in whatever RAID configuration desired.

What file metadata is unencrypted?

I was mistaken, file metadata is indeed encrypted.

Wouldn't just LUKS->ZFS be enough?

Why does metadata encryption matter?

I saw this deck on ZFS and btrfs while researching them some years ago [1]. It led me to think ZFS would be fine for me to run in my closet/personal cloud, but I’d probably want to avoid it at work.

I know it’s an old deck... has anything substantial changed since it was put together?

Specifically around the licensing [2], inclusion in the kernel, etc.

It looks like FB in particular had invested a lot in btrfs [3] over ZFS.

[1]: http://marc.merlins.org/linux/talks/Btrfs-LCA2015/Btrfs.pdf

[2]: https://arstechnica.com/gadgets/2009/10/apple-abandons-zfs-o...

[3]: https://btrfs.wiki.kernel.org/index.php/Contributors

ZFS is useful if you don't have any higher-level abstractions for bulk storage. Databases and object storage (including cloud storage) generally provide integrity, reliability, and snapshot features at least as well as ZFS. If you need robust posix/samba file servers then ZFS is a good choice. I've never worked on a site with a large fleet on ZFS; it always had its own niche.

Maybe a good rule of thumb is that if you've ever needed to track down a missing file, fix file corruption, or have contingency plans for such an event in production then ZFS can help. If the default action is to wipe a host and reinstall if the local filesystem looks fishy then ZFS won't provide many benefits.

Btrfs has the advantage of living in the kernel tree (but I use ZFS on Freebsd where the same is true), but still carries an aura of not being quite done. There are several "things seem to work well at version X, and watch out for Y" statements in https://btrfs.wiki.kernel.org/index.php/Incremental_Backup for example. This is a year old at least so maybe all bugs and caveats are fixed. How could I be sure? No idea. zfs send and receive just work.

Btrfs seems to be the filesystem to use if you can dedicate enough time to testing your specific use case thoroughly and keeping up to date with changes and improvements as opposed to ext4 or ZFS which change relatively infrequently. ZFS enables/tracks new features in the zpool metadata so there's a modicum of forward- and backward-compatibility.

To be honest I am even a bit leery of ZFS-on-Linux but even Freebsd is moving there soon. Abandoning a working, trusted codebase is always scary.

It's the other way around. FB put a lot of investment into btrfs, but only for the features they needed. Single drive works great, because they needed that, but RAID-equivalents don't. On the flip-side, Sun->Oracle have spent decades investing in ZFS at all levels, and it's really solid now.

Licensing is a minor issue with ZFS, but if you aren't distributing the code to others, those issues don't affect you.

Sure, I can help. That powerpoint is horribly out of date and I'd argue full of FUD (intentional or not).

>• Raid 0, 1, 5, and 6 are also built in the filesystem

BTRFS RAID has been a nightmare from day 1 including complete data loss. Parity-based RAID has been "just around the corner" for almost a DECADE. Sure you can do RAID-1 but in 2020 I'm just not interested in losing half of my capacity.

Yes you can layer it on top of MDRAID but that eliminates half the elegance of what ZFS brought to the table.

>ZFS is fairly memory hungry, it's recommended to have 16GB of RAM and give 8GB or more to ZFS (it wasn't designed to use the linux memory filesystem, so it uses its own memory that can't be shared with the rest of linux).

This is just flat wrong. You need 2GB of memory for a happy filesystem, 8GB+ is if you're doing deduplication which is unnecessary overhead in most environments. I'm also not sure where the "memory that can't be shared with the rest of linux" is coming from - ARC will use and free memory as needed by the system.

>Due to the CDDL being incompatible with GPLv2, a linux vendor or hardware vendor will never be able to ship a linux distribution or hardware device using ZFS

Except Ubuntu already is. I believe SLES does as well.

>As a result, you shouldn't plan on using ZFS for any product that you might ever want to ship one day.

Delphix ships a product today based on ZFS. 0 issues.

>Oracle may have stopped further work on ZFS as a result. Or it could be another reason entirely...

Oracle absolutely didn't stop work on ZFS, I'm not even sure where he came up with that nonsense. Oracle continued to update and release new versions of ZFS long after the lawsuit was settled.

You know what's far more telling? 13 YEARS after starting BTRFS: Oracle uses ext4 as their default filesystem, not BTRFS. Redhat has dropped support for BTRFS entirely.


Thank you very much for the play-by-play. This slide deck is from 2015, so I'm willing to give the author the benefit of the doubt, that things have probably changed a lot since then.

It seems that the ZFS story has gotten much better since 2015 (FreeBSD & Ubuntu support, memory usage improvements, license problem proving to be not a big problem).

Correspondingly, it seems that the btrfs story has regressed (redhat drops support, many issues in this greater thread raised around reliability and use cases, suspect development practices).

My guess is that they were closer to parity in 2015. This deck was making the case for why btrfs would win out. The author seemingly turned out to be wrong, as we're all liable to be from time to time.

My guess is that the licensing might be an issue for some big corporations today, but it appears to be generally benign, like you say.

Thank you so much for the information! :)

ZFS sucks do not use it .

I’m in the process right now of setting up a raidz2 with 6 4TB disks. So far I’ve really liked zfs, but I have had some weird issues, like deletes taking a very long time.

My plan is to upgrade the host to Ubuntu 20.04 to see if the newer ZoL version helps. Also looking forward to root volume on zfs so I can do snapshotting to the raidz2 and just let the drive fail eventually.

Any remarks/experiences about the "record size" of ZFS, maybe especially in relation to RAIDZx? I don't fully understand it.

I have in a NAS a RAIDZ1 of 4HDDs on which I set a recordsize of 1MB, currently full at 50% and so far performance has been good with both big and small files... .

I'll probably create in a future a RAIDZ2/3 by using ~8 HDDs and I'll test various record sizes but I just wanted to know if anybody had already any positive/negative experiences with some combination of record size and RAIDZx... .


The record size setting is the max, ZFS can use less in some cases. The best one depends on the data youre writing and the ashift of the pool, so testing is best. Large record sizes are helpful for large files (less overhead losses).

Some good beginner info https://arstechnica.com/gadgets/2020/05/zfs-versus-raid-eigh...

Some advanced info https://www.joyent.com/blog/bruning-questions-zfs-record-siz...

Thank you!

I did read Arstechnica's article in the past but I did not feel comfortable with their results... (I'm not challenging them, I'm just not sure if they're relevant for me or not).

So, I just did a test (ashift 12, RAIDZ1 with 4 8TB HDDs) and I got better performance in both cases with a 1MB recordsize vs. 128KB (all sequential).

Recordsize 1MB:

  reading one 10GB file: 21 seconds.
  reading 10000 1MB files: 83 seconds
Recordsize 128KB:

  reading one 10GB file: 31 seconds
  reading 10000 1MB files: 116 seconds
Maybe a small recordsize can have some benefits when overwriting parts of the files...mmmhhh...?

Ok, it seems complicated => I'll just have to test different variants :)

> Maybe a small recordsize can have some benefits when overwriting parts of the files...mmmhhh...?

Right. People who've done more testing than me reckon on 16KB being a good record size for transaction-processing database work, where tables are seeing lots of small inserts and updates. (You might think matching the database's block size would be ideal, e.g. Postgres writes 8KB at a time, but the rationale here is that you tend to get better compression at 16KB recordsize than 8KB, and the benefit from this outweights the write-amplification.)

But if database update performance isn't a big deal for you then you can probably just ignore this.

I've not done any testing of my own at the 1MB size, but I don't think I'd be inclined to try it unless I was fairly confident that there weren't going to be many small writes to big files.

In short: use the large recordsize where you think you've got a good case for it, and likewise with a small record size. Otherwise, just stick with the default.

Thank you.

Yeah, in my case the DBs "Clickhouse" and "MariaDB+MyRocks" might fit well the 1MB-case (as they both never "update" existing files but keep writing new files not just for "inserts" but as well for "updates", Clickhouse anyway not supporting "update/delete" almost at all, he).

On the other hand "PostgreSQL" and (maybe) as well "MariaDB+TokuDB" might need a small recordsize -> I'll have to test it, and anyway, splitting each single DB to use different datasets seems to be a great idea :)

In some cases -- when the file is smaller. For small files ZFS uses the smallest possible block size that can accommodate the file. Once the file grows beyond the recordsize (maximum block size), it uses recordsize-sized blocks.

Related article from arstechnica:


You can issue arbitrary 'zfs send' to rsync.net, over SSH.

Has anyone tried to do a home NAS server with ZFS on Raspberry Pi 4?

Why would you choose a system with a single Gen 2 PCIe lane xor bandwidth-constrained USB 3.0 as the only high-bandwidth I/O to build a storage server?

I ask as an owner and regular user of several Pi models, including a 4.

With that said, assuming you choose PCIe, acquire a decent HBA, and build it into a decent enclosure, I'm sure it'd work as well as typical low-end off-the-shelf NAS boxes, and wouldn't cost that much more in time and materials to set up.

For the uninitiated among us (including myself) could you explain what an HBA is? Or share any more details on how you’d build a low-cost NAS that’s not a Raspberry Pi?

I have a Raspberry Pi 4 that mounts a USB hard disk and serves files over SMB and Nextcloud. I have been considering reformatting the drive to ZFS or btrfs and booting the Pi directly from that so that I can start taking snapshots. Is this a bad idea?

I’ve looked at buying dedicated NAS hardware before (mostly Synology products) but I’m always deterred by the cost. A low end Synology NAS with drives runs around $500 or $600, which is a huge jump from my little Pi.

HBA is basically a PCIe to SAS/SATA card (could be other variants but this is the typical one).

For a low-cost NAS you could use a spare computer. My first NAS was my old desktop computer, using the motherboard SATA ports.

That said, ZFS should work on the Raspberry Pi 4 at least in 64bit mode. You probably will want to use Ubuntu as it has ZFS support.

If you have a spare drive you can test with, give a whirl. Just keep in mind this[1] if you have poor performance from the USB-SATA.

[1]: https://www.raspberrypi.org/forums/viewtopic.php?t=245931

It's possible but you're limiting yourself to 2 or 4 UDB drive mirrors. If you just want a zfs storage pool and aren't too picky about anything else, it works, but if your list of expectations goes beyond a place to dump data, you're setting yourself up to be frustrated later.

My home net has several pi3 and pi4 systems that boot off a small SD card and then mounts working storage via iSCSI, which works great for how I use them. There's no reason I couldn't serve that off a pi, but their limited IO and memory makes them a poor choice.

The only thing preventing me from adopting zfs is that it is not being part of the linux kernel.

ZFS is so awesome that for my file server I pick the OS based on ZFS support, not the other way around.

Currently still running OpenSolaris on my ZFS server, probably will be FreeBSD next time I upgrade the hardware, but time will tell. Regardless of the OS, it will be ZFS!

Ditto, I run OpenIndiana but FreeBSD would be my next choice.

Can't live without boot environments on physical hardware.

OpenSolaris is dead, isn't it? Presumably you started when it was released/still alive, but hasn't it not had any security updates for years? Why not OpenIndiana/Other IllumOS variant?

Well yes, it's actually OpenIndiana.

Oh, OK. Thanks for clarifying.

ZFS was one of the reasons I switched to FreeBSD, and it turned out to have a lot of other advantages too. Might be worth a look?

I did the same, and no regrets. ZFS and FreeBSD are a great combination. Wish I'd made the move several years earlier.

Could you or someone else is why it's able to be part of the FreeBSD kernel but not the Linux kernel? I'm guessing this is a licensing and/or legal issue?

License compatibility. CDDL is a weak, file-based license similar to the MPL. It is generally (though not universally) believed to be incompatible with the GPL. (Which may or may not have been intentional. Again, accounts differ.) In any case, the Linux kernel along with Red Hat for its distributions won't incorporate ZFS code for this reason. The fact that it's Oracle that owns the copyright doubtless doesn't help.

FreeBSD is BSD licensed of course and that's considered compatible with CDDL given that CDDL is file-based.

Thanks ok yes that sounds familiar. For anyone else interested, this is a good read:


Too bad FreeBSD is a crappy operating system.

When ZFS came out I tried OpenSolaris on a 48 disk machine. Regretted it immensely, that was the last sun hardware we bought

Simply it didn’t fit with the hundreds of other linux boxes we had.

We ended up scrapping the zfs idea and bout 1000+ disks (about 2PB) worth of linux storage over the next decade, on xfs and ext4.

Had Sun’s x4500 platform worked with a Debian based linux we’d have bought that instead of supermicro. Sun lost because of their choice to exclude zfs from linux.

From what I can tell, zfs is/was great, but wasn’t good enough to change our standard OS, and since then people moved to object storage

> When ZFS came out I tried OpenSolaris

Historical context: ZFS predates OpenSolaris by quite a few years, so the above statement can't be technically true.

Yes you’re correct it was normal Solaris 10, this was about 12 years ago. On my evaluation I was comparing with a satabeast and hp320s at the time, and I said that “zfs probably outweighs the problems of running Solaris”

The box was still in use in 2012, we were having issues and a “zfs upgrade” was suggested, but at that point we were adding new storage on linux.

Loving how zfs is so easy once you get a hang of it.

I'm trying to test it out on an i3a.large instance with 1.2TB NVME ssd and benchmarking postgres on it. Trying to move out of RDS since time to time I have an heavy iops scripts.

Only thing left is doing zfs snapshots next for my backups. zfs snapshots or pg_dump snapshots? I wonder what's better.

I've had a FreeNAS VM set up with a couple mirrored ZFS disks (PCI passthrough) for the past few years. It's worked fairly well, but overall it's a pain to always have make sure the VM is on, mount the NFS, etc. My original motivation was preventing bitrot, but I've since seen that rationale called into question[0]. Is there a current consensus on whether ZFS is worth the hassle? I've never used the snapshotting, though I've heard good things about it.

I'd much rather manage a simple local disk with offset backups.

[0]: https://www.jodybruchon.com/2017/03/07/zfs-wont-save-you-fan...

> My original motivation was preventing bitrot, but I've since seen that rationale called into question[0].

The article is interesting in his contrarian view, however, when it comes to bitrot, it counters anecdata with other anecdata:

> One bit flip will easily be detected and corrected, so we’re talking about a scenario where multiple bit flips happen in close proximity and in such a manner that it is still mathematically valid. While it is a possible scenario, it is also very unlikely. A drive that has this many bit errors in close proximity is likely to be failing

I detected bitrot once or twice, and in neither case the drive was failing. This is anecdata though - is it valid? Who knows.

I'm personally skeptical about blanket statements (which the author makes) without seriously backing data.

I have a ZFS setup, and it's arguable whether it's a hassle in itself. At least for RAID-1 setups (I have two), once installed, it's not inherently harder to maintain than other FSs. Installation is manual, and that's definitely a hassle, but users are definitely intended to be advanced ones.

Regarding SMART: it's not as easy at the article author states. I have a laptop that periodically pops up with new instances of a certain error, but the SMART guides says that this is not an error one needs to consider, so I'm confused. Additionally, the smart-notifier of Ubuntu (at least up to 18.04) is broken. I agree that SMART is important to consider, but it's not straightforward as it seems.

I do understand the point the article's author is trying to make, but honestly, it is ridden with misconceptions and sounds just like the fanboys he is trying to enlighten. While CRC algorithms are used everywhere (and some provide data recovery to some extent), they do have their limitations.

It is perfectly possible to read corrupted data from a disk. I know this because I've seen it happen several times over the years. If your system is making decisions (ie generating new data) based on read information, this can actually be quite harmful. Like it or not, transitional errors on to-be failing disks may cause data corruption. It is easy to say "hey, restore from backups!", but it may happen weeks go by before an actual failure happens. By then you don't know if your backups are tainted, and how long was your storage misbehaving. ZFS actually helps with this, because it can tell you explicitly that your file/block is tainted, even if readable. This provides a level of confidence on the system, based on observability. And snapshots can actually refer to different blocks, so it is often possible to recover a previous version of a given file without firing up the backup system.

Also, the idea you need RAID for healing is nonsense - ZFS can keep multiple copies of your block even on a single-disk system.

To finalize on the "bitrot" topic, keep in mind network communications, as most serial protocols, have varying degrees of CRC and checksum checks at different levels of the stack, following the end-to-end principle. We even use compression and encryption on top of that, that also provides multiple verification methods. Yet most relevant files such as iso images have a checksum file to verify your download - and sometimes it doesn't match. ZFS provides you the same functionality, but for your storage.

I'd argue the main advantage that ZFS gives is one that this article tries to dismiss - the article seems to assume that a lot of ZFS deployments are not multi-disk, or not RAIDZ?

I can say that at both work and home, I have only ever seen groups of mirrors or RAIDZ in use - I've never seen just striped pools or single disk ZFS.

I know it's anecdotal, but I have seen ZFS recover data flawlessly with drives returning incorrect data for some sectors with no I/O errors, or from total and sudden drive failure with no SMART warning. I personally think drive hardware is rather more fallible than this article assumes.

With that said, of course ZFS is not a magic bullet, and there's no substitute for backups - but ZFS does make that easier too, because snapshotting is trivial, and zfs send | zfs receive is very useful for transferring the snapshots to another pool for backup. And it does require an amount of reading and understanding before you set it up.

That essay seems to be essentially “hardware already does it”. If you go and read the original ZFS paper, much of the reason it exists is they found hardware lies its ass of. Hardware does not lie less since them, it lies more (witness WD recently caught essentially laying about Reds).

And SMART tells you that a disk is dying, it doesn’t tell you the a disk is not dying.

Furthermore the disk health tools can’t catch e.g. a dying cable.

SMART does report transfer errors between a disk and the motherboard/HBA. Failing cables should be detected.

That's not correct.

Some SMART implementation can report E2E errors.

Why rely on some consensus? Instead, look at what exactly you need and how to get it. ZFS is very nice, not because of some checksums or whatever but because it offers holistic storage management including incremental backups and snapshots and compression.

However, you could also use MDRaid and LVM2 Thin Volumes with any file system you like to get almost the same, without additional kernel modules.

My main gripe with ZFS is that I can't dynamically size up an array once it's created. A raidz2 can't go from 6 disks to 7 or 8 disks without completely destroying and recreating the array.

What are the limits of ZFS, can it scale up to the petabyte range like Ceph?

ZFS can scale a lot further than Ceph. The limits were designed to be large enough to be never encountered in practice. Not just large enough to be never encountered by the people working on it, but large enough that it is not possible to fit a filesystem that needs more on earth, no matter how good your technology is.

The only limits that can be reached is that the maximum size of a single file is 2^64 bytes, or 16 exabytes, and the maximum amount of files in a single directory is 2^48, or 281 trillion. The other limits are large enough that to reach them would generally require more energy than it would take to literally boil the oceans.

I mean practically speaking. Lets say I actually were to buy 100 x 10TB drives. I know that I can span a ceph cluster across those drives (e.g. CERN has created some 30PB cluster in the past). Can I use 100 drives in a RAID-Z?

I'm asking because the main advantage I can see for a zettabyte file system (ZFS) is at larger scale, thus I'm interested in how to actually use it e.g. for a server setup of a company that needs to save data on many disks.

> Can I use 100 drives in a RAID-Z

AFAIK yes you could, but you wouldn't want to put them in a single RAID-Z(1,2 or 3) VDEV, but rather multiple VDEVs. This is because I/O operations scale with the number of VDEVs rather than number of disks[1]. So there's a tradeoff between space efficiency and IOPS. But AFAIK you absolutely could put 100 disks in a single RAID-Z VDEV.

On the mailing lists there's frequently people posting with that or more disks in a single pool (split across multiple VDEVs).

[1]: https://www.delphix.com/blog/delphix-engineering/zfs-raidz-s...

Let's say I build such a system and in 2 years from now, I want to add another 20 disks, is that possible too?

That's the easy thing with ZFS, just add the disks as new VDEVs.

So just to clarify. A ZFS pool stores data in VDEVs. A single VDEV can be:

- a single disk (no redundancy, but same error checking)

- N disks in a N-way mirror

- N disks in a RAID-Z (single for disk parity, min 3 disks)

- RAID-Z2 with 2 disk parity

- RAID-Z3 with 3 disk parity (much slower than other two RAID-Z's due to complex math)

You can always _add_ VDEVs to a pool, and ZFS will start using it. It will try to be clever about it, so for example if existing VDEVs are quite full, it will redirect most of the writes to the new VDEV.

What you cannot do (yet) is _remove_ a VDEV.

It's also possible to replace _all_ the disks in a single VDEV to larger disks in turn (important!), and once the final disk is replaced the VDEV will suddenly appear to have the added capacity.

> That's the easy thing with ZFS, just add the disks as new VDEVs.

Careful though, depending on how you're setting up the VDEVs this could introduce fragility to the pool. For example, if you have a raidz2 partition and then add a single disk vdev, that single disk going down would make the entire pool unavailable.

Good point!

Only way to recover from this would be to either add another disk turning the single-disk VDEV to a mirror, or send/receive the entire pool.

It is actually now possible to remove a vdev. It's a new feature, not sure if it works for all RAID types, but I just removed a mirror from my pool without any issue. It migrated data off that vdev and into the rest of the pool just like a resilver.

> That's the easy thing with ZFS, just add the disks as new VDEVs

What are the drawbacks of this approach and why can't I just add additional drives to an existing VDEV?

Well, you can add drives to _some_ VDEV types. For a single drive VDEV you can add a drive and it will become a mirror. With a 2-way mirror you can add a drive and it will become a 3-way mirror.

However for the RAID-Z variants it's more difficult, because of how the parity calculations are done and how the parity is stored. If you have a 5 drive RAID-Z, you can store up to 4 blocks of data for each parity block. With a 6 drive RAID-Z you can store up to 5 blocks of data for each parity block.

So if you add a drive to a RAID-Z vdev, the system suddenly needs to keep track of the stripe width that was used when the data was written.

In addition it's the issue of _where_ the data is stored. If your RAID-Z VDEV is mostly full and you were to add a new drive to it (not that you currently can), then you might not be able to use all of the capacity without further ado.

This is because you can only store one block per stripe on the new drive for redundancy to work, a RAID-Z1 can only recover from the loss of one block per stripe. The rest of the blocks in the stripe has to be distributed on the other drives.

However, there is work in progress[1] to support RAID-Z expansion.

That said, RAID-Z expansion is for expanding one drive at a time. If you're adding 20 disks in one go, do it as a new VDEV.

[1]: https://github.com/openzfs/zfs/pull/8853

Is there any official statement from the ZFS maintainers on when we can expect the Pull Request (RAID-Z expansion) to be merged?

Not really, but as you can see from the PR[1] there is work being done.

The ZFS devs don't like to push unstable features, so I expect we're still quite a way off.

[1]: https://github.com/openzfs/zfs/pull/8853

Years. It'll be years.

For RaidZ cases, you need to rebalance all the strips, which is non-trivial to do online (you have to make sure if the power cut off during the rebalancing, you don't lose the data). Without rebalancing, I think it can end up with weird corner cases (for example, in RaidZ2 case, if you have the rest 5 disks full and added 1 more disk, you cannot write data into the pool because there is no place for parity. It is not a problem if the disks are balanced).

ZFS can use that many drives, but it's not a distributed filesystem so it's intended to use within a single server. For 100 disks (or some arbitrarily large number) I'd assume you want something distributed, in which case ZFS could serve as the underlying FS on each node with e.g. Gluster or Ceph. ZFS as the backing store for Gluster is fairly common, for example.

I have two systems with 140 drives in a single ZFS pool. (14 VDEVS of 10 drives each). They work great, and we haven't had any issues with them.

What type of connection do these system have (e.g. 10GbE) and how much latency? Would it be possible to span a ZFS Pool across multiple regions (us, europe, ...)?

The two systems are in different buildings on campus. Each system (with 140) drives is a single unit, with 60-drive JBODs running on SAS-external connections.

If you want to span a pool across multiple regions, you'll want a distributed file system on top of your ZFS to manage that. Something like Lustre or Ceph. It would still be very challenging, though.

"What are the limits of ZFS, can it scale up to the petabyte range like Ceph?"

We (rsync.net) have scaled zpools to petabyte range.

A current, example configuration would be:

- 60-drive JBODs - 15 drive raidz3 vdevs, four per JBOD - 16TB SAS drives

That ends up being ~192TB per vdev, 768 TB per JBOD ... and if you span a pool across two JBODs, you have ~1.5 PB.

I should note that what makes it possible to sleep at night with such a configuration is the fact that raidz3 exists. If not for that, I would not configure 15 drive vdevs with jut raidz2 ("raid6") protection.

Has rsync.net ever considered Ceph and if yes, why decided against it in favour of ZFS?

No, we did not ever consider ceph.

rsync.net architecture is purely FreeBSD and there was just a very good fit - and roadmap - from UFS2 which we used from 2001-2012 and ZFS which we have used since.

Also, for better or worse, rsync.net is about UNIX and ZFS is very unixy. We think in filesystems and files and directories and ZFS let's us keep that set of abstractions.

Ceph is a distributed storage system whereas ZFS is not so you can't really compare the two by the capacity people run them at.

Okay, if I'm not allowed to compare them directly, what are their different use cases?

I believe the z in zfs is for zettabyte

Interesting technical comparison table https://www.snapraid.it/compare

I tried FreeNAS but even without data disks it ate 2.5GB ram. I went with snapraid, mergerfs and OMV combo.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact