... for me using ZFS has changed the way I look at files, filesystems, data, and backups for general computing. I've been a linux user for 13 years but never felt the need to have a fileserver. Now being able to plug a drive in and take a snapshot without rsyncing or thinking about what I'm snapshotting, having it be inherent to the filesystem, was a game changer.
Not to mention being able to snapshot important folders to the native drive in case I need to recover a file from a previous state. I run datasets for categories of data and I can choose categories that I want regular local snapshots of (zvol/crypt/Documents, zvol/crypt/scripts, zvol/crypt/Papers)
Essentially, ZFS manages my files for me. And it all comes with things I didn't know I needed, like filesystem compression. I know BTRFS also attempts to provide this, and there's the licensing issues with ZFS, but I wanted MacOS compatibility also. Although that was an adventure on its own.
Yes! Managing our files is the whole point of file systems! It's amazing how bad at it most of them are. Linux is still catching up with btrfs...
It's extremely aggravating how most file systems can't create a pool of storage out of many drives. We end up having to manually keep track of which drives have which sets of files, something that the file system itself should be doing. Expanding storage capacity quickly results in a mess of many drives with many file systems...
Unlike traditional RAID and ZFS, btrfs allows adding arbitrary drives of any make, model and capacity to the pool and it's awesome... But there's still no proper RAID 5/6 style parity support.
Some would argue that is the job of the volume manager, not the filesystem. On Linux it's LVM2, FreeBSD has vinum, for example.
And for my own needs, I typically favor the security of not throwing too much complexity on top of my storage system (if it would be impossible for me to recover something without relying on a ton of magic I can't fully understand, I'd rather not use it in production).
ZFS is both at the same time.
That is why there is the zpool command and the zfs command. While tight-coupling is sometimes bad, in this instance it is useful:
I would have thought this was a long solved problem. Kind of like chat applications.
ZFS is available on Linux if you want it, in fact FreeBSD is basing its support for ZFS on it.
For a mainline option, am personally holding out hope for bcachefs, rather than btrfs.
The only thing that holds it back is the lack of proper parity support. Anyone has an update on that? The kernel wiki says it still has problems...
> Not much has changed for raid5/6 since 2014, other than the introduction
> of raid1c3 for metadata in 2019 to make filesystem with raid6 data usable.
> Almost all of the bugs from 2014 still exist today. Developers have
> been fixing more severe and less avoidable bugs in the meantime.
Also, worth mentioning another email from same author with guidelines for users running btrfs raid5 array:
As @matheusmoreira mentioned - you can use harddisks of unequal size. I.e. three disks of 4+3+3 Tb in RAID1 will happily give you 5 Tb of usable space. AFAIK, no other filesystem can do that.
Quoting @matheusmoreira further:
> I can slowly buy drives of different capacities and add them to the file system one by one. I don't have to plan ahead of time like traditional RAID and ZFS setups. This is excellent for home data storage!
I use zfs now and have many, many, more than 20 snapshots without issue.
I wasn't using a nvme drive with btrfs, it was a sata ssd. So perhaps the higher io ops or lower latency also helps.
ZFS is great. What todays developers will make out of it... is worrysome.
> electron snapshots on each run
...on a raspberry pi, for maximum trendiness, deployed in a check out aisle where you have to wait for it to painfully work through its issues after every interaction. Kill me now.
This all changed when I started using ZFS. Not only it has support for raidz and mirroring (which I could get with LVM too) but it is tweekable and tunable easily. Plus commands like _zpool status_ will give you a great overview of the health of the array in no time. It might seem like nothing but it makes all the difference for me.
I can recommand for everybody running a (home) server. It will save you lots of time.
I am asking because this sounds really short to me and I am wondering if I am extremely lucky never to have a failed disk in more than 25 years, mostly with more than 3 computers at a time with only consumer grade disk of random brands. I only swich them because for better performance/capacity, but some of them are almost 10 years old.
I've yet to experience a failure of HGST drives, even after 10 years of operation in an LVM or ZFS array.
Now I have a backup of everything and whenever I have to think about how much storage I need for something I always take that number and multiply it by three -- if it's only stored once or twice, it really can't be more important than /tmp.
I had five new Seagate 7200.11s which had an easy life but not make it past 22 to 25k hours before they started failing en masse (and thats not even counting their firmware bug that got me). ZFS RAIDZ1 (RAID5) pool, which survived a second drive starting to fail while rebuilding from the first failure. It was that event that made me love ZFS forever.
Contrast to my WD Reds which are 52k+ hours without errors or issues (no failures in a set of 8). And some HGST refurb drives that are at 70k+ hours (some failures in a batch of 14, but they were refurb with wiped SMART data so the failures werent unexpected).
Out of ~30 disks over my ~30 years of hard disks I can count at least 3 5.5" HDD failures and 5 or 6 2.5" HDD failures. Laptop drives seem to fail at a much higher rate especially for kids.
One simple command each to:
* create a zpool
* create a dataset (persistently mounted across reboots, with optional encryption and compression etc)
While, yes, properly tweaking ZFS for performance and making proper decisions on things like recordsizes, L2ARC and SLOG requires a bit of deeper understanding of how it works, the CLI is very approachable and the man pages are straight-forward and easy to understand for a beginner.
Compared the alternatives of cryptsetup/mdadm/lvm/lvcache, it's such a breath of fresh air and a lot easier to work with. It's unified and intuitive and things just make sense.
The big game-changer for me is its caching mechanisms. For workloads that already have efficient caching (mature databases, for example), it might not make a big difference, but for things doing redundant IO it can improve the performance by an order of magnitude in a way the Linux page cache just can't.
Put a fast NVMe as L2ARC in front of a mirror of slow but inexpensive disks and you can eat the cake and have it.
Oh, and native compression (currently lz4 but zstd is coming) that barely impacts performance (and sometimes improves it).
I'm one of those weird guys now and even my nerd friends don't get me anymore :')
Also, 18 months sounds kind of short, I usually get around 5 years of use from hard drives. What brand are you using? I highly recommend checking out the quarterly Backblaze HDD failure reports.
Source: been running a server (read: consumer headless desktop) for years without issues.
You're not wrong. And if there's a mechanical issue with a drive and it dies / stops spinning, then you'll probably get an alert and can swap it.
But if there's any kind of bit rot or data corruption, and it only happens on one side of the (traditional) RAID1 mirror, how will you (a) know that it actually occurred, or (b) know which side is good bits and which side has bad bits?
With ZFS and checksums, you can be confident that your data is still healthy. Depending on the data, this may be an important consideration.
zfs send-recv is also very handy for doing incremental backups.
 - https://docs.oracle.com/cd/E36784_01/html/E36835/gkkqz.html
As others have mentioned, would "zfs diff ..." be useful?
As the name suggests, "snapshots" are read-only and so cannot be altered. You could either copy/rsync the modified file/s to the live location, or do a rollback to a particular snapshot:
If the machine is compromised in some way, you could reinstall and do a "zfs send-recv" of a pool from a remote system.
Filesystems don't read SMART data. You might have a separate daemon which monitors SMART.
ZFS checksumming is amazing for this. You know, without doubt, which file(s) are bad. You can still use failing or unknown quality drives because the checksumming will protect you from silent data corruption.
Of course, that's great for a server farm; in home use monitoring is a lot less structured.
as long as it writes and reads back (most of the time): ZFS will deal with it
That seems unusually bad. Up until earlier this month my old raid array has been mostly desktop drives(just upgraded it to IronWolf NAS drives though). I did finally do the upgrade to new drives because one died, but that drive was manufactured in 2011, and the rest of the drives are from 2010, 2013, and 2015. To be fair I've not gone five years with out a failure, I think the 2015 drive was added about 3 years ago(being my actual desktop drive before that), but otherwise have had good luck. In fact, I think my most recent failure may have been accelerated by my server being put into a temporary case with poor air flow.
Might be power supply issue.
The first one is ZSTD compression, this will work great together with MySQL and PostgreSQL.
The second major feature is persistent L2ARC, I'm using LARGE ssd's as cache. And the warmup time takes weeks. So rebooting has a major performance impact.
For the last 10 years, I have been using FreeBSD with ZFS. This has been working perfectly with good performance. But now I want to take advantage of even faster network speeds, with RoCE/RDMA. And FreeBSD support for iSER, NVMe-OF is non-existant, while Linux has excellent support for these technologies.
Highly recommended as something to learn and use, it's obviously the best choice for a home built filer, but is also an excellent choice even for general purpose server use. My colo setup is running FreeBSD w/ ZFS which is very stable for backending VPN servers, web servers and app servers of all stripes, etc.
I’ve got a couple of projects I’d like to do that require maybe 100TB of storage: some scientometrics against the sci-hub collection, as well as building a bajillion scala projects from github. I don’t really care about data redundancy.
The cheapest way I’ve been able to figure out to do this is just buy a case with 15 HDD bays, eg the Anidees AI crystal case, and just get a ridiculously beefy processor + 256gb RAM and do all the computation on a single box.
Does this sound right? All of the purchasable NAS cases all seem more expensive, but I’m out of my element.
I expect I’m going to want to figure out ZFS to make a single logical drive.
Does this seem right? Building eg a backblaze pod is outside of my budget, and my eyes glaze over whenever I try to read about NAS controllers.
This is 1080 TB (in 12 TB disks) in one server under ZFS:
For 100 TB you would get smaller server, like 2U with 12 slots filled with 10/12/14 TB disks for your 'usable' 100 TB space.
For more ZFS and/or FreeBSD storage options check this:
Hope that helps.
I'm surprised you went with one single HDD model. Don't people normally say "Go for different HDDs from different manufacturers, so that in case of manufacturing defects not all HDDs will break at the same time"?
You can slice and dice the drives however you like. I found this guy's blog posts to be useful for running a homegrown NAS:
"It seems though that the one URE in 10^14^ bits (an error every 12.5 TB of data read) is a worst-case specification. In real life, drives are way more reliable than this specification."
I don't believe that statement without data backing it up.
Disk drives are right at the engineering edge. There is a reason why consumer class drives have one MTBF and enterprise drives have a better MTBF. If a disk drive manufacturer could somehow cite a better MTBF, they absolutely would (see the whole GB vs GiB marketing stupidity, for example).
If you plan on using ZFS, make sure to take into account for the overhead and parity drives. For example, I have 12x 16TB (192TB raw) drives in raidzfs2, including overhead and parity I only have ~125TiB usable.
Basically the reasons I'm drawn to zfs are:
- checksumming & self-healing
- ergonomics & flexibility of managing pools with zfs
- copy on write for cheap local copy/experimentation (i.e. just clone your DB folder and you have a new DB)
- zfs send/recv for very efficient incremental backups
From what I can find it seems like btrfs does all that, and faster. In addition to being faster, it also is in-kernel, and more flexible for the user in various ways, for example allowing resizing. Looking around btrfs may not be blessed as stable but there are a lot of big orgs using it.
All that said, there are articles like this one which are somewhat dated but paint ZFS really positively from a maintenance point of view. Very hard to pick between these two.
I'd really like to use ZFS -- the community seems very welcoming and amazing but I'm a little worried about picking the wrong tool for the job.
[EDIT] - there's also this old comparison from phoronix which is confusing. I'm still learning towards ZFS but sure would like to hear some strong opinions if anyone has em.
- I've corrupted btrfs filesystems with compression just through normal use, on hardware that is fine. This may have been fixed, but it was on relatively recent (post 5.x) kernels
- zfs's logical volume layer is rather more flexible than btrfs's - you can make a multi-device set of mixed disks. For example, my backup pool is 9x4TB and 7x3TB. These are individual raidz sets that are combined together into one pool. In ZFS this is all in one place and it means the fs is aware of the logical disk layout and where data is stored. To do this on btrfs I'd need to use lvm and I'd in theory lose a bunch of the self-healing ability
- btrfs's snapshotting seems excessively complicated - it requires you to create a non-trivial logical layout in the fs, and it's very easy to accidentally expose these snapshots to the system's view of the fs. It's more flexible, but more annoying for my usecase. zfs's on the other hand is really very simple, and much much easier to use
With that said, ZFS on linux is slightly awkward as it's out of tree, and most distros build the module with DKMS. I don't entirely trust this for using for / (so my / is just a md raid1) - I use zfs for bulk data instead. btrfs is in-kernel, so there's no real disadvantage with using it for /.
-Unlike btrfs subvolumes, zfs datasets can be mounted with different zfs-specific mount options, such as compression algorithms or recordsizes.
-zfs can take atomic recursive snapshots of nested datasets, whereas btrfs snapshots of a subvolume do not include nested subvolumes.
Overall, zfs treats datasets as first-class citizens, whereas the only purpose of btrfs subvolumes seems to be to exclude folders from snapshots.
Is that true? RHEL/CentOS have both DMKS and kmod versions; in Ubuntu ZFS is a supported package. That's not most distros, certainly, but it is the ones most people use.
Not used Ubuntu ZFS, admittedly. Haven't used Ubuntu since 2010 or earlier.
You may wish to checkout ZFS dRAID, which recently got committed:
Can you please elaborate on this? Working on mixed disks was always a killer feature of btrfs for me!
As far as I could work out from the btrfs documentation, this isn't currently possible? Plus, RAID5/6 in btrfs is still of questionable stability?
Indeed, btrfs doesn't allow such configuration. However, it allows using disks of unequal sizes within single filesystem - like, 3+3+4T in RAID1 mode gives you 5T of usable space (imagine 4T disk being split in half, and each half duplicated to a different 3T disk; remaining space of 3T disks duplicated to each other). But as I understand, it's possible to achieve the same with manual partitioning and vdev allocation on ZFS, too.
RAID5/6 is indeed a danger zone - in another thread I've already mentioned a nice write-up of "guidelines for users running btrfs raid5 arrays to survive single-disk failures without losing all the data" by Zygo Blaxell: https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@h...
ZFS needs serious tweaking if you have performance-critical workloads. I experienced this on databases, when comparing against ext4 - at the very least, one needs to move the ZIL on a separate disk. Also, up to a short time ago, ZFS had performance problems with encrypted volumes, due to (let's call) formal issues with the kernel - as a matter of fact, it was slow on a laptop of mine.
All in all, I don't use BTRFS because of trust issues. The BTRFS is in a sort of "never stable" camp, which is not a good indicator of engineering practices. The nail in the coffin was for me that in the official FAQ, at least up to some time ago, there was the tragicomic cop-out statement that the concept of stability in software is just a matter of labeling, because all the software has bugs.
Performance has been great so far even on my underpowered machine, even with just 4GB. I don't use deduplication though which makes a huge difference.
I chose Ubuntu 20.04 as OS since they are pushing ZFS support. Did consider FreeBSD as well where I had a positive experience in the past but since they are switching to OpenZFS anyway I stuck with Ubuntu since I'm running Debian derivatives on all my servers.
> Performance has been great so far even on my underpowered machine, even with just 4GB. I don't use deduplication though which makes a huge difference.
So almost every OpenZFS community video/talk I've seen recently has included a like like "friends don't let friends dedup" or something to that effect... I think dedup is considered unnecessary/dangerous these days with how good compression is. Not sure exactly what the dangers were, but I know that I wouldn't even turn it on.
> I chose Ubuntu 20.04 as OS since they are pushing ZFS support. Did consider FreeBSD as well where I had a positive experience in the past but since they are switching to OpenZFS anyway I stuck with Ubuntu since I'm running Debian derivatives on all my servers.
Same, I used to run lots of different OSes but I've settled down on Ubuntu for everything now, and OpenZFS having good support is what makes it possible there.
No expert either but from what I've gathered the main issue is that the dedup tables require a _lot_ of memory that scales with the size of the pool, and if they can't fit in RAM performance tanks. However the real issue is that any blocks written while dedup was enabled will demand this dedup overhead, even if the dedup option is turned off. That is, it's a "sticky feature".
So if you use dedup, notice performance tanks because dedup tables are too big, well you're screwed because turning it off won't give you back the performance. Only way to recover is to send/receive the whole shebang to a separate pool.
In addition, they don't actually bring that much space saving on common workloads. Most people don't have a lot of truly identical blocks of data.
VM backing storage seems to be the biggest worthwhile use case, but that depends on whether snapshots and clones are used extensively. Installing 100 copies of Debian on empty VMs will likely get deduped quite a bit. But it's faster and provides almost the same benefits to install one VM, snapshot it, and produce the rest of the VMs by cloning from the snapshot.
The only other case I could imagine dedup being good for is storing a lot of genomic data: https://techtransfer.universityofcalifornia.edu/NCD/25080.ht...
But if the use-case is narrow enough for custom deduplication then it will probably be much more efficient than ZFS's block-based dedup.
Things to consider:
While a volume can span multiple devices (physical), historically it hasn't been the most stable and many users just stick with md, so real world testing is probably limited. Details are in the wiki.
When a volume starts getting full (let's say above 98% or something), performance suffers. This is also documented behaviour. Take monitoring seriously, even more than usual.
Pools and subvolumes can be a bit confusing at first, even if you've had experience with other volume managers. Read the documentation make sure you know what you're doing.
btrfs is also a volume manager with RAID-like functionality of its own. But you can also use btrfs on an md device, just like you would with any other filesystem.
Software RAID is something I have only used for personal use, and with the enormous consumer hard drives available now, striping them seems less necessary than before.
This is actually pretty interesting because one of the 'features' of the hosting provider I'm using is that they will software RAID your drives by default. Maybe btrfs is a better choice in that kind of environment if I don't have to undo the software raid on every machine and btrfs will interop well without too much abstraction.
Of course it depends a lot on your read/write patterns what impact this has, though.
On adding disks, I don't know if that's what you're referring to, but you can't add new disks to an existing RAIDZ.
Personally I've only used mirrors and single-device vdevs so far, haven't seen any need for RAIDZ.
And yeah, ZFS can't be expanded willy-nilly, found a good blog post with someone's adventures that was illuminating.
I’m not taking any chances, though, seeing as it is “unstable” – I’ve got snapshots and local and remote backups set up and working flawlessly. It’s sooo awesome to be able to pluck older versions of any file on my system whenever I want, and it’s saved me numerous times.
I haven’t tried zfs but I really have no need to. Btrfs does the job for me.
Plus, I love snapper and the way subvolumes and snapshots are handled.
EDIT: I will add that I don't use any RAID. I use BTRFS for snapshots, cow and data integrity on my single drive desktops. I think that's where it ready shines. I don't think anyone should be using EXT4 anymore.
I'm also not necessarily going to do any RAID5/6 stuff -- I'm probably just going to keep it safe (for my level of understanding) and do a RAID1/mirror setup and call it a day. The snapshots/cow/data integrity bit is definitely what I'm interested in as well. It feels to me like as long as I run ZFS under my servers I am much safer than anything else (and it's easier/possible to go back in time and undo mistakes).
Unfortunately, there's that whole thing about it being hard to boot to.. Is that still a thing?
But for the rest of it, I just have everything in my fstab and it works like anything else. Super easy.
Regarding performance, I'm guessing you won't be happy with either if you're not happy with performance. You want to be looking at ARC / SLOG if you want higher performance.
Moving writes to ARC or SLOG-on-a-faster-thing would also definitely help, but I'm dealing with SSDs for the most part.
Also, talking of faster storage, NVMe looks really bad for ZFS (and probably btrfs), based on this reddit post(graphs). It's not terrible of course, and some recommended that maybe actually turning ARC off would be better, since it might have been actually getting in the way of the NVMe drive.
 - https://github.com/openzfs/zfs/pull/8853
It's worth noting you can expand a RAIDZ through replacing disks - if you start off with a pool of 4x2TB for instance say giving 6TB usable, you can expand it by replacing those disks one by one with 4TB disks - in which case you eventually end up with 12TB usable, once all disks are replaced.
Alternatively, you can add another RAIDZ to the same pool with extra disks (but you will lose more capacity this way).
Otherwise, recreate the pool, and restore from your backup (which you definitely have, right?). Assuming both your live and backup are zfs, this is easy with zfs send | zfs receive.
My solution was:
offlining one drive (sdb), degrading vpool1
creating a vpool2 in RAIDZ1, with sdb, sdc, and an 8TB sparse file on a flash drive
offlining the sparse file, degrading vpool2
sending a ZFS snapshot from vpool1 to vpool2
destroying vpool1, and adding sda to vpool2 to replace the sparse file
> This would be a pretty common configuration choice for a lower-end VM storage box. If you only had 16GB or of RAM in your system, all of your ARC space would be wasted with L2ARC mappings and you would only have 2GB of the entire rest of your system.
Am I misunderstanding this? 140,000,000 bytes is 140MB (salesman MB, not 2^20 bytes). It looks like they're saying it's 14GB.
There is a de-facto term distinction: MB megabyte (1000^2) vs MiB Mebibyte (1024^2).
I've broke a lot of systems and data over the years. With that in mind, I like my data storage (NAS/filer) to be boring and predictable. FreeNAS is exactly that for me.
I'm running FreeNAS on an ancient Lenovo minitower with just four drives as two mirrored volumes and it's about time I upgrade the thing. But it just works and only draws about 60W.
I do a few things including A) keep it patched B) minimize the number of exposed services C) utilize pfsense to control [network-level] access above and beyond what freenas itself provides D) stream all host/service logs to an ELK stack to review for any funny business that may occur.
Not bulletproof but I haven't had an incident yet.
Hope this helps!
P.S. All of this assumes proper backups. I can restore most of my stuff from backups, it just will take forever.
It is very important to realize that you can’t expand VDEVS. You can only add VDEVS.
This makes expanding storage less flexible than regular MDADM RAID. Going for Mirrors is the most flexible but you lose 50% of capacity.
You also get the random IOPs performance of a single drive per VDEV. Sequential performance does scale within a VDEV.
You scale random I/O performance by adding VDEVS.
For home usage, as a NAS, you don’t need to use SSDs for a SLOG, unless you have write-intensive random I/O workloads.
There is work ongoing to allow adding drives to a RAIDZ VED. And VDEV Removal gives some flexibility (at the cost of indirection on the read path). But yeah, ZFS is not super flexible for adding and removing random drives.
I've detailed this also here.
You may also realise from the top of my article that expanding VDEVS is a topic since 2017 but still no show.
I would now go for Ubuntu + ZoL myself. I bought all capacity up front and paid the ZFS tax. So I don't need to expand as I go.
But if you want to expand as you go, Linux + MDADM are still fine in my opinion. It's a tradeoff. Do you want to 'pay' the ZFS tax and expand at a cost, or do you want a bit more risk (on paper) but more flexibility?
I ran a RAID6 of 20 drives before that using Linux + MDADM and that worked fine. And MDADM allows you to expand as you go, exactly as you want.
I think the risks ZFS protect against are very small.
If you true freedom you'd need to go Ceph(FS) based, but there's no GUI to manage a Ceph cluster that I'm aware of, and they're so focused on cloud usage that you'll find little guidance on single-node clusters.
edit: Forgot to read Louwrentius' comment, of course MDADM with raid 6 allows you to add disks in pairs of two, so that's definitely also an option. Don't forget to use LVM, it can get real awkward without it.
I think technically you can do something similar to Synology's Hybrid RAID with ZFS, using partitions and creating mirror VDEVs from the partitions.
But I haven't gone through all the details so could be there are some risks I overlooked and in any case you'd have to manage it from the command line so would be a bit tedious. But I did set up a proof of concept in a VM.
>but there's no GUI to manage a Ceph cluster that I'm aware of
There are two, Open Attic and Calamari ,but Open Attic is in maintenance mode, so all the work goes now into Ceph Dashboard:
This isn't strictly true. It is true for RAIDZ-n and for writing to a mirror, but ZFS can accelerate random reads with a mirror by distributing the read requests to the underlying disks.
Their forum is also great for information about running ex-enterprise gear at home.
My use case is both CephFS and RBD for a small two node OpenStack cluster. I have found CephFS to be rather performant for my use cases, enough that I had to get a 10Gbps switch. I am not machine or the bandwidth on that switch, but my individual clients use more than 1Gbps.
OpenStack and Ceph tie together wonderfully. I have my VMs backed by NVMe drives and my VMs are snappy. Recovery is quick too. I am using crappy first gen xeon-d boards and even with those I hit 8Gbps recovery on those drives.
Ceph shines when you have a lot of parallel access. It is recommended to have at least ten nodes for a production cluster so recoveries so not take too long. If you have a lot of clients Ceph is king.
I used ZFS in the past as a simple Fileserver before using Ceph. It worked well, I could saturate a 1Gbps link, however I found the vdev resize limitation too restricting at times when I wanted to expand by a little bit. It is pretty easy to manage, though I find Ceph very easy to manage as well.
For my backup server which is a target for BorgBackup I went with btrfs for the better flexibility it offers with resizing arrays.
ZFS and Ceph work at different 'layers'.
ZFS provides redundancy with-in a server, so if drive(s) die then the service on that service can continue to run without interruption. Ceph provides redundancy between servers, so if drive(s), servers, or even entire racks/ToR switches die then things keep going. Ceph is generally for much larger scales (e.g., OpenStack, HPC) than ZFS, which is usually done on NFS or SMB servers.
Until relatively recently, Ceph was also only accessible at a block layer, so you'd have to put a file system on top of it (i.e, Ceph gave you a /dev/sdX), but somewhat recently CephFS has become/declared 'stable'.
Regardless, you still need three machines for quorum for Ceph. If you just want a file share, then it's probably unnecessarily complex.
Well, yeah, good point to make but it is becoming harder and harder to find CMR drives, especially in 2.5".
ZFS should really adapt to this. Perhaps using bigger block sizes or something. Because SMR is not going away.
Although frankly at this point if you care about performance you're on NVME anyway.
I use two 5TB 2.5 SMR drives in ZFS mirror:
These drives can slow down to 30-40 MB/s when filled to 80% or more but I use that storage over WiFi which is at most 11-12MB/s which means the SMR problem does not exist for me.
If I would be using that storage over LAN the 30-40 MB/s in 'WORST CASE' is also not bad considering that maximum real life LAN speed over gigabit network is about 80-90 MB/s.
Its also not possible to get non-SMR large 2.5 drives. I use 2.5 drives as they are silent and they need very small amount of power comparing to 3.5 drives.
Rebuilds certainly are a problem but beyond that I think it's simply "be ok with slow drives". I've been operating a decently sized SMR pool for 3 years now with no major issues, including surviving two drive failures.
If you try to put random write workloads onto SMR you're gonna have a bad time no matter what you do. In my use case it's great, since this is effectively WORM storage writing giant files to disk all at once so I have effectively 0% fragmentation and all subsequent reads of the file tend to be sequential.
That said, this is for my personal use and lab projects. Not sure I'd go ZFS+SMR for a production workload.
Metadata not encrypted: Dataset / snapshot names, Dataset properties, Pool layout, ZFS Structure, Dedup tables
ZFS encrypts: File data and metadata ,ACLs, names, permissions, attrs
Directory listings,, All Zvol data,FUID Mappings
,Master encryption keys
,All of the above in the L2ARC
,All of the above in the ZIL
For most uses and use cases this is net increase in security. You can do some operations on data without needing the keys.
The downside is that you have a choice between encrypting all file data twice, or losing the benefits of ZFS's encryption (mainly the ability to send snapshots to another pool without decrypting them). It would be nice if you could specify a pool key to be used to encrypt all blocks not covered by ZFS native encryption, which would eliminate the need for LUKS.
Encrypt each hard disk and add the decrypted block devices to the vdev in whatever RAID configuration desired.
I know it’s an old deck... has anything substantial changed since it was put together?
Specifically around the licensing , inclusion in the kernel, etc.
It looks like FB in particular had invested a lot in btrfs  over ZFS.
Maybe a good rule of thumb is that if you've ever needed to track down a missing file, fix file corruption, or have contingency plans for such an event in production then ZFS can help. If the default action is to wipe a host and reinstall if the local filesystem looks fishy then ZFS won't provide many benefits.
Btrfs has the advantage of living in the kernel tree (but I use ZFS on Freebsd where the same is true), but still carries an aura of not being quite done. There are several "things seem to work well at version X, and watch out for Y" statements in https://btrfs.wiki.kernel.org/index.php/Incremental_Backup for example. This is a year old at least so maybe all bugs and caveats are fixed. How could I be sure? No idea. zfs send and receive just work.
Btrfs seems to be the filesystem to use if you can dedicate enough time to testing your specific use case thoroughly and keeping up to date with changes and improvements as opposed to ext4 or ZFS which change relatively infrequently.
ZFS enables/tracks new features in the zpool metadata so there's a modicum of forward- and backward-compatibility.
To be honest I am even a bit leery of ZFS-on-Linux but even Freebsd is moving there soon. Abandoning a working, trusted codebase is always scary.
Licensing is a minor issue with ZFS, but if you aren't distributing the code to others, those issues don't affect you.
>• Raid 0, 1, 5, and 6 are also built in the filesystem
BTRFS RAID has been a nightmare from day 1 including complete data loss. Parity-based RAID has been "just around the corner" for almost a DECADE. Sure you can do RAID-1 but in 2020 I'm just not interested in losing half of my capacity.
Yes you can layer it on top of MDRAID but that eliminates half the elegance of what ZFS brought to the table.
>ZFS is fairly memory hungry, it's recommended to have
16GB of RAM and give 8GB or more to ZFS (it wasn't
designed to use the linux memory filesystem, so it uses its
own memory that can't be shared with the rest of linux).
This is just flat wrong. You need 2GB of memory for a happy filesystem, 8GB+ is if you're doing deduplication which is unnecessary overhead in most environments. I'm also not sure where the "memory that can't be shared with the rest of linux" is coming from - ARC will use and free memory as needed by the system.
>Due to the CDDL being incompatible with GPLv2, a linux vendor or
hardware vendor will never be able to ship a linux distribution or hardware
device using ZFS
Except Ubuntu already is. I believe SLES does as well.
>As a result, you shouldn't plan on using ZFS for any product that you
might ever want to ship one day.
Delphix ships a product today based on ZFS. 0 issues.
>Oracle may have stopped further work on ZFS as a result.
Or it could be another reason entirely...
Oracle absolutely didn't stop work on ZFS, I'm not even sure where he came up with that nonsense. Oracle continued to update and release new versions of ZFS long after the lawsuit was settled.
You know what's far more telling? 13 YEARS after starting BTRFS: Oracle uses ext4 as their default filesystem, not BTRFS. Redhat has dropped support for BTRFS entirely.
It seems that the ZFS story has gotten much better since 2015 (FreeBSD & Ubuntu support, memory usage improvements, license problem proving to be not a big problem).
Correspondingly, it seems that the btrfs story has regressed (redhat drops support, many issues in this greater thread raised around reliability and use cases, suspect development practices).
My guess is that they were closer to parity in 2015. This deck was making the case for why btrfs would win out. The author seemingly turned out to be wrong, as we're all liable to be from time to time.
My guess is that the licensing might be an issue for some big corporations today, but it appears to be generally benign, like you say.
Thank you so much for the information! :)
My plan is to upgrade the host to Ubuntu 20.04 to see if the newer ZoL version helps. Also looking forward to root volume on zfs so I can do snapshotting to the raidz2 and just let the drive fail eventually.
I have in a NAS a RAIDZ1 of 4HDDs on which I set a recordsize of 1MB, currently full at 50% and so far performance has been good with both big and small files... .
I'll probably create in a future a RAIDZ2/3 by using ~8 HDDs and I'll test various record sizes but I just wanted to know if anybody had already any positive/negative experiences with some combination of record size and RAIDZx... .
Some good beginner info
Some advanced info
I did read Arstechnica's article in the past but I did not feel comfortable with their results... (I'm not challenging them, I'm just not sure if they're relevant for me or not).
So, I just did a test (ashift 12, RAIDZ1 with 4 8TB HDDs) and I got better performance in both cases with a 1MB recordsize vs. 128KB (all sequential).
reading one 10GB file: 21 seconds.
reading 10000 1MB files: 83 seconds
reading one 10GB file: 31 seconds
reading 10000 1MB files: 116 seconds
Ok, it seems complicated => I'll just have to test different variants :)
Right. People who've done more testing than me reckon on 16KB being a good record size for transaction-processing database work, where tables are seeing lots of small inserts and updates. (You might think matching the database's block size would be ideal, e.g. Postgres writes 8KB at a time, but the rationale here is that you tend to get better compression at 16KB recordsize than 8KB, and the benefit from this outweights the write-amplification.)
But if database update performance isn't a big deal for you then you can probably just ignore this.
I've not done any testing of my own at the 1MB size, but I don't think I'd be inclined to try it unless I was fairly confident that there weren't going to be many small writes to big files.
In short: use the large recordsize where you think you've got a good case for it, and likewise with a small record size. Otherwise, just stick with the default.
Yeah, in my case the DBs "Clickhouse" and "MariaDB+MyRocks" might fit well the 1MB-case (as they both never "update" existing files but keep writing new files not just for "inserts" but as well for "updates", Clickhouse anyway not supporting "update/delete" almost at all, he).
On the other hand "PostgreSQL" and (maybe) as well "MariaDB+TokuDB" might need a small recordsize -> I'll have to test it, and anyway, splitting each single DB to use different datasets seems to be a great idea :)
You can issue arbitrary 'zfs send' to rsync.net, over SSH.
I ask as an owner and regular user of several Pi models, including a 4.
With that said, assuming you choose PCIe, acquire a decent HBA, and build it into a decent enclosure, I'm sure it'd work as well as typical low-end off-the-shelf NAS boxes, and wouldn't cost that much more in time and materials to set up.
I have a Raspberry Pi 4 that mounts a USB hard disk and serves files over SMB and Nextcloud. I have been considering reformatting the drive to ZFS or btrfs and booting the Pi directly from that so that I can start taking snapshots. Is this a bad idea?
I’ve looked at buying dedicated NAS hardware before (mostly Synology products) but I’m always deterred by the cost. A low end Synology NAS with drives runs around $500 or $600, which is a huge jump from my little Pi.
For a low-cost NAS you could use a spare computer. My first NAS was my old desktop computer, using the motherboard SATA ports.
That said, ZFS should work on the Raspberry Pi 4 at least in 64bit mode. You probably will want to use Ubuntu as it has ZFS support.
If you have a spare drive you can test with, give a whirl. Just keep in mind this if you have poor performance from the USB-SATA.
My home net has several pi3 and pi4 systems that boot off a small SD card and then mounts working storage via iSCSI, which works great for how I use them. There's no reason I couldn't serve that off a pi, but their limited IO and memory makes them a poor choice.
Currently still running OpenSolaris on my ZFS server, probably will be FreeBSD next time I upgrade the hardware, but time will tell. Regardless of the OS, it will be ZFS!
Can't live without boot environments on physical hardware.
FreeBSD is BSD licensed of course and that's considered compatible with CDDL given that CDDL is file-based.
Simply it didn’t fit with the hundreds of other linux boxes we had.
We ended up scrapping the zfs idea and bout 1000+ disks (about 2PB) worth of linux storage over the next decade, on xfs and ext4.
Had Sun’s x4500 platform worked with a Debian based linux we’d have bought that instead of supermicro. Sun lost because of their choice to exclude zfs from linux.
From what I can tell, zfs is/was great, but wasn’t good enough to change our standard OS, and since then people moved to object storage
Historical context: ZFS predates OpenSolaris by quite a few years, so the above statement can't be technically true.
The box was still in use in 2012, we were having issues and a “zfs upgrade” was suggested, but at that point we were adding new storage on linux.
I'm trying to test it out on an i3a.large instance with 1.2TB NVME ssd and benchmarking postgres on it. Trying to move out of RDS since time to time I have an heavy iops scripts.
Only thing left is doing zfs snapshots next for my backups.
zfs snapshots or pg_dump snapshots? I wonder what's better.
I'd much rather manage a simple local disk with offset backups.
The article is interesting in his contrarian view, however, when it comes to bitrot, it counters anecdata with other anecdata:
> One bit flip will easily be detected and corrected, so we’re talking about a scenario where multiple bit flips happen in close proximity and in such a manner that it is still mathematically valid. While it is a possible scenario, it is also very unlikely. A drive that has this many bit errors in close proximity is likely to be failing
I detected bitrot once or twice, and in neither case the drive was failing. This is anecdata though - is it valid? Who knows.
I'm personally skeptical about blanket statements (which the author makes) without seriously backing data.
I have a ZFS setup, and it's arguable whether it's a hassle in itself. At least for RAID-1 setups (I have two), once installed, it's not inherently harder to maintain than other FSs. Installation is manual, and that's definitely a hassle, but users are definitely intended to be advanced ones.
Regarding SMART: it's not as easy at the article author states. I have a laptop that periodically pops up with new instances of a certain error, but the SMART guides says that this is not an error one needs to consider, so I'm confused. Additionally, the smart-notifier of Ubuntu (at least up to 18.04) is broken. I agree that SMART is important to consider, but it's not straightforward as it seems.
It is perfectly possible to read corrupted data from a disk. I know this because I've seen it happen several times over the years. If your system is making decisions (ie generating new data) based on read information, this can actually be quite harmful. Like it or not, transitional errors on to-be failing disks may cause data corruption. It is easy to say "hey, restore from backups!", but it may happen weeks go by before an actual failure happens. By then you don't know if your backups are tainted, and how long was your storage misbehaving. ZFS actually helps with this, because it can tell you explicitly that your file/block is tainted, even if readable. This provides a level of confidence on the system, based on observability. And snapshots can actually refer to different blocks, so it is often possible to recover a previous version of a given file without firing up the backup system.
Also, the idea you need RAID for healing is nonsense - ZFS can keep multiple copies of your block even on a single-disk system.
To finalize on the "bitrot" topic, keep in mind network communications, as most serial protocols, have varying degrees of CRC and checksum checks at different levels of the stack, following the end-to-end principle. We even use compression and encryption on top of that, that also provides multiple verification methods. Yet most relevant files such as iso images have a checksum file to verify your download - and sometimes it doesn't match. ZFS provides you the same functionality, but for your storage.
I can say that at both work and home, I have only ever seen groups of mirrors or RAIDZ in use - I've never seen just striped pools or single disk ZFS.
I know it's anecdotal, but I have seen ZFS recover data flawlessly with drives returning incorrect data for some sectors with no I/O errors, or from total and sudden drive failure with no SMART warning. I personally think drive hardware is rather more fallible than this article assumes.
With that said, of course ZFS is not a magic bullet, and there's no substitute for backups - but ZFS does make that easier too, because snapshotting is trivial, and zfs send | zfs receive is very useful for transferring the snapshots to another pool for backup. And it does require an amount of reading and understanding before you set it up.
And SMART tells you that a disk is dying, it doesn’t tell you the a disk is not dying.
Furthermore the disk health tools can’t catch e.g. a dying cable.
Some SMART implementation can report E2E errors.
However, you could also use MDRaid and LVM2 Thin Volumes with any file system you like to get almost the same, without additional kernel modules.
The only limits that can be reached is that the maximum size of a single file is 2^64 bytes, or 16 exabytes, and the maximum amount of files in a single directory is 2^48, or 281 trillion. The other limits are large enough that to reach them would generally require more energy than it would take to literally boil the oceans.
I'm asking because the main advantage I can see for a zettabyte file system (ZFS) is at larger scale, thus I'm interested in how to actually use it e.g. for a server setup of a company that needs to save data on many disks.
AFAIK yes you could, but you wouldn't want to put them in a single RAID-Z(1,2 or 3) VDEV, but rather multiple VDEVs. This is because I/O operations scale with the number of VDEVs rather than number of disks. So there's a tradeoff between space efficiency and IOPS. But AFAIK you absolutely could put 100 disks in a single RAID-Z VDEV.
On the mailing lists there's frequently people posting with that or more disks in a single pool (split across multiple VDEVs).
So just to clarify. A ZFS pool stores data in VDEVs. A single VDEV can be:
- a single disk (no redundancy, but same error checking)
- N disks in a N-way mirror
- N disks in a RAID-Z (single for disk parity, min 3 disks)
- RAID-Z2 with 2 disk parity
- RAID-Z3 with 3 disk parity (much slower than other two RAID-Z's due to complex math)
You can always _add_ VDEVs to a pool, and ZFS will start using it. It will try to be clever about it, so for example if existing VDEVs are quite full, it will redirect most of the writes to the new VDEV.
What you cannot do (yet) is _remove_ a VDEV.
It's also possible to replace _all_ the disks in a single VDEV to larger disks in turn (important!), and once the final disk is replaced the VDEV will suddenly appear to have the added capacity.
Careful though, depending on how you're setting up the VDEVs this could introduce fragility to the pool. For example, if you have a raidz2 partition and then add a single disk vdev, that single disk going down would make the entire pool unavailable.
Only way to recover from this would be to either add another disk turning the single-disk VDEV to a mirror, or send/receive the entire pool.
What are the drawbacks of this approach and why can't I just add additional drives to an existing VDEV?
However for the RAID-Z variants it's more difficult, because of how the parity calculations are done and how the parity is stored. If you have a 5 drive RAID-Z, you can store up to 4 blocks of data for each parity block. With a 6 drive RAID-Z you can store up to 5 blocks of data for each parity block.
So if you add a drive to a RAID-Z vdev, the system suddenly needs to keep track of the stripe width that was used when the data was written.
In addition it's the issue of _where_ the data is stored. If your RAID-Z VDEV is mostly full and you were to add a new drive to it (not that you currently can), then you might not be able to use all of the capacity without further ado.
This is because you can only store one block per stripe on the new drive for redundancy to work, a RAID-Z1 can only recover from the loss of one block per stripe. The rest of the blocks in the stripe has to be distributed on the other drives.
However, there is work in progress to support RAID-Z expansion.
That said, RAID-Z expansion is for expanding one drive at a time. If you're adding 20 disks in one go, do it as a new VDEV.
The ZFS devs don't like to push unstable features, so I expect we're still quite a way off.
If you want to span a pool across multiple regions, you'll want a distributed file system on top of your ZFS to manage that. Something like Lustre or Ceph. It would still be very challenging, though.
We (rsync.net) have scaled zpools to petabyte range.
A current, example configuration would be:
- 60-drive JBODs
- 15 drive raidz3 vdevs, four per JBOD
- 16TB SAS drives
That ends up being ~192TB per vdev, 768 TB per JBOD ... and if you span a pool across two JBODs, you have ~1.5 PB.
I should note that what makes it possible to sleep at night with such a configuration is the fact that raidz3 exists. If not for that, I would not configure 15 drive vdevs with jut raidz2 ("raid6") protection.
rsync.net architecture is purely FreeBSD and there was just a very good fit - and roadmap - from UFS2 which we used from 2001-2012 and ZFS which we have used since.
Also, for better or worse, rsync.net is about UNIX and ZFS is very unixy. We think in filesystems and files and directories and ZFS let's us keep that set of abstractions.