Then I found out it fragments badly, and nobody can figure out how to write a defragmenter. So, uh, keep the FS below 60-80% full apparently.
Confirmed. Not FUD.
Our experience is that things go to hell around 90% and even if you bring it below 90% there is a permanent performance degradation to the pool. In order to be safe, we try to keep things below 80%, just to be safe. That's probably a bit conservative, though.
ZFS needs defrag. It is not reasonable to give up 3 drives worth of capacity for the parity (raidz3, for instance) and then on top of that set aside another 10-20% as the "angels share".
It doesn't fragment, it actually turns all random writes into sequential ones, provided there is enough space because ZFS uses copy-on-write atomic writes:
now, for those of us in the Solaris / illumos / SmartOS world, this is well known and well understood. We either keep 20% free in the pool, or we turn off the defrag search algorithm. But now with the Linux crowd missing out on 11 years of experience, I see there will be lots of misunderstanding of what is actually going on, and consequently, lots of misinformation, which is unfortunate.
* I'm not sure how experienced, but they have Sun hardware running that's older than ZFS.
Anyway, it would be nice if you could provide actual numbers and meta slab statistics from zdb. The worst case fragmentation that has been reported and that I can confirm from data provided to me is a factor of 2 reduction in sequential read bandwidth on a pool consisting of spinning disk after it had reached ~90% capacity. All files on it had been created out of sequence by bit torrent.
A factor of 2 might be horrible to some people. I can certainly imagine a filesystem performing many times worse though. I would be interested to hear from someone who managed to do worse than that in a manner that cannot be prescribed to the best fit allocator protecting the pool from gang block formation.
1) Having a ZIL helps with this, and in general.
2) ZFS changes strategy depending on how full it is, it spends more time avoiding further fragmentation rather than grabbing the first empty slot. This hit would go away if you get the free space back up.
3) Finally, there is a way to have ZFS keep all the info it needs in RAM to greatly alleviate the times when it starts hunting harder to prevent more fragmentation. It looks like the RAM requirements are 32GB/1PB... so not too bad IMO.
Look, I'll admit that we haven't done a lot of scientific comparisons between healthy pools and presumed-wrecked-but-back-below-80-percent pools ... but I know what I saw.
I think if you break the 90% barrier and either: a) get back below it quickly, or b) don't do much on the filesystem while it's above 90%, you'll probably be just fine once you get back below 90%. However, if you've got a busy, busy, churning filesystem, and you grow above 90% and you keep on churning it while above 90%, your performance problems will continue one you go back below, presuming the workload is constant.
Which makes sense ... and, anecdotally, is the same behavior we saw with UFS2 when we tune2fs'd minfree down to 0% and ran on that for a while ... freeing up space and setting minfree back to 5-6% didn't make things go back to normal ...
I am receptive to the idea that a ZIL solves this. I don't know if it does or not.
That being said, since rsync.net makes heavy use of snapshots, the snapshots would naturally keep the allocations in metaslabs toward the front of the disks pinned. That would make it a pain to get the metaslabs back below the 96% threshold. If you are okay with diminished bandwidth when the pool is empty (assuming spinning disks are used), turn off LBA weighting and the problem should become more manageable.
That said, getting data on the metaslabs from `zdb -mmm tank` would be helpful in diagnosing this.
What I believe, and what I think others have also concluded, is that it shouldn't be fatal. That is, when the dust has settled and you trim down usage and have a decent maintenance outage, you should be able to defrag the filesystem and get back to normal.
That's not possible with ZFS because there is no defrag utility ... and I have had it explained to me in other HN threads (although not convincingly) that it might not be possible to build a proper defrag utility.
Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem.
Yes, and that is why I did not mention recreating the pool as a solution. If your pool is big enough or expensive enough, that's still "fatal".
(2) If your pool is big enough/expensive enough, surely you've also budgeted for backups.
(2) This has nothing to do with backups or data security in any way - it's about data availability (given a specific performance requirement).
You're not going to restore your backups to an unusable pool - you're going to build or buy a new pool and that's not something people expect to have to do just because they hit 90% and churned on it for a while.
I agree it's not ideal to have filesystems do this, but it also simplifies a lot of engineering. And I think direct user exposure to a filesystem with a POSIX-like interface is a paradigm mostly on the way out anyway, meaning it's increasingly feasible to design systems to not exceed a safe utilization threshold.
`tar` and `zfs send | zfs recv`.
So it'd be nice if there were something like a "remove/hide" feature for containers, separate from delete or clean up, with a command that makes it easier to select many containers for deletion. At least on Btrfs this should be quite fast, and the background cleaner process that does the actual work of updating the ref count and freeing extents should have a priority such that it doesn't overly negatively impact other processes.
Some of this behavior may change on Btrfs as the free space tracking has been recently rewritten. Right now the default is the original space cache implementation, while the new free space b-tree implementation is a mount time option intended only for testing.
Anyone using ZFS in a serious capacity would have both dedicated ARC and ZIL.
That said, all file systems degrade in performance as they fill. I do not think there is anything notable about how ZFS degrades. The most that I have heard happen is a factor of 2 sequential read performance decrease on a system where all files were written by bit torrent and the pool had reached 90% full. That used mechanical disks. A factor of 2 in a nightmare scenario is not that terrible.
Ignoring logbias=throughput, when you have a slog you save on writing intents for small synchronous writes into the ordinary vdevs in the pool. If you do a lot of little synchronous writes, you can save a lot of IOPS writing their intents to the log vdev instead of the other vdevs. Log vdevs are write-only except at import (and at the end phases of scrubs and exports).
Here's the killer thing on an IOPS-constrained pool not dominated by large numbers of small synchronous writes: the reads get in the way of writes. ZFS is so good at aggegating writes that unless you are doing lots of small synchronous random writes, they write IOPS tend to vanish.
Reads are dealt with very well as well, especially if they are either prefetchable or cacheable. Random small reads are what kill ZFS performance.
Unfortunately systems dominated by lots of rsync or git or other walks of filesystems tends to produce large numbers of essentially random small reads (in particular, for all the ZFS metadata at various layers, to reach the "metadata" one thinks of at the POSIX layer). This is readily seen with Brendan Gregg's various dtrace tools for zfs.
The answer is, firstly, an ARC that is allowed to grow large, and secondly high-IOPS cache vdevs (L2ARC). l2 hit rates tend to be low compared to ARC hits, but every l2 hit is approximately one less seek on the regular vdevs, and seeks are zfs's true performance killers.
Persistent L2ARC is amazing, but has been languishing at https://reviews.csiden.org/r/267/
It has several virtues that are quickly obvious in production. Firstly, you get bursts of l2arc hits near import time, and if you have frequently traversed zfs metadata (which is likely if you have containers of some sort running on the pool shortly after import) the performance improvement is obvious. Secondly, you get better data-safety; l2arc corruption, although rare in the real world, can really ruin your day, and the checksumming in persistent l2arc is much more sound. Thirdly, it can take a very long time for large l2arcs to become hot, which make system downtown (or pool import/export) more traumatic than with l2arc (rebuilds of full ~128GiB l2arc vdevs take a couple of seconds or so on all realistic devices; even USB3 thumb drives (e.g Patriot Supersonic or Hyper-X DataTravellers, both of which I've used on busy pools) are fast and give an IOPS uptick early on after a reboot or import, and of course you can have several of those on a pool. "Real" ssds give greater IOPS still. Fifthly, the persistent l2arc being available at import time means that early writes are not stuck waiting for zfs metadata to be read in from the ordinary vdevs; that data again is mostly randomly placed LBA-wise, and small, so there will be many seeks compared the amount of data needed. Persistent l2arc is a huge win here, especially if for some reason you insist on having datasets or zvols that require DDT lookups (small synchronous high-priority reads if not in ARC or L2ARC!) at write time.
Maybe you could consider integrating it into ZoL since you guys have been busy exploring new features lately.
Finally, if you are doing bittorrent or some other system which produces temp files that are scattered somewhat randomly, there are two things you can do which will help: firstly, recordsize=1M (really; it's great for reducing write IOPS and subsequent read IOPS, and reduces pressure on the metadata in ARC), and secondly, particularly if your receives take a long time (i.e., many txgs), tell your bittorrent client to move the file to a different dataset when the file has been fully received and checked -- that will almost certainly coalesce scattered records.
As for persistent L2ARC, it was developed for Illumos and will be ported after Illumos adopts a final version of it.
Further research showed that the ZIL is only needed for synchronous writes, which my workload didn't have any of.
Mmap'ing into ARC would probably come after that as it would make mapping easier.
Why? If there's a legitimate reason, please expand.
It's been a while since I looked at using ZFS for anything meaningful, but at the time (~6 years ago), while losing L2ARC was no big deal, losing dedicated ZIL was catastrophic. I think that's still true today.
So you need at least two ZIL devices in a mirror. On top of that, you really need something faster and lower latency for your ZIL vs. the ARC or main pool; people were trying to use SSDs but most commonly-available drives at the time would either degrade or fail in a hurry under load. So the options were RAM-based, e.g. STEC ZeusRAM on the high end, or some sort of PCI-X/PCIe RAM device. The former was not easy or cheap to acquire for testing stuff, and the latter made failover configs impossible.
I think that ZIL is also not soaking up all writes, just most writes meeting a certain criteria. Some just stream through to the pool. So I was always thinking of it as a protection device that also converted random writes to sequential. Some people don't think they need that.
I remember the fragmentation issue being a problem at the time, but also thinking it was probably going to get solved soon because there was so much interest and a whole company behind it. Then Oracle happened. My guess is that if it were still Sun and all the key people were still there, this would be a solved problem right now. As it is, Oracle probably wants you to buy all the extra storage anyway, and would love to offer professional services to get you out of the fragmentation bind you're in.
Here is some info on ZIL vs SLOG:
You can test these things for yourself.
I contend that most people using ZFS in a serious capacity do not have a dedicated ZIL.
One of the goals of such systems is that copy of the file system on disk is always consistent, turn power off at any point and you can come right back up with a valid file system. This is accomplished by only writing to the 'free block list'. You construct updated inodes from the file change all the way up to the root inode out of new blocks and then to "step" forward you write a new root block. This is really neat and it means that when you've done that step, you still have the old inodes and datablocks around, they just aren't linked but you can link them to another "holder" inode attached to the name ".snapshot" and it will show you the file system just before the change. Write the old root block back into the real root block and "poof!" you have reverted the file system back to the previous snapshot.
Ok, so that is pretty sweet and really awesome in a lot of ways, but it has a couple of problems. The first, as noted, is that it pretty much guarantees fragmentation as its always reaching for free blocks and they can be anywhere. On NetApp boxes of old, that wasn't too much of a big deal because everything was done "per RAID stripe" so you were fragmented, but you were also reading/writing full stripes in RAID so you had the bandwidth you needed and fragmentation was absorbed by the efficiencies of full stripe reads/writes. But the second issue arises when you start getting close to full, managing the free block list gets harder and harder. You are constantly getting low block pressure, so you are constantly trying to reclaim old blocks (on unused snapshots, or expired ones) and that leads to a big drop in performance. The math is you can't change more of the data between snapshot steps than the amount of space you have free. That is why NetApp filers would get cranky using them in build environments where automated builds would delete scads of intermediate files, only to rebuild them and then relink them. Big percentage change in the overall storage.
On the positive side, storage is pretty darn cheap these days, so a swapping in 3TB drives instead of 2TB drives means you could use all the storage you "planned" to use and keep the drives at 66% occupancy. Hard on users though who will yell at you "It says it has 10TB of storage and is only using 6TB but you won't expand my quota?" At such times it would be useful for the tools to lie but that doesn't happen.
 Disclosure 1, I worked for 5 years at NetApp with systems that worked this way. Disclosure 2, an intern with NetApp (we'll call him Matt) was very impressed with this and went on to work at Sun for Jeff and similar solutions appeared in ZFS.
Goal yes, implementation no. WAFL does in fact have consistency problems and filers do ship with a consistency checker called "wack" which if you ever need this tool you'll probably have better luck throwing the filer in the trash and restoring from backups rather than waiting a month for it to complete.
If you'd like to see some information on a feature that's been added while working around that particular constraint (or, rather, mitigating the impact of it), check out .
 - http://open-zfs.org/w/images/b/b4/Device_Removal-Alex_Reece_...
The benefits in terms of defragmentation are also limited because ZFS does a fair job of resisting fragmentation related performance penalties. The most that I would expect to see on a pool where poor performance is not caused by the best fit allocator would be a factor of two on sequential reads.
The zpool command line utility tries hard to help you not shoot yourself in the foot, but "zpool add -f pool diskname" sometimes happens when "zpool add -f pool cache diskname" was meant. Everyone's done it once. Thinks of a system melting down because the l2arc has died, and you're trying to replace it in a hurry, and you fat-finger the attempt to get rid of the "-n" and end up getting rid of "log" instead.
Without this device removal, that essentially dooms your pool -- there is no way to back out, and the best you can do is throw hardware at the pool (attach another device fast to mirror the single device vdev, then try to grow the vdev to something temporarily useful, where "temporarily" almost always means "as long as it takes to get everything properly backed up" with the goal being the destruction and re-creation of the pool (plus restoral from backups).
With this device removal, you do not have to destroy your pool; you have simply leaked a small amount of space (possibly permanently) and will carry a seek penalty on some blocks (possibly permanently, but that's rarer) that get written to that vdev before the replacement.
As noted further in the slide deck (and in Alex's blog entries), this only works for single device vdevs -- you cannot remove anything else, like a raidz vdev, and you have to detach devices from mirror vdevs before removal.
Also, note the overheads: although you can remove a single-device vdev with a large amount of data on it, doing so is a wrecking ball to resources, particularly memory. You won't want to do something like:
disk0 2tb-used 3tb-disk-size
disk1 2tb-used 3tb-disk-size
disk2 2tb-used 3tb-disk-size
disk3 2tb-used 3tb-disk-size
do an expand dance, so you have
disk0 2tb-used 6tb-disk-size
disk1 2tb-used 6tb-disk-size
disk2 2tb-used 3tb-disk-size
disk3 2tb-used 3tb-disk-size
then detach disk3, then device-removal remove disk2, except in extremely special circumstances, and where you are well aware of the time it will take, the danger to the unsafe data in the pool during the removal (i.e., everything in former mirror-1), that your pool will be trashed beyond hope in the presence of crashes or errors during the removal, and that you will have a permanent expensive overhead in the pool after the removal is done.
It would almost certainly be much faster and vastly safer to make a new pool with the 6tb disks and zfs send data from the old one to the new one.
Certainly, it would be much less exciting to send|recv from poolA to poolB, and require no code changes and no GB per TB of data indirection overhead.
But this was intended as an example of how many caveats and problems are involved in even a "simple" feature involving shuffling data on-disk, and thus, why "defrag" is a horrendously hard problem in this environment.
I have yet to see evidence that 60%-80% causes issues unless the system is so overloaded that having performance drop a small amount is noticeable. On spinning disks, such a thing is only natural because there is not much space left in the outer platters.
That said, older versions years ago would enter best fit behavior at 80%, which is where the 80% talk originated.
Not sure if your container tool supports btrfs snapshots of course, but it's conceptually simple, right?
RH/Fedora are very dm/LVM thinp snapshots with XFS centric for backing their containers. I think what you're seeing is distros are doing something different with their container backing approach in order to differentiate from other distros. Maybe it's a stab in the dark or spaghetti on the wall approach but in the end all of these storage backends are going to mature a lot in the interim, so ultimately it'll be good for everyone.
ZFS is protecting data in enterprise production environments since 2006 (Solaris 10 update 2).
ZFS is an excellent filesystem, btrfs is an excellent filesystem. There is room for both excellent options for users.
anything with copy of write is going to fragment.
are you still running on spinning rust?
I've not seen any real performance hits until 90% full, but then any file system with large images suffers at that point.
Also, removing files uses up an insane amount of CPU and can block all FS operations if you get into the 80%+ full situation.
I'll also note that the interplay between the ARC and Linux MM is... "interesting".
This means that no binary kernel modules are shipped, just the cli tools.
I am obviously glad that this happened, but afraid of an Oraclocalypse.
The CDDL doesn't have any problems being linked into GPL'ed code, it is the GPL that has a problem accepting CDDL code.
The problem would go away if you convinced oracle to change the license or give explicit permission to link ZFS with the linux kernel under the GPL conditions.
Convinced the linux kernel team to change their license or give explicit permission to link the linux kernel with ZFS under the CDDL conditions.
Or you can blame "GPL", because it obviously was created after CDDL...
It's the other way around.
That's a strange way to say relicensing CDDL code under GPL
> A license incompatibility happens when two licenses has each some conditions which are incompatible with each other,
I think it's a bit of a far stretch to say that both licenses are to blame when one license has a so far reaching condition that other code needs to be relicensable under that license.
Which license require that other code needs to be relicensable under its license? this license really means, this license. Modified code is required to be under, and I will repeat myself here, this license. If source code is not under this license, then it is infringing on this license condition and this license will no longer permit redistribution.
This license has a condition, and this license condition makes it incompatible with every other license that has an identical condition.
"The appeals court reversed the district court on the central issue, holding that the "structure, sequence and organization" of an API was copyrightable."
Here's another post about GPL violations related to combining ZFS and Linux:
Quote from that:
"Is The Analysis Different With Source-Only Distribution?
We cannot close discussion without considering one final unique aspect to this situation. CDDLv1 does allow for free redistribution of ZFS source code. We can also therefore consider the requirements when distributing Linux and ZFS in source code form only.
Pure distribution of source with no binaries is undeniably different. When distributing source code and no binaries, requirements in those sections of GPLv2 and CDDLv1 that cover modification and/or binary (or “Executable”, as CDDLv1 calls it) distribution do not activate. Therefore, the analysis is simpler, and we find no specific clause in either license that prohibits source-only redistribution of Linux and ZFS, even on the same distribution media.
Nevertheless, there may be arguments for contributory and/or indirect copyright infringement in many jurisdictions. We present no specific analysis ourselves on the efficacy of a contributory infringement claim regarding source-only distributions of ZFS and Linux. However, in our GPL litigation experience, we have noticed that judges are savvy at sniffing out attempts to circumvent legal requirements, and they are skeptical about attempts to exploit loopholes. Furthermore, we cannot predict Oracle's view — given its past willingness to enforce copyleft licenses, and Oracle's recent attempts to adjudicate the limits of copyright in Court. Downstream users should consider carefully before engaging in even source-only distribution.
We note that Debian's decision to place source-only ZFS in a relegated area of their archive called contrib, is an innovative solution. Debian fortunately had a long-standing policy that contrib was specifically designed for source code that, while licensed under an acceptable license for Debian's Free Software Guidelines, also has a default use that can cause licensing problems for downstream Debian users. Therefore, Debian communicates clearly to their users that this code is problematic by keeping it out of their main archive. Furthermore, Debian does not distribute any binary form of zfs.ko.
(Full disclosure: Conservancy has a services agreement with Debian in which Conservancy occasionally gives its opinions, in a non-legal capacity, to Debian on topics of Free Software licensing, and gave Debian advice on this matter under that agreement. Conservancy is not Debian's legal counsel.)"
They would only bite a golden hand anyway.
> We are also concerned that it may infringe Oracle's copyrights in ZFS.
The Software Freedom Conservancy saying that is a bit scary. I am somewhat less afraid of Linux copyright holders suing.
In the United States there are two kinds of indirect infringement: contributory infringement and vicarious infringement.
Contributory infringement can occur when you know that someone else is or will directly infringe, and you substantially aid that by inducing, causing, or materially aiding their direct infringement. That can include providing the tools and equipment they use to infringe.
Vicarious infringement can occur when someone who is a direct infringer is your agent or under your control.
A very important aspect of both of these types of indirect infringement is that they make you liable for the direct infringement of someone else. If there is no someone else who is a direct infringer, then you cannot possibly be a contributory or vicarious infringer.
Why would Free/open-source distros have to distribute ZFS source code? Couldn't they simply provide a method for downloading the source from the already existing ZFS repos and then compile the source? Wouldn't that be enough?
$ sudo apt install zfs-dkms zfsutils-linux
From an operational perspective, this is insane, I need reliability. Of course in my organisation I could create a binary package and use that, but that's more work and then the new Debian package doesn't help me anyway.
When I need linux I just run ZFS on a better supported system and either virtualise Linux or expose an iSCSI target from ZFS for Linux.
IANAL so I have no clue what is technically correct, but the fact is that Ubuntu is distributing zfs.ko in 16.04.
Yes, and that is called system engineering. It is the job of a system engineering department (3rd level support) to deliver such components and stable operating system builds to the 2nd level support (operations, system administrators / database administrators).
Technical solutions to legal problems don't work, just like GPL wrappers don't work (at least, that's what some lawyers say). If Oracle decides to make a stink about this, they still can.
edit: Huh, apparently last year Debian actually got advice from SFLC about this:
However it is unlikely that Fedora will ship ZFS unless the license changes or is clarified by Oracle. Unlike Canonical, Red Hat is a US company making serious amounts of money ($2bn revenue last financial year).
We have recently switched all our servers to LXC containers so that we can take full advantage of btrfs features.
I doubt I need to explain the advantages of containers to anyone here... but in short we've broken out all the network services (file service, LDAP, DNS, etc.) to separate containers.
Each container is in a separate btrfs subvolume. This allows us to take snapshots of the running systems every 10 minutes, and using btrfs send/receive, cheaply back up those snapshots to alternate container hosts. The send/receive stuff works better with the btrfs v4.4 tools that ship with Ubuntu 16.04.
Since the network interfaces for all the containers are bridged with the container host, we can configure each container with its own static IP address. So if a container host fails, those containers can be booted up on the alternate host, and keep their IP and MAC addresses. So that's convenient, and causes minimal disruption.
The main improvement I'd like to see with btrfs is a configurable RAID redundancy level. Currently, RAID-1 means that there are two copies of each piece of data / metadata. So in a 3-drive RAID-1 system that gives you extra capacity, but two drives failing at the same time will cause data loss.
Right now a gotcha with Docker + Btrfs is that SELinux contexts for each container can be different, but the context= mount option currently is once per superblock (thus per fs volume, rather than per fs tree or subvolume). So Docker's work around in 1.10.x is they do a snapshot and then relabel it with the new selinux context then start the container. For my containers (very basic) this adds an almost imperceptible one time delay for that container. I doubt it's even 1 second, which in container start times might seem massive to some.
Containers are long-lived and mutable. I treat them like old school servers, just not tied to physical hardware.
It's great to see ZFS on Linux get a more stable footing. It's an excellent filesystem. As others have said I think the use case differs slightly from btrfs (though they are very similar in capabilities).
ZFS, to my eyes, seems more resilient. It has more levels of data checksums, The RAIDz model allows for more redundancy, and it just feels like a stronger enterprise offering (meaning stable and built for large systems and disk quantities).
btrfs brings many of the ZFS features to Linux in a GPL wrapping. What it lacks in resiliency, it makes up for with flexibility. Raid in btrfs, for instance, occurs within data chunks across disks, not at the disk level, meaning mixed disk capacities, and on the fly raid changes. I also appreciate the way it divides namespace across subvolumes while maintaining block awareness within the pool (cp --reflink across subvolumes, snapshots across subvolumes). It also doesn't have the ram requirements of ZFS (which aren't much of a data center concern, but are definitely a client level concern for workstations).
Either way it's a win, both great filesystems for Linux. With bcache supporting btrfs properly now, I personally don't have much of a reason for ZFS now. Two years ago I would have jumped easily to it. Your workloads and needs may differ, it's great to have choices!
among others: the ability to resize/change raid layout in place, reflink copies, on-demand deduplication.
The problem with failures during rebuilds is overblown
I've got roughly 30 arrays in production, between 4 and 12 disks in each. All are RAID5 + hotspare. If you believe the maths people keep quoting, the odds of seeing a total failure in a given year is close to 100%. I started using this configuration, across varying hardware, over 15 years ago and I've been growing in number since.
I'm not pretending one example proves the rule, or that it's totally safe and I would run a highly critical environment this way (before anyone comments: these environments do not meet that definition), but people have tried to show maths that there's a six nine likelihood of failure, and I just don't for a second believe I'm that lucky.
The misconfiguration is the drive's SCT ERC timeout is greater than the kernel's SCSI command timer. So what happens on a URE is, the drive does "deep recovery" if it's a consumer drive, and keeps trying to recover that bad sector well beyond the default command timer of the kernel, which is 30 seconds. At 30 seconds the kernel assumes something's wrong and does a link reset. On SATA drives this obliterates the command queue and any other state in the drive. The drive doesn't report a read error, doesn't report what sector had the problem, and so RAID can't do its job and fix the problem by reconstructing the missing data from parity and writing the data back to that bad sector.
So it's inevitable these bad sectors pop up here and there, and then if there's a single drive failure, in effect you get one or more full stripes with two or more missing strips, and now those whole stripes are lost just as if it were a 2-disk failure. It is possible to recover from this but it's really tedious and as far as I know there are no user space tools to make such recovery easy.
I wouldn't be surprised if lots of NAS's using Linux were configured this way, and the user didn't use recommended drives because, FU vendor those drives are expensive, etc.
(My personal favorite was when I discovered a certain model of "consumer" drives we had thousands of in production claimed to not support SCT ERC configuration, but if you patched smartctl to ignore the response to "do you support this", the drives would happily configure and honor it.)
So yeah, I agree, the whole idea in some circles that you will get a URE every ~12TB of data read is obviously b.s. We don't know what the real world rate is because of that little less than sign that appears in all of these specs. We only know there won't be more errors than that, and not for a specific drive, but rather across a (virtual) sample size for that make/model of drive.
No degraded state ever happens. This way Gluster knows not to even make requests from that brick. If the brick were raid56 and the cluster fs isn't aware, requests happen with degraded read/writes which suck performance wise.
Plus Gluster (or even Ceph for that matter) might use some logic and say, well that data doesn't even need to be replicated on that brick, it's better for the network/use loading if the new copy is on this other brick over here.
I guess in your case you have a more stable raid5/6 opportunity, but you're losing many of the raid benefits present in btrfs natively. I'd also imaging it could be slower or introduce IO issues others haven't tested. Though I really have no idea, never seen anyone do that before.
We've been running btrfs in production since Ubuntu 14.04 with excellent results. The feature set vastly outweighs the few risks that remain.
See this for the ZFS example; http://serverfault.com/a/556892/79238
In both cases the exact amount of free space you desire is a mix of workload, fragmentation, and snapshots.
However, I disagree they waste a lot of space. I think both ZFS and btrfs more than make up for the overhead of free space commits through their space saving features. cp --reflink, block suballocation, compresssion, and efficient snapshots outweight the overhead, in my case. Your mileage may vary.
As far as I know, there is no way to implement online deduplication with constant RAM usage without performing poorly as things scale or playing Schrödinger's cat with whether data that should deduplicate is subject to deduplication. Offline data deduplication might work, but it would be performance crippling ZFS' data integrity guarentees.
If HAMMER has online data deduplication that is performant with constant ram, they likely made a sacrifice elsewhere to get it. My guess is that it misses cases, such that while you would expect unique records to be written once, they can be written multiple times.
Have a look;
Reading more for my own interest it seams ZFS uses ZIL to help convert random writes to sequential, which helps with fragmentation under low space. I am curious if bcache, can operate similarly. In addition, I should also point out that this is less of an issue on SSD, due to the nature of how random reads/writes work there anyway (btrfs does a good job of being SSD aware).
ENOSPC is a very different condition from lower performance. If you tested filesystem me at 90% full, you should find that all of them have lower performance than when they were empty. You might also find performance varies based on how you filled them.
Hadoop is a completely different workload, and maybe not something for ZFS or btrfs. Our Hadoop nodes are not raid, just JBOD ext4 disks. We have been conidering btrfs with nodatacow mount option and lz4 compression, however. We haven't decided if it's better to compress within Hadoop or at the fs layer yet. I would be curious on your findings.
Also, nodatacow is a hack. If you take a snapshot on btrfs with nodatacow, it must use CoW on each thing in the snapshot that is overwritten. Until then, what ever horrible performance nodatacow prevented will manifest temporarily. ZFS is designed to make things asynchronous as much as possible (with the exception of partial record writes to unaccredited records, which needs to be fixed), so it lacks an equivalent to nodatacow and does not need it.
To the point of lzo performance though, btrfs is a tad smart with compression, it tries to compress an initial 128KiB and if the compressed segment is not smaller than the uncompressed it adds it to a list of no compress files, and will not try compression on that file again (unless of course you force it).
This was for our Hadoop use case, comparing to ext4, so nodatacow would work because we have no desire of snapshots in that environment. It still seems like we're better of compressing within the Hadoop framework (as jamesblonde is doing) and sticking with ext4 jbod, for now at least.
There are links there to mailing list emails explaining the reasoning behind that. The reasoning behind ZFS adopting LZ4 can be found here:
Contrary to what the btrfs developers claimed about LZ4 versus LZJB in ZFS, LZ4's compression performance on incompressible data alone would have been enough to adopt it had ZFS already had LZO support. LZ4 also has the benefit of extremely quick decompression speeds. It also has the peculiar property where running LZ4 repeatedly on low entropy files outperforms "superior" compression algorithms such as gzip. Someone on the LZ4 mailing list discovered this when compressing log files. He compressed a 3.5GB log file into a ~750KB file by running LZ4HC 5 times. Running it twice yielded a 9.5MB file with regular LZ4 compression and a 2MB file with LZ4HC compression. He was able to compress it to ~750KB after running LZ4HC roughly 5 times.
As for btrfs being smart with lzo by compressing only the first 128KB as a heuristic, LZ4 uses a hash table for that and is able to give up much faster. I would expect to see LZ4 significantly outperform LZO. The following site has numbers that appear to confirm that:
On a JPEG on their Intel® Core i7-2630QM, LZ4 level 1 compression runs at 608.24MB/sec while LZO level3 compression runs at 68.13MB/sec. Also of possible interest is that Snappy compresses at 559MB/s here. There are a couple caveats though. While I picked the correct variant of LZ4 for ZFS (and also the Linux kernel), I assumed that btrfs is using the default compression level on LZO like ZFS does on LZ4 and knew nothing about the different revisions measured, so I took the best reported number for any of them at the default compression level. That happened to be lzo1b. I also assumed that JPEGs are incompressible, which the data strongly supports. There are a several exceptions, but LZO, Snappy and LZ4 all consider the file to be incompressible.
As for the question of whether to compress in hadoop or in the filesystem, the properties of LZ4 mean that you can do both. If your data is incompressible, LZ4 will give up very quickly (both times). If your data is incompressible after 1 round of LZ4 compression, then the LZ4 compression in ZFS will give up quickly. If it is compressible by two rounds of LZ4, then both will run and you will use less storage space because of it.
As for fragmentation there are two kinds: fragmentation of files into more than one extent, fixed with 'btrfs filesystem defrag' and also autodefrag. And the other is fragmentation of free space, as a result of deleting files and that's fixed with 'btrfs balance' which consolidates extents and writes them into new chunks then frees up large regions of contiguous space on the drives. This is best used with filters.
Now, I do think these methods are increasingly approachable for all users, however. A lot of that is actually enabled by the feature sets of ZFS and btrfs. The default Ubuntu installer, for example, will create snapshots during OS upgrade for rollback if the upgrade fails. ZFS and btrfs send/receive feature allows for efficient DR clones (not to mention seed images and snapshots). LXD leverages ZFS if you choose to for rapid containerization and snapshots.
These intrinsic abilities of these two filesystems allow for smaller users and organizations to improve those workflows to be more risk averse in general (even while assuming some risk in a newer FS).
Just, if possible, be up to date on the known issues and run more recent kernels and userland utilities.
In development since 2007, stable since 2014. ZFS, in development since 2001 (correction I erroneously listed 2005 earlier), stable since 2006 (or at least Solaris included since then). Do you consider the Solaris years or just the Linux years and then do you consider the Linux Debian/Ubuntu sanctioned years or the ZoL years?
I'm fine with mature since included in the default installer on Ubuntu/Redhat/Oracle/SUSE/etc... for my definition of btrfs maturity.
As you can tell I'm not that strong on ZFS other than what I've learned in some small tinkering and discussion with others.
btrfs can handle database workloads but you have to disable cow for them (which you can do at the file, directory, or subvolume level in btrfs). You would specify the nodatacow mount option, or chattr +C (file/directory).
The btrfs autodefrag is still rather new, and needs some work. I expect that could be the long term fix (manual defrag is fine now, but you wouldn't want to call it frequently on a db file), I'm not sure how ZFS handles the fragmentation, but I do know in the past ZFS observed similar issues (seems to have been mostly resolved). I should also point out that disabling cow doesn't really fully disable it, snapshots can still function, etc... however, I'm sure that once you start to use the other cow functions you might observe slipping performance due to fragmentation of these types of files.
There is no magical point in time where a file system is suddenly "mature". If anything I'd argue once OpenSUSE started defaulting to it one should consider it stable. I just wonder why Ubuntu / Fedora are not prioritizing making it default since snapshotted upgrades are such a huge usability gain and are trivially easy with btrfs.
I'm not necessarily trying to make a point here, just adding some related facts.
ZFS doesn't support different sized hard drives. I could also be wrong, but ZFS doesn't expand as easily, and you can't remove drives from it.
Synology NAS just added support for Btrfs in their software
It does not support different sized drives in a raidz (equivalent to an mdadm array), as far as I know, but you can mix raidz and singular drives in a pool, though that is a bit silly. It also does not support adding disks to a raidz, which is equivalent to growing a raid5 or raid6 array, which is admittedly a pain in the ass.
% mkfile -v 64m /var/tmp/d0
/var/tmp/d0 67108864 bytes
% mkfile -v 96m /var/tmp/d1
/var/tmp/d1 100663296 bytes
% mkfile -v 128m /var/tmp/d2
/var/tmp/d2 134217728 bytes
% sudo zpool create -f pool0 raidz /var/tmp/d0 /var/tmp/d1 /var/tmp/d2
% zpool status
scan: none requested
NAME STATE READ WRITE CKSUM
pool0 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
/var/tmp/d0 ONLINE 0 0 0
/var/tmp/d1 ONLINE 0 0 0
/var/tmp/d2 ONLINE 0 0 0
errors: No known data errors
The smallest device capacity applies to RAID1+0, which is mirroring, but it does not apply to RAIDZ.
Of course you can remove drives in ZFS. How else could you replace disks?
I downvoted your post because every statement you made about ZFS is wrong.
Yes, you can add drives of arbitrary size to a ZFS pool, but when put into the same array, they all get treated as the size of the smallest one. Yes, you can expand a ZFS pool but only by adding a new array or by incrementally upgrading all the drives in the new array. Yes, you can remove drives, but only for replacement purposes; you can't do an in-place rebuild to make the reduced number of drives the new normal instead of degraded mode.
Can you have an array that contains a single disk? And if you can, then
* Can you add as many single drive arrays to a pool as you like?
* Can you migrate data off of a given single-drive array on to other single-drive arrays in the pool and then permanently remove the array from the pool?