Hacker News new | comments | ask | show | jobs | submit login
ZFS lands in Debian contrib (debian.org)
278 points by turrini on May 14, 2016 | hide | past | web | favorite | 184 comments

So I just started experimenting with ZFS, because it seemed required for container snapshots.

Then I found out it fragments badly, and nobody can figure out how to write a defragmenter. So, uh, keep the FS below 60-80% full apparently.


"Then I found out it fragments badly, and nobody can figure out how to write a defragmenter. So, uh, keep the FS below 60-80% full apparently."

Confirmed. Not FUD.

Our experience[1] is that things go to hell around 90% and even if you bring it below 90% there is a permanent performance degradation to the pool. In order to be safe, we try to keep things below 80%, just to be safe. That's probably a bit conservative, though.

ZFS needs defrag. It is not reasonable to give up 3 drives worth of capacity for the parity (raidz3, for instance) and then on top of that set aside another 10-20% as the "angels share".

[1] rsync.net

ZFS has defragmentation built into the very design of it!

It doesn't fragment, it actually turns all random writes into sequential ones, provided there is enough space because ZFS uses copy-on-write atomic writes:



now, for those of us in the Solaris / illumos / SmartOS world, this is well known and well understood. We either keep 20% free in the pool, or we turn off the defrag search algorithm. But now with the Linux crowd missing out on 11 years of experience, I see there will be lots of misunderstanding of what is actually going on, and consequently, lots of misinformation, which is unfortunate.

Experienced* SunOS admins are aware of that and can still end up -- accidentally I think -- with ZFS filesystems with unacceptable performance in a state that Oracle apparently didn't understand. There was a ticket open for order months but I don't know whether it ever got resolved.

* I'm not sure how experienced, but they have Sun hardware running that's older than ZFS.

The performance degradation is likely from full meta slabs and maybe from gang blocks, although ZFS does a fair job at preventing gang blocks by using best fit behavior to minimize the external fragmentation that necessitates them. The magic threshold for best fit behavior is 96% full at the meta slab level. This tends to be where slowdowns occur. On spinning disks, being near full also means that basically all of the outermost tracks have been used, so you are limited to the inner most tracks, which can halve bandwidth.

Anyway, it would be nice if you could provide actual numbers and meta slab statistics from zdb. The worst case fragmentation that has been reported and that I can confirm from data provided to me is a factor of 2 reduction in sequential read bandwidth on a pool consisting of spinning disk after it had reached ~90% capacity. All files on it had been created out of sequence by bit torrent.

A factor of 2 might be horrible to some people. I can certainly imagine a filesystem performing many times worse though. I would be interested to hear from someone who managed to do worse than that in a manner that cannot be prescribed to the best fit allocator protecting the pool from gang block formation.

I've seen this repeated a lot, but have not had quite the same experience with "permanent" performance degradation. Especially if I eventually expand the pool with another vdev. Not sure about ZFSonLinux, but:

1) Having a ZIL helps with this, and in general. 2) ZFS changes strategy depending on how full it is, it spends more time avoiding further fragmentation rather than grabbing the first empty slot. This hit would go away if you get the free space back up. 3) Finally, there is a way[1] to have ZFS keep all the info it needs in RAM to greatly alleviate the times when it starts hunting harder to prevent more fragmentation. It looks like the RAM requirements are 32GB/1PB... so not too bad IMO.

[1] https://blogs.oracle.com/bonwick/entry/space_maps

"I've seen this repeated a lot, but have not had quite the same experience with "permanent" performance degradation. Especially if I eventually expand the pool with another vdev. Not sure about ZFSonLinux"

Look, I'll admit that we haven't done a lot of scientific comparisons between healthy pools and presumed-wrecked-but-back-below-80-percent pools ... but I know what I saw.

I think if you break the 90% barrier and either: a) get back below it quickly, or b) don't do much on the filesystem while it's above 90%, you'll probably be just fine once you get back below 90%. However, if you've got a busy, busy, churning filesystem, and you grow above 90% and you keep on churning it while above 90%, your performance problems will continue one you go back below, presuming the workload is constant.

Which makes sense ... and, anecdotally, is the same behavior we saw with UFS2 when we tune2fs'd minfree down to 0% and ran on that for a while ... freeing up space and setting minfree back to 5-6% didn't make things go back to normal ...

I am receptive to the idea that a ZIL solves this. I don't know if it does or not.

The magic threshold is 96% per meta slab. LBA weighting (which can be disabled with a kernel module parameter or its equivalent on your platform) causes metaslabs toward the front of the disk to hit this earlier. LBA weighting is great for getting maximum bandwidth out of spinning disks. It is not so great once the pool is near full. I wrote a patch that is in ZoL that disables it on solid state disk based vdevs by default where it has no benefit.

That being said, since rsync.net makes heavy use of snapshots, the snapshots would naturally keep the allocations in metaslabs toward the front of the disks pinned. That would make it a pain to get the metaslabs back below the 96% threshold. If you are okay with diminished bandwidth when the pool is empty (assuming spinning disks are used), turn off LBA weighting and the problem should become more manageable.

That said, getting data on the metaslabs from `zdb -mmm tank` would be helpful in diagnosing this.

You really shouldn't run non-CoW file systems above 90%, to include UFS and ext

Agreed. I don't think anyone is arguing that you shouldn't do it.

What I believe, and what I think others have also concluded, is that it shouldn't be fatal. That is, when the dust has settled and you trim down usage and have a decent maintenance outage, you should be able to defrag the filesystem and get back to normal.

That's not possible with ZFS because there is no defrag utility ... and I have had it explained to me in other HN threads (although not convincingly) that it might not be possible to build a proper defrag utility.

My understanding is that the way to defrag ZFS is to do a send and receive. Combined with incremental snapshotting, this should actually be realistic with almost no downtime for most environments.

Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem.

"Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem."

Yes, and that is why I did not mention recreating the pool as a solution. If your pool is big enough or expensive enough, that's still "fatal".

You ought to define what is fatal here. The worst that I have seen reported at 90% full is a factor of 2 on sequential reads off mechanical disks, which is acceptable to most people. Around that point, sequential writes should also suffer similarly from writes going to the inner most tracks.

(1) I'm not proposing recreating the pool - I'm proposing an approach to incrementally fixing the pool in an entirely online manner.

(2) If your pool is big enough/expensive enough, surely you've also budgeted for backups.

(1) Regardless of what you call it, it means having enough zpool somewhere else to zfs send the entire (90% full) affected zpool off to ... that might be impossible or prohibitively expensive depending on the size of the zpool.

(2) This has nothing to do with backups or data security in any way - it's about data availability (given a specific performance requirement).

You're not going to restore your backups to an unusable pool - you're going to build or buy a new pool and that's not something people expect to have to do just because they hit 90% and churned on it for a while.

You can send/receive to the same zpool and still defrag. With careful thought, this can be done incrementally and with very minimal availability implications.

I agree it's not ideal to have filesystems do this, but it also simplifies a lot of engineering. And I think direct user exposure to a filesystem with a POSIX-like interface is a paradigm mostly on the way out anyway, meaning it's increasingly feasible to design systems to not exceed a safe utilization threshold.

This does work.

On UNIX, there are two defragmentation utilities:

`tar` and `zfs send | zfs recv`.

What I think would help any COW file system is delay snapshot (and clone) deletion longer, and delete in groups, which would result in larger contiguous regions being freed. When one container is deleted, a small amount of space is freed, and then may be used for writes, quickly filling up, and thus increasing localized fragmentation - or more problematically for spinning drives it's increasing seek times by causing recent writes to be scattered farther apart. To reduce read and write seeks, it's better to have larger free areas for COW to write to sequentially.

So it'd be nice if there were something like a "remove/hide" feature for containers, separate from delete or clean up, with a command that makes it easier to select many containers for deletion. At least on Btrfs this should be quite fast, and the background cleaner process that does the actual work of updating the ref count and freeing extents should have a priority such that it doesn't overly negatively impact other processes.

Some of this behavior may change on Btrfs as the free space tracking has been recently rewritten. Right now the default is the original space cache implementation, while the new free space b-tree implementation is a mount time option intended only for testing.

Using dedicated ZIL significantly reduces fragmentation:


Anyone using ZFS in a serious capacity would have both dedicated ARC and ZIL.

ZIL is only used on synchronous IO. Moving it to a dedicated SLOG device would have no impact on non-synchronous IO. A SLOG device does help on synchronous IO though.

That said, all file systems degrade in performance as they fill. I do not think there is anything notable about how ZFS degrades. The most that I have heard happen is a factor of 2 sequential read performance decrease on a system where all files were written by bit torrent and the pool had reached 90% full. That used mechanical disks. A factor of 2 in a nightmare scenario is not that terrible.

A log vdev is a log vdev (or a SLOG). ZIL is a badly overloaded term.

Ignoring logbias=throughput, when you have a slog you save on writing intents for small synchronous writes into the ordinary vdevs in the pool. If you do a lot of little synchronous writes, you can save a lot of IOPS writing their intents to the log vdev instead of the other vdevs. Log vdevs are write-only except at import (and at the end phases of scrubs and exports).

Here's the killer thing on an IOPS-constrained pool not dominated by large numbers of small synchronous writes: the reads get in the way of writes. ZFS is so good at aggegating writes that unless you are doing lots of small synchronous random writes, they write IOPS tend to vanish.

Reads are dealt with very well as well, especially if they are either prefetchable or cacheable. Random small reads are what kill ZFS performance.

Unfortunately systems dominated by lots of rsync or git or other walks of filesystems tends to produce large numbers of essentially random small reads (in particular, for all the ZFS metadata at various layers, to reach the "metadata" one thinks of at the POSIX layer). This is readily seen with Brendan Gregg's various dtrace tools for zfs.

The answer is, firstly, an ARC that is allowed to grow large, and secondly high-IOPS cache vdevs (L2ARC). l2 hit rates tend to be low compared to ARC hits, but every l2 hit is approximately one less seek on the regular vdevs, and seeks are zfs's true performance killers.

Persistent L2ARC is amazing, but has been languishing at https://reviews.csiden.org/r/267/

It has several virtues that are quickly obvious in production. Firstly, you get bursts of l2arc hits near import time, and if you have frequently traversed zfs metadata (which is likely if you have containers of some sort running on the pool shortly after import) the performance improvement is obvious. Secondly, you get better data-safety; l2arc corruption, although rare in the real world, can really ruin your day, and the checksumming in persistent l2arc is much more sound. Thirdly, it can take a very long time for large l2arcs to become hot, which make system downtown (or pool import/export) more traumatic than with l2arc (rebuilds of full ~128GiB l2arc vdevs take a couple of seconds or so on all realistic devices; even USB3 thumb drives (e.g Patriot Supersonic or Hyper-X DataTravellers, both of which I've used on busy pools) are fast and give an IOPS uptick early on after a reboot or import, and of course you can have several of those on a pool. "Real" ssds give greater IOPS still. Fifthly, the persistent l2arc being available at import time means that early writes are not stuck waiting for zfs metadata to be read in from the ordinary vdevs; that data again is mostly randomly placed LBA-wise, and small, so there will be many seeks compared the amount of data needed. Persistent l2arc is a huge win here, especially if for some reason you insist on having datasets or zvols that require DDT lookups (small synchronous high-priority reads if not in ARC or L2ARC!) at write time.

Maybe you could consider integrating it into ZoL since you guys have been busy exploring new features lately.

Finally, if you are doing bittorrent or some other system which produces temp files that are scattered somewhat randomly, there are two things you can do which will help: firstly, recordsize=1M (really; it's great for reducing write IOPS and subsequent read IOPS, and reduces pressure on the metadata in ARC), and secondly, particularly if your receives take a long time (i.e., many txgs), tell your bittorrent client to move the file to a different dataset when the file has been fully received and checked -- that will almost certainly coalesce scattered records.

The term ZIL is not overloaded. Unfortunately, users tend to misuse it because the ZIL's existence is hard to discover until it is moved into a SLOG device.

As for persistent L2ARC, it was developed for Illumos and will be ported after Illumos adopts a final version of it.

I'm using a somewhat older version of ZFS, but I tried having an SLOG (a dedicated ZIL disk) and it went essentially unused, so instead I moved the disk over to a second L2ARC, which helped a lot, as it doubled the throughput.

Further research showed that the ZIL is only needed for synchronous writes, which my workload didn't have any of.

ARC when I looked at ZOL is separate from the linux page cache and thus you get double buffering.

Only with mmap'ed files.

Why can't ZoL just not cache into ARC when mmaping then?

There is no reason why the driver cannot be patches to mmap into ARC. There are just many higher priority things to do at the moment. In terms of performance, the value of eliminating double caching of mmap'ed data is rather small compared to other things in development. Later this year, ZoL will replace kernel virtual memory backed SLAB buffers with lists of pages (the ABD patches). That will improve performance under memory pressure by making memory reclaim faster and more effective versus the current code that will ecessively evict due to SLAB fragmentation. It should also bypass the crippled kernel virtual memory allocator on 32-bit Linux that prevents ZoL from operating reliably there. Additionally, workloads that cause the kernel to frequently count all of the kernel virtual memory allocations would improve tremendously.

Mmap'ing into ARC would probably come after that as it would make mapping easier.

ARC yes, ZIL no.

In a thread that is about the perils of ZFS fragmentation, you are replying to a link saying that a ZIL seriously reduces the risk of fragmentation, and saying that someone worried about fragmentation does not need to use a ZIL.

Why? If there's a legitimate reason, please expand.

I think he meant that they might not have one.

It's been a while since I looked at using ZFS for anything meaningful, but at the time (~6 years ago), while losing L2ARC was no big deal, losing dedicated ZIL was catastrophic. I think that's still true today.

So you need at least two ZIL devices in a mirror. On top of that, you really need something faster and lower latency for your ZIL vs. the ARC or main pool; people were trying to use SSDs but most commonly-available drives at the time would either degrade or fail in a hurry under load. So the options were RAM-based, e.g. STEC ZeusRAM on the high end, or some sort of PCI-X/PCIe RAM device. The former was not easy or cheap to acquire for testing stuff, and the latter made failover configs impossible.

I think that ZIL is also not soaking up all writes, just most writes meeting a certain criteria. Some just stream through to the pool. So I was always thinking of it as a protection device that also converted random writes to sequential. Some people don't think they need that.

I remember the fragmentation issue being a problem at the time, but also thinking it was probably going to get solved soon because there was so much interest and a whole company behind it. Then Oracle happened. My guess is that if it were still Sun and all the key people were still there, this would be a solved problem right now. As it is, Oracle probably wants you to buy all the extra storage anyway, and would love to offer professional services to get you out of the fragmentation bind you're in.

A lot has changed. Well - one thing actually: you no longer lose your ZFS pool if your dedicated ZIL log (called a SLOG) dies.

Here is some info on ZIL vs SLOG: http://www.freenas.org/blog/zfs-zil-and-slog-demystified/

Your information is out of date. Losing s SLOG device while the system is running is fine. As far as I know, it has always been fine (unless someone goofed on the initial implementation long before I became involved). All data in ZIL is kept in memory, regardless of whether it is written to the main pool or to a SLOG device. The data is written to the main pool in a permanent fashion with the transaction group commit. If a SLOG device dies, that write out still happens and the pool harmlessly stops using it. If the SLOG device dies on an exported pool, you need to set the zil_replay_disable kernel module parameter to allow the pool to be imported. The same might be true if you reboot (although I doubt it, but need to check).

You can test these things for yourself.

> Anyone using ZFS in a serious capacity would have both dedicated ARC and ZIL.

I contend that most people using ZFS in a serious capacity do not have a dedicated ZIL.

So to understand why this is you have to appreciate the goals behind write-anywhere-file-layout (aka WAFL) file systems. [1]

One of the goals of such systems is that copy of the file system on disk is always consistent, turn power off at any point and you can come right back up with a valid file system. This is accomplished by only writing to the 'free block list'. You construct updated inodes from the file change all the way up to the root inode out of new blocks and then to "step" forward you write a new root block. This is really neat and it means that when you've done that step, you still have the old inodes and datablocks around, they just aren't linked but you can link them to another "holder" inode attached to the name ".snapshot" and it will show you the file system just before the change. Write the old root block back into the real root block and "poof!" you have reverted the file system back to the previous snapshot.

Ok, so that is pretty sweet and really awesome in a lot of ways, but it has a couple of problems. The first, as noted, is that it pretty much guarantees fragmentation as its always reaching for free blocks and they can be anywhere. On NetApp boxes of old, that wasn't too much of a big deal because everything was done "per RAID stripe" so you were fragmented, but you were also reading/writing full stripes in RAID so you had the bandwidth you needed and fragmentation was absorbed by the efficiencies of full stripe reads/writes. But the second issue arises when you start getting close to full, managing the free block list gets harder and harder. You are constantly getting low block pressure, so you are constantly trying to reclaim old blocks (on unused snapshots, or expired ones) and that leads to a big drop in performance. The math is you can't change more of the data between snapshot steps than the amount of space you have free. That is why NetApp filers would get cranky using them in build environments where automated builds would delete scads of intermediate files, only to rebuild them and then relink them. Big percentage change in the overall storage.

On the positive side, storage is pretty darn cheap these days, so a swapping in 3TB drives instead of 2TB drives means you could use all the storage you "planned" to use and keep the drives at 66% occupancy. Hard on users though who will yell at you "It says it has 10TB of storage and is only using 6TB but you won't expand my quota?" At such times it would be useful for the tools to lie but that doesn't happen.

[1] Disclosure 1, I worked for 5 years at NetApp with systems that worked this way. Disclosure 2, an intern with NetApp (we'll call him Matt) was very impressed with this and went on to work at Sun for Jeff and similar solutions appeared in ZFS.

"One of the goals of such systems is that copy of the file system on disk is always consistent."

Goal yes, implementation no. WAFL does in fact have consistency problems and filers do ship with a consistency checker called "wack" which if you ever need this tool you'll probably have better luck throwing the filer in the trash and restoring from backups rather than waiting a month for it to complete.

Why not 'defrag' the free list during low io so this issue is somewhat mitigated?

At least on ZFS, the whole reason "defrag" is impractical is that a bunch of places in the FS structure assume the logical address of a block is immutable for the lifetime of the block, which makes a number of properties really easy and inexpensive, but also means that your life is suffering if you want to try to modify that particular constraint.

If you'd like to see some information on a feature that's been added while working around that particular constraint (or, rather, mitigating the impact of it), check out [1].

[1] - http://open-zfs.org/w/images/b/b4/Device_Removal-Alex_Reece_...

Defragmenting a merkle tree required BPR, which temporarily breaks the structure intended to keep data safe. The only code known to have achieved it performed poorly and is behind closed doors at Oracle.

The benefits in terms of defragmentation are also limited because ZFS does a fair job of resisting fragmentation related performance penalties. The most that I would expect to see on a pool where poor performance is not caused by the best fit allocator would be a factor of two on sequential reads.

As it says in that slide deck's first slide (after the title slide), second bullet, this particular device removal technique is to deal with an "oops" where one accidentally adds a storage vdev to an existing pool.

The zpool command line utility tries hard to help you not shoot yourself in the foot, but "zpool add -f pool diskname" sometimes happens when "zpool add -f pool cache diskname" was meant. Everyone's done it once. Thinks of a system melting down because the l2arc has died, and you're trying to replace it in a hurry, and you fat-finger the attempt to get rid of the "-n" and end up getting rid of "log" instead.

Without this device removal, that essentially dooms your pool -- there is no way to back out, and the best you can do is throw hardware at the pool (attach another device fast to mirror the single device vdev, then try to grow the vdev to something temporarily useful, where "temporarily" almost always means "as long as it takes to get everything properly backed up" with the goal being the destruction and re-creation of the pool (plus restoral from backups).

With this device removal, you do not have to destroy your pool; you have simply leaked a small amount of space (possibly permanently) and will carry a seek penalty on some blocks (possibly permanently, but that's rarer) that get written to that vdev before the replacement.

As noted further in the slide deck (and in Alex's blog entries), this only works for single device vdevs -- you cannot remove anything else, like a raidz vdev, and you have to detach devices from mirror vdevs before removal.

Also, note the overheads: although you can remove a single-device vdev with a large amount of data on it, doing so is a wrecking ball to resources, particularly memory. You won't want to do something like:


mirror-0 disk0 2tb-used 3tb-disk-size disk1 2tb-used 3tb-disk-size mirror-1 disk2 2tb-used 3tb-disk-size disk3 2tb-used 3tb-disk-size

do an expand dance, so you have

mirror-0 disk0 2tb-used 6tb-disk-size disk1 2tb-used 6tb-disk-size mirror-1 disk2 2tb-used 3tb-disk-size disk3 2tb-used 3tb-disk-size

then detach disk3, then device-removal remove disk2, except in extremely special circumstances, and where you are well aware of the time it will take, the danger to the unsafe data in the pool during the removal (i.e., everything in former mirror-1), that your pool will be trashed beyond hope in the presence of crashes or errors during the removal, and that you will have a permanent expensive overhead in the pool after the removal is done.

It would almost certainly be much faster and vastly safer to make a new pool with the 6tb disks and zfs send data from the old one to the new one.

I think we're basically agreeing loudly over everything except the example being a demonstration of mitigating the impact of BPs being immutable while adding a feature that requires that statement be less than true - and I agree, the permanent overhead of a mini-DDT is a non-starter for anything other than the example case of "oops I added a device, time to evac it before $TONS_OF_DATA gets landed".

Certainly, it would be much less exciting to send|recv from poolA to poolB, and require no code changes and no GB per TB of data indirection overhead.

But this was intended as an example of how many caveats and problems are involved in even a "simple" feature involving shuffling data on-disk, and thus, why "defrag" is a horrendously hard problem in this environment.

On my SSDs, I can go to 96% full without issue using ZoL. ZFSOnLinux is patched to disable ZFS' LBA weighting on solid state storage though. Non-solid state storage tends to reach 96% in metaslabs early due to LBA weighting, although I would be fine with filling a pool to 90% with the recent code. Going much higher than that is probably not a good idea though.

I have yet to see evidence that 60%-80% causes issues unless the system is so overloaded that having performance drop a small amount is noticeable. On spinning disks, such a thing is only natural because there is not much space left in the outer platters.

That said, older versions years ago would enter best fit behavior at 80%, which is where the 80% talk originated.

Have you tried btrfs? It also has support for super-cheap copy-on-write operations which should make container images and the like a snap.

Not sure if your container tool supports btrfs snapshots of course, but it's conceptually simple, right?

I think the main issue here is Btrfs is still developing. Its kernel doc file still says it's for benchmarking and review. [1] CoreOS devs decided to switch from Btrfs to overlay(fs) about 17 months ago. That's a long, long time in Btrfs development "years" that's how much development happens on Btrfs. But I can't say if CoreOS would, had today's Btrfs been what they were using in 2014, would have changed their decision.

RH/Fedora are very dm/LVM thinp snapshots with XFS centric for backing their containers. I think what you're seeing is distros are doing something different with their container backing approach in order to differentiate from other distros. Maybe it's a stab in the dark or spaghetti on the wall approach but in the end all of these storage backends are going to mature a lot in the interim, so ultimately it'll be good for everyone.

[1] https://git.kernel.org/cgit/linux/kernel/git/stable/linux-st...

Btrfs has been in beta for how many years now?

ZFS is protecting data in enterprise production environments since 2006 (Solaris 10 update 2).

Btrfs is not beta, see the discussion re: maturity below.

ZFS is an excellent filesystem, btrfs is an excellent filesystem. There is room for both excellent options for users.

For development containers, systemd-nspawn has had support for btrfs snapshots since 2014 or 2015. Simple to get going if you're on a Linux box with systemd, no other daemons or tools required.

Isn't this issue inherent to all COW systems (ZFS, WAFL, btrfs)?

have you actually benchmarked it?

anything with copy of write is going to fragment.

are you still running on spinning rust?

I've not seen any real performance hits until 90% full, but then any file system with large images suffers at that point.

We got bitten by this back in the Solaris days in 2009, on a TV broadcasting production box with quite stringent uptime requirements: what happens is the defragmenter gets itself tied in knots and starts thrashing, and the symptom is 50% system CPU with no apparent cause. Got a Sun kernel engineer on call and all. SPOILER: it did in fact require a reboot to unfuck the system. Then we kept the disk in question being wasted.

A reboot implies a bug had to be fixed. That can be assumed to be fixed everywhere by now.

... no, it doesn't imply anything of the sort. As far as I know, ZFS still has this issue - it can get itself tied in knots, and only a reboot will stop this from happening.

We used ZFS in production for a year and it was the worst decision we'd ever made, precisely for this reason.

Also, removing files uses up an insane amount of CPU and can block all FS operations if you get into the 80%+ full situation.

I'll also note that the interplay between the ARC and Linux MM is... "interesting".

It appears that the kernel-level code is shipped as source to be built, automatically by dkms, by the end user. Check out the list of binaries on the bottom left of that page.

This means that no binary kernel modules are shipped, just the cli tools.

Reminds me of the open-vm-tools package that hooks up Ubuntu guests to VMware. VMware ships with its own copies that it can mount as ISOs, but having the alternate FOSS version be managed and versioned and regression-tested by the distro maintainers makes everything a lot smoother (and means you don't have to wait for a VMware update to ship before you can update your kernel.)

Which is roughly the same way as closed-source drivers are handled.

Not really. Closed source drivers don't come with source code and aren't compiled by dkms.

The Nvidia driver does require DKMS, but it's still a binary blob - It ships with source for a wrapper, compiles that, and that loads the blob.

DKMS is for out of tree drivers. Closed source drivers being out of tree can of course use it too.

Has there been updates on the legal situation since http://blog.halon.org.uk/2016/01/on-zfs-in-debian/?

I am obviously glad that this happened, but afraid of an Oraclocalypse.

It's just that the thing described in this article finally happened: "TLDR: It’s going in contrib, as a source only dkms module."

> Oraclocalypse

The CDDL doesn't have any problems being linked into GPL'ed code, it is the GPL that has a problem accepting CDDL code.

If we could ignore the CDDL conditions, then the problems would go away. A license incompatibility happens when two licenses has each some conditions which are incompatible with each other, and the authors of software under those licenses do not give additional permissions.

The problem would go away if you convinced oracle to change the license or give explicit permission to link ZFS with the linux kernel under the GPL conditions.


Convinced the linux kernel team to change their license or give explicit permission to link the linux kernel with ZFS under the CDDL conditions.

Or you can blame "GPL", because it obviously was created after CDDL...

> Or you can blame "GPL", because it obviously was created after CDDL...

It's the other way around.

Sarcasm was implied.

Wait, there is sarcasm on the internet? What's the markdown for that?

> "What's the markdown for that?"


The sarcasm was that the great-great-grandparent didn't use it.

It's not a requirement, you don't have to use it every time you employ sarcasm online, it just helps if it's likely to be misunderstood.

> If we could ignore the CDDL conditions, then the problems would go away.

That's a strange way to say relicensing CDDL code under GPL

> A license incompatibility happens when two licenses has each some conditions which are incompatible with each other,

I think it's a bit of a far stretch to say that both licenses are to blame when one license has a so far reaching condition that other code needs to be relicensable under that license.

Any Covered Software that You distribute or otherwise make available in Executable form must also be made available in Source Code form and that Source Code form must be distributed only under the terms of this License. The Modifications that You create or to which You contribute are governed by the terms of this License.

Which license require that other code needs to be relicensable under its license? this license really means, this license. Modified code is required to be under, and I will repeat myself here, this license. If source code is not under this license, then it is infringing on this license condition and this license will no longer permit redistribution.

This license has a condition, and this license condition makes it incompatible with every other license that has an identical condition.

Oracle has nothing whatsoever to do with this code: this is OpenZFS, out of the illumos' source tree.

If there's money to be made, Oracle will make defending ZFS their business, regardless of the source of the implementation. Look at what's happening over Java with the Oracle vs. Google court cases:


Did you not read what I said? It doesn't matter whether it's a fork or not. If Oracle wins the current lawsuit over Java, it's arguable they could go after OpenZFS too as the lawsuit is based on whether the design of software (not the implementation, the design) can be owned by a company.


"The appeals court reversed the district court on the central issue, holding that the "structure, sequence and organization" of an API was copyrightable."

Java and illumos are licensed under two completely different licenses, and neither illumos code base, nor OpenZFS are an application programming interface.

If you're afraid of Oracle suing Debian for doing less than Oracle themselves do for profit (with linux+dtrace), you could get the distribution from zfsonlinux.org. Then the risk is transferred to .gov, who are also significant end users for enormous high-performance parallel filesystems. They have also been distributing binaries for RHEL-ish kernels for a while.

+1 to this. I wouldn't touch ZFS or DTrace or anything else from Sun/Oracle with a 30-million-foot pole.

Oraclepocalypse? Aren't their lawyers busy with Google?: http://fortune.com/2016/05/13/google-oracle-java-email/

Here's another post about GPL violations related to combining ZFS and Linux: https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/

Quote from that:

"Is The Analysis Different With Source-Only Distribution?

We cannot close discussion without considering one final unique aspect to this situation. CDDLv1 does allow for free redistribution of ZFS source code. We can also therefore consider the requirements when distributing Linux and ZFS in source code form only.

Pure distribution of source with no binaries is undeniably different. When distributing source code and no binaries, requirements in those sections of GPLv2 and CDDLv1 that cover modification and/or binary (or “Executable”, as CDDLv1 calls it) distribution do not activate. Therefore, the analysis is simpler, and we find no specific clause in either license that prohibits source-only redistribution of Linux and ZFS, even on the same distribution media.

Nevertheless, there may be arguments for contributory and/or indirect copyright infringement in many jurisdictions. We present no specific analysis ourselves on the efficacy of a contributory infringement claim regarding source-only distributions of ZFS and Linux. However, in our GPL litigation experience, we have noticed that judges are savvy at sniffing out attempts to circumvent legal requirements, and they are skeptical about attempts to exploit loopholes. Furthermore, we cannot predict Oracle's view — given its past willingness to enforce copyleft licenses, and Oracle's recent attempts to adjudicate the limits of copyright in Court. Downstream users should consider carefully before engaging in even source-only distribution.

We note that Debian's decision to place source-only ZFS in a relegated area of their archive called contrib, is an innovative solution. Debian fortunately had a long-standing policy that contrib was specifically designed for source code that, while licensed under an acceptable license for Debian's Free Software Guidelines, also has a default use that can cause licensing problems for downstream Debian users. Therefore, Debian communicates clearly to their users that this code is problematic by keeping it out of their main archive. Furthermore, Debian does not distribute any binary form of zfs.ko.

(Full disclosure: Conservancy has a services agreement with Debian in which Conservancy occasionally gives its opinions, in a non-legal capacity, to Debian on topics of Free Software licensing, and gave Debian advice on this matter under that agreement. Conservancy is not Debian's legal counsel.)"

> Aren't their lawyers busy with Google?

They would only bite a golden hand anyway.

> We are also concerned that it may infringe Oracle's copyrights in ZFS.

The Software Freedom Conservancy saying that is a bit scary. I am somewhat less afraid of Linux copyright holders suing.

I wouldn't be surprised if the Canonical is taking a calculated risk here that Oracle doesn't actually care, or would be actively helpful if it meant spiting Redhat.

Of course Oracle also owns part of the copyright to the kernel. They still employ some kernel developers. And oracle could certainly sue distributors for violating their copyright (not on zfs, but on Linux)

> Nevertheless, there may be arguments for contributory and/or indirect copyright infringement in many jurisdictions

In the United States there are two kinds of indirect infringement: contributory infringement and vicarious infringement.

Contributory infringement can occur when you know that someone else is or will directly infringe, and you substantially aid that by inducing, causing, or materially aiding their direct infringement. That can include providing the tools and equipment they use to infringe.

Vicarious infringement can occur when someone who is a direct infringer is your agent or under your control.

A very important aspect of both of these types of indirect infringement is that they make you liable for the direct infringement of someone else. If there is no someone else who is a direct infringer, then you cannot possibly be a contributory or vicarious infringer.

In Sweden, the Pirate Bay tried and failed using a similar argument. The court instead found a law that targeted biker bars, where a law had been created to make it easier to shut down such facilities and prosecute its owners under contributory crimes. The prosecutor only need to convince the court that the average use is primary of a criminal nature, which in the Pirate Bay case consisted of a screenshot of the top 100 list. There didn't need to be someone that was found guilty of an actual infringement.

"Downstream users should consider carefully before engaging in even source-only distribution."

Why would Free/open-source distros have to distribute ZFS source code? Couldn't they simply provide a method for downloading the source from the already existing ZFS repos and then compile the source? Wouldn't that be enough?

DKMS seems sufficient to make even the handful of guys with extreme legal positions happy. So far, none of them have come up with a way of saying it is not okay in their statements to the rest of us that think binaries are fine.

Excellent, have been running with some ubuntu ppa stuff for a while now, and that's great but things occasionally break. Looks like soon I can ditch it for pure debian again.

ZFS is in Ubuntu 16.04 without a PPA, by the way.

For the lazy:

    $ sudo apt install zfs-dkms zfsutils-linux

Does anyone know if I can switch my previously-used ZFS-on-Linux with this and it will just see my pool and work as before?

You can.

Unfortunately, it's a dkms, which means it gets compiled on the user machine on update.

From an operational perspective, this is insane, I need reliability. Of course in my organisation I could create a binary package and use that, but that's more work and then the new Debian package doesn't help me anyway.

When I need linux I just run ZFS on a better supported system and either virtualise Linux or expose an iSCSI target from ZFS for Linux.

It isn't legal to distribute the binary of zfs.ko


Canonical is doing so anyway and it seems they have a legal disagreement on the interpretation of the law with the Software Freedom Conservancy.

IANAL so I have no clue what is technically correct, but the fact is that Ubuntu is distributing zfs.ko in 16.04.

> Of course in my organisation I could create a binary package and use that, but that's more work and then the new Debian package doesn't help me anyway.

Yes, and that is called system engineering. It is the job of a system engineering department (3rd level support) to deliver such components and stable operating system builds to the 2nd level support (operations, system administrators / database administrators).

I believe on Ubuntu 16.04 ZFS is not using dkms. You'd want to check, because I'm not currently running it, but that's what I have been told.

contrib is a funny place for it. Normally contrib means free software that depends on non-free software. In this case, it seems to have acquired the meaning of free software has a license incompatibility with other free software. I wonder if we have heard the last of CDDL vs GPL.

Technical solutions to legal problems don't work, just like GPL wrappers don't work (at least, that's what some lawyers say). If Oracle decides to make a stink about this, they still can.

edit: Huh, apparently last year Debian actually got advice from SFLC about this:


Yes, zfs in contrib appears inconsistent with openafs in main. Has that been explained somewhere?

Great news. So this means Ubuntu, Debian and the new Redox OS now have ZFS. I would love to see it officially supported in Fedora too.

There is a DKMS package similar to the Debian one which requires just a single command to install: https://github.com/zfsonlinux/zfs/wiki/Fedora

However it is unlikely that Fedora will ship ZFS unless the license changes or is clarified by Oracle. Unlike Canonical, Red Hat is a US company making serious amounts of money ($2bn revenue last financial year).

I imagine it ends up in EPEL? Or is EPEL Red Hat controlled?

So is btrfs dead?

No, I don't think so. I'm a small-time admin, and I think that btrfs is working pretty good these days.

We have recently switched all our servers to LXC containers so that we can take full advantage of btrfs features.

I doubt I need to explain the advantages of containers to anyone here... but in short we've broken out all the network services (file service, LDAP, DNS, etc.) to separate containers.

Each container is in a separate btrfs subvolume. This allows us to take snapshots of the running systems every 10 minutes, and using btrfs send/receive, cheaply back up those snapshots to alternate container hosts. The send/receive stuff works better with the btrfs v4.4 tools that ship with Ubuntu 16.04.

Since the network interfaces for all the containers are bridged with the container host, we can configure each container with its own static IP address. So if a container host fails, those containers can be booted up on the alternate host, and keep their IP and MAC addresses. So that's convenient, and causes minimal disruption.

The main improvement I'd like to see with btrfs is a configurable RAID redundancy level. Currently, RAID-1 means that there are two copies of each piece of data / metadata. So in a 3-drive RAID-1 system that gives you extra capacity, but two drives failing at the same time will cause data loss.

Being lazy, I'm going to ask if you know of some LXC + Btrfs pros in contrast to Docker + Btrfs?

Right now a gotcha with Docker + Btrfs is that SELinux contexts for each container can be different, but the context= mount option currently is once per superblock (thus per fs volume, rather than per fs tree or subvolume). So Docker's work around in 1.10.x is they do a snapshot and then relabel it with the new selinux context then start the container. For my containers (very basic) this adds an almost imperceptible one time delay for that container. I doubt it's even 1 second, which in container start times might seem massive to some.

If you have already adapted to Docker, then I'm sure you don't want to use LXC.

Containers are long-lived and mutable. I treat them like old school servers, just not tied to physical hardware.

btrfs is most certainly not dead. I actually think it's really gaining traction presently. It only became stable near Ubuntu 14.04 or so, and it takes people awhile (understandably) to warm up to a new filesystem.

It's great to see ZFS on Linux get a more stable footing. It's an excellent filesystem. As others have said I think the use case differs slightly from btrfs (though they are very similar in capabilities).

ZFS, to my eyes, seems more resilient. It has more levels of data checksums, The RAIDz model allows for more redundancy, and it just feels like a stronger enterprise offering (meaning stable and built for large systems and disk quantities).

btrfs brings many of the ZFS features to Linux in a GPL wrapping. What it lacks in resiliency, it makes up for with flexibility. Raid in btrfs, for instance, occurs within data chunks across disks, not at the disk level, meaning mixed disk capacities, and on the fly raid changes. I also appreciate the way it divides namespace across subvolumes while maintaining block awareness within the pool (cp --reflink across subvolumes, snapshots across subvolumes). It also doesn't have the ram requirements of ZFS (which aren't much of a data center concern, but are definitely a client level concern for workstations).

Either way it's a win, both great filesystems for Linux. With bcache supporting btrfs properly now, I personally don't have much of a reason for ZFS now. Two years ago I would have jumped easily to it. Your workloads and needs may differ, it's great to have choices!

Not even close. Each merge window sees many dozens of bug fixes and enhancements, thousands of line additions and deletions. Development is very active. There are even a couple developers who are getting ants in the pants about focusing more on stabilization. The drawback of stabilization is that it increases the burden of adding new features, because any new features could reduce stability. So it's a balancing act.

btrfs supports a bunch of features that ZoL does not which are useful especially on smaller systems.

among others: the ability to resize/change raid layout in place, reflink copies, on-demand deduplication.

But btrfs RAID-5 still suffers from the 'write hole' problem where data can be permanently lost. We looked at RAID-5 performance for both ZFS and btrfs, and btrfs is about twice the throughput. One of the reasons for the performance difference is the following: when ZFS writes a stripe, it waits for all blocks and the parity block to confirm they have been written before atomically updating the metadata for that stripe; when btrfs writes a stripe, it just updates the metadata and all blocks in parallel. Hence if btrfs crashes, metadata may say a stripe is written but not all blocks are written. It will be interesting to see the performance hit on btrfs when they finally fix the write hole problem. My prediction is that it will perform similarly to ZFS for RAID5. Currently, for writes, we can get about 400MB/s for btrfs RAID5, but only 220MB/s for ZFS with five 5400 rpm 2.5" disks.

Interesting findings. I'm curious also to see how fixed width stripes change the performance profile in btrfs (presently not supported). To be clear though, the raid5/6 write hole only applies to power failures with data in flight. It is still a concern, of course, but I think depending on your environment is acceptable to some (redundant PSUs, well engineered PDUs and UPS systems). Personally, I'm of increasing opinion that parity rebuilds aren't worth it anyway. I'd rather raid10, raid1, or raid0 depending on use. If I have to take a system out of production during parity rebuild (because IO activity is too intense for performant use), might as well not parity rebuild and simply reload the system on failure and rely on other cluster nodes.

Raid5 is not dead yet (https://www.cafaro.net/2014/05/26/why-raid-5-is-not-dead-yet...). The problem with failures during rebuilds is overblown, IMO. Manufacturer quoted URE failure rate (probability of failure to read) is overstated - instead of 1×10^14, they are mostly like 1×10^15 or higher. Full disclosure: we're actually doing erasure coding in HDFS over Raid5 on servers (double insurance - if the raid array goes down, we can recover from other servers in HDFS). But our expectation for 6x4TB arrays is not for a 70%+ chance of a URE during a rebuild, rather a couple of percent. With ZFS or btrfs, it won't actually matter for us, as we'll only lose a block on a URE- that we can recover from the rest of the cluster.

    The problem with failures during rebuilds is overblown
I thought I was the only one who believed that. I've said this on reddit before and ended on like -20 votes with people blatantly arguing I'm falsifying an "impossibility".

I've got roughly 30 arrays in production, between 4 and 12 disks in each. All are RAID5 + hotspare. If you believe the maths people keep quoting, the odds of seeing a total failure in a given year is close to 100%. I started using this configuration, across varying hardware, over 15 years ago and I've been growing in number since.

I'm not pretending one example proves the rule, or that it's totally safe and I would run a highly critical environment this way (before anyone comments: these environments do not meet that definition), but people have tried to show maths that there's a six nine likelihood of failure, and I just don't for a second believe I'm that lucky.

Well at least on Linux, by default almost everyone (using consumer drives) has their array in a very common misconfiguration. And this leads to raid5 collapse much sooner than it should.

The misconfiguration is the drive's SCT ERC timeout is greater than the kernel's SCSI command timer. So what happens on a URE is, the drive does "deep recovery" if it's a consumer drive, and keeps trying to recover that bad sector well beyond the default command timer of the kernel, which is 30 seconds. At 30 seconds the kernel assumes something's wrong and does a link reset. On SATA drives this obliterates the command queue and any other state in the drive. The drive doesn't report a read error, doesn't report what sector had the problem, and so RAID can't do its job and fix the problem by reconstructing the missing data from parity and writing the data back to that bad sector.

So it's inevitable these bad sectors pop up here and there, and then if there's a single drive failure, in effect you get one or more full stripes with two or more missing strips, and now those whole stripes are lost just as if it were a 2-disk failure. It is possible to recover from this but it's really tedious and as far as I know there are no user space tools to make such recovery easy.

I wouldn't be surprised if lots of NAS's using Linux were configured this way, and the user didn't use recommended drives because, FU vendor those drives are expensive, etc.

Don't forget the part where many consumer drives won't let you play with the SCT ERC settings, and some of them just completely crap out on URE and won't come back.

(My personal favorite was when I discovered a certain model of "consumer" drives we had thousands of in production claimed to not support SCT ERC configuration, but if you patched smartctl to ignore the response to "do you support this", the drives would happily configure and honor it.)

Most enterprise-class drives are just consumer drives packaged with a bit more software, buy I guess you know that.

Yeah, I was just entertained by how lazily the removal was implemented in the consumer drive FW.

Follow the money who is selling the raid5 is dead story. The main worry is correlated failures if you have the Sam types.of drives in arrays and they reach their end of life.

Note that all the manufacturers aren't actually saying the URE is X. They are saying it's less than X, it's a cap. Therefore it isn't a rate. The actual rate for two drives could be very different, maybe even more than an order of magnitude different, but so long as it's below the spec's cap for such errors, it's considered normal operation.

So yeah, I agree, the whole idea in some circles that you will get a URE every ~12TB of data read is obviously b.s. We don't know what the real world rate is because of that little less than sign that appears in all of these specs. We only know there won't be more errors than that, and not for a specific drive, but rather across a (virtual) sample size for that make/model of drive.

For scaleable storage, get rid of conventional RAID for data. I'd like to see n-way (definable) copies of metadata, and single copies of data. On top is a cluster file system like GlusterFS. When a device dies, the file system merely rebuilds metadata, and then informs GlusterFS of the missing data due to the failed drive(s). And then the file system deletes the reference to all missing/damaged files from that missing drive.

No degraded state ever happens. This way Gluster knows not to even make requests from that brick. If the brick were raid56 and the cluster fs isn't aware, requests happen with degraded read/writes which suck performance wise.

Plus Gluster (or even Ceph for that matter) might use some logic and say, well that data doesn't even need to be replicated on that brick, it's better for the network/use loading if the new copy is on this other brick over here.

Sounds like hdfs....

That should be exposed as an option so the user can choose whether they need performance or reliability.

Have you tried btrfs on an md RAID device (for interest)? We've run it for a while, but we don't care too much about performance.

That seems ill advised, one of the benefits of btrfs is it obviates the need for lvm and mdadm.

I guess in your case you have a more stable raid5/6 opportunity, but you're losing many of the raid benefits present in btrfs natively. I'd also imaging it could be slower or introduce IO issues others haven't tested. Though I really have no idea, never seen anyone do that before.

It's been great for our use case - we've been using it for about 5 years like this. We use it for rsyncing data onto as a backup (not the only one!), making daily snapshots. md is more flexible than zfs's raid, but less flexible as native btrfs raid.

No, but a lot of people on the btrfs mailing list recommend this due to the lack of stability for raid5/6 on btrfs.

And it does the reflink copies and on-demand deduplication with minimal overhead. ZFS handles snapshots fine but its deduplication can cripple speeds, even when reading. Maybe some day it'll get block pointer rewrite, but for now it can be a big problem, depending on the type of data you have.

btrfs is immature.

btrfs is stable as of kernel 3.14 or so. The features in btrfs that are unstable at this point are raid56 (write hole on power failure risk), autodefrag (and mainly just with high transaction files like vm images and databases), and ext34/in place filesystem conversion.

We've been running btrfs in production since Ubuntu 14.04 with excellent results. The feature set vastly outweighs the few risks that remain.

You forgot about how sensitive it is to low disk space. Don't go above 80% in production is what I've been told - that's a lot of wasted disk space.

You are correct that btrfs and ZFS are sensitive to low disk space. This is also related to how cow filesystems function, they need that free space to commit writes and for snapshots because they, by nature, don't overwrite blocks.

See this for the ZFS example; http://serverfault.com/a/556892/79238

In both cases the exact amount of free space you desire is a mix of workload, fragmentation, and snapshots.

However, I disagree they waste a lot of space. I think both ZFS and btrfs more than make up for the overhead of free space commits through their space saving features. cp --reflink, block suballocation, compresssion, and efficient snapshots outweight the overhead, in my case. Your mileage may vary.

So, isn't there a way to set aside a buffer space so that you don't run into ENOSPACE problems? Maybe like ext's 5% reserve.

As I understand it in btrfs case there is two problems: 1) Metadata in btrfs can use lots of space, especially when you convert from ext4. It might happen that you have gigabytes of free space reserved for metadata, so it can not be used anymore. This can be solved with rebalancing, but that can take ages, which is actually one of the reasons zfs doesn't have bpr rewrite feature. 2) btrfs can have mixed raid levels and in that scenario calculating free space is tricky, but people still rely on common tools, that simply give some estimates in that case. Change the way it estimates free space and you'll have less clueless people complaining about fs running out of it, but more will say btrfs shows too little.

FWIW, Hammer on DragonflyBSD can rebalance and dedup with little memory and doesn't take long, but details matter and the comparison may not be fair. What's rebalance in ZFS might be something much more trivial and less effective in Hammer, but I've deduped Hammer filesystems on machines with little memory compared to what ZFS requires for its data structures in memory.

ZFS' data desuplication requires very little memory. However, it will check every new record write under it with every other record write. The only way to do this in a performant way is to lean on cache. Without sufficient cache, you degrade to performing 3 random sequential IOs, which peforms terribly. The system will continue to run, but it would be slow.

As far as I know, there is no way to implement online deduplication with constant RAM usage without performing poorly as things scale or playing Schrödinger's cat with whether data that should deduplicate is subject to deduplication. Offline data deduplication might work, but it would be performance crippling ZFS' data integrity guarentees.

If HAMMER has online data deduplication that is performant with constant ram, they likely made a sacrifice elsewhere to get it. My guess is that it misses cases, such that while you would expect unique records to be written once, they can be written multiple times.

I believe you're misunderstanding the problems that occur on a cow filesystem. In fact btrfs already has an overcommit disk buffer, and is already doing many ENOSPACE handling tricks.

Have a look; https://btrfs.wiki.kernel.org/index.php/ENOSPC

Reading more for my own interest it seams ZFS uses ZIL to help convert random writes to sequential, which helps with fragmentation under low space. I am curious if bcache, can operate similarly. In addition, I should also point out that this is less of an issue on SSD, due to the nature of how random reads/writes work there anyway (btrfs does a good job of being SSD aware).

The last I checked, ZFS had no open ENOSPC bugs. The trick that it uses is to reserve small amount of space. I forget if it is 1.6% or 3.3%, but whichever that is, it combined with other tricks is considered to be enough.

ENOSPC is a very different condition from lower performance. If you tested filesystem me at 90% full, you should find that all of them have lower performance than when they were empty. You might also find performance varies based on how you filled them.

You're right that the free space overhead is workload dependent. However, compression is orthogonal to the FS and for us, in the Hadoop world, we win nothing with efficient snapshots or other features. The problem we have is estimating how much overhead is 'safe', so we are inherently conservative. The lost disk capacity a big deal on 1k+ hadoop clusters.

I'm going to disagree with you re: compression. Compression at the FS layer brings a lot of benefits, and generally most workloads are IO constrained not CPU constrained. Compression at the FS improves IO at the expense of CPU.

Hadoop is a completely different workload, and maybe not something for ZFS or btrfs. Our Hadoop nodes are not raid, just JBOD ext4 disks. We have been conidering btrfs with nodatacow mount option and lz4 compression, however. We haven't decided if it's better to compress within Hadoop or at the fs layer yet. I would be curious on your findings.

In Hadoop, people are mostly using formats like Parquet, Orc, an if not, compression libs like lzo or snappy. If you believe the Berkeley people (I don't, but the sheeple do), most Spark workloads are CPU bound not IO bound. But irrespective of that, if most of your data is in a columnar data storage format, there's no gain (only cost) in having your FS also try and compress it. JBOD is considered best practice for Hadoop. That's why we're looking at RAID0 and RAID5 - we're researchers :) Actually, MapR recommend using 3 disks in RAID0 as volumes.

btrfs does not support lz4 compression. Unless you meant lzo, which performs terribly on incompressible data, you will want to use ZFS for lz4.

Also, nodatacow is a hack. If you take a snapshot on btrfs with nodatacow, it must use CoW on each thing in the snapshot that is overwritten. Until then, what ever horrible performance nodatacow prevented will manifest temporarily. ZFS is designed to make things asynchronous as much as possible (with the exception of partial record writes to unaccredited records, which needs to be fixed), so it lacks an equivalent to nodatacow and does not need it.

Yes, apologies, I did mean lzo. zlib seemed perhaps too much on the CPU side of the IO/CPU calculation. I am excited to see how btrfs snappy and lz4 support compares when they are added, however.

To the point of lzo performance though, btrfs is a tad smart with compression, it tries to compress an initial 128KiB and if the compressed segment is not smaller than the uncompressed it adds it to a list of no compress files, and will not try compression on that file again (unless of course you force it).

This was for our Hadoop use case, comparing to ext4, so nodatacow would work because we have no desire of snapshots in that environment. It still seems like we're better of compressing within the Hadoop framework (as jamesblonde is doing) and sticking with ext4 jbod, for now at least.

Btrfs will likely never add support for lz4 or snappy:


There are links there to mailing list emails explaining the reasoning behind that. The reasoning behind ZFS adopting LZ4 can be found here:


Contrary to what the btrfs developers claimed about LZ4 versus LZJB in ZFS, LZ4's compression performance on incompressible data alone would have been enough to adopt it had ZFS already had LZO support. LZ4 also has the benefit of extremely quick decompression speeds. It also has the peculiar property where running LZ4 repeatedly on low entropy files outperforms "superior" compression algorithms such as gzip. Someone on the LZ4 mailing list discovered this when compressing log files. He compressed a 3.5GB log file into a ~750KB file by running LZ4HC 5 times. Running it twice yielded a 9.5MB file with regular LZ4 compression and a 2MB file with LZ4HC compression. He was able to compress it to ~750KB after running LZ4HC roughly 5 times.


As for btrfs being smart with lzo by compressing only the first 128KB as a heuristic, LZ4 uses a hash table for that and is able to give up much faster. I would expect to see LZ4 significantly outperform LZO. The following site has numbers that appear to confirm that:


On a JPEG on their Intel® Core i7-2630QM, LZ4 level 1 compression runs at 608.24MB/sec while LZO level3 compression runs at 68.13MB/sec. Also of possible interest is that Snappy compresses at 559MB/s here. There are a couple caveats though. While I picked the correct variant of LZ4 for ZFS (and also the Linux kernel), I assumed that btrfs is using the default compression level on LZO like ZFS does on LZ4 and knew nothing about the different revisions measured, so I took the best reported number for any of them at the default compression level. That happened to be lzo1b. I also assumed that JPEGs are incompressible, which the data strongly supports. There are a several exceptions, but LZO, Snappy and LZ4 all consider the file to be incompressible.

As for the question of whether to compress in hadoop or in the filesystem, the properties of LZ4 mean that you can do both. If your data is incompressible, LZ4 will give up very quickly (both times). If your data is incompressible after 1 round of LZ4 compression, then the LZ4 compression in ZFS will give up quickly. If it is compressible by two rounds of LZ4, then both will run and you will use less storage space because of it.

Thank you, this was a great reply. Reading the discussions have been interesting.

I haven't heard this and I've been on the btrfs list for years. The biggest issue you're going to have at 80+% usage, that applies to all file systems getting slow, is transfer rate drops due to inner tracks having fewer sectors; and depending on the workload, seeks are increasing simply because there's more stuff on the disk to go looking for.

As for fragmentation there are two kinds: fragmentation of files into more than one extent, fixed with 'btrfs filesystem defrag' and also autodefrag. And the other is fragmentation of free space, as a result of deleting files and that's fixed with 'btrfs balance' which consolidates extents and writes them into new chunks then frees up large regions of contiguous space on the drives. This is best used with filters.

The problem is "in production" you can do a lot of things which ironically don't work for smaller users, because "in production" you specifically optimize and assume hardware and applications will crash in catastrophic ways pretty much all the time and you code quite differently (i.e. it is totally reasonable to ask applications to be distributed and deal with it).

Agreed, my workload is probably not representative of a small organization or user. I absolutely leverage resiliency in software, distributed computing, and mitigate single points of failure, etc... This is a good point.

Now, I do think these methods are increasingly approachable for all users, however. A lot of that is actually enabled by the feature sets of ZFS and btrfs. The default Ubuntu installer, for example, will create snapshots during OS upgrade for rollback if the upgrade fails. ZFS and btrfs send/receive feature allows for efficient DR clones (not to mention seed images and snapshots). LXD leverages ZFS if you choose to for rapid containerization and snapshots.

These intrinsic abilities of these two filesystems allow for smaller users and organizations to improve those workflows to be more risk averse in general (even while assuming some risk in a newer FS).

Just, if possible, be up to date on the known issues and run more recent kernels and userland utilities.


"stable" != "mature"

That's a fair point. I guess we can debate what 'mature' means.

In development since 2007, stable since 2014. ZFS, in development since 2001 (correction I erroneously listed 2005 earlier), stable since 2006 (or at least Solaris included since then). Do you consider the Solaris years or just the Linux years and then do you consider the Linux Debian/Ubuntu sanctioned years or the ZoL years?

I'm fine with mature since included in the default installer on Ubuntu/Redhat/Oracle/SUSE/etc... for my definition of btrfs maturity.

I think you've confused your dates. ZFS has been in development since 2001. It was first introduced in 2005.

Yes, thanks for catching that, I've edited the parent.

As you can tell I'm not that strong on ZFS other than what I've learned in some small tinkering and discussion with others.

It seems odd that after all this time that btrfs wouldn't be able to handle database workloads. If it can't handle hight I/O it doens't have much use except for a bootdisk. Do you have any reading you could suggest on this issue?

As zanny said, cow presents an issue for transactional workloads (databases, vm images, etc...) as each write fragments the file, by nature of the process of cow.

btrfs can handle database workloads but you have to disable cow for them (which you can do at the file, directory, or subvolume level in btrfs). You would specify the nodatacow mount option, or chattr +C (file/directory).

The btrfs autodefrag is still rather new, and needs some work. I expect that could be the long term fix (manual defrag is fine now, but you wouldn't want to call it frequently on a db file), I'm not sure how ZFS handles the fragmentation, but I do know in the past ZFS observed similar issues (seems to have been mostly resolved). I should also point out that disabling cow doesn't really fully disable it, snapshots can still function, etc... however, I'm sure that once you start to use the other cow functions you might observe slipping performance due to fragmentation of these types of files.

https://blog.pgaddict.com/posts/friends-dont-let-friends-use... https://bartsjerps.wordpress.com/2013/02/26/zfs-ora-database...

btrfs is only particularly slow on database files if they are not marked for inplace writing because the filesystem is default copy on write, which is horribly slow for huge constantly changing files (they end up highly fragmented across data blocks if you don't turn off cow).

btrfs is not stable!, xfs is stable

ext4 = started in 2006, stable in 2008, universal on all distros by 2010 btrfs = started in 2007, stable in 2014, mature never?

There is no magical point in time where a file system is suddenly "mature". If anything I'd argue once OpenSUSE started defaulting to it one should consider it stable. I just wonder why Ubuntu / Fedora are not prioritizing making it default since snapshotted upgrades are such a huge usability gain and are trivially easy with btrfs.

Meanwhile, just for reference, ZFS started in 2001. Introduced in Solaris in 2005 and FreeBSD in 2007.

I'm not necessarily trying to make a point here, just adding some related facts.

I don't know if it's fair to say that ext4 "started" in 2006, given that it was an evolution of ext3, which was an evolution of ext2 (which I assume was an evolution of ext?), to the point that you can still mount an ext2 filesystem as a very feature-limited ext4 filesystem.

Why the 7 year gap for stabilization? Especially after ext4 only took 2. Jusding from the timelines, btrfs is much more complicated and I'd guess more Linux users exposed more corner cases, but a 3.5x lengthening is substantial.

AFAICT, btrfs is a while new system, while ext4 is a set of changes to an already tested coffee base.

*code base. Friggin' autocomplete.

ext4 isn't a rewrite and magnitudes smaller in code size and complexity, so the comparison isn't really fair.

It's the default file system in OpenSUSE.

BTRFS has much lower memory requirements than ZFS which needs at least 16GB RAM for anything performant. Also for NAS boxes expecting the user to compile the ZFS libs from source wouldn't be optimal.

Btrfs is a different use-case.

ZFS doesn't support different sized hard drives. I could also be wrong, but ZFS doesn't expand as easily, and you can't remove drives from it.

Synology NAS just added support for Btrfs in their software

ZFS does support different sized drives, and you can add disks to a pool easily.

It does not support different sized drives in a raidz (equivalent to an mdadm array), as far as I know, but you can mix raidz and singular drives in a pool, though that is a bit silly. It also does not support adding disks to a raidz, which is equivalent to growing a raid5 or raid6 array, which is admittedly a pain in the ass.

You can also upgrade an entire raidz vdev if you replace each drive in the vdev with a larger one (one at a time) and enable "autoexpand" on the zpool. It's a manual and time consuming process, waiting for each drive to resilver. Still an pain, but at least possible.

> It does not support different sized drives in a raidz

  % mkfile -v 64m /var/tmp/d0
  /var/tmp/d0 67108864 bytes
  % mkfile -v 96m /var/tmp/d1
  /var/tmp/d1 100663296 bytes
  % mkfile -v 128m /var/tmp/d2
  /var/tmp/d2 134217728 bytes
  % sudo zpool create -f pool0 raidz /var/tmp/d0 /var/tmp/d1 /var/tmp/d2
  % zpool status
    pool: pool0
   state: ONLINE
    scan: none requested

        NAME             STATE     READ WRITE CKSUM
        pool0            ONLINE       0     0     0
          raidz1-0       ONLINE       0     0     0
            /var/tmp/d0  ONLINE       0     0     0
            /var/tmp/d1  ONLINE       0     0     0
            /var/tmp/d2  ONLINE       0     0     0

  errors: No known data errors

You can build it, but it is limited to the smallest device in the vdev, so the array will appear to only have three devices of 64m available. Though you can replace all smaller disks and grow it.

That is not how RAIDZ works. Because every write is a dynamic width stripe, the capacity is an aggregate minus parity, not linear.

The smallest device capacity applies to RAID1+0, which is mirroring, but it does not apply to RAIDZ.

Of course that ZFS supports any sized drives in a pool. ZFS expands very easily, in fact, the user doesn't have to do anything, it automatically expands.

Of course you can remove drives in ZFS. How else could you replace disks?

I downvoted your post because every statement you made about ZFS is wrong.

You're describing operations that can be done on a pool, while conveniently ignoring that they can not be done to a single RAID array (vdev), which is the more useful comparison against btrfs.

Yes, you can add drives of arbitrary size to a ZFS pool, but when put into the same array, they all get treated as the size of the smallest one. Yes, you can expand a ZFS pool but only by adding a new array or by incrementally upgrading all the drives in the new array. Yes, you can remove drives, but only for replacement purposes; you can't do an in-place rebuild to make the reduced number of drives the new normal instead of degraded mode.

I know nothing about ZFS, so sorry if these are dumb questions.

Can you have an array that contains a single disk? And if you can, then

* Can you add as many single drive arrays to a pool as you like?

* Can you migrate data off of a given single-drive array on to other single-drive arrays in the pool and then permanently remove the array from the pool?

Yes, yes, no. Plus a single disk array has no redundancy.

Minor nit: depending on your definition of redundancy, you may set copies=n, on a single vdev pool. You won't get hardware redundancy but will get data redundancy.

This is interesting to see at a time when so many key packages are missing or badly outdated in Debian core.

finally, hopefully it makes its way to the installer (unlike the ubuntu installer...)

Don't get your hopes up.

massive opportunity now for Oracle to generate some goodwill

Oracle has nothing whatsoever to do with OpenZFS. As in zero, nada, zilch. And because of CDDL, Oracle cannot take back the source code from illumos (which contains OpenZFS code) without open sourcing the code to Solaris again:


Oracle isn't Goodwill

so... what's missing?

Legal clarity.

But it didn't land in Debian. It only landed in contrib. Title of this link is wrong, so maybe someone should fix it?


We added 'contrib'. Will that do?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact