
The State of ZFS on Linux - ferrantim
https://clusterhq.com/blog/state-zfs-on-linux/
======
ownedthx
At a previous job, we built a proof-of-concept Sinatra service (i.e.,
HTTP/RESTful service) that would, on a certain API call, clone from a
specified snapshot, and also create an iscsi target to that new clone. This
was on OpenIndiana initially, then some other variant of that OS as a second
attempt.

The client making the HTTP request was IPXE; so, every time the machine
booted, you'd get yourself a flesh clone + iscsi target and we'd then mount
that ISCSI target in IPXE, which would then hand off the ISCSI target to the
OS and away you'd go.

The fundamental problem we hit was that there was a linear delay for every new
clone; the delay seemed to be 'number of clones * .05 second' or so. This was
on extremely fast hardware. It was the ZFS command to clone that was going to
slowly.

Around 500 clones, we'd notice these 10/20 second delays. The reason that hurt
so bad is that, to our understanding, it wasn't safe to do ZFS commands or
ISCSI commands in a parallel manner; the Sinatra service was responsible for
serializing all ZFS/ISCSI commands.

So my question to the author:

1) Does this 'delay per clones' ring familiar to you? Does ZFS on Linux have
the same issue? It was a killer for us, and I found a thread eventually that
implied it would not ever get fixed in Solaris-land.

2) Can you execute concurrent ZFS CLI commands on the OS? Or is that dangerous
like we found it to be on Solaris?

~~~
ryao
1\. I am not aware of this specific issue. However, I am aware of an issue
involving slow pool import with large numbers of zvols. Delphix has developed
a fix for it that implements prefix. It should be merged into various Open ZFS
platforms soon. It could resolve the problem that you describe.

2\. Matthew Ahrens' synctask rewrite fixed this in Open ZFS. It took a while
for the fix to propagate to tagged releases, but all Open ZFS platforms should
now have it. ZoL gained it with the 0.6.3 release. Here is a link to a page
with links to the commits that added this to each platform as well as the
months in which the were added:

[http://open-zfs.org/wiki/Features#synctask_rewrite](http://open-
zfs.org/wiki/Features#synctask_rewrite)

~~~
ownedthx
Thanks for the reply.

Regarding #2: On OpenIndiana, we first started with concurrent zfs commands
and ruin, I think, the whole pool (maybe it wasn't that drastic, but was still
a disaster scenario where key data would be lost). I couldn't believe it.

I was asking anyone who knew anything... 'so if two admins were logged in at
the same time and made two zvols, they could basically ruin their filesystem'?
No one knew for sure. Crazy stuff.

Anyway, I'm quite glad that's safe now.

------
ryao
I am the author. Feel free to respond with questions. I will be watching for
questions throughout the day.

~~~
Nursie
I don't really have a question, but as a ZoL user I'd just like to say thanks
for all the hard work.

It makes management of my disk arrays pretty painless and has some fantastic
migration/recovery stuff going on. All of which I'm sure you know!

~~~
IgorPartola
Seconded. ZFS is the only filesystem I trust with my children's baby pictures,
as well as to store the git repo's for my personal projects (stuff I don't
want on GitHub for a variety of reasons).

~~~
ryao
I am happy to hear that. While I certainly think ZFS is the best filesystem
available for storing this kind of data, i would like to add a word of caution
that ZFS is not a replacement for backups. I elaborated on this in one of the
supplementary blog posts:

[https://clusterhq.com/blog/file-systems-data-loss-
zfs/#disk-...](https://clusterhq.com/blog/file-systems-data-loss-zfs/#disk-
failure)

~~~
IgorPartola
Absolutely. I have a regular backups strategy that also backs up to a ZFS
system :).

------
astral303
Tried using ZFS in the earnest and got spooked, felt it was not production
ready. Wanted to use ZFS for MongoDB on Amazon Linux (primarily for
compression, but also for snapshot functionality for backups). Tried 0.6.2.

Ended up running into a situation where a snapshot delete hung and none of my
ZFS commands were returning. The snapshot delete was not killable with kill
-9.
[https://github.com/zfsonlinux/zfs/issues/1283](https://github.com/zfsonlinux/zfs/issues/1283)

Also, under load encountered a kernel panic or a hang (I forget), turns out
it's because the Amazon Linux kernel comes compiled with no preemption. It
seems that "voluntary preemption" is the only setting that's reliable.
[https://github.com/zfsonlinux/zfs/issues/1620](https://github.com/zfsonlinux/zfs/issues/1620)

That left a bad taste in my mouth. Might be worth trying out 0.6.3 again.

I am still leafing through the issues closed in 0.6.3, but based on what I
see, 0.6.2 did not seem production-ready-enough for me:

[https://github.com/zfsonlinux/zfs/issues?page=2&q=is%3Aissue...](https://github.com/zfsonlinux/zfs/issues?page=2&q=is%3Aissue+milestone%3A0.6.3)

~~~
ryao
Your deadlock was likely caused by the sole regression to get by us in the
0.6.2 release:

[https://github.com/zfsonlinux/zfs/commit/a117a6d66e5cf1e9d4f...](https://github.com/zfsonlinux/zfs/commit/a117a6d66e5cf1e9d4f173bccc786a169e9a8e04)

This occurred because it was rare enough that neither us nor the buildbots
caught it back in Feburary. George Wilson wrote a fix for it in Illumos rather
promptly. However, Illumos and ZoL projects had different formats for the
commit titles of regression fixes. In specific, the Illumos developers would
reuse the same exact title while the ZoL developers would generally expect a
different title, so we missed it when merging work done in Illumos. I caught
it in November when I was certain that George had made a mistake and noticed
that our code and the Illumos code was different. It is fixed in 0.6.3. The
fix was backported to a few distribution repositories, but not to all of them.

The 0.6.3 release was notable for having a very long development cycle. As I
described in the blog post, the project will begin doing official bug fix
releases when 1.0 is tagged. That should ensure that these fixes become
available to all distributions much sooner. In the mean time, future releases
are planned to have much shorter development cycles than 0.6.3 had, so fixes
like this will become available more quickly.

That being said, I was at the MongoDB office in NYC earlier this year to
troubleshoot poor performance on MongoDB. I will refrain from naming the
MongoDB developer with whom I worked lest he become flooded with emails, but
my general understanding is that 0.6.3 resolved the performance issues that
the MongoDB had observed. Future releases should further increase performance.

~~~
astral303
Thank you so much for the information! This is very encouraging. I will
definitely give 0.6.3 a whirl!

------
WestCoastJustin
For anyone interested, over the past couple weeks I have heavily researched
ZFS, and created a couple screencasts about my findings [1, 2].

[1] [https://sysadmincasts.com/episodes/35-zfs-on-linux-
part-1-of...](https://sysadmincasts.com/episodes/35-zfs-on-linux-part-1-of-2)

[2] [https://sysadmincasts.com/episodes/37-zfs-on-linux-
part-2-of...](https://sysadmincasts.com/episodes/37-zfs-on-linux-part-2-of-2)

------
agapon
Great blog post! Something from personal experience. OpenZFS on FreeBSD feels
mostly like a port of illumos ZFS where most of the non-FreeBSD-specific
changes happen in illumos and then get ported downstream. On the other hand,
OpenZFS on Linux feels like a fork. There is certainly a stream of changes
from illumos, but there's a rather non-trivial amount of changes to the core
code that happen in ZoL.

~~~
ryao
This is because Martin Matuška of FreeBSD has been focused on upstreaming
changes made in FreeBSD's ZFS port into Illumos. At present, the ZFSOnLinux
project has had no one dedicated to that task and code changes mostly flow
from Illumos to Linux. This is starting to change. A small change went
upstream to Illumos earlier this year and more should follow in the future.

That being said, there are commonalities between Illumos and FreeBSD that make
it easier for the FreeBSD ZFS developers to collaborate with their Illumos
counterparts:

1\. FreeBSD and Illumos have large kernel stacks (4 pages and 6 pages
respectively) while Linux's kernel stacks are limited to 2 pages.

2\. In-kernel virtual memory is well supported in FreeBSD and Illumos while
Linux's in-kernel virtual memory is crippled for philosophical reasons.

3\. FreeBSD and Illumos have both the kernel and userland in the same tree.
FreeBSD even maintained Illumos' directory structure in its import of the code
while ZoL's project lead decided to refactor it to be more consistent with
Linux.

Difficulties caused by these differences should go away changes made in ZoL to
improve code portability are sent back to Illumos.

~~~
gnu8
I'm interested in point 2, can you clarify how and why Linux in-kernel virtual
memory is crippled or provide a link?

~~~
ryao
There are two issues:

1\. Page table allocations use GFP_KERNEL, even when done for an allocation
that used GFP_ATOMIC. This means that allocations that are needed to do
pageout and other things to free memory can deadlock. There is a workaround in
the SPL that will switch it to kmalloc when this issue occurs. There is also a
new mechanism in Linux 3.9 and later that should render this unnecessary.

2\. The kernel deals with kernel virtual address space exhaustion in vmalloc()
by spinning until memory becomes available. This is not a problem on current
64-bit systems where the virtual address space used by vmalloc is larger than
physical memory, but it is a problem on most 32-bit systems.

------
bussiere
I may have read the article too fast , but what about cryptography in zol ? is
there a way to crypt data on zol ? regards and thks for the article

~~~
ryao
At present, you need to either encrypt the block devices beneath ZFS via LUKS
or the filesystem on top of ZFS via ecryptfs. There are some guides on how to
do this for each distribution.

There is an open issue for integrating encryption into ZoL itself:

[https://github.com/zfsonlinux/zfs/issues/494](https://github.com/zfsonlinux/zfs/issues/494)

This will likely be added to ZoL in the future, but no one is actively working
on it at this time.

~~~
bussiere
thks

------
a2743906
I'm using ZFS right now, because I need something that cares for data
integrity, but the fact that it will never be included in Linux is a very big
issue for me. Every time you upgrade your kernel, you have to upgrade the
separate modules as well - this is the point where bad things can happen. I
will definitely be looking into Btrfs once it is more reliable. For now I'm
having a bit of a problem with SSD caching and performance, but don't care
about it enough for it to be relevant, I just use the filesystem to store data
safely and ZFS does an OK job.

~~~
sp332
From what I understand, aside from certain RAID levels, btrfs is production-
ready. RAID5 and RAID6 don't have recovery code finished yet, but RAID0,
RAID1, "dup" which just keeps 2 copies of each chunk, and "single" mode all
work fine.

~~~
xioxox
I've setup btrfs on software mdraid (raid 6) as a backup system (not the only
backup!). You still get the checksums and snapshotting, but not the
flexibility of the btrfs raid system. It has the advantage of being easy to
grow, unlike zfs, which can't be resized once created. We've encountered no
problems, even though it has been running for around three years of rsyncing
and snapshotting.

~~~
barrkel
zfs can grow - new vdevs can be added, and existing vdevs can have their disks
replaced one at a time with larger disks.

A bigger downside of ZFS, IMO, is lack of defragmentation and similar larger
scale pool management. If you ever push a ZFS pool close to its space limit,
you can end up with fragmentation that never really goes away, even if you
delete lots of files. The recommended solution is to recreate the pool and
restore from backup, or create a new pool and stream a snapshot across with
zfs send | zfs receive. Not terribly practical for most home users.

~~~
xioxox
zfs can sort of grow by adding vdevs, as you say, however it's pretty wasteful
due to the new parity drives It was much more efficient to expand the mdraid
raid 6 and expand the btrfs onto that. The other backup server does use zfs
(albeit the user mode fuse version). I set up that system in a similar way
putting zfs-fuse on mdraid 6.

------
Andys
I used ZFSonLinux on my laptop and workstation for a couple of years now, with
Ubuntu, without any major problems. When I tried to use it in production, I
didn't get data loss but I hit problems:

* Upgrading is a crapshoot: Twice, it failed to remount the pool after rebooting, and needed manual intervention.

* Complete pool lockup: in an earlier version, the pool hung and I had to reboot to get access to it again. If you look through the issues on github, you'll see weird lockups or kernel whoopsies are not uncommon.

* Performance problems with NFS: This is partially due to the linux NFS server sucking, but ZFS made it worse. Used alot of CPU compared to solaris or freebsd, and was slow. Its even slow looping back to localhost.

* Slower on SSDs: ZFS does more work than other filesystems, so I found that it used more CPU time and had more latency on pure SSD-backed pools.

* There are alternatives to L2ARC/ZIL on linux, are built-in, and work with any filesystem, such as "flashcache" on ubuntu.

For these reasons, I think ZoL is good for "near line" and backups storage,
where you have a large RAID of HDDs and need stable and checksummed data
storage, but not mission critical stuff like fileservers or DBs.

~~~
ryao
I mentioned most of these issues in the supplementary blog posts. Here is
where each stands:

* There are issues when upgrading because the initramfs can store an old copy of the kernel module and the /dev/zfs interface is not stabilized. This will be addressed in the next 6 months by a combination of two things. The first is /dev/zfs stabilization. The second is bootloader support for dynamic generation of initramfs archives. syslinux does this, but it does not at this time support ZFS. I will be sending Peter Alvin patches to add ZFS support to syslinux later this year. Systems using the patched syslinux will be immune to this problem while systems using GRUB2 will likely need to rely on the /dev/zfs stabilization.

* There are many people who do not have problems, but this is certainly possible. Much of the weirdness should be fixed in 0.6.4. In particular, I seem to have fixed a major cause of rare weirdness in the following pull requests, which had the side benefit of dramatically increasing performance in certain workloads:

[https://github.com/zfsonlinux/spl/pull/369](https://github.com/zfsonlinux/spl/pull/369)
[https://github.com/zfsonlinux/zfs/pull/2411](https://github.com/zfsonlinux/zfs/pull/2411)

* The above pull requests have a fairly dramatic impact on NFS performance. Benchmarks shown to me by SoftNAS indicate that all performance metrics have increased anywhere from 1.5 to 3 times. Those patches have not yet been merged as I need to address a few minor concerns from the project lead, but those will be rectified in time for 0.6.4. Additional benchmarks by SoftNAS have shown that the following patch that was recently merged increases performance another 5% to 10% and has a fairly dramatic effect on CPU utilization:

[https://github.com/zfsonlinux/zfs/commit/cd3939c5f06945a3883...](https://github.com/zfsonlinux/zfs/commit/cd3939c5f06945a3883a362379d0c12e57f31a4d)

* There is opportunity for improvement in this area, but it is hard for me to tell what you mean. In particular, I am not certain if you mean minimum latency, maximum latency, average latency or the distribution of latency. In the latter case, the following might be relevant:

[https://twitter.com/lmarsden/status/383938538104184832/photo...](https://twitter.com/lmarsden/status/383938538104184832/photo/1)

That said, I believe that the kmem patches that I linked above will also have
a positive impact on SSDs. They reduce contention in critical code paths that
affect low latency devices.

Additionally, there is at least one opportunity to improve our latencies. In
particular, ZIL could be modified to use Force Unit Access instead of flushes.
The problem with this is that not all devices honor Force Unit Access, so
making this change could result in data loss. It might be possible to safely
make it on SLOG devices as I am not aware of any flash devices that disobey
Force Unit Access. However, data integrity takes priority. You can test
whether a SLOG device would make a difference in latencies by setting
sync=disabled temporarily for the duration of your test. All improvements in
the area of SLOG devices will converge toward the performance of
sync=disabled. If sync=disabled does not improve things, the bottleneck is
somewhere else.

* These alternatives operate on the block device level and add opportunity for bugs to cause cache coherence problems that are damaging to a filesystem on top. They are also unaware of what is being stored, so they cannot attain the same level of performance as a solution that operates on internal objects.

~~~
Andys
Great reply :)

Your last two points are a little weak. Ultimately ZFS does more stuff, so on
an SSD its going to be slower than a filesystem which doesn't have all these
extra features if you aren't using them. I think there's a trade-off to be
had.

All software has bugs, especially ZFS. You could argue it is easier to
developer, test, and maintain individual device-mapper building blocks.

------
ryao
I have been inundated with feedback from a wide number of channels. If I did
not reply to a comment today, I will try to address it tomorrow.

------
ashayh
ZFS, and most* other file systems are all about _one_ computer system.

While ZFS data integrity features may be useful, they don't prevent the wide
variety of things that can go wrong on a _single_ computer. You still need
site redundancy, multiple physical copies, recovery from user errors etc.

Large, modern enterprises are better off keeping data on application layer
"filesystems" or databases, since they can more easily aggregate the storage
of hundreds or thousands of physical nodes. ZFS doesn't help with anything
special here.

For the average home user, ZoL modules are a hassle to maintain. You are
better of setting up FeeNAS on a 2nd computer if you really want to use ZFS.
Otherwise there is nothing much over what XFS, EXT4 or btrfs can offer.

The 'ssm' set of tools to manage LVM, and other built in file systems, is more
easier for home users with regular needs.

GlusterFS and others are distributed file systems, but suffers from additional
complexity at the OS and management layer.

~~~
ryao
The Lustre filesystem is able to use ZFSOnLinux for its OSDs. This gives it
end to end checksum capabilities that I am told enabled the Lustre developers
to catch buggy NIC drivers that were silently corrupting data.

Alternatively, there is a commercial Linux distribution called SoftNAS that
implements a proprietary feature called snap replicate on top of ZFS
send/recv. This is allows it to maintain backups across availability zones and
is achieved by its custom management software running the zfs send/recv
commands as per user requests.

In the interest of full disclosure, my 2014 income tax filing will include
income from consulting fees that SoftNAS paid me to prioritize fixes for bugs
that affected them. I received no money for such services in prior tax years.

------
mbreese
I love ZFS, and I love working with Linux, but I can't help but worry about
using ZFS on Linux. Without the needed support from the kernel side, I don't
see how it can be useful for production. I can see using it on personal
workstations, but for any situation where data loss is critical, you just
won't see any uptake. Because of the licensing, ZFS can never be anything more
than a second-class citizen on Linux.

That said, I run a FreeBSD ZFS file server just to host NFS that is exported
over to a Linux cluster. At least on FreeBSD, there is first-class integration
of ZFS into the OS. (I used to also maintain a Sun cluster that had a Solaris
ZFS storage server that exported NFS over to Linux nodes, which is where I
first got a taste for ZFS).

So, I guess my main question is: In what use cases is ZFS on Linux so useful
when native FreeBSD/ZFS support exists?

I'm not saying it can't be done - I just don't understand _why_.

~~~
michael_h

      Without the needed support from the kernel side
    

Can you clarify what you mean by that?

~~~
mbreese
Basically I mean integration with the kernel's code base and all of the
testing that entails. After their initial development, file systems all end up
migrating to the kernel's code base.

So, I'm not thinking in terms of technical API support, but more
development/testing/integration support.

~~~
ryao
That is not a requirement. ZoL has the most sophisticated build system of any
Linux kernel module in the world to enable it to live outside of the main
tree. ZoL relies on autotools' API checks to do this. In addition, the project
has an automated buildbot that helps us to detect regressions in pull requests
before they are merged. It is similar in principle to how lustre filesystem
development is done. Lustre is also (primarily) outside of the tree and lived
entirely outside of the tree for years.

That being said, being inside the kernel source tree is not necessarily a good
thing. As I wrote in the blog post, other filesystems on Linux generally do
not provide the latest code to older kernels, but ZoL provides the latest code
to all supported kernels and distributions. This ranges from Linux 2.6.26.y to
the 3.16.y in 0.6.3 and will include 3.17.y in 0.6.4. Something as important
as a filesystem should be updated to fix bugs, even if the kernel proper
cannot be. The inability of all Linux systems to update to the latest kernel
is an issue that Linus Torvalds mentioned at LinuxCon North America 2014 and
ZoL is one of the few filesystems that can deal with it on systems where it is
deployed.

~~~
mbreese
You know... I hadn't given much thought to Lustre. And that is a very good
comparison. Lustre is completely out of kernel (in terms of development), so
it's not like you're the first to try this.

I guess it really depends on where you want ZoL to be deployed. Lustre is
typically deployed on HPC clusters and is managed by people who understand it,
how to set it up, and how to manage it. It's not a trivial system to get
working, and requires a pretty large budget. It just isn't setup on your
typical server. Lustre also has a few big names behind it to provide
development resources and support (Whamclound, now Intel).

What is the target market for ZoL? Do you want it to work on servers?
Clusters? Personal workstations? They are very different markets.

I'm using ZFS right now for a single storage server that doesn't need to
support a large cluster, so ZFS/NFS works great. But I'm using FreeBSD. For my
Linux servers, I wouldn't think of using a filesystem that wasn't natively
supported by Red Hat. I just don't want the management headache. I'm okay
running a single FreeBSD box for my storage needs, but I'm fearful of what
happens when I'd need to scale things out to multiple servers.

I wish you luck, I really do. It seems like you're not going into this blind
and you're making good decisions. But it will be an uphill battle and you'll
always have that specter of Oracle looming over you. It doesn't matter if
you're technically/legally right on the licensing front - you'll still have it
looming over you until Larry signs off on it (like you said below).

If you can get that sign off, then all bets are off, and you'd be golden!

------
leonroy
I've used ZFS (FreeNAS) for quite a few years and find it pretty flawless.
Trust it's not too dumb a question but what advantage is there to running ZFS
on Linux when you can run it on variants of Solaris or BSD just fine?

------
DiabloD3
I've used ZoL since it was created, and zfs-fuse before that. I ran it on my
workstation for a few years (managing a 4x750gb RAID-Z (= ZFS's RAID-5 impl),
with ext3 on mdadm RAID 1 2x400gb root), and then swapped to BTRFS for 2x2TB
BTRFS native RAID 1 (which was Oracle's ZFS competitor that seems to be
largely abandoned although I see commits in the kernel changelog
periodically), and now back to ZFS on a dedicated file server using 2x128GB
Crucial M550 SSD + 2x2TB, setup as mdadm RAID 1 + XFS for the first 16GB of
the SSDs for root[2], 256MB on each for ZIL[1], and the rest as L2ARC[3], and
the 2x2TB as ZFS mirror. I honestly see no reason to use any other FS for a
storage pool, and if I could reliably use ZFS as root on Debian, I wouldn't
even need that XFS root in there.

All of this said, I get RAID 0'ed SSD-like performance with very high data
reliability and without having to shell out the money for 2TB of SSD. And
before someone says "what about bcache/flashcache/etc", ZFS had SSD caching
before those existed, and ZFS imo does it better due to all the strict data
reliability features.

[1]: ZFS treats multiple ZIL devs as round robin (RAID 0 speed without
increased device failure taking down all your RAID 0'ed devices). You need to
write multiple files concurrently to get the full RAID 0-like performance out
of that because it blocks on writing consecutive inodes, allowing no more than
one in flight per file at a time. ZIL is only used for O_SYNC writes, and it
is concurrently writing to both ZIL and the storage pool, ie, ZIL is not a
write-through cache but a true journal.

The failure of a ZIL device is only "fatal" if the machine also dies before
ZFS can write to the storage pool, and the mode of the failure cannot leave
the filesystem in an inconsistent state. ZFS does not currently support RAID
for ZIL devices internally, nor is it recommended to hijack this and use mdadm
to force it. It only exists to make O_SYNC work at SSD speeds.

[2]: /tank and /home are on ZFS, the rest of the OS takes up about 2GB of that
16GB. I oversized it a tad, I think. If I ever rebuild the system, I'm going
for 4GB.

[3]: L2ARC is a second level storage for ZFS's in memory cache, called ARC.
ARC is a highly advanced caching system that is designed to increase
performance by caching often used data obsessively instead of being just a
blind inode cache like the OS's usual cache is, and is independent of the OS's
disk cache. L2ARC is sort of like a write through cache, but is more advanced
by making a persistent version of ARC that survive reboots and is much larger
than system memory. L2ARC is implicitly round robin (like how I described ZIL
above), and survives the loss of any L2ARC dev with zero issues (it just
disables the device, no unwritten data is stored here). L2ARC does not suffer
from the non-concurrent writing issue that ZIL "suffers" (by design) from.

~~~
foobarqux
Can you speak more about why ZFS is better than BTRFS?

~~~
ryao
A btrfs versus ZFS comparison probably deserves a blog post of its own, but I
will try to address your question. I wrote the following on this topic last
year:

[https://groups.google.com/d/msg/funtoo-
dev/g9OY_vqVpCM/VTKF8...](https://groups.google.com/d/msg/funtoo-
dev/g9OY_vqVpCM/VTKF8Ef9ab4J)

However, significant time has passed and it requires some corrections to be
current:

1\. I have not heard of any recent data corruption issues in btrfs, although I
have not looked into them lately.

2\. btrfs now has experimental RAID 5/6 support, but is neither production
ready nor as refined as ZFS' raidz.

3\. I should have said "inline block-based data deduplication". You can
(ab)use reflinks to achieve a file-level data deduplication in btrfs, but it
is not quite the same. btrfs now has a bedup tool that makes using reflinks
somewhat easier now:

[https://btrfs.wiki.kernel.org/index.php/Deduplication](https://btrfs.wiki.kernel.org/index.php/Deduplication)

4\. btrfs now has some kind of incremental send/recv operation. However, it is
not clear to me how it handles consistency issues from having "write-able
snapshots":

[https://btrfs.wiki.kernel.org/index.php/Incremental_Backup](https://btrfs.wiki.kernel.org/index.php/Incremental_Backup)

5\. Illumos' ZFS implementation is now able to store small files in the dnode,
which is improves its efficiency when storing small files in a manner similar
to btrfs' block suballocation. This feature will likely be in ZoL 0.6.4.

Aside from those corrections, what I wrote in that mailing list email should
still be relevant today. However, there are a few advantages that ZFS has over
btrfs that I recall offhand that I do not see here there or in nisa's reply:

0\. ZFS uses 256-bit checksums with algorithms that are still considered to be
good today. btrfs uses checksum algorithms that are known to be weak. In
specific, btrfs uses CRC32 on 32-bit processors and CRC64 on 64-bit
processors. CRC32 is the same algorithm used by TCP/IP. Its deficiencies are
well documented:

[http://noahdavids.org/self_published/CRC_and_checksum.html](http://noahdavids.org/self_published/CRC_and_checksum.html)

I have not examined CRC64, but I am not particularly confident in it. btrfs
should have room in its on-disk data structures that would allow it to
implement 256-bit checksums in a future disk format extension, but until then,
its checksum implementations are vastly inferior.

1\. The ztest utility that I described in the blog allows ZFS developers to
catch issues that would have otherwise gone into production and debug them
from userland. No other filesystem has something like quite like it.

2\. ZFSOnLinux is the only kernel filesystem driver that is kernel version-
independent, so if you are unable to upgrade your kernel, you can still get
fixes. The inability of people to always update their kernels is an issue
Linus mentioned at LinuxCon North America 2014.

3\. The CDDL gives the ZFSOnLinux a patent grant for the ZFS patent portfolio.
This is something that btrfs does not have and will likely never have unless
Oracle decides to provide one. Consequently, Oracle is the only company in the
world that I know is able to ship products incorporating the btrfs source code
without being at risk should btrfs infringe on one of the dozens if not
hundreds of patents in the ZFS patent portfolio. A small subset of them can be
accessed from the Jeff Bonwick Wikipedia page:

[https://en.wikipedia.org/wiki/Jeff_Bonwick](https://en.wikipedia.org/wiki/Jeff_Bonwick)

~~~
ScottBurson
I got the impression somewhere that btrfs RAID 5/6 support allows, or will
allow, new devices to be added to an existing RAID group.

That's an important feature for home and small business users. Having to
replace every drive in a RAID group to grow it, as ZFS requires, is painful
and expensive.

Fortunately, drives are cheap enough these days that you can just way oversize
your pool to begin with. But anyone switching to ZFS should be aware of the
need to do this.

~~~
cmurf
You can add/delete devices from raid5/6 volumes now. The raid5/6 code is still
experimental, in particular while detected problems are fixed on-the-fly to
userspace, the fixes aren't written back to drives. That limitation applies to
normal usage and scrubbing. A balance detects and fixes these.

Also the determination of a drive being "faulty" (in the md/mdadm sense) and
how this gets communicated to userspace isn't in place. If it's in place for
ZoL (?) that'd be a considerable difference, a bigger one than checksum
algorithms in my opinion.

------
turrini
I've created the script below a while (year) ago. It (deb)bootstrap a working
Debian Wheezy with ZFS on root (rpool) using only 3 partitions: /boot(128M)
swap(calculated automatically) rpool(according to # of your disks, mirrored or
raidz'ed).

All commentaries are in Brazilian Portuguese. I didn't have time to translate
it to English. Someone could do it and fill a push request.

[https://github.com/turrini/scripts/blob/master/debian-
zol.sh](https://github.com/turrini/scripts/blob/master/debian-zol.sh)

Hope you like it.

~~~
ryao
Thanks for sharing. I will let Debian users interested in / on ZFS know that
this is available as they ask me about this sort of thing.

------
andikleen
When swap doesn't work, mmap is unlikely to work correctly either.

Figuring out why that is so is left as an exercise for the poster.

------
nailer
Putting production data on a driver maintained outside the mainline Linux
kernel is a bad idea.

That isn't a licensing argument - I'm happy to use a proprietary nvidia.ko for
gaming tasks, for example, because I won't be screwing up anyone's data if it
breaks.

~~~
ryao
You could be "screwing up" someone's data if an in-tree filesystem breaks. If
you read the supplementary blog posts, you would have seen the following:

[http://lwn.net/Articles/437284/](http://lwn.net/Articles/437284/)

Nearly all in-tree filesystems can fail in the same way described there. ZFS
cannot. That being said, no filesystem is a replacement for backups. This
applies whether you use ZFS or not. If you care about your data, you should
have backups.

~~~
mbreese
Well, to be honest, all filesystems that are currently in the kernel tree
started out as being maintained outside the tree. Inclusion into the kernel is
normally one of the end goals. It's part of the standard progression - 1)
rapid development outside of the tree, 2) once the filesystem is stable, it
negotiates for inclusion, 3) inclusion into the kernel, 4) maintenance /
updates as part of the main kernel development process.

ZoL isn't even trying to get included into the kernel, so it's a bit of an odd
duck here.

------
mrmondo
While I like most parts of ZFS, these days BTRFS is both stable and performs
well with a decent feature set. We moved from ZFS and EXT4 to BTRFS for a good
portion of our production servers last year - and we haven't looked back.

~~~
thijsb
Do you run RAID5/6? I had that running for half year, and it crashed often.

Now on ZFS (raidz) and it works flawless

~~~
mrmondo
We run it on iSCSI LUNs straight from our SAN so this hasn't been an issue for
us.

------
seoguru
I have a laptop running ubuntu with a single SSD. Does it make sense to run it
with ZFS to get compression and snapshots? If I add a hard drive, again does
it make sense (perhaps using SSD as cache (arc?) )

~~~
fsckin
I've never heard of someone using ZFS with a single disk. You're probably
better off with ext4.

The compression and deduplication features of ZFS is terrific on network
filers. Compression could possibly improve performance slightly on a single
disk system.

With two disks, I'd say you'd probably be better off with running RAID0 (or no
RAID at all) and having a great backup plan. Using another SSD to cache writes
to another SSD doesn't make a whole lot of sense to me.

~~~
DiabloD3
Honestly, I recommend XFS over ext4. It seems to be a much more mature file
system, and Redhat-family distros (RHEL, Centos, Scientific, etc) have
switched to XFS as the default filesystem (instead of moving to ext4 from
6.x's default of ext3; 6.x does not support XFS or ext4 for root).

XFS performs better out of the box on a wide range of hardware, while
seemingly giving stronger data reliability guarantees (but not anywhere near
ZFS's).

ZFS on a single disk, however, will still give you data checksumming, so you
can detect silent data corruption. XFS's sole missing feature as a basic
filesystem, imo, is data checksumming.

~~~
lmz
RHEL6 uses ext4 as default[1] filesystem. It certainly does support ext4 as
root.

[1]: [https://access.redhat.com/documentation/en-
US/Red_Hat_Enterp...](https://access.redhat.com/documentation/en-
US/Red_Hat_Enterprise_Linux/6/html/6.0_Release_Notes/filesystems.html#idp10337936)

~~~
DiabloD3
Weird, I had to install RHEL6 for a customer, and it defaulted to ext3 and
ext4 was not selectable.

------
awonga
I've looked into ZFS before for distributions like freenas, is there any
solution on the horizon for the massive memory requirements?

For example, needing 8-16gb ram for something like a xTB home nas is high.

~~~
ryao
The "massive memory requirements" only exist if you use data deduplication and
care about write performance. Otherwise, ZFS does not require very much memory
to run. It has a reputation to the contrary because ARC's memory usage is not
shown as cache in the kernel's memory accounting, even though it is cache.
This is in integration issue that would need to be addressed in Linus' tree.

