
ZFS lands in Debian contrib - turrini
https://tracker.debian.org/pkg/zfs-linux
======
dakami
So I just started experimenting with ZFS, because it seemed required for
container snapshots.

Then I found out it fragments badly, and nobody can figure out how to write a
defragmenter. So, uh, keep the FS below 60-80% full apparently.

Yeah.

~~~
hhw
Using dedicated ZIL significantly reduces fragmentation:

[http://www.racktopsystems.com/dedicated-zfs-intent-log-
aka-s...](http://www.racktopsystems.com/dedicated-zfs-intent-log-aka-slogzil-
and-data-fragmentation/)

Anyone using ZFS in a serious capacity would have both dedicated ARC and ZIL.

~~~
ryao
ZIL is only used on synchronous IO. Moving it to a dedicated SLOG device would
have no impact on non-synchronous IO. A SLOG device does help on synchronous
IO though.

That said, all file systems degrade in performance as they fill. I do not
think there is anything notable about how ZFS degrades. The most that I have
heard happen is a factor of 2 sequential read performance decrease on a system
where all files were written by bit torrent and the pool had reached 90% full.
That used mechanical disks. A factor of 2 in a nightmare scenario is not that
terrible.

~~~
raattgift
A log vdev is a log vdev (or a SLOG). ZIL is a badly overloaded term.

Ignoring logbias=throughput, when you have a slog you save on writing intents
for _small_ synchronous writes into the ordinary vdevs in the pool. If you do
a lot of little synchronous writes, you can save a lot of IOPS writing their
intents to the log vdev instead of the other vdevs. Log vdevs are write-only
except at import (and at the end phases of scrubs and exports).

Here's the killer thing on an IOPS-constrained pool not dominated by large
numbers of small synchronous writes: the _reads_ get in the way of writes. ZFS
is so good at aggegating writes that unless you are doing lots of small
synchronous random writes, they write IOPS tend to vanish.

Reads are dealt with very well as well, especially if they are either
prefetchable or cacheable. Random small reads are what kill ZFS performance.

Unfortunately systems dominated by lots of rsync or git or other walks of
filesystems tends to produce large numbers of essentially random small reads
(in particular, for all the ZFS metadata at various layers, to reach the
"metadata" one thinks of at the POSIX layer). This is readily seen with
Brendan Gregg's various dtrace tools for zfs.

The answer is, firstly, an ARC that is allowed to grow large, and secondly
high-IOPS cache vdevs (L2ARC). l2 hit rates tend to be low compared to ARC
hits, but every l2 hit is approximately one less seek on the regular vdevs,
and seeks are zfs's true performance killers.

Persistent L2ARC is amazing, but has been languishing at
[https://reviews.csiden.org/r/267/](https://reviews.csiden.org/r/267/)

It has several virtues that are quickly obvious in production. Firstly, you
get bursts of l2arc hits near import time, and if you have frequently
traversed zfs metadata (which is likely if you have containers of some sort
running on the pool shortly after import) the performance improvement is
obvious. Secondly, you get better data-safety; l2arc corruption, although rare
in the real world, can really ruin your day, and the checksumming in
persistent l2arc is much more sound. Thirdly, it can take a very long time for
large l2arcs to become hot, which make system downtown (or pool import/export)
more traumatic than with l2arc (rebuilds of full ~128GiB l2arc vdevs take a
couple of seconds or so on all realistic devices; even USB3 thumb drives (e.g
Patriot Supersonic or Hyper-X DataTravellers, both of which I've used on busy
pools) are fast and give an IOPS uptick early on after a reboot or import, and
of course you can have several of those on a pool. "Real" ssds give greater
IOPS still. Fifthly, the persistent l2arc being available at import time means
that early writes are not stuck waiting for zfs metadata to be read in from
the ordinary vdevs; that data again is mostly randomly placed LBA-wise, and
small, so there will be many seeks compared the amount of data needed.
Persistent l2arc is a _huge_ win here, especially if for some reason you
insist on having datasets or zvols that require DDT lookups (small
_synchronous_ high-priority reads if not in ARC or L2ARC!) at write time.

Maybe you could consider integrating it into ZoL since you guys have been busy
exploring new features lately.

Finally, if you are doing bittorrent or some other system which produces temp
files that are scattered somewhat randomly, there are two things you can do
which will help: firstly, recordsize=1M (really; it's great for reducing write
IOPS and subsequent read IOPS, and reduces pressure on the metadata in ARC),
and secondly, particularly if your receives take a long time (i.e., many
txgs), tell your bittorrent client to move the file to a different dataset
when the file has been fully received and checked -- that will almost
certainly coalesce scattered records.

~~~
ryao
The term ZIL is not overloaded. Unfortunately, users tend to misuse it because
the ZIL's existence is hard to discover until it is moved into a SLOG device.

As for persistent L2ARC, it was developed for Illumos and will be ported after
Illumos adopts a final version of it.

------
dmm
It appears that the kernel-level code is shipped as source to be built,
automatically by dkms, by the end user. Check out the list of binaries on the
bottom left of that page.

This means that no binary kernel modules are shipped, just the cli tools.

~~~
creshal
Which is roughly the same way as closed-source drivers are handled.

~~~
jcoffland
Not really. Closed source drivers don't come with source code and aren't
compiled by dkms.

~~~
GauntletWizard
The Nvidia driver does require DKMS, but it's still a binary blob - It ships
with source for a wrapper, compiles that, and that loads the blob.

------
espadrine
Has there been updates on the legal situation since
[http://blog.halon.org.uk/2016/01/on-zfs-in-
debian/](http://blog.halon.org.uk/2016/01/on-zfs-in-debian/)?

I am obviously glad that this happened, but afraid of an Oraclocalypse.

~~~
kangar00
Oraclepocalypse? Aren't their lawyers busy with Google?:
[http://fortune.com/2016/05/13/google-oracle-java-
email/](http://fortune.com/2016/05/13/google-oracle-java-email/)

Here's another post about GPL violations related to combining ZFS and Linux:
[https://sfconservancy.org/blog/2016/feb/25/zfs-and-
linux/](https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/)

Quote from that:

"Is The Analysis Different With Source-Only Distribution?

We cannot close discussion without considering one final unique aspect to this
situation. CDDLv1 does allow for free redistribution of ZFS source code. We
can also therefore consider the requirements when distributing Linux and ZFS
in source code form only.

Pure distribution of source with no binaries is undeniably different. When
distributing source code and no binaries, requirements in those sections of
GPLv2 and CDDLv1 that cover modification and/or binary (or “Executable”, as
CDDLv1 calls it) distribution do not activate. Therefore, the analysis is
simpler, and we find no specific clause in either license that prohibits
source-only redistribution of Linux and ZFS, even on the same distribution
media.

Nevertheless, there may be arguments for contributory and/or indirect
copyright infringement in many jurisdictions. We present no specific analysis
ourselves on the efficacy of a contributory infringement claim regarding
source-only distributions of ZFS and Linux. However, in our GPL litigation
experience, we have noticed that judges are savvy at sniffing out attempts to
circumvent legal requirements, and they are skeptical about attempts to
exploit loopholes. Furthermore, we cannot predict Oracle's view — given its
past willingness to enforce copyleft licenses, and Oracle's recent attempts to
adjudicate the limits of copyright in Court. Downstream users should consider
carefully before engaging in even source-only distribution.

We note that Debian's decision to place source-only ZFS in a relegated area of
their archive called contrib, is an innovative solution. Debian fortunately
had a long-standing policy that contrib was specifically designed for source
code that, while licensed under an acceptable license for Debian's Free
Software Guidelines, also has a default use that can cause licensing problems
for downstream Debian users. Therefore, Debian communicates clearly to their
users that this code is problematic by keeping it out of their main archive.
Furthermore, Debian does not distribute any binary form of zfs.ko.

(Full disclosure: Conservancy has a services agreement with Debian in which
Conservancy occasionally gives its opinions, in a non-legal capacity, to
Debian on topics of Free Software licensing, and gave Debian advice on this
matter under that agreement. Conservancy is not Debian's legal counsel.)"

~~~
espadrine
> _Aren 't their lawyers busy with Google?_

They would only bite a golden hand anyway.

> _We are also concerned that it may infringe Oracle 's copyrights in ZFS._

The Software Freedom Conservancy saying that is a bit scary. I am somewhat
less afraid of Linux copyright holders suing.

~~~
jldugger
I wouldn't be surprised if the Canonical is taking a calculated risk here that
Oracle doesn't actually care, or would be actively helpful if it meant spiting
Redhat.

------
Nursie
Excellent, have been running with some ubuntu ppa stuff for a while now, and
that's great but things occasionally break. Looks like soon I can ditch it for
pure debian again.

~~~
krondor
ZFS is in Ubuntu 16.04 without a PPA, by the way.

~~~
kdeldycke
For the lazy:

    
    
        $ sudo apt install zfs-dkms zfsutils-linux

~~~
StavrosK
Does anyone know if I can switch my previously-used ZFS-on-Linux with this and
it will just see my pool and work as before?

~~~
ryao
You can.

------
4ad
Unfortunately, it's a dkms, which means it gets compiled on the user machine
on update.

From an operational perspective, this is insane, I need reliability. Of course
in my organisation I could create a binary package and use that, but that's
more work and then the new Debian package doesn't help me anyway.

When I need linux I just run ZFS on a better supported system and either
virtualise Linux or expose an iSCSI target from ZFS for Linux.

~~~
ashitlerferad
It isn't legal to distribute the binary of zfs.ko

[https://sfconservancy.org/blog/2016/feb/25/zfs-and-
linux/](https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/)

~~~
krondor
Canonical is doing so anyway and it seems they have a legal disagreement on
the interpretation of the law with the Software Freedom Conservancy.

IANAL so I have no clue what is technically correct, but the fact is that
Ubuntu is distributing zfs.ko in 16.04.

------
jordigh
contrib is a funny place for it. Normally contrib means free software that
depends on non-free software. In this case, it seems to have acquired the
meaning of free software has a license incompatibility with other free
software. I wonder if we have heard the last of CDDL vs GPL.

Technical solutions to legal problems don't work, just like GPL wrappers don't
work (at least, that's what some lawyers say). If Oracle decides to make a
stink about this, they still can.

edit: Huh, apparently last year Debian actually got advice from SFLC about
this:

[https://lists.debian.org/debian-devel-
announce/2015/04/msg00...](https://lists.debian.org/debian-devel-
announce/2015/04/msg00006.html)

~~~
gnufx
Yes, zfs in contrib appears inconsistent with openafs in main. Has that been
explained somewhere?

------
l1ambda
Great news. So this means Ubuntu, Debian and the new Redox OS now have ZFS. I
would love to see it officially supported in Fedora too.

~~~
rwmj
There is a DKMS package similar to the Debian one which requires just a single
command to install:
[https://github.com/zfsonlinux/zfs/wiki/Fedora](https://github.com/zfsonlinux/zfs/wiki/Fedora)

However it is unlikely that Fedora will ship ZFS unless the license changes or
is clarified by Oracle. Unlike Canonical, Red Hat is a US company making
serious amounts of money ($2bn revenue last financial year).

~~~
meddlepal
I imagine it ends up in EPEL? Or is EPEL Red Hat controlled?

------
grigio
So is btrfs dead?

~~~
ansible
No, I don't think so. I'm a small-time admin, and I think that btrfs is
working pretty good these days.

We have recently switched all our servers to LXC containers so that we can
take full advantage of btrfs features.

I doubt I need to explain the advantages of containers to anyone here... but
in short we've broken out all the network services (file service, LDAP, DNS,
etc.) to separate containers.

Each container is in a separate btrfs subvolume. This allows us to take
snapshots of the running systems every 10 minutes, and using btrfs
send/receive, cheaply back up those snapshots to alternate container hosts.
The send/receive stuff works better with the btrfs v4.4 tools that ship with
Ubuntu 16.04.

Since the network interfaces for all the containers are bridged with the
container host, we can configure each container with its own static IP
address. So if a container host fails, those containers can be booted up on
the alternate host, and keep their IP and MAC addresses. So that's convenient,
and causes minimal disruption.

The main improvement I'd like to see with btrfs is a configurable RAID
redundancy level. Currently, RAID-1 means that there are two copies of each
piece of data / metadata. So in a 3-drive RAID-1 system that gives you extra
capacity, but two drives failing at the same time will cause data loss.

~~~
cmurf
Being lazy, I'm going to ask if you know of some LXC + Btrfs pros in contrast
to Docker + Btrfs?

Right now a gotcha with Docker + Btrfs is that SELinux contexts for each
container can be different, but the context= mount option currently is once
per superblock (thus per fs volume, rather than per fs tree or subvolume). So
Docker's work around in 1.10.x is they do a snapshot and then relabel it with
the new selinux context then start the container. For my containers (very
basic) this adds an almost imperceptible one time delay for that container. I
doubt it's even 1 second, which in container start times might seem massive to
some.

~~~
ansible
If you have already adapted to Docker, then I'm sure you don't want to use
LXC.

Containers are long-lived and mutable. I treat them like old school servers,
just not tied to physical hardware.

------
mrmondo
This is interesting to see at a time when so many key packages are missing or
badly outdated in Debian core.

------
Eun
finally, hopefully it makes its way to the installer (unlike the ubuntu
installer...)

~~~
jlgaddis
Don't get your hopes up.

------
gjvc
massive opportunity now for Oracle to generate some goodwill

~~~
Annatar
Oracle has nothing whatsoever to do with OpenZFS. As in zero, nada, zilch. And
because of CDDL, Oracle cannot take back the source code from illumos (which
contains OpenZFS code) without open sourcing the code to Solaris again:

[https://youtu.be/-zRN7XLCRhc?t=2732](https://youtu.be/-zRN7XLCRhc?t=2732)

------
yc-kraln
so... what's missing?

~~~
nailer
Legal clarity.

------
sobkas
But it didn't land in Debian. It only landed in contrib. Title of this link is
wrong, so maybe someone should fix it?

[https://www.debian.org/doc/debian-policy/ch-
archive.html#s-c...](https://www.debian.org/doc/debian-policy/ch-
archive.html#s-contrib)

~~~
dang
We added 'contrib'. Will that do?

