Hacker News new | past | comments | ask | show | jobs | submit login
Native encryption added to ZFS on Linux (github.com/zfsonlinux)
280 points by turrini on July 20, 2016 | hide | past | favorite | 140 comments

Slightly off-topic but if anyone has any resources on performing a clean (preferably, Ubuntu or Arch) Linux "root on ZFS" installation, please share.

I followed the instructions for Ubuntu 16.04 on the github.com/zfsonlinux wiki [0] a while back and (encountered a few little issues along the way but) got it working, although I experienced some MAJOR performance problems so something wasn't quite right (exact same hardware is blazing fast when running FreeBSD). I can't imagine it was just "how things are" with regard to the current state of ZFS on Linux (or Ubuntu specifically) -- it was like someone hit the laptop's "pause" button a couple of times per minute.

[0]: https://github.com/zfsonlinux/zfs/wiki/Ubuntu-16.04-Root-on-...

This sounds crazy, and is, but following those instructions verbatim with a 16.04 install I managed to brick a new laptop. Yeah...

Define "brick". You flashed the wrong BIOS?

Possibly deleted the efi partition

Or "rm -Rf /" with a writable /sys/firmware/efi/efivars/ mount.


This problem was fixed half a year ago, surely Ubuntu 16.04 should have a backport of the relevant commits?


This works wonderfully for me: https://gist.github.com/binwiederhier/9ba0983b392b6468504e9b.... I know it's not a tutorial or anything, but I hope it helps.

Funtoo has a zfs install guide worked for me. http://www.funtoo.org/ZFS_Install_Guide

Related pull request with more details & discussion


tcaputi works with me at Datto. We use ZFS as our main file system for some 200+ PB of storage. It is a truly amazing file system, and this change, once merged, will add another layer of security for our customers. Great stuff!

At this scale, why ZoL and not a more 'native' ZFS with FreeBSD (or even Illumos)? Which also perform better and have more features?

Is it the lack of drivers? Better availability of other 3rd party software for Linux? The package manager? Or just "more brains in Linux"/"our CTO grew up with Linux"?

Is there a meaningful performance difference between ZFS in FreeBSD or Illumos and ZoL?

Edit: this is a legitimate question. I know previously Linux-related ZFS efforts used FUSE, but ZoL is native. I assume performance should be roughly equivalent between Linux and FreeBSD, certainly.

Better observability with Dtrace, mdb, iostat, and vmstat on illumos-based systems for sure. Also simple logic dictatates that if something is made on something for that certain something, it's going to run best on that something. Linux has ZFS, but it's grafted on and the illumos POSIX layer is emulated in that sense. Further, Linux's version of OpenZFS will always lag behind fixes and features in the illumos-based systems; even FreeBSD usually contains newer vesions of OpenZFS sooner than ZFS on Linux does. Linux is just "the last hole on the porting flute".

Also, experience tells me that illumos and FreeBSD based systems will always perform faster than Linux with regads to ZFS but I'd have to publish a full line of benchmarks and I bet even those would get vehemently disputed because Linux is all the rage right now, so that's a lost cause: you have to try it for yourself and make up your own mind.

I thought that ZoL and ZFS+FreeBSD were largely feature comparable?

Other than orders of magnitude more testing and experience and usage on FreeBSD. Much code is the same, but the glue may have bugs.

Well Linux receives orders of magnitude more testing and usage than FreeBSD, so I don't see what your argument is

His argument refers to the current state of this specific component.

ZFS on Linux has been around for 3 years now and is integrated in Ubuntu 16.04 - I'd argue that by now it has received more usage than the BSD version.

You can argue that the earth is shaped like a torus. I can argue that the moon is violet. That is not the point, and has nothing to do with the merit of the argument.

Setting this aside, all usage hours are not equal. How do you compare an hour of enduser desktop usage on Ubuntu 23.57.something to a usage hour of a commercial NAS storage solution internally based on FreeBSD and ZFS? Do you think the enduser performed comparable product testing cycles?

I've seen ZOL systems where the zpool claimed it was online, but contained no vdevs. `zfs list` had no datasets, but datasets where mounted and trying to read from them got your process stuck in-kernel. It just lost/forgot its devices.

Up until one of the latest releases on every boot you could roll the dice by which name your pool would import the devices. Behaving differently on identical machines and setups. Personally, I am still not trusting that problem to not reappear again.

ZOL is bolted on. With a large nailgun. Simple as that. At times, it feels about as integrated as pjd's original patches distributed on the FreeBSD mailinglists. And since this division is not technical, but legal, based on the license choice made 30 something years ago for Linux with regards to where the code could be exported to and what could be imported into, this situation will not resolve itself.

ZFS on Linux will always lag in fixes and features because OpenZFS is developed on and for illumos first and foremost.

No; run

  zpool upgrade -v; zfs upgrade -v
on illumos and on Linux and you will discover the truth. (upgrade -v lists the versions and features instead of performing an actual upgrade.)

This is not native encryption that was committed. It is just the kernel cryptography framework required for native encryption. Native encryption comes next.

What's the stability of ZFS on linux like these days? Anyone have any positive / negative experiences to share?

We run ZFS at my company, Datto, across hundreds of machines, including devices out in the field and servers in our data centers. It has a few warts, but overall very stable and the feature set has allowed us to deliver world-class features to our customers.

What are the few warts you encounter? I'm a Sysadmin/Engineer at a VFX company. We have hundreds of machines, render, and servers all running Linux but we're typically on ext4 with the notable exception of our Isilon cluster.

What does your environment look like?

As a few others mentioned, the 80% issue is one. We try and keep all machines below that so performance doesn't degrade. With the number of drives we have (200+ PB total space) we also run into issues where there are "holes" in the ZFS array after replacing a disk. We've always had enough redundancy work around any issues we find. Overall, I cannot speak highly enough of ZFS!

One that I've heard often repeated (paging user rsync) is that you should never fill your filesystems past 80% (or so). Things get slow otherwise.

> you should never fill your filesystems past 80% (or so). Things get slow otherwise.

That's a ZFS-in-general complaint, not a ZFS-on-Linux complaint.

However, it's worth mentioning most (all?) filesystems drop sharply in performance as it nears capacity.

This is also true of other filesystems however. I don't have experience with high IO on ZFS at those rumored 80%+ capacities however once a core filer fills up to 90% you start seeing the tickets rolling in and application performance tank. 80% if usually fine however. I wonder to what degree the performance degradation is on OpenZFS.

The problem is much more severe on ZFS and other CoW filesystems, hence the common warning. Filling an ext4 filesystem up to 80%+ and then removing files might increase fragmentation a bit but it'll not have that much effect on performance. On ZFS, filling the fileystem up once can kill performance permanently on that pool.

"One that I've heard often repeated (paging user rsync) is that you should never fill your filesystems past 80% (or so). Things get slow otherwise."

Hi. I've said a lot about this in HN comments, so I encourage you, future reader, to search for that. It is indeed the case.

Recently it has been suggested that the presence of a fast write cache (a SLOG, in ZFS parlance) minimizes this problem and allows you to run up to, and around, 90% without breaking the filesystem.

We haven't tested this, intentionally or otherwise, but it sort of makes sense ...

Make no mistake, however - if you fill up a ZFS filesystem and run it for a while in that state, it will be permanently broken, at least in terms of performance.

A ZFS defrag utility would solve this problem, or at least provide a way out if you fall into this trap, but it has been related to me that ZFS defrag would be extremely complicated to implement.

Yes, we do indeed see this on other filesystems ... even a UFS2 filesystem with NO snapshots enabled can be effectively "broken" is you set minfree down to 0%, fill it all the way up, and then run it like that for a while. Freeing up the space and resetting minfree back to 6 or 8%, etc., doesn't fix it.

ZFS certainly slows down as it fills up, but only enough to affect high-IO, write-intensive loads like databases. A light-weight application like a home or office file server can still saturate gigabit ethernet even while 99% full.

Yes, because after 80% ZFS will consume CPU(s) in an attempt to find enough contiguous space for the new branch of blocks in order to prevent fragmentation. The fix is to either add more vdevs or to change the kernel tunable which controls the free contiguous space allocation algorithm. The latter approach is very likely to cause fragmentation where there would otherwise be none. As I don't remember the name of the tunable, you can search my comments and in one of them on the subject there will be a link to the procedure. However, I strongly recommend adding more logical units or physical disks as vdevs to the zpool instead.

But you are only using the zpool part (zvol), not the zfs part. Correct?

I have been using it on Ubuntu 16.04 and before that with Ubuntu 15.10 for about a year. Good news is its a first class citizen on Ubuntu now. I have have great success with it on my home server and have plans to try for some prod systems I'm building out.

https://insights.ubuntu.com/2016/02/16/zfs-is-the-fs-for-con... https://wiki.ubuntu.com/ZFS

We've evaluated it for our cloud database service, but ended up going back to ext4 due to memory management bugs.

In particular this one: https://github.com/zfsonlinux/zfs/issues/3645

I ran into a bug [1] that appeared to be a deadlock triggered by rsync of a relatively modest directory with tons of small files. The only thing that fixed it was `spl_taskq_thread_dynamic=0` to stop the dynamic spawning of ZFS-related kernel threads. Rather than a memory management issue, it'd stall the copy and peg my CPUs indefinitely.

I suspect it's fixed in 0.7.0 based on some of the other related bugs I've run into since, but I've been reluctant to upgrade as of yet. Otherwise, ZFS on a home file server has been great.

[1] https://github.com/zfsonlinux/zfs/issues/3808

The Ubuntu ZFS module sets spl_taskq_thread_dynamic to zero by default.

My problem was that the large number threads with spl_taskq_thread_dynamic=0 cause it to OOM, at one point it had an OOM at mount (with 32GB RAM). When I set spl_taskq_thread_dynamic=1 I had the same deadlock issue (That's where I stopped trying to use it and went back to FreeBSD).

The fix is not to work around the out of memory killer in Linux, but to turn off memory overcommit altogether: the system should not lie to applications that it has more virtual memory than is actually available, because that is extremely detrimental to correctness of operation. illumos based systems, for instance, never lie about such things.

ZFSOnLinux is a kernel module and not a user space application (and you cannot OOM kill kernel threads) and the overcommit probably isn't direct, e.g. there are 64 worker threads who all need some memory to do the work and tell the kernel that memory allocation must not fail. If the memory isn't available either Linux starts killing userspace processes or gets stuck. The same thing can happen in Solaris except there ZFS doesn't use a separate memory mechanism which does not properly react to memory pressure.

This kind of thing is what's better on platforms like freebsd

I've been running it on a machine I built in 2011, and it seems to work quite well. It started out as whatever Ubuntu was available in August of 2011, then 12.04, then 14.04.

I have 3 3x 3TB Hitachi Deskstars in a raids, with a crucial M4-CT128M4 as a combined rootfs and l2arc (the ssd was an upgrade from an older ssd when I moved from 12.04 to 14.04).

The machine has a Xeon e3-1270 CPU, and 16GB of ECC RAM.

I used it as a workstation (now my wife does), and it has been as a home media / tv server / NAS for its entire lifetime. 0 problems so far (knock on wood), and it has made a move from the east coast to California, and then back to the east coast again.

What chassis are you using? And do you like it? I'm in the market. Thanks.

I like the chassis, but the build was ~5 years ago, and when I built a new machine ~1 year ago, it was not available.

My current machine uses a Fractal Design Define R5 Black Silent ATX Midtower case. That one is quite nice. The one gotcha that I had is that it is a bit wider than I was used to, which means that it just barely fits in the computer holder accessory for my ikea desk.

I've been running it on my laptop (chromebook pixel 2) for several months so far and it's great. I was waiting for this to be merged for a long time. Right now I have /home as a zpool than runs from an SDCard on my computer. The SD Card is a 64Gb card (the most expensive/fast I could find) and for my workload, it works just amazing: this setup leaves me with 64Gb (a bit less) for / and 64Gb for /home.

The ability to cd into snapshots is just great and a time saver. So far, nothing wrong/bad with this.

I've been using it casually as my /home for ~2 years. The most stress it gets is torrenting movies to it. But, there's been no problems. Setting it up on 3 drives was dead simple.

I've been running two very-lightly-used sets of pools on two old desktops for ... about a year and a half now actually and everything's gone swimmingly.

Nor have I read of anything bad happening to anyone that didn't involve not having backups.

All in all, lots of people are running it, lots of people seem to be having almost entirely positive experiences with it, so it seems like it's very stable.

As long as you don't backup with send/receive. ("Silently corrupted file in snapshots after send/receive" https://github.com/zfsonlinux/zfs/issues/4809)

That's a bug in OpenZFS, to be fair, not ZoL-specific, though 0.6.5 not having a fix merged for a while is slightly surprising.

Just grab https://github.com/zfsonlinux/zfs/pull/4833 and send/recv in safety. ;)

> Nor have I read of anything bad happening to anyone that didn't involve not having backups.

This should be true of any storage technology, no matter how unreliable (and thereby undermines the anecdote).

The trouble is it depends how you define 'backup'.

E.g. if you are using zfs send/receive to do backups, and there's a bug in zfs send/receive....

Some people advise making sure you backup using different techniques/strategies/software, but in practice that gets quite difficult to manage, and you quickly end up going down a very deep rabbit hold on your quest for independent backup strategies.

Consider this example: I'm aware of at least one instance where a faulty tape drive damaged the tapes used by that drive - in a a way that caused those tapes to damage other drives in the same way. The damage spread like a virus. Unless you have "blue" tapes, and "yellow" tapes, and "blue" drives, and "yellow drives", an issue like that won't be contained.

But you do the combinatorics on all the kinds of issues like that which need to be addressed, and quickly you'll end up spending your whole life backing up data. Which will sort of solve the problem, because you won't have time to create any data which needs to be backed up.

Could you give machine specs ?

    $ head -1 /proc/meminfo
    MemTotal:        4046880 kB

    $ grep model.name /proc/cpuinfo | head -1
    model name      : Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz

    $ sudo zpool list
    tank  36.2T  27.2T  9.09T         -    10%    74%  1.00x  ONLINE  -
That's 10x 4TB WD Red drives in raidz2 configuration.

I've been running ZFS for the past 6 years, a little over 2 years on ZoL, and 4 years before that on Nexenta (Ubuntu userland with OpenSolaris kernel).

I've had 5 HDD failures over the past few years, never lost any data.

I've found ZoL to have less jitter when streaming video files in particular over SMB than the Solaris implementation did. ZoL is tighter on memory though.

raidz2 isn't particularly fast for random I/O (since it acts like it has a single spindle), but I don't do much random I/O. It's mostly video and local backup cache.

I'll be thinking about expanding my current setup soon enough. I'll probably be aiming for a raid10 setup for better random access, and of course the new system will have more RAM, and ECC.

Thanks. Beefier cpu than the beefier I got here, but RAM is a great hint that for personal needs ZFS can run on normal machines.

Interesting that you had no issues without ECC in the first place. Some ZFS boards are pretty against trying ZFS without.

> Interesting that you had no issues without ECC in the first place. Some ZFS boards are pretty against trying ZFS without.

ZFS without ECC is no better or worse than any other filesystem without ECC, in that silent data corruption is possible. It is for this reason just a lot worse than ZFS with ECC, introducing a fault condition normally not seen on ZFS.

But that fault condition does not go away because you chose to skip ZFS due to no ECC. And you will get a lot of other problems on top that ZFS would have helped against, even without ECC.

Lastly, you'll end up back in the horrible land of static partitioning schemes with questionable tooling. Why suffer through that if you do not have to?

ZFS boards really encourage you to use ECC because one of the nominal features of ZFS, "end-to-end checksumming", is not quite as useful if you have a point between ends where it won't notice silent data corruption (e.g. a bad stick of non-ECC RAM).

ZFS will notice, depending on _when_ the munging happened (if it was mangled after it got to RAM but before ZFS checksummed it, it obviously can't), but that's (primarily) why ZFS boards often tell you not to do that.

Not only that, but I believe a bad checksum can also cause a cascade to "fix the data", causing further corruption.

For that to happen, you would need a bad checksum and the corresponding data block to be corrupted in such a way that the corrupted data block checksums to the corrupted checksum. Given that the checksum is 256-bit, the scariest I'd call that is "possible in theory".

ZFS don't really need ECC that much, just FUD.

Most likely someone got bitten by faulty RAM, got corrupted data, which stopped after getting ECC RAM. => "ZFS wants ECC" myth, but in reality you just need stable hardware.

At the same time, if you are running like 30-40 TB pool, you probably want that ECC RAM, it will cost only small fraction of total storage box price and will save you from rebuilding a pool if shit happens.

I don't know about anyone else, but when I'm seeking advice about software from people on the internet whom I don't know personally, I tend to weight the words of the authors of the software in question more highly than that of people who didn't write it. There are probably exceptions to that I'm currently forgetting.

And the authors of ZFS, from the Sun days on, have been consistently and repeatedly saying that if you care about your data, use ECC.

I personally don't get why people seem so resistant to using ECC. It isn't that much more expensive and we know these errors happen in around 8% of DIMMS[1]. It reminds me of people I know who refuse to buy a decent power supply and then complain about hardware failures because of shoddy power.

Personally, paying a slight premium to avoid an almost one in 10 likelihood of silent data corruption seems like a no-brainer to me. But then, I care about my data.

Edit: to be clear, ZFS will perform less dangerously with non-ECC ram than a filesystem that doesn't checksum with non-ECC, because it will detect Bad Things happening and tell you about it. ECC helps avoid the problems in the first place.

[1] http://research.google.com/pubs/pub35162.html

ECC isn't that much more expensive, if you already have a processor and motherboard that give you that option. If you're trying to use ZFS on surplus consumer Intel equipment, ECC is a major expense.

And aside from the cost issue, I don't think people are resistant to using ECC. You have probably just misinterpreted people who are justifiably shooting down ZFS/ECC scaremongering: Advising people to never use ZFS without ECC RAM is bad advice, because the advice should simply be to use ECC RAM if you want enterprise grade reliability, whether or not ZFS is part of the picture. Some people aren't in need of that level of reliability but can still benefit from ZFS, and they shouldn't be misled into thinking that ZFS has some particular need for ECC RAM.

I agree on the last part, and I think the particular weirdness here is a result of ZFS historically mentioning the value of ECC a lot more than other FSes. I think this probably makes people think that ZFS in some way depends on it more than others, rather than simply pointing out that there's a disturbingly high chance of encountering a problem without it that applies to everything.

And sure, if you're building a frankenbox, ECC is probably not an option, or the sort of expense that takes it out of the frankenbox category. I do hope that folks wouldn't store important data on dodgy hardware, but that is about more than RAM, and also none of my business.

I don't believe I'm misinterpreting people's reactions - I've witnessed people assert that ECC is a scam, waste of money and similar. Even after I point them to that link I posted above, they still seem to believe that It Won't Happen To Them. ("I don't have millions of machines.")

In any case, I still find it frankly bizarre that people run the risk with important data. If you asked people if they wanted to buy a CPU that had an ~8% chance of undetectably lying to them, I'm pretty sure the vast majority would at least want to spring for the premium one that allows at the least detecting the lie.

ZFS does depend on ECC more than other filesystems, in that it does data checksumming, which other filesystems (mostly) do not do. Those checksums can and will lie without ECC, in worst case rendering a data protective measure into a data losing measure.

> I've witnessed people assert that ECC is a scam, waste of money and similar. Even after I point them to that link I posted above, they still seem to believe that It Won't Happen To Them. ("I don't have millions of machines.")

If by your own citation 92% of DIMMs operate with zero errors per year, and a consumer machine has at most four DIMMs, then it is actually pretty likely that any given consumer machine will operate without RAM errors even without ECC. And when you multiply by the low probability that a DRAM error will cause catastrophic data loss, then it is very easy to come to a reasonable conclusion that ECC is not worth the expense.

>And when you multiply by the low probability that a DRAM error will cause catastrophic data loss, then it is very easy to come to a reasonable conclusion that ECC is not worth the expense.

If your data is worth nothing, then ECC isn't worth the expense.

That's part of the difference between more enterprise level and consumer level equipment. I do some work on my home computer, but I don't have ECC in it. I probably push a few terabytes of work related information over it a year. The rest is many more terabytes of movies and music, and other things that will never notice a bit error. At work where I move 10s of terabytes of information a day, and that information may have cost many hundreds of manhours to create, I use enterprise level memory, disks, and other parts.

I've seen both servers and desktops develop bad ram. You want to know what the difference is when it happens? I get MCE logs from the server and we replace the equipment before anything bad happens. When it happens on the desktop you can end up with crashing programs, reboots, and even worse, corrupt data written to disk.

Hey, it is your data. I personally don't like gambling with mine, but yours is none of my business.

Again, I don't know how many people out there buy other products with a nearly 1 in 10 chance of undetectably not performing the function they are supposed to perform, but it seems nutty to me.

Although that line of thought does go some way towards explaining the vitamin business...

It's disingenuous of you to keep putting the error probabilities in terms like "nearly 1 in 10" without acknowledging the context that you're talking about the probability of a transient error occurring at any time over the course of a full year of continuous operation. Most people actually are comfortable with the idea that their equipment will have occasional downtime or faults, but you're trying to paint a very different picture.

Buddy, I would heartily encourage you to believe whatever you makes you happy. Your insult is false and petty, and I think you're pretty wrong about the rest of that.

I personally consider it pretty disingenuous to call a fault that damages data on disk 'transient'. The root cause may have been transient, but the damage doesn't go away if/when a stuck bit functions normally again. Or do you consider a stroke leading to paralysis a transient injury?

I'd also like to see a cite that "most people actually are comfortable with the idea that..." a 'transient' fault that scrambles random data they've chosen to keep. Where, exactly, are you getting this?

Finally, assuming you have some actual basis for that claim, how many of those people run ZFS? You wouldn't conflate nontechnical people who buy the cheapest box at Best Buy with folks who take the time to configure software RAID across several disks using a nonstandard filesystem, would you?

"if you care about your data, use ECC."

This is great advice, but it has nothing to do with ZFS. It's true no matter what filesystem you use.

It's a good sense thing. Probably means "Don't take steps to use ZFS for data integrity if you don't use simpler, lower level precautions".

It's not so much that ZFS needs ECC anymore than any other filesystem, but rather when you do have RAM errors, you are going to find out very quickly.

Other filesystems will happily let you write corrupt data blocks...

See my other comment a bit up in the thread. ZFS does data checksumming which relies on the ram not lying to you.

But at least with ZFS checksums you have a statistical chance of detecting bad RAM because it will sometimes manifest as checksum errors, whereas with non-checksumming filesystems you just get silent data corruption.

I don't think I've seen ZFS use much more than a single core. Apart from it being left over from upgrading desktop machines, I put that CPU in in there for video transcoding.

It's well worth while creating a bunch of sub-filesystems (something ZFS makes really easy) with different settings: block sizes, compression, etc. - and copying sample data to it, to tune things in. The defaults are not necessarily a good fit for everybody. There's no reason not to turn on lz4 compression, for example.

If you create zvols for use with different filesystems or sharing out over iSCSI, watch out for 100+% space consumption (e.g. 500G volume takes up 1+T space out of the pool, which is more than you would expect even with 8+2 raid overhead). I understand it relates to block vs stripe mismatch. This makes it a little less handy for things like virtual machine backing store than I had originally hoped.

I think the ECC myth has long ago been debunked no?

The ECC myth?

Issues related to not having ECC RAM are rare but if they happen, you might not notice the corrupted data, unless you try to access it.

Wow! That is a lot of porn!

ZFS can run on pretty low specs unless it's being pushed hard. I use a low-power AMD Sempron to keep things quiet. These specs are enough for all my backup and video streaming needs:

    $ head -1 /proc/meminfo
    MemTotal:        1967672 kB

    $ grep model.name /proc/cpuinfo | head -1
    model name: AMD Sempron(tm) 145 Processor

The in-development version of pfsense allows installs to ZFS, and I've installed onto a 2c atom with 2GB ram and a small SSD.

But it's not as performant as UFS on the same hardware.

I've been running it for over 8 years on Arch Linux using the AUR packages, from FUSE through to the kernel driver. I run a RAIDZ2 pool that is now upgraded to ~4TB capacity from 4 2TB disks. I have had no known data loss. I did once lose my (non-ZFS) boot drive to hardware failure and the zpool mounted right up with a fresh installation.

It's quite good. I've been running ZFS on my Ubuntu server for about a year now. It's been quite solid. It detected and corrected some corruption resulting from overheating due to fans caked with dust. I've recently switched over to root-on-ZFS and it's continued to run well.

Why would each file system need a "native encryption"? What gains are there fromt this?

Encryption seems it would be more cleanly implemented transparently underneath the file system level.

There's a few things.

First off, ZFS takes advantage of doing its own volume management on bare metal. Some aspects of data recovery/resilience work better when it can interact directly with the device, particularly for resistance to bit rot (which is one of ZFS's biggest advantages). Putting something like LUKS in between isn't as bad as LVM or hardware RAID, but is not great.

Second, it enables the use of different ways of using a particular block cipher like AES (namely GCM instead of say XTS) that have some advantages such as authentication of data. I'm not sure if it's an option for this particular implementation, but nothing about the way ZFS encryption works would preclude using XTS, while GCM doesn't really map well to block device encryption (there is no good way to store the extra IV and authentication code, while ZFS can put it in the metadata).

There are of course disadvantages. Some information about the structure of the data on your drive is accessible that would not be if you used block encryption. Also, unlike LUKS or LUKS w/ LVM you can't easily mix filesystem types on the same drive set.

I wonder how/what is better with direct hard drive access here, if there is bit rot on LUKS I expect it is also fixed with ZFS, what should be different? "Some aspects of data recovery/resilience work better when it can interact directly with the device, particularly for resistance to bit rot (which is one of ZFS's biggest advantages)."

Encryption is easier to implement at the block level, but it's also much less effective.


Thomas, have you looked at this code? Can you confirm if it's actually doing seemingly-sensible authenticated filesystem-level encryption?

I know some previous efforts at "native ZFS encryption" essentially re-implemented AES-XTS block-level encryption below the FS, which seems unlikely to offer any advantages.

Here's one example where it would be helpful, say you have a simple mirrored pool built from 2x LUKS encrypted drives in a low powered NAS device.

With the encryption underneath ZFS, the encryption during a write necessarily happens twice, once for each LUKS mapping, which increases CPU load, reduces throughput, or both.

With the encryption in the ZFS layer, data only needs to be encrypted once during a write, after that the data can be written to as many drives as necessary without any additional overhead.

I see.

Encryption is one possible filter one could apply to data before it goes to the block device. Another is compression. I guess ZFS has that one also covered. Chould ZFS provide some generic way to apply different such filters which would be applied to each "mirror"?

No, datasets do not map to vdevs (mirrors). Since it came up in another comment as well: a dataset is a child object of a pool, but it is also the root-object of a subtree that consumes storage in that pool. To the user this dataset is presented as either a filesystem or a blockdevice to name the two most common options.

Apart from internal accounting things like the spacemap, every consumed storage space in a zpool belongs to a dataset in some way. It might be a data block for a file in that dataset, or it might be an old storage block still referenced by a snapshot of a dataset.

A lot of zfs commands work on datasets (send/receive, snapshot, clone, ...). They are also the point where settings such as compression, deduplication etc can be enabled/disabled as well as traditional filesystem mount options like noatime or noexec.

All the datasets consume storage from the pool, which dynamically stripes over all configured vdevs. If you enable an option for a dataset, you enable it for all storage of that dataset which ends up on all vdevs. You can not delegate a dataset to a specific vdev and then enable some option on that vdev.

Also sorry to everyone who knows enough about ZFS internals to realize that I just took their design, pulled it behind a shed and hit it with a blunt, heavy object.

Because in the unified storage model that ZFS (and btrfs and friends) have, being force to set up specifically encrypted zvols and zpools is a bunch of arse.

As noted in the PR, in this case it's a licensing issue.

No, I mean why would ZFS have anything to do with encryption at all?

It doesn't seem to be a filesystem issue, rather something you put underneath it (encrypting the block device, no matter what file system is on top) or on top of it (encrypting separate files).

Encrypting on top of ZFS negates ZFS' compression and de-duplication. My understanding is that this implementation preserves both of those. For example, ZFS can compress before encrypting.

In order to do filesystem encryption properly, it needs to be done at the file layer, not the block layer. Block-level encryption is not authenticated because there are no extra bytes to add the authentication tag.

If a ciphertext is not authenticated, it can be trivially tampered with. This means that someone with access to the encrypted drive could add backdoors or otherwise tamper with the executables and data even though it is encrypted.

> If a ciphertext is not authenticated, it can be trivially tampered with.

Shouldn't it be impossible to forge a plaintext without the key for a good encryption algorithm?

I imagine a good algorithm not to be just key -> pseudorandom stuff that is XORed with data, but something that has cascading. Change a bit anywhere, and a whole block changes unpredictably. Include the physical position of the block in the key, so that it impossible to copy blocks around to duplicate data.

You can perfectly fine divide the block device into a bit smaller chunks so that you can fit a MAC or similar at the end.

This is how FreeBSD's GELI (which has authenticated encryption for blockdevices) did it. For every 4k data block it presented up the stack, it consumed 9 512 sectors on disk. Each of them contained 480 bytes, the rest for MAC.

With 4k native drives, this became completely impractical. To keep ratios similar, you would have to present 32k byte devices up the stack, which filesystems have troubles with. Or have 1 MAC sector per data sector or similar, cutting your storage in half.

My understanding is that in practice this is quite hard to do. As far as I know, none of the mainstream block-layer encryption systems (BitLocker, FileVault, dm-crypt) provide authenticated encryption.

Here's a short thread about the problems with adding it to dm-crypt: http://comments.gmane.org/gmane.linux.kernel.device-mapper.d...

I'm sure you can find more threads if you look around.

Thanks! That’s good to know. Indeed, encryption without authentication is poor security.

With this you can encrypt one dataset and leave another unencrypted, on the same block device, paying the cost only or the encrypted one.

Which means you can also potentially use different keys for each encrypted filesystem as well, one for each user or various other scenarios

That’s interesting, but it again seems like something which can be implemented independently of the particular choice of file system.

Only sort of?

ZFS is both filesystem and logical volume abstraction, so to implement different keys on different "filesystems" on ZFS, you'd need to expose them as block devices, not filesystems, and then use your encryption du jour on them, then your filesystem atop that - which also kills most of your compression or encryption properties, since you're doing it before ZFS sees the data, so to speak.

I guess I was thinking more like a 9P[0] approach where a file system can be given to a module (or any program) and based on it the module can export a related file system. For example the module would encrypt the files before storing it to the given file system.

However, I realise this does not quite fit into the current kernel architectures.

[0] https://en.wikipedia.org/wiki/9P_(protocol)

What is a dataset in this setting?

Think flexible-sized partitions. Then forget it again because it is too wrong.

The one really nice scenario this opens up that is discussed in the pull request linked in another comment:

  1. Server A has an encrypted dataset in pool foo, currently not decrypted
  2. A can send full or incremental streams from that dataset to server B
     without decrypting
  3. B can receive those streams and import them as encrypted
     datasets into the pool without decrypting or really ever having
     even seen the keys
  4. Server A can restore from B as required
  5. The owner of the key material can log into B and unlock
     the dataset as if it were on A
This is a really nice, accessible way for encrypted remote backups.

Cool. As I mentioned in some other comment, I would really have to look into the details before I could use and trust such an encrypted system, to understand the guarantees provided.

It's a feature that trades off layering purity (which ZFS has never adhered to) for increased usability.

I see. When crypto is concerned, this kind of purity seems attractive though… Easier to reason about when using it.

If I were to use ZFS’s native crypto I would have to look into the details of their implementation to understand what guarantees are provided, and where in the stack the crypto is applied etc…

In fact, it's significantly harder to reason about encryption at the block layer, and not just because it's difficult to reason about the most popular construction (XTS). Among other things, you have to come up with a semantic for handling authentication failures at a sector level. What does it mean for one sector of a file to fail authentication? Have you lost the whole file? Do you fill with zeroes? How do you signal the precise error to the kernel or users?

The semantics seem pretty strait forward. That sector is untrusted. You do not fill with zeroes(!), but fail as hard as you can.

I understand that kernel currently may not have any semantics for files or sectors failing authentication, but that issue would doesn't seem to depend on the file/sector dilemma, and would have to be solved anyhow.

To me, this is very simple:

File-level encryption has total format flexibility. A filesystem can store arbitrary metadata. It turns out that strong encryption schemes want internal metadata. You can go through contortions --- contortions no block encryption scheme currently goes through --- to get some sort of space for metadata at the block layer, at significant expense. Or you can get it for free with files.

File-level encryption is also message-aware. Encryption cares about the boundaries between messages (just for instance: because it leaves no ambiguity about the set of disk sectors that makes up any particular file at any particular moment in time). Block-level crypto can't provide this; it's by definition oblivious to messages.

We have block-level crypto because it's a cheap easy win, for the minimal security it provides. If the FBI is interdicting you, block crypto isn't going to help you: they incurred weeks worth of planning before anyone with handcuffs approached you, and the extra planning it takes to make sure they grab you when your key is resident is a rounding error. But if you leave your laptop full of financial secrets in a taxi cab, block crypto will probably keep you from having to roll all your keys, change your account numbers, notifying your clients, etc.

Is there a way to do file-level encryption without leaking the number of files and their sizes? Because that seems much worse than any known problem with block-level encryption.

But won't the FBI also wait to make the grab when using filesystem encryption? We're not talking about some filesystem that uses unique passwords for every file.

Unlike block encryption, filesystem encryption doesn't have to be all-or-none; you can have lots of different keys, unlock on a directory-by-directory basis, whatever. There's room for arbitrary metadata. If you're using your computer at all, block encryption is unlikely to help you.

I'm not saying any particular extant system does those things; I'm just saying they're a possibility for filesystems and foreclosed for device-level encryption.

This is why ninjas partition their disk into lots of devices. :)

Except that partitioning a single device completely falls apart when using ZFS, as zpool is designed to manage entire drives, and managing multiple partitions as vdevs on a single device would not only kill performance, but severely imperil data redundancy. Once the drive kicked the bucket, all the vdevs would be gone too, causing an outage.

So now thanks to this there are two implementations of AES in the linux kernel. Who's responsible for ensuring they are both correct?

The other AES implementation is not licensed and thus can not be exported for use with ZFS.

(And no, it's not the USA crypto export restrictions circa 1994, it's two pieces of freely available code with different notions of freedom coded in.)

This is only part of the zfs on linux port, not part of the kernel proper (i.e. mainline). The kernel team is responsible for their part. and the zfs on linux team, since they have to import their own, due to licensing concerns must maintain the library they ship.

Note that the commit linked here is a port of the Illumos Crypto Framework (ICF), which is a dependency but is not the change that actually brings native encryption.

That is a terrifying amount of crypto code. Has anyone audited this, or plan to?

It was Signed-off-by two devs @llnl.gov so there's no need to worry, the U.S. Government is looking out for us. /s

Is there really a need to make zfs your root volume? You can reinstall your root volume in a few minutes from a flash drive. What you really want is your home directory to be zfs, and just do all your work in your home directory. Saves all the grief of trying to make it your boot volume and it works just as well.

This is basically what I do. A copy of the root volume can be kept in a ZFS filesystem since the zpool can be easily mounted during any recovery.

I think this is most useful for cross-compatibility. Right now if a client uses FreeBSD ZFS, and you need to mount it to access project files on your Linux desktop, you can't if they used encryption. But after this is standard, you should be able to mount the same ZFS filesystem anywhere.

Do you create zpools on memory sticks? Or how do you export the pool and move the device with the exported pool on it to your linux desktop?

Yes, you can use ZFS on memory sticks. You can also "zfs send | zfs recv" to copy a single projects dataset snapshot. You don't need to use actual disks though, if someone wants to send you project files you can have them either send you a file from "zfs send > snapshot.zfs", or send it over SSH, "ssh friend zfs send | zfs recv".

Encryption could be an issue if for example someone uses a FreeBSD based NAS for large data files, and you want to skip the network and just access them directly from your Linux box. You can "zfs export; zfs import", but not if they used encryption. That's where I think this will be useful, because then we have one standard filesystem we can use everywhere.

I am well aware of all of that. But all the zfs send/recv options can by their very definition not have the full disk encryption problem hinted at in the comment I replied to.

Also, if it is easier to take your NAS offline and apart to chuck the disks into your desktop (compared exporting it over the network), then your NAS is too small. What you were looking for is a Laptop.

Which leaves abusing the zpool on a memory stick as data interchange format. Most likely with copies=1, so you have to add some par2 files anyway, at which point you could simply put them on figuratively any other filesystem out there. And encrypt them with gpg/openssl etc. That way I would also not have to run a potentially maliciously crafted filesystem within my kernel.

> We cannot use the Linux kernel's built in crypto api because it is only exported to GPL-licensed modules.

What a strange choice by the Linux kernel.

The crypto code is all very internal to the kernel, so of course it will be GPL. Besides, I personally think that all kernel drivers should be GPL because they are essentially all derivative works of the kernel. The ability of module loading to get around the GPL is one of the more worrying decisions made by Torvalds.

Why is this a strange choice? Isn't most of the kernel GPL? (and thus, a non-GPL component can't link to the GPL part, as that would be a violation of the GPL, would it not?)

As someone who just last night set up an Ubuntu 16.04 server with the intention of using ZFS, should I wait until this hits the Ubuntu repos? Is it possible to enable this encryption on an existing filesystem?

Similar questions here. It's unclear to me how I can integrate this into my system and how to go about it. I'd hate to have to blow away my zpool to do this.

My zpool isn't very full. Maybe I can make a new dataset that's encrypted and move everything from the unencrypted dataset to the encrypted dataset. Of course, being able to encrypt in place would be most ideal, but I'll settle for it being a painless operation that doesn't require blowing away a zpool.

So what's the relation of this to OpenZFS? Is this currently just for the Linux port, and not yet pulled into OpenZFS for other platforms?

This work is being done in ZFS on Linux but will make its way back into OpenZFS/Illumos and FreeBSD.

So does ZFS on FreeBSD support native encryption? Can I switch my existing pool?

native encryption with AES-NI support sounds awesome. When is likely to make it into ubuntu repos?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact