Slightly off-topic but if anyone has any resources on performing a clean (preferably, Ubuntu or Arch) Linux "root on ZFS" installation, please share.
I followed the instructions for Ubuntu 16.04 on the github.com/zfsonlinux wiki [0] a while back and (encountered a few little issues along the way but) got it working, although I experienced some MAJOR performance problems so something wasn't quite right (exact same hardware is blazing fast when running FreeBSD). I can't imagine it was just "how things are" with regard to the current state of ZFS on Linux (or Ubuntu specifically) -- it was like someone hit the laptop's "pause" button a couple of times per minute.
tcaputi works with me at Datto. We use ZFS as our main file system for some 200+ PB of storage. It is a truly amazing file system, and this change, once merged, will add another layer of security for our customers. Great stuff!
At this scale, why ZoL and not a more 'native' ZFS with FreeBSD (or even Illumos)? Which also perform better and have more features?
Is it the lack of drivers? Better availability of other 3rd party software for Linux? The package manager? Or just "more brains in Linux"/"our CTO grew up with Linux"?
Is there a meaningful performance difference between ZFS in FreeBSD or Illumos and ZoL?
Edit: this is a legitimate question. I know previously Linux-related ZFS efforts used FUSE, but ZoL is native. I assume performance should be roughly equivalent between Linux and FreeBSD, certainly.
Better observability with Dtrace, mdb, iostat, and vmstat on illumos-based systems for sure. Also simple logic dictatates that if something is made on something for that certain something, it's going to run best on that something. Linux has ZFS, but it's grafted on and the illumos POSIX layer is emulated in that sense. Further, Linux's version of OpenZFS will always lag behind fixes and features in the illumos-based systems; even FreeBSD usually contains newer vesions of OpenZFS sooner than ZFS on Linux does. Linux is just "the last hole on the porting flute".
Also, experience tells me that illumos and FreeBSD based systems will always perform faster than Linux with
regads to ZFS but I'd have to publish a full line of benchmarks and I bet even those would get vehemently disputed because Linux is all the rage right now, so that's a lost cause: you have to try it for yourself and make up your own mind.
ZFS on Linux has been around for 3 years now and is integrated in Ubuntu 16.04 - I'd argue that by now it has received more usage than the BSD version.
You can argue that the earth is shaped like a torus. I can argue that the moon is violet. That is not the point, and has nothing to do with the merit of the argument.
Setting this aside, all usage hours are not equal. How do you compare an hour of enduser desktop usage on Ubuntu 23.57.something to a usage hour of a commercial NAS storage solution internally based on FreeBSD and ZFS?
Do you think the enduser performed comparable product testing cycles?
I've seen ZOL systems where the zpool claimed it was online, but contained no vdevs. `zfs list` had no datasets, but datasets where mounted and trying to read from them got your process stuck in-kernel.
It just lost/forgot its devices.
Up until one of the latest releases on every boot you could roll the dice by which name your pool would import the devices. Behaving differently on identical machines and setups. Personally, I am still not trusting that problem to not reappear again.
ZOL is bolted on. With a large nailgun. Simple as that. At times, it feels about as integrated as pjd's original patches distributed on the FreeBSD mailinglists.
And since this division is not technical, but legal, based on the license choice made 30 something years ago for Linux with regards to where the code could be exported to and what could be imported into, this situation will not resolve itself.
This is not native encryption that was committed. It is just the kernel cryptography framework required for native encryption. Native encryption comes next.
We run ZFS at my company, Datto, across hundreds of machines, including devices out in the field and servers in our data centers. It has a few warts, but overall very stable and the feature set has allowed us to deliver world-class features to our customers.
What are the few warts you encounter? I'm a Sysadmin/Engineer at a VFX company. We have hundreds of machines, render, and servers all running Linux but we're typically on ext4 with the notable exception of our Isilon cluster.
As a few others mentioned, the 80% issue is one. We try and keep all machines below that so performance doesn't degrade. With the number of drives we have (200+ PB total space) we also run into issues where there are "holes" in the ZFS array after replacing a disk. We've always had enough redundancy work around any issues we find. Overall, I cannot speak highly enough of ZFS!
This is also true of other filesystems however. I don't have experience with high IO on ZFS at those rumored 80%+ capacities however once a core filer fills up to 90% you start seeing the tickets rolling in and application performance tank. 80% if usually fine however. I wonder to what degree the performance degradation is on OpenZFS.
The problem is much more severe on ZFS and other CoW filesystems, hence the common warning. Filling an ext4 filesystem up to 80%+ and then removing files might increase fragmentation a bit but it'll not have that much effect on performance. On ZFS, filling the fileystem up once can kill performance permanently on that pool.
"One that I've heard often repeated (paging user rsync) is that you should never fill your filesystems past 80% (or so). Things get slow otherwise."
Hi. I've said a lot about this in HN comments, so I encourage you, future reader, to search for that. It is indeed the case.
Recently it has been suggested that the presence of a fast write cache (a SLOG, in ZFS parlance) minimizes this problem and allows you to run up to, and around, 90% without breaking the filesystem.
We haven't tested this, intentionally or otherwise, but it sort of makes sense ...
Make no mistake, however - if you fill up a ZFS filesystem and run it for a while in that state, it will be permanently broken, at least in terms of performance.
A ZFS defrag utility would solve this problem, or at least provide a way out if you fall into this trap, but it has been related to me that ZFS defrag would be extremely complicated to implement.
Yes, we do indeed see this on other filesystems ... even a UFS2 filesystem with NO snapshots enabled can be effectively "broken" is you set minfree down to 0%, fill it all the way up, and then run it like that for a while. Freeing up the space and resetting minfree back to 6 or 8%, etc., doesn't fix it.
ZFS certainly slows down as it fills up, but only enough to affect high-IO, write-intensive loads like databases. A light-weight application like a home or office file server can still saturate gigabit ethernet even while 99% full.
Yes, because after 80% ZFS will consume CPU(s) in an attempt to find enough contiguous space for the new branch of blocks in order to prevent fragmentation. The fix is to either add more vdevs or to change the kernel tunable which controls the free contiguous space allocation algorithm. The latter approach is very likely to cause fragmentation where there would otherwise be none. As I don't remember the name of the tunable, you can search my comments and in one of them on the subject there will be a link to the procedure. However, I strongly recommend adding more logical units or physical disks as vdevs to the zpool instead.
I have been using it on Ubuntu 16.04 and before that with Ubuntu 15.10 for about a year. Good news is its a first class citizen on Ubuntu now. I have have great success with it on my home server and have plans to try for some prod systems I'm building out.
I ran into a bug [1] that appeared to be a deadlock triggered by rsync of a relatively modest directory with tons of small files. The only thing that fixed it was `spl_taskq_thread_dynamic=0` to stop the dynamic spawning of ZFS-related kernel threads. Rather than a memory management issue, it'd stall the copy and peg my CPUs indefinitely.
I suspect it's fixed in 0.7.0 based on some of the other related bugs I've run into since, but I've been reluctant to upgrade as of yet. Otherwise, ZFS on a home file server has been great.
The Ubuntu ZFS module sets spl_taskq_thread_dynamic to zero by default.
My problem was that the large number threads with spl_taskq_thread_dynamic=0 cause it to OOM, at one point it had an OOM at mount (with 32GB RAM).
When I set spl_taskq_thread_dynamic=1 I had the same deadlock issue (That's where I stopped trying to use it and went back to FreeBSD).
The fix is not to work around the out of memory killer in Linux, but to turn off memory overcommit altogether: the system should not lie to applications that it has more virtual memory than is actually available, because that is extremely detrimental to correctness of operation. illumos based systems, for instance, never lie about such things.
ZFSOnLinux is a kernel module and not a user space application (and you cannot OOM kill kernel threads) and the overcommit probably isn't direct, e.g. there are 64 worker threads who all need some memory to do the work and tell the kernel that memory allocation must not fail. If the memory isn't available either Linux starts killing userspace processes or gets stuck.
The same thing can happen in Solaris except there ZFS doesn't use a separate memory mechanism which does not properly react to memory pressure.
I've been running it on a machine I built in 2011, and it seems to work quite well. It started out as whatever Ubuntu was available in August of 2011, then 12.04, then 14.04.
I have 3 3x 3TB Hitachi Deskstars in a raids, with a crucial M4-CT128M4 as a combined rootfs and l2arc
(the ssd was an upgrade from an older ssd when I moved from 12.04 to 14.04).
The machine has a Xeon e3-1270 CPU, and 16GB of ECC RAM.
I used it as a workstation (now my wife does), and it has been as a home media / tv server / NAS for its entire lifetime. 0 problems so far (knock on wood), and it has made a move from the east coast to California, and then back to the east coast again.
I like the chassis, but the build was ~5 years ago, and when I built a new machine ~1 year ago, it was not available.
My current machine uses a Fractal Design Define R5 Black Silent ATX Midtower case. That one is quite nice. The one gotcha that I had is that it is a bit wider than I was used to, which means that it just barely fits in the computer holder accessory for my ikea desk.
I've been running it on my laptop (chromebook pixel 2) for several months so far and it's great. I was waiting for this to be merged for a long time. Right now I have /home as a zpool than runs from an SDCard on my computer. The SD Card is a 64Gb card (the most expensive/fast I could find) and for my workload, it works just amazing: this setup leaves me with 64Gb (a bit less) for / and 64Gb for /home.
The ability to cd into snapshots is just great and a time saver. So far, nothing wrong/bad with this.
I've been using it casually as my /home for ~2 years. The most stress it gets is torrenting movies to it. But, there's been no problems. Setting it up on 3 drives was dead simple.
I've been running two very-lightly-used sets of pools on two old desktops for ... about a year and a half now actually and everything's gone swimmingly.
Nor have I read of anything bad happening to anyone that didn't involve not having backups.
All in all, lots of people are running it, lots of people seem to be having almost entirely positive experiences with it, so it seems like it's very stable.
The trouble is it depends how you define 'backup'.
E.g. if you are using zfs send/receive to do backups, and there's a bug in zfs send/receive....
Some people advise making sure you backup using different techniques/strategies/software, but in practice that gets quite difficult to manage, and you quickly end up going down a very deep rabbit hold on your quest for independent backup strategies.
Consider this example: I'm aware of at least one instance where a faulty tape drive damaged the tapes used by that drive - in a a way that caused those tapes to damage other drives in the same way. The damage spread like a virus. Unless you have "blue" tapes, and "yellow" tapes, and "blue" drives, and "yellow drives", an issue like that won't be contained.
But you do the combinatorics on all the kinds of issues like that which need to be addressed, and quickly you'll end up spending your whole life backing up data. Which will sort of solve the problem, because you won't have time to create any data which needs to be backed up.
$ head -1 /proc/meminfo
MemTotal: 4046880 kB
$ grep model.name /proc/cpuinfo | head -1
model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
$ sudo zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 36.2T 27.2T 9.09T - 10% 74% 1.00x ONLINE -
That's 10x 4TB WD Red drives in raidz2 configuration.
I've been running ZFS for the past 6 years, a little over 2 years on ZoL, and 4 years before that on Nexenta (Ubuntu userland with OpenSolaris kernel).
I've had 5 HDD failures over the past few years, never lost any data.
I've found ZoL to have less jitter when streaming video files in particular over SMB than the Solaris implementation did. ZoL is tighter on memory though.
raidz2 isn't particularly fast for random I/O (since it acts like it has a single spindle), but I don't do much random I/O. It's mostly video and local backup cache.
I'll be thinking about expanding my current setup soon enough. I'll probably be aiming for a raid10 setup for better random access, and of course the new system will have more RAM, and ECC.
> Interesting that you had no issues without ECC in the first place. Some ZFS boards are pretty against trying ZFS without.
ZFS without ECC is no better or worse than any other filesystem without ECC, in that silent data corruption is possible. It is for this reason just a lot worse than ZFS with ECC, introducing a fault condition normally not seen on ZFS.
But that fault condition does not go away because you chose to skip ZFS due to no ECC. And you will get a lot of other problems on top that ZFS would have helped against, even without ECC.
Lastly, you'll end up back in the horrible land of static partitioning schemes with questionable tooling. Why suffer through that if you do not have to?
ZFS boards really encourage you to use ECC because one of the nominal features of ZFS, "end-to-end checksumming", is not quite as useful if you have a point between ends where it won't notice silent data corruption (e.g. a bad stick of non-ECC RAM).
ZFS will notice, depending on _when_ the munging happened (if it was mangled after it got to RAM but before ZFS checksummed it, it obviously can't), but that's (primarily) why ZFS boards often tell you not to do that.
For that to happen, you would need a bad checksum and the corresponding data block to be corrupted in such a way that the corrupted data block checksums to the corrupted checksum. Given that the checksum is 256-bit, the scariest I'd call that is "possible in theory".
Most likely someone got bitten by faulty RAM, got corrupted data, which stopped after getting ECC RAM. => "ZFS wants ECC" myth, but in reality you just need stable hardware.
At the same time, if you are running like 30-40 TB pool, you probably want that ECC RAM, it will cost only small fraction of total storage box price and will save you from rebuilding a pool if shit happens.
I don't know about anyone else, but when I'm seeking advice about software from people on the internet whom I don't know personally, I tend to weight the words of the authors of the software in question more highly than that of people who didn't write it. There are probably exceptions to that I'm currently forgetting.
And the authors of ZFS, from the Sun days on, have been consistently and repeatedly saying that if you care about your data, use ECC.
I personally don't get why people seem so resistant to using ECC. It isn't that much more expensive and we know these errors happen in around 8% of DIMMS[1]. It reminds me of people I know who refuse to buy a decent power supply and then complain about hardware failures because of shoddy power.
Personally, paying a slight premium to avoid an almost one in 10 likelihood of silent data corruption seems like a no-brainer to me. But then, I care about my data.
Edit: to be clear, ZFS will perform less dangerously with non-ECC ram than a filesystem that doesn't checksum with non-ECC, because it will detect Bad Things happening and tell you about it. ECC helps avoid the problems in the first place.
ECC isn't that much more expensive, if you already have a processor and motherboard that give you that option. If you're trying to use ZFS on surplus consumer Intel equipment, ECC is a major expense.
And aside from the cost issue, I don't think people are resistant to using ECC. You have probably just misinterpreted people who are justifiably shooting down ZFS/ECC scaremongering: Advising people to never use ZFS without ECC RAM is bad advice, because the advice should simply be to use ECC RAM if you want enterprise grade reliability, whether or not ZFS is part of the picture. Some people aren't in need of that level of reliability but can still benefit from ZFS, and they shouldn't be misled into thinking that ZFS has some particular need for ECC RAM.
I agree on the last part, and I think the particular weirdness here is a result of ZFS historically mentioning the value of ECC a lot more than other FSes. I think this probably makes people think that ZFS in some way depends on it more than others, rather than simply pointing out that there's a disturbingly high chance of encountering a problem without it that applies to everything.
And sure, if you're building a frankenbox, ECC is probably not an option, or the sort of expense that takes it out of the frankenbox category. I do hope that folks wouldn't store important data on dodgy hardware, but that is about more than RAM, and also none of my business.
I don't believe I'm misinterpreting people's reactions - I've witnessed people assert that ECC is a scam, waste of money and similar. Even after I point them to that link I posted above, they still seem to believe that It Won't Happen To Them. ("I don't have millions of machines.")
In any case, I still find it frankly bizarre that people run the risk with important data. If you asked people if they wanted to buy a CPU that had an ~8% chance of undetectably lying to them, I'm pretty sure the vast majority would at least want to spring for the premium one that allows at the least detecting the lie.
ZFS does depend on ECC more than other filesystems, in that it does data checksumming, which other filesystems (mostly) do not do. Those checksums can and will lie without ECC, in worst case rendering a data protective measure into a data losing measure.
> I've witnessed people assert that ECC is a scam, waste of money and similar. Even after I point them to that link I posted above, they still seem to believe that It Won't Happen To Them. ("I don't have millions of machines.")
If by your own citation 92% of DIMMs operate with zero errors per year, and a consumer machine has at most four DIMMs, then it is actually pretty likely that any given consumer machine will operate without RAM errors even without ECC. And when you multiply by the low probability that a DRAM error will cause catastrophic data loss, then it is very easy to come to a reasonable conclusion that ECC is not worth the expense.
>And when you multiply by the low probability that a DRAM error will cause catastrophic data loss, then it is very easy to come to a reasonable conclusion that ECC is not worth the expense.
If your data is worth nothing, then ECC isn't worth the expense.
That's part of the difference between more enterprise level and consumer level equipment. I do some work on my home computer, but I don't have ECC in it. I probably push a few terabytes of work related information over it a year. The rest is many more terabytes of movies and music, and other things that will never notice a bit error. At work where I move 10s of terabytes of information a day, and that information may have cost many hundreds of manhours to create, I use enterprise level memory, disks, and other parts.
I've seen both servers and desktops develop bad ram. You want to know what the difference is when it happens? I get MCE logs from the server and we replace the equipment before anything bad happens. When it happens on the desktop you can end up with crashing programs, reboots, and even worse, corrupt data written to disk.
Hey, it is your data. I personally don't like gambling with mine, but yours is none of my business.
Again, I don't know how many people out there buy other products with a nearly 1 in 10 chance of undetectably not performing the function they are supposed to perform, but it seems nutty to me.
Although that line of thought does go some way towards explaining the vitamin business...
It's disingenuous of you to keep putting the error probabilities in terms like "nearly 1 in 10" without acknowledging the context that you're talking about the probability of a transient error occurring at any time over the course of a full year of continuous operation. Most people actually are comfortable with the idea that their equipment will have occasional downtime or faults, but you're trying to paint a very different picture.
Buddy, I would heartily encourage you to believe whatever you makes you happy. Your insult is false and petty, and I think you're pretty wrong about the rest of that.
I personally consider it pretty disingenuous to call a fault that damages data on disk 'transient'. The root cause may have been transient, but the damage doesn't go away if/when a stuck bit functions normally again. Or do you consider a stroke leading to paralysis a transient injury?
I'd also like to see a cite that "most people actually are comfortable with the idea that..." a 'transient' fault that scrambles random data they've chosen to keep. Where, exactly, are you getting this?
Finally, assuming you have some actual basis for that claim, how many of those people run ZFS? You wouldn't conflate nontechnical people who buy the cheapest box at Best Buy with folks who take the time to configure software RAID across several disks using a nonstandard filesystem, would you?
But at least with ZFS checksums you have a statistical chance of detecting bad RAM because it will sometimes manifest as checksum errors, whereas with non-checksumming filesystems you just get silent data corruption.
I don't think I've seen ZFS use much more than a single core. Apart from it being left over from upgrading desktop machines, I put that CPU in in there for video transcoding.
It's well worth while creating a bunch of sub-filesystems (something ZFS makes really easy) with different settings: block sizes, compression, etc. - and copying sample data to it, to tune things in. The defaults are not necessarily a good fit for everybody. There's no reason not to turn on lz4 compression, for example.
If you create zvols for use with different filesystems or sharing out over iSCSI, watch out for 100+% space consumption (e.g. 500G volume takes up 1+T space out of the pool, which is more than you would expect even with 8+2 raid overhead). I understand it relates to block vs stripe mismatch. This makes it a little less handy for things like virtual machine backing store than I had originally hoped.
ZFS can run on pretty low specs unless it's being pushed hard. I use a low-power AMD Sempron to keep things quiet. These specs are enough for all my backup and video streaming needs:
$ head -1 /proc/meminfo
MemTotal: 1967672 kB
$ grep model.name /proc/cpuinfo | head -1
model name: AMD Sempron(tm) 145 Processor
I've been running it for over 8 years on Arch Linux using the AUR packages, from FUSE through to the kernel driver. I run a RAIDZ2 pool that is now upgraded to ~4TB capacity from 4 2TB disks. I have had no known data loss. I did once lose my (non-ZFS) boot drive to hardware failure and the zpool mounted right up with a fresh installation.
It's quite good. I've been running ZFS on my Ubuntu server for about a year now. It's been quite solid. It detected and corrected some corruption resulting from overheating due to fans caked with dust. I've recently switched over to root-on-ZFS and it's continued to run well.
First off, ZFS takes advantage of doing its own volume management on bare metal. Some aspects of data recovery/resilience work better when it can interact directly with the device, particularly for resistance to bit rot (which is one of ZFS's biggest advantages). Putting something like LUKS in between isn't as bad as LVM or hardware RAID, but is not great.
Second, it enables the use of different ways of using a particular block cipher like AES (namely GCM instead of say XTS) that have some advantages such as authentication of data. I'm not sure if it's an option for this particular implementation, but nothing about the way ZFS encryption works would preclude using XTS, while GCM doesn't really map well to block device encryption (there is no good way to store the extra IV and authentication code, while ZFS can put it in the metadata).
There are of course disadvantages. Some information about the structure of the data on your drive is accessible that would not be if you used block encryption. Also, unlike LUKS or LUKS w/ LVM you can't easily mix filesystem types on the same drive set.
I wonder how/what is better with direct hard drive access here, if there is bit rot on LUKS I expect it is also fixed with ZFS, what should be different? "Some aspects of data recovery/resilience work better when it can interact directly with the device, particularly for resistance to bit rot (which is one of ZFS's biggest advantages)."
Thomas, have you looked at this code? Can you confirm if it's actually doing seemingly-sensible authenticated filesystem-level encryption?
I know some previous efforts at "native ZFS encryption" essentially re-implemented AES-XTS block-level encryption below the FS, which seems unlikely to offer any advantages.
Here's one example where it would be helpful, say you have a simple mirrored pool built from 2x LUKS encrypted drives in a low powered NAS device.
With the encryption underneath ZFS, the encryption during a write necessarily happens twice, once for each LUKS mapping, which increases CPU load, reduces throughput, or both.
With the encryption in the ZFS layer, data only needs to be encrypted once during a write, after that the data can be written to as many drives as necessary without any additional overhead.
Encryption is one possible filter one could apply to data before it goes to the block device. Another is compression. I guess ZFS has that one also covered. Chould ZFS provide some generic way to apply different such filters which would be applied to each "mirror"?
No, datasets do not map to vdevs (mirrors). Since it came up in another comment as well: a dataset is a child object of a pool, but it is also the root-object of a subtree that consumes storage in that pool. To the user this dataset is presented as either a filesystem or a blockdevice to name the two most common options.
Apart from internal accounting things like the spacemap, every consumed storage space in a zpool belongs to a dataset in some way. It might be a data block for a file in that dataset, or it might be an old storage block still referenced by a snapshot of a dataset.
A lot of zfs commands work on datasets (send/receive, snapshot, clone, ...). They are also the point where settings such as compression, deduplication etc can be enabled/disabled as well as traditional filesystem mount options like noatime or noexec.
All the datasets consume storage from the pool, which dynamically stripes over all configured vdevs. If you enable an option for a dataset, you enable it for all storage of that dataset which ends up on all vdevs. You can not delegate a dataset to a specific vdev and then enable some option on that vdev.
Also sorry to everyone who knows enough about ZFS internals to realize that I just took their design, pulled it behind a shed and hit it with a blunt, heavy object.
Because in the unified storage model that ZFS (and btrfs and friends) have, being force to set up specifically encrypted zvols and zpools is a bunch of arse.
No, I mean why would ZFS have anything to do with encryption at all?
It doesn't seem to be a filesystem issue, rather something you put underneath it (encrypting the block device, no matter what file system is on top) or on top of it (encrypting separate files).
Encrypting on top of ZFS negates ZFS' compression and de-duplication. My understanding is that this implementation preserves both of those. For example, ZFS can compress before encrypting.
In order to do filesystem encryption properly, it needs to be done at the file layer, not the block layer. Block-level encryption is not authenticated because there are no extra bytes to add the authentication tag.
If a ciphertext is not authenticated, it can be trivially tampered with. This means that someone with access to the encrypted drive could add backdoors or otherwise tamper with the executables and data even though it is encrypted.
> If a ciphertext is not authenticated, it can be trivially tampered with.
Shouldn't it be impossible to forge a plaintext without the key for a good encryption algorithm?
I imagine a good algorithm not to be just key -> pseudorandom stuff that is XORed with data, but something that has cascading. Change a bit anywhere, and a whole block changes unpredictably. Include the physical position of the block in the key, so that it impossible to copy blocks around to duplicate data.
This is how FreeBSD's GELI (which has authenticated encryption for blockdevices) did it. For every 4k data block it presented up the stack, it consumed 9 512 sectors on disk. Each of them contained 480 bytes, the rest for MAC.
With 4k native drives, this became completely impractical. To keep ratios similar, you would have to present 32k byte devices up the stack, which filesystems have troubles with. Or have 1 MAC sector per data sector or similar, cutting your storage in half.
My understanding is that in practice this is quite hard to do. As far as I know, none of the mainstream block-layer encryption systems (BitLocker, FileVault, dm-crypt) provide authenticated encryption.
ZFS is both filesystem and logical volume abstraction, so to implement different keys on different "filesystems" on ZFS, you'd need to expose them as block devices, not filesystems, and then use your encryption du jour on them, then your filesystem atop that - which also kills most of your compression or encryption properties, since you're doing it before ZFS sees the data, so to speak.
I guess I was thinking more like a 9P[0] approach where a file system can be given to a module (or any program) and based on it the module can export a related file system. For example the module would encrypt the files before storing it to the given file system.
However, I realise this does not quite fit into the current kernel architectures.
Think flexible-sized partitions. Then forget it again because it is too wrong.
The one really nice scenario this opens up that is discussed in the pull request linked in another comment:
1. Server A has an encrypted dataset in pool foo, currently not decrypted
2. A can send full or incremental streams from that dataset to server B
without decrypting
3. B can receive those streams and import them as encrypted
datasets into the pool without decrypting or really ever having
even seen the keys
4. Server A can restore from B as required
5. The owner of the key material can log into B and unlock
the dataset as if it were on A
This is a really nice, accessible way for encrypted remote backups.
Cool. As I mentioned in some other comment, I would really have to look into the details before I could use and trust such an encrypted system, to understand the guarantees provided.
I see. When crypto is concerned, this kind of purity seems attractive though… Easier to reason about when using it.
If I were to use ZFS’s native crypto I would have to look into the details of their implementation to understand what guarantees are provided, and where in the stack the crypto is applied etc…
In fact, it's significantly harder to reason about encryption at the block layer, and not just because it's difficult to reason about the most popular construction (XTS). Among other things, you have to come up with a semantic for handling authentication failures at a sector level. What does it mean for one sector of a file to fail authentication? Have you lost the whole file? Do you fill with zeroes? How do you signal the precise error to the kernel or users?
The semantics seem pretty strait forward. That sector is untrusted. You do not fill with zeroes(!), but fail as hard as you can.
I understand that kernel currently may not have any semantics for files or sectors failing authentication, but that issue would doesn't seem to depend on the file/sector dilemma, and would have to be solved anyhow.
File-level encryption has total format flexibility. A filesystem can store arbitrary metadata. It turns out that strong encryption schemes want internal metadata. You can go through contortions --- contortions no block encryption scheme currently goes through --- to get some sort of space for metadata at the block layer, at significant expense. Or you can get it for free with files.
File-level encryption is also message-aware. Encryption cares about the boundaries between messages (just for instance: because it leaves no ambiguity about the set of disk sectors that makes up any particular file at any particular moment in time). Block-level crypto can't provide this; it's by definition oblivious to messages.
We have block-level crypto because it's a cheap easy win, for the minimal security it provides. If the FBI is interdicting you, block crypto isn't going to help you: they incurred weeks worth of planning before anyone with handcuffs approached you, and the extra planning it takes to make sure they grab you when your key is resident is a rounding error. But if you leave your laptop full of financial secrets in a taxi cab, block crypto will probably keep you from having to roll all your keys, change your account numbers, notifying your clients, etc.
Is there a way to do file-level encryption without leaking the number of files and their sizes? Because that seems much worse than any known problem with block-level encryption.
But won't the FBI also wait to make the grab when using filesystem encryption? We're not talking about some filesystem that uses unique passwords for every file.
Unlike block encryption, filesystem encryption doesn't have to be all-or-none; you can have lots of different keys, unlock on a directory-by-directory basis, whatever. There's room for arbitrary metadata. If you're using your computer at all, block encryption is unlikely to help you.
I'm not saying any particular extant system does those things; I'm just saying they're a possibility for filesystems and foreclosed for device-level encryption.
Except that partitioning a single device completely falls apart when using ZFS, as zpool is designed to manage entire drives, and managing multiple partitions as vdevs on a single device would not only kill performance, but severely imperil data redundancy. Once the drive kicked the bucket, all the vdevs would be gone too, causing an outage.
This is only part of the zfs on linux port, not part of the kernel proper (i.e. mainline). The kernel team is responsible for their part. and the zfs on linux team, since they have to import their own, due to licensing concerns must maintain the library they ship.
Note that the commit linked here is a port of the Illumos Crypto Framework (ICF), which is a dependency but is not the change that actually brings native encryption.
Is there really a need to make zfs your root volume? You can reinstall your root volume in a few minutes from a flash drive. What you really want is your home directory to be zfs, and just do all your work in your home directory.
Saves all the grief of trying to make it your boot volume and it works just as well.
I think this is most useful for cross-compatibility. Right now if a client uses FreeBSD ZFS, and you need to mount it to access project files on your Linux desktop, you can't if they used encryption. But after this is standard, you should be able to mount the same ZFS filesystem anywhere.
Yes, you can use ZFS on memory sticks. You can also "zfs send | zfs recv" to copy a single projects dataset snapshot. You don't need to use actual disks though, if someone wants to send you project files you can have them either send you a file from "zfs send > snapshot.zfs", or send it over SSH, "ssh friend zfs send | zfs recv".
Encryption could be an issue if for example someone uses a FreeBSD based NAS for large data files, and you want to skip the network and just access them directly from your Linux box. You can "zfs export; zfs import", but not if they used encryption. That's where I think this will be useful, because then we have one standard filesystem we can use everywhere.
I am well aware of all of that. But all the zfs send/recv options can by their very definition not have the full disk encryption problem hinted at in the comment I replied to.
Also, if it is easier to take your NAS offline and apart to chuck the disks into your desktop (compared exporting it over the network), then your NAS is too small. What you were looking for is a Laptop.
Which leaves abusing the zpool on a memory stick as data interchange format. Most likely with copies=1, so you have to add some par2 files anyway, at which point you could simply put them on figuratively any other filesystem out there. And encrypt them with gpg/openssl etc. That way I would also not have to run a potentially maliciously crafted filesystem within my kernel.
The crypto code is all very internal to the kernel, so of course it will be GPL. Besides, I personally think that all kernel drivers should be GPL because they are essentially all derivative works of the kernel. The ability of module loading to get around the GPL is one of the more worrying decisions made by Torvalds.
Why is this a strange choice? Isn't most of the kernel GPL? (and thus, a non-GPL component can't link to the GPL part, as that would be a violation of the GPL, would it not?)
As someone who just last night set up an Ubuntu 16.04 server with the intention of using ZFS, should I wait until this hits the Ubuntu repos? Is it possible to enable this encryption on an existing filesystem?
Similar questions here. It's unclear to me how I can integrate this into my system and how to go about it. I'd hate to have to blow away my zpool to do this.
My zpool isn't very full. Maybe I can make a new dataset that's encrypted and move everything from the unencrypted dataset to the encrypted dataset. Of course, being able to encrypt in place would be most ideal, but I'll settle for it being a painless operation that doesn't require blowing away a zpool.
I followed the instructions for Ubuntu 16.04 on the github.com/zfsonlinux wiki [0] a while back and (encountered a few little issues along the way but) got it working, although I experienced some MAJOR performance problems so something wasn't quite right (exact same hardware is blazing fast when running FreeBSD). I can't imagine it was just "how things are" with regard to the current state of ZFS on Linux (or Ubuntu specifically) -- it was like someone hit the laptop's "pause" button a couple of times per minute.
[0]: https://github.com/zfsonlinux/zfs/wiki/Ubuntu-16.04-Root-on-...