I followed the instructions for Ubuntu 16.04 on the github.com/zfsonlinux wiki  a while back and (encountered a few little issues along the way but) got it working, although I experienced some MAJOR performance problems so something wasn't quite right (exact same hardware is blazing fast when running FreeBSD). I can't imagine it was just "how things are" with regard to the current state of ZFS on Linux (or Ubuntu specifically) -- it was like someone hit the laptop's "pause" button a couple of times per minute.
Is it the lack of drivers? Better availability of other 3rd party software for Linux? The package manager? Or just "more brains in Linux"/"our CTO grew up with Linux"?
Edit: this is a legitimate question. I know previously Linux-related ZFS efforts used FUSE, but ZoL is native. I assume performance should be roughly equivalent between Linux and FreeBSD, certainly.
Also, experience tells me that illumos and FreeBSD based systems will always perform faster than Linux with
regads to ZFS but I'd have to publish a full line of benchmarks and I bet even those would get vehemently disputed because Linux is all the rage right now, so that's a lost cause: you have to try it for yourself and make up your own mind.
Setting this aside, all usage hours are not equal. How do you compare an hour of enduser desktop usage on Ubuntu 23.57.something to a usage hour of a commercial NAS storage solution internally based on FreeBSD and ZFS?
Do you think the enduser performed comparable product testing cycles?
I've seen ZOL systems where the zpool claimed it was online, but contained no vdevs. `zfs list` had no datasets, but datasets where mounted and trying to read from them got your process stuck in-kernel.
It just lost/forgot its devices.
Up until one of the latest releases on every boot you could roll the dice by which name your pool would import the devices. Behaving differently on identical machines and setups. Personally, I am still not trusting that problem to not reappear again.
ZOL is bolted on. With a large nailgun. Simple as that. At times, it feels about as integrated as pjd's original patches distributed on the FreeBSD mailinglists.
And since this division is not technical, but legal, based on the license choice made 30 something years ago for Linux with regards to where the code could be exported to and what could be imported into, this situation will not resolve itself.
zpool upgrade -v; zfs upgrade -v
What does your environment look like?
That's a ZFS-in-general complaint, not a ZFS-on-Linux complaint.
However, it's worth mentioning most (all?) filesystems drop sharply in performance as it nears capacity.
Hi. I've said a lot about this in HN comments, so I encourage you, future reader, to search for that. It is indeed the case.
Recently it has been suggested that the presence of a fast write cache (a SLOG, in ZFS parlance) minimizes this problem and allows you to run up to, and around, 90% without breaking the filesystem.
We haven't tested this, intentionally or otherwise, but it sort of makes sense ...
Make no mistake, however - if you fill up a ZFS filesystem and run it for a while in that state, it will be permanently broken, at least in terms of performance.
A ZFS defrag utility would solve this problem, or at least provide a way out if you fall into this trap, but it has been related to me that ZFS defrag would be extremely complicated to implement.
Yes, we do indeed see this on other filesystems ... even a UFS2 filesystem with NO snapshots enabled can be effectively "broken" is you set minfree down to 0%, fill it all the way up, and then run it like that for a while. Freeing up the space and resetting minfree back to 6 or 8%, etc., doesn't fix it.
In particular this one: https://github.com/zfsonlinux/zfs/issues/3645
I suspect it's fixed in 0.7.0 based on some of the other related bugs I've run into since, but I've been reluctant to upgrade as of yet. Otherwise, ZFS on a home file server has been great.
My problem was that the large number threads with spl_taskq_thread_dynamic=0 cause it to OOM, at one point it had an OOM at mount (with 32GB RAM).
When I set spl_taskq_thread_dynamic=1 I had the same deadlock issue (That's where I stopped trying to use it and went back to FreeBSD).
I have 3 3x 3TB Hitachi Deskstars in a raids, with a crucial M4-CT128M4 as a combined rootfs and l2arc
(the ssd was an upgrade from an older ssd when I moved from 12.04 to 14.04).
The machine has a Xeon e3-1270 CPU, and 16GB of ECC RAM.
I used it as a workstation (now my wife does), and it has been as a home media / tv server / NAS for its entire lifetime. 0 problems so far (knock on wood), and it has made a move from the east coast to California, and then back to the east coast again.
My current machine uses a Fractal Design Define R5 Black Silent ATX Midtower case. That one is quite nice. The one gotcha that I had is that it is a bit wider than I was used to, which means that it just barely fits in the computer holder accessory for my ikea desk.
The ability to cd into snapshots is just great and a time saver. So far, nothing wrong/bad with this.
Nor have I read of anything bad happening to anyone that didn't involve not having backups.
All in all, lots of people are running it, lots of people seem to be having almost entirely positive experiences with it, so it seems like it's very stable.
Just grab https://github.com/zfsonlinux/zfs/pull/4833 and send/recv in safety. ;)
This should be true of any storage technology, no matter how unreliable (and thereby undermines the anecdote).
E.g. if you are using zfs send/receive to do backups, and there's a bug in zfs send/receive....
Some people advise making sure you backup using different techniques/strategies/software, but in practice that gets quite difficult to manage, and you quickly end up going down a very deep rabbit hold on your quest for independent backup strategies.
Consider this example: I'm aware of at least one instance where a faulty tape drive damaged the tapes used by that drive - in a a way that caused those tapes to damage other drives in the same way. The damage spread like a virus. Unless you have "blue" tapes, and "yellow" tapes, and "blue" drives, and "yellow drives", an issue like that won't be contained.
But you do the combinatorics on all the kinds of issues like that which need to be addressed, and quickly you'll end up spending your whole life backing up data. Which will sort of solve the problem, because you won't have time to create any data which needs to be backed up.
$ head -1 /proc/meminfo
MemTotal: 4046880 kB
$ grep model.name /proc/cpuinfo | head -1
model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
$ sudo zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 36.2T 27.2T 9.09T - 10% 74% 1.00x ONLINE -
I've been running ZFS for the past 6 years, a little over 2 years on ZoL, and 4 years before that on Nexenta (Ubuntu userland with OpenSolaris kernel).
I've had 5 HDD failures over the past few years, never lost any data.
I've found ZoL to have less jitter when streaming video files in particular over SMB than the Solaris implementation did. ZoL is tighter on memory though.
raidz2 isn't particularly fast for random I/O (since it acts like it has a single spindle), but I don't do much random I/O. It's mostly video and local backup cache.
I'll be thinking about expanding my current setup soon enough. I'll probably be aiming for a raid10 setup for better random access, and of course the new system will have more RAM, and ECC.
Interesting that you had no issues without ECC in the first place. Some ZFS boards are pretty against trying ZFS without.
ZFS without ECC is no better or worse than any other filesystem without ECC, in that silent data corruption is possible. It is for this reason just a lot worse than ZFS with ECC, introducing a fault condition normally not seen on ZFS.
But that fault condition does not go away because you chose to skip ZFS due to no ECC. And you will get a lot of other problems on top that ZFS would have helped against, even without ECC.
Lastly, you'll end up back in the horrible land of static partitioning schemes with questionable tooling. Why suffer through that if you do not have to?
ZFS will notice, depending on _when_ the munging happened (if it was mangled after it got to RAM but before ZFS checksummed it, it obviously can't), but that's (primarily) why ZFS boards often tell you not to do that.
Most likely someone got bitten by faulty RAM, got corrupted data, which stopped after getting ECC RAM. => "ZFS wants ECC" myth, but in reality you just need stable hardware.
At the same time, if you are running like 30-40 TB pool, you probably want that ECC RAM, it will cost only small fraction of total storage box price and will save you from rebuilding a pool if shit happens.
And the authors of ZFS, from the Sun days on, have been consistently and repeatedly saying that if you care about your data, use ECC.
I personally don't get why people seem so resistant to using ECC. It isn't that much more expensive and we know these errors happen in around 8% of DIMMS. It reminds me of people I know who refuse to buy a decent power supply and then complain about hardware failures because of shoddy power.
Personally, paying a slight premium to avoid an almost one in 10 likelihood of silent data corruption seems like a no-brainer to me. But then, I care about my data.
Edit: to be clear, ZFS will perform less dangerously with non-ECC ram than a filesystem that doesn't checksum with non-ECC, because it will detect Bad Things happening and tell you about it. ECC helps avoid the problems in the first place.
And aside from the cost issue, I don't think people are resistant to using ECC. You have probably just misinterpreted people who are justifiably shooting down ZFS/ECC scaremongering: Advising people to never use ZFS without ECC RAM is bad advice, because the advice should simply be to use ECC RAM if you want enterprise grade reliability, whether or not ZFS is part of the picture. Some people aren't in need of that level of reliability but can still benefit from ZFS, and they shouldn't be misled into thinking that ZFS has some particular need for ECC RAM.
And sure, if you're building a frankenbox, ECC is probably not an option, or the sort of expense that takes it out of the frankenbox category. I do hope that folks wouldn't store important data on dodgy hardware, but that is about more than RAM, and also none of my business.
I don't believe I'm misinterpreting people's reactions - I've witnessed people assert that ECC is a scam, waste of money and similar. Even after I point them to that link I posted above, they still seem to believe that It Won't Happen To Them. ("I don't have millions of machines.")
In any case, I still find it frankly bizarre that people run the risk with important data. If you asked people if they wanted to buy a CPU that had an ~8% chance of undetectably lying to them, I'm pretty sure the vast majority would at least want to spring for the premium one that allows at the least detecting the lie.
If by your own citation 92% of DIMMs operate with zero errors per year, and a consumer machine has at most four DIMMs, then it is actually pretty likely that any given consumer machine will operate without RAM errors even without ECC. And when you multiply by the low probability that a DRAM error will cause catastrophic data loss, then it is very easy to come to a reasonable conclusion that ECC is not worth the expense.
If your data is worth nothing, then ECC isn't worth the expense.
That's part of the difference between more enterprise level and consumer level equipment. I do some work on my home computer, but I don't have ECC in it. I probably push a few terabytes of work related information over it a year. The rest is many more terabytes of movies and music, and other things that will never notice a bit error. At work where I move 10s of terabytes of information a day, and that information may have cost many hundreds of manhours to create, I use enterprise level memory, disks, and other parts.
I've seen both servers and desktops develop bad ram. You want to know what the difference is when it happens? I get MCE logs from the server and we replace the equipment before anything bad happens. When it happens on the desktop you can end up with crashing programs, reboots, and even worse, corrupt data written to disk.
Again, I don't know how many people out there buy other products with a nearly 1 in 10 chance of undetectably not performing the function they are supposed to perform, but it seems nutty to me.
Although that line of thought does go some way towards explaining the vitamin business...
I personally consider it pretty disingenuous to call a fault that damages data on disk 'transient'. The root cause may have been transient, but the damage doesn't go away if/when a stuck bit functions normally again. Or do you consider a stroke leading to paralysis a transient injury?
I'd also like to see a cite that "most people actually are comfortable with the idea that..." a 'transient' fault that scrambles random data they've chosen to keep. Where, exactly, are you getting this?
Finally, assuming you have some actual basis for that claim, how many of those people run ZFS? You wouldn't conflate nontechnical people who buy the cheapest box at Best Buy with folks who take the time to configure software RAID across several disks using a nonstandard filesystem, would you?
This is great advice, but it has nothing to do with ZFS. It's true no matter what filesystem you use.
Other filesystems will happily let you write corrupt data blocks...
It's well worth while creating a bunch of sub-filesystems (something ZFS makes really easy) with different settings: block sizes, compression, etc. - and copying sample data to it, to tune things in. The defaults are not necessarily a good fit for everybody. There's no reason not to turn on lz4 compression, for example.
If you create zvols for use with different filesystems or sharing out over iSCSI, watch out for 100+% space consumption (e.g. 500G volume takes up 1+T space out of the pool, which is more than you would expect even with 8+2 raid overhead). I understand it relates to block vs stripe mismatch. This makes it a little less handy for things like virtual machine backing store than I had originally hoped.
$ head -1 /proc/meminfo
MemTotal: 1967672 kB
$ grep model.name /proc/cpuinfo | head -1
model name: AMD Sempron(tm) 145 Processor
But it's not as performant as UFS on the same hardware.
Encryption seems it would be more cleanly implemented transparently underneath the file system level.
First off, ZFS takes advantage of doing its own volume management on bare metal. Some aspects of data recovery/resilience work better when it can interact directly with the device, particularly for resistance to bit rot (which is one of ZFS's biggest advantages). Putting something like LUKS in between isn't as bad as LVM or hardware RAID, but is not great.
Second, it enables the use of different ways of using a particular block cipher like AES (namely GCM instead of say XTS) that have some advantages such as authentication of data. I'm not sure if it's an option for this particular implementation, but nothing about the way ZFS encryption works would preclude using XTS, while GCM doesn't really map well to block device encryption (there is no good way to store the extra IV and authentication code, while ZFS can put it in the metadata).
There are of course disadvantages. Some information about the structure of the data on your drive is accessible that would not be if you used block encryption. Also, unlike LUKS or LUKS w/ LVM you can't easily mix filesystem types on the same drive set.
I know some previous efforts at "native ZFS encryption" essentially re-implemented AES-XTS block-level encryption below the FS, which seems unlikely to offer any advantages.
With the encryption underneath ZFS, the encryption during a write necessarily happens twice, once for each LUKS mapping, which increases CPU load, reduces throughput, or both.
With the encryption in the ZFS layer, data only needs to be encrypted once during a write, after that the data can be written to as many drives as necessary without any additional overhead.
Encryption is one possible filter one could apply to data before it goes to the block device. Another is compression. I guess ZFS has that one also covered. Chould ZFS provide some generic way to apply different such filters which would be applied to each "mirror"?
Apart from internal accounting things like the spacemap, every consumed storage space in a zpool belongs to a dataset in some way. It might be a data block for a file in that dataset, or it might be an old storage block still referenced by a snapshot of a dataset.
A lot of zfs commands work on datasets (send/receive, snapshot, clone, ...). They are also the point where settings such as compression, deduplication etc can be enabled/disabled as well as traditional filesystem mount options like noatime or noexec.
All the datasets consume storage from the pool, which dynamically stripes over all configured vdevs. If you enable an option for a dataset, you enable it for all storage of that dataset which ends up on all vdevs. You can not delegate a dataset to a specific vdev and then enable some option on that vdev.
Also sorry to everyone who knows enough about ZFS internals to realize that I just took their design, pulled it behind a shed and hit it with a blunt, heavy object.
It doesn't seem to be a filesystem issue, rather something you put underneath it (encrypting the block device, no matter what file system is on top) or on top of it (encrypting separate files).
If a ciphertext is not authenticated, it can be trivially tampered with. This means that someone with access to the encrypted drive could add backdoors or otherwise tamper with the executables and data even though it is encrypted.
Shouldn't it be impossible to forge a plaintext without the key for a good encryption algorithm?
I imagine a good algorithm not to be just key -> pseudorandom stuff that is XORed with data, but something that has cascading. Change a bit anywhere, and a whole block changes unpredictably. Include the physical position of the block in the key, so that it impossible to copy blocks around to duplicate data.
With 4k native drives, this became completely impractical. To keep ratios similar, you would have to present 32k byte devices up the stack, which filesystems have troubles with. Or have 1 MAC sector per data sector or similar, cutting your storage in half.
Here's a short thread about the problems with adding it to dm-crypt: http://comments.gmane.org/gmane.linux.kernel.device-mapper.d...
I'm sure you can find more threads if you look around.
ZFS is both filesystem and logical volume abstraction, so to implement different keys on different "filesystems" on ZFS, you'd need to expose them as block devices, not filesystems, and then use your encryption du jour on them, then your filesystem atop that - which also kills most of your compression or encryption properties, since you're doing it before ZFS sees the data, so to speak.
However, I realise this does not quite fit into the current kernel architectures.
The one really nice scenario this opens up that is discussed in the pull request linked in another comment:
1. Server A has an encrypted dataset in pool foo, currently not decrypted
2. A can send full or incremental streams from that dataset to server B
3. B can receive those streams and import them as encrypted
datasets into the pool without decrypting or really ever having
even seen the keys
4. Server A can restore from B as required
5. The owner of the key material can log into B and unlock
the dataset as if it were on A
If I were to use ZFS’s native crypto I would have to look into the details of their implementation to understand what guarantees are provided, and where in the stack the crypto is applied etc…
I understand that kernel currently may not have any semantics for files or sectors failing authentication, but that issue would doesn't seem to depend on the file/sector dilemma, and would have to be solved anyhow.
File-level encryption has total format flexibility. A filesystem can store arbitrary metadata. It turns out that strong encryption schemes want internal metadata. You can go through contortions --- contortions no block encryption scheme currently goes through --- to get some sort of space for metadata at the block layer, at significant expense. Or you can get it for free with files.
File-level encryption is also message-aware. Encryption cares about the boundaries between messages (just for instance: because it leaves no ambiguity about the set of disk sectors that makes up any particular file at any particular moment in time). Block-level crypto can't provide this; it's by definition oblivious to messages.
We have block-level crypto because it's a cheap easy win, for the minimal security it provides. If the FBI is interdicting you, block crypto isn't going to help you: they incurred weeks worth of planning before anyone with handcuffs approached you, and the extra planning it takes to make sure they grab you when your key is resident is a rounding error. But if you leave your laptop full of financial secrets in a taxi cab, block crypto will probably keep you from having to roll all your keys, change your account numbers, notifying your clients, etc.
I'm not saying any particular extant system does those things; I'm just saying they're a possibility for filesystems and foreclosed for device-level encryption.
(And no, it's not the USA crypto export restrictions circa 1994, it's two pieces of freely available code with different notions of freedom coded in.)
Encryption could be an issue if for example someone uses a FreeBSD based NAS for large data files, and you want to skip the network and just access them directly from your Linux box. You can "zfs export; zfs import", but not if they used encryption. That's where I think this will be useful, because then we have one standard filesystem we can use everywhere.
Also, if it is easier to take your NAS offline and apart to chuck the disks into your desktop (compared exporting it over the network), then your NAS is too small. What you were looking for is a Laptop.
Which leaves abusing the zpool on a memory stick as data interchange format. Most likely with copies=1, so you have to add some par2 files anyway, at which point you could simply put them on figuratively any other filesystem out there. And encrypt them with gpg/openssl etc. That way I would also not have to run a potentially maliciously crafted filesystem within my kernel.
What a strange choice by the Linux kernel.
My zpool isn't very full. Maybe I can make a new dataset that's encrypted and move everything from the unencrypted dataset to the encrypted dataset. Of course, being able to encrypt in place would be most ideal, but I'll settle for it being a painless operation that doesn't require blowing away a zpool.