(...sure, you could call zfs or btrfs "mainstream", I suppose, but when I say "mainstream" I mean something along the lines of "officially supported by RedHat". zfs isn't, and RH considers btrfs to still be "experimental".)
I recently had a server be crippled by running snapper on default settings for a few months. And after a couple days of balancing (which desperately needs a throttle control) it wasn't much better, so I gave up on having it run btrfs.
I think the record I managed was slightly over two minutes of btrfs blocking all disk I/O. Something is deeply wrong with how it organizes transactions.
Turns out at the time, re-balancing still had to be run manually. I'm not sure if that still holds true.
I've got a pretty good setup now with a fairly complex fstab, multiple SSDs, backup drives, and everything fully encrypted and auto mounting at boot. I'd really love to move this to a file system more resistant to data corruption.
There was a presentation of this by Tom Caputi at the most recent OpenZFS Developer Summit.
Performance wise it does fairly well, our benchmarks shows ~10-15% decrease on random 8kb IO (14.04).
We are definitely looking forward to ZFS native encryption!
First decrypt LUKS (we are doing this in GRUB)
Then mount zpool(s)
when calling anything that ultimately calls grub-probe (e.g. apt-get upgrade), you have to symlink the decrypted device mapper volume up a layer into /dev because grub-probe can't seem to find the ZFS vdev(s) otherwise. ie: "ln -s /dev/mapper/encrypted-zfs-vdev /dev".
This is in fact the case on every Linux distribution I've run ZFS over dm-crypt on.
EDIT: IOW its a grub bug not an Ubuntu bug.
I've had it on my backlog to at some point go in and sort out my initramfs's insanity when it comes to handling crypt'd disks in general - it should be a lot less brittle then it is.
- is the crypto module still present in GRUB config?
- is grub running in text only mode? Otherwise we cannot see the LUKS prompt to decrypt the devices.
- does initramfs know about all mirror devices (instead of an early return after the first)
(summary: I don't know how to make it super easy, but what you want should totally work)
(I'm not Kent but I love the work)
While ZFS might not be supported oput of the box on "your-choice-distro" I would point out that Ansible or other automation tools should make it pretty easy to end up with a repeatable installed machine or VM image with ZFS.
SUSE (the other main enterprise Linux distribution vendor) has been supporting Btrfs in the enterprise (as the default filesystem) for several years.
[Disclosure: I work for SUSE.]
I recall they switched to reiserfs as default at one point. reiserfs was never a good choice for data consistency - the fact that storing a reiserfs image in a file on the host reiserfs filesystem and then doing an fsck on the host FS would corrupt the FS should be a clear signal that there are fundamental problems remaining to be solved.
That said, I'm playing with btrfs on some of my machines, and it seems quite nice. But no way would I risk using it on a production server at this time.
Of course, someone has to go first and filesystems never truly get battle-hardened until distros start pushing them. I appreciate that SuSE does this from that perspective. It means when I switch over there will be less bugs. :)
I'm using btrfs as a daily driver on my workstations so I get some experience with the tooling, and also because features like consistent snapshots are really nice to have. Still haven't taken the plunge on the server side, I expect I'll give it a few years until it's considered "boring".
That's already far ahead of ext, ufs, xfs, jfs, btrfs etc. The only ones offhand which are possibly more portable are fat, hfs, udf and ntfs, and you're not exactly going to want to use them for any serious purpose on a Unix system. ZFS is the most portable and featureful Unix filesystem around at present IMHO.
Not only is a cryptographic hash unnecessary, under certain circumstances it will actually do a worse job.
Cryptographic hashes operate under a different set of constraints than error detecting code. With an error detecting code, it's desirable to guarantee a different checksum in the event of a bitflip.
With a cryptographic random oracle, this is not the case: we want all outcomes to have equal probability, even potentially producing the same digest in the event of a bitflip. As an example of a system which failed in this way: Enigma was specifically designed so the ciphertext of a given letter was always different from its plaintext. This left a statistical signature on the ciphertext which in turn was exploited by cryptanalysts to recover plaintexts. (Note: a comparison to block ciphers is apt as most hash functions are, at their core, similar to block ciphers)
Though the digests of cryptographic hash functions are so large it's statistically improbable for a single bitflip to result in the same digest as the original, it is not a guarantee the same way it is with the CRC family.
Cryptographic hash functions are not designed to be error detecting codes. They are designed to be random oracles. Outside a security context, using a CRC family function will not only be faster, but will actually provide guarantees cryptographic hash functions can't.
A cryptographic hash is unnecessary, but as you point out in the next to last paragraph, it is statistically improbable that it will do a worse job. Because collisions are statistically improbable.
On the other hand, any CRC will detect all such errors; a well-chosen one such as CRC32C will detect all messages with up to 5 bits flipped at this message size.
This is quite appropriate for the error model in data transmission, of uncorrelated bit errors. https://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koop... is a pretty good paper, though there are probably better ones for readers without an existing background in how CRC works.
Unlike a plain checksum, CRC-32C is hardened against bias, which means its distribution is not far from that of an ideal checksum. This means if your bitrot is random and you're using 16KB blocks, you will need to see on the order of ((2 * * 32) * 16KB)=64TB of corrupted data to get a random failure. Modern hard drives corrupt data at a rate of once every few terabytes. TCP without SSL (because of its highly imperfect checksum) corrupts data at a rate of once every few gigabytes. Assuming an extremely bad scenario of a corrupt packet once every 1 GB, in theory you'd need to read more than a zettabyte of data to get a random CRC-32C failure. I'm not doubting that real world performance could be much worse, but I'd like to understand how.
No, that's what you need to generate one guaranteed failure, when enumerating different random corruption possibilities. Simply because 32-bit number can at most represent 2^32 different states.
In practice, you'd have 50% probability to have a collision for every 32 TB... assuming perfect distribution.
By the way, 32 TB takes just 4-15 hours to read from a modern SSD. Terabyte is just not so much data nowadays.
I do agree that tens of TBs are not too much data, but mind that this probability means that you need to feed your checksum 64TB worth of 16KB blocks, every one of them being corrupt, to let at least one of them go trough unnoticed with 63% probability. So you don't only need to calculate with the throughput of your SSD, but the throughput multiplied with the corruption rate.
When people started using CRC-32 it was because with the technology of the time it was virtually impossible to see collisions. Nowadays we are discussing if it's reasonable to expect a data volume that gives you 40% or 60% of collision chance.
CRC32 end is way overdue. We should standardize on a CRC64 algorithm soon, or will have our hands forced and probably stick with a bad choice.
If what you're saying is it's not portable to Windows then sure, but compared to most filesystems it's extremely portable.
"ZFS export" on the host system, remove drives, insert drives in new server, "ZFS import" on the new server, It really is that simple.
And for others a constriction on which distribution they can move to etc. It just reduces the number of situations where it can be used, even if you find yourself in one where it can.
I know you have the SPL anyway, so maybe with the addition of the Linux POSIX-ish layer in there this could be the case...
Longer answer: The Linux subsystem in Windows 10 only deals with userspace. It doesn't support kernel modules nor changes anything about making Windows drivers. Porting ZFS to Windows is certainly possible, but it will take quite a lot of effort, and the Linux subsystem is irrelevant in that situation.
Obviously, I've yet to actually use it myself.
That doesn't get you access from Windows programs, but there are some other ways to do FUSE or FUSE-like things on Windows..
The only choices, imo, are multiple disks, or automatic backup to offsite (dropbox, box, one drive, etc).
Modern HDDs present a consistent interface to the rest of the computer, no matter what data encoding scheme is used internally. Typically a DSP in the electronics inside the HDD takes the raw analog voltages from the read head and uses PRML and Reed–Solomon error correction to decode the sector boundaries and sector data, then sends that data out the standard interface. That DSP also watches the error rate detected by error detection and correction, and performs bad sector remapping, data collection for Self-Monitoring, Analysis, and Reporting Technology, and other internal tasks.
With the answer I gave above, and the fact that it's already being done in hardware, I don't think adding another layer of EC will be fruitful.
Also, EC at the block level would probably spread the blocks around the drive. This means any read would need to seek all over the dang place trying to reassemble files and that's a bad access pattern for spinning disks. Real bad. Like, the worst. It might even reduce the effective lifetime of the drive. So it would be not only correlated with device failure, it could precipitate it.
Maybe it would be ok on an SSD.
Error detection allows you to:
* Discover and replace bad hardware (like the author of the article),
* Avoid unknowingly using invalid data for calculations or program code,
* Stop using the system and corrupting more data over time,
* Restore files from backup while still in the retention period of your backup service.
I once had a computer with a bad disk that somehow corrupted 1 bit out of about 150MB written, and probably only in some regions. I only found out after the GZip checksum of a very large downloaded file mismatched, and it took a while to figure out the real cause. By that time I had been using it for months, so it's unclear to this day what other files might have been corrupted.
Nope I'd take a system that can tell me "Hey sorry I don't have the right data" over "Here's some data, looks good to me" any day of the week.
Also as ProblemFactory points out, ZFS will self-heal whenever possible.
Nope, it doesn't help with anything. I think you are making false assumptions about the safety of your data. Because to keep consistency such system must automatically shutdown on data corruption and only wake up once the problem is fixed. To keep availability it must automatically fix the problem. There is no third option, because filesystems are not aware of how applications use them. Returning EIO would also introduce inconsistency.
But if you actually care about your data, you shouldn't trust a single box, ever, no matter how fancy the filesystem is. There are just so many things that could go wrong, it's hard to believe someone would even take that risk.
Just a few weeks ago I saw data loss on one of the object storage nodes, because of a typical sata cable issue. It caused multiple kernel panics, affecting data on multiple disks, before I was able to figure it out. Not a problem for a distributed storage, but you wouldn't survive that with a local storage. This is also one of the reasons I consider zfs an over engineered toy, that doesn't target real world problems.
The purpose of data block checksumming isn't to make your data more resilient (at least directly), it's to make sure you know you have a problem. Once you know you have a problem, then you can go read from your alternate datacenter or whatever.
A super-ZFS that automatically did that using remote mirrors would be interesting, but would also stretch the definition of "transparent" a bit.
Data centers do have lighting protection. Lighting protection is very low-tech.
Unfortunately, you still can't get a root zfs partition without hassles.
Filesystems are HARD. And as bcantrill notes, if you can't get simple things like epoll right your dead in the water on harder things like kqueue and zfs, Dtrace, jails, etc.
This is proving true looking at the fact Linux land can't get a filesystem right and they've been trying for decades.
Many will argue and try and justify that simple fact but as the cold harsh hammer of reality slams into their skull eventually those folks will be forced to face the reality.
Linux honestly just needs to adopt zfs. Period. But because of the Linux license that may be all but impossible now. Which leaves Linux in this untenable position of being stuck in a tar pit. Unable to adopt the clear choice to move forward and unable to have the technical ability to implement their own solution equal to zfs. So wither now Linux?
> Linux honestly just needs to adopt zfs. Period.
> But because of the Linux license that may be all but impossible now.
If your only copy of your children's baby pictures are on a solo 4TB drive in your house --- then you may be more than willing to pay that cost. But what if your house burns down? It may be that the right answer is do data backups in the cloud, and there you will want erasure coding, and perhaps more than 100% overhead --- you might erasure coding scheme that has a 150% to 300% blowup in space so you have better data availability, not just data loss avoidance.
I do agree that file systems are hard, but at the same time you need to have a business case for doing that level of investment. This is true whether it is for a proprietary or open source code base. Many years ago, before ZFS was announced, I participated in a company wide study at my employer at the time about whether it was worth it to invest in file systems. Many distinguished engineers and fellows, as well as business people and product managers participated. The question was studied not just in terms of the technical questions, but also from the business perspective --- "was the ROI on this positive; would customers actually pay more such that the company would actually gain market share, or otherwise profit from making this investment". And the answer was "No". This made me sad, because it meant that my company wasn't going to invest in further file system technologies. But from a stock holder's perspective, the company's capital was better spent investing in middleware and database products, because there the ROI was actually positive.
From everything that I've read and from listening to Bryan's presentations, I understanding is that at Sun they did _not_ do a careful business investigation before deciding to invest in ZFS. And so, "Good for Open Solaris!" Maybe it kinda sucked if you were a Sun shareholder, but maybe Sun was going to go belly-up anyway, so might as well get some cool technology developed before the ship went under. :-)
As far as Linux is concerned, at some level, the amount of volunteer manpower and corporate investment in btrfs speaks to, I suspect, a similar business decision being made across the Linux ecosystem about whether or not the ROI of investing in btrfs makes sense. The companies that have invested in ext4 because in no-journal mode, have done so because if what you want is a back-end, local disk file system, on top of which you put a cluster file system like HDFS or GFS or Colossus, and where the data integrity is done end-to-end at the cluster file system level, and not at the local disk file system, you'll want the lowest overhead file system layer you can get. That doesn't mean that you don't care about data integrity; you do! But you've made an architectural decision about where to place that functionality, and a business decision about where to invest their proprietary and open source development work.
Each company and each user should make their own decisions. If you don't need the software compatibility and other advantages of Linux, and if data integrity is hugely important, and you don't want to go down the path of using some userspace solution like Camilstore, or some cloud service like AWS S3 or Google Compute Storage, then perhaps switching to FreeBSD is the right solution for you. Or you can choose to contribute to make btrfs more stable. And in some cases that may mean being willing to accept a certain risk for data loss at one level, because you handle your integrity requirements at a different level. (e.g., why worry about your source tree on your laptop when it's backed up regularly via "git push" to servers around the internet?)
I spent a day chasing what turned out to be a bad bit in the cache of a disk drive; bits would get set to zero in random sectors, but always at a specific sector offset. The drive firmware didn't bother doing any kind of memory test; even a simple stuck-at test would have found this and preserved the customer's data.
In another case, we had Merkle-tree integrity checking in a file system, to prevent attackers from tampering with data. The unasked-for feature was that it was a memory test, too, and we found a bunch of systems with bad RAM. ECC would have made this a non-issue, but this was consumer-level hardware with very small cost margins.
It's fun (well maybe "fun" isn't the right word) to watch the different ways that large populations of systems fail. Crash reports from 50M machines will shake your trust in anything more powerful than a pocket calculator.
They even spread the metadata across the disk by default. I'm running on some old WD-Greens with 1500+ of bad sectors and it's cruising along with RAIDZ just fine.
There is also failmode=continue where ZFS doesn't hang when it can't read something. If you have a distributed layer above ZFS that also checksums (like HDFS) you can go pretty far even without RAID and quite broken disks. There is also copies=n. When ZFS broke, the disk usally stopped talking or died a few days later. btrs, ext4 just choke and remount ro quite fast (probably the best and correct course of action) but you can tell ZFS to just carry on! Great piece of engineering!
We had a cluster for Hadoop experiments at uni and no ressources to replace all the faulty disks at that time (20-30% were faulty to some degree from the SMART values - more than 150 disks). So this was kind of an experiment. All used data was available and backup up ouside of that cluster. The problem was that with ext4 after running a job certain disks always switched to readonly and this was a major hassle as this node had to be touched by hand. HDFS ist 3x replicated and checksummed and the disks usally worked fine for quite a time after the first bad sector. So we switched to ZFS, ran weekly scrubs - only replaced disks that didn't survived the srub in reasonable time or with reasonable failure rates and bumped up the HDFS checksum reads that everything is control read once a week. The working directory for the layer above (MapReduce and stuff like that) got a dataset with copies=2 so that intermediate data is still fine within reasonable amounts. This was for learning or research purposes where top speed or 100% integrity didn't matter and uptime and usability was more important. Basically the metadata on disk had to be sound and the data on a single disk didn't matter that much. This was quite a ride and it's long been replaced since then.
Just thought it's interesting how far you can push that. In the end it worked but turned out there is no magic, disks die sooner or later and sometimes take the whole node with them.
Don't go to ebay and buy broken disks out of believing with ZFS these will work. Some survive a while, most die fast, some exhibit strange behavoir.
That RAIDZ is more or less for "let's see where this goes" purposes, backups are in place it's not a production system.
It seems that limited resources often lead to some interesting solutions (and learning new things). A factor that is not very common in VC backed companies.
Therefore this is of limited relevance to Apple.
In both sense of the word.
Many moons ago, in one of my first professional assignments, I was tasked with what was, for the organisation, myself, and the provisioned equipment, a stupidly large data processing task. One of the problems encountered was a failure of a critical hard drive -- this on a system with no concept of a filesystem integrity check (think a particularly culpable damned operating system, and yes, I said that everything about this was stupid). The process of both tracking down, and then demonstrating convincingly to management (I said ...) the nature of the problem was infuriating.
And that was with hardware which was reliably and replicably bad. Transient data corruption ... because cosmic rays ... gets to be one of those particularly annoying failure modes.
Yes, checksums and redundancy, please.
My assumption is the read will fail and the error logged but there is no redundancy so it will stay unreadable.
Will ZFS attempt to read the file again, in case the error is transient? If not, can I make ZFS retry reading? Can I "unlock" the file and read it even though it is corrupted, or get a copy of the file? If I restore the file from backup, can ZFS make sure the backup is good using the checksum it expects the file to have?
Single disk users seem to be unusual so it's not obvious how to do this, all documentation assumes a highly available installation rather than laptop, but I think there's value in ZFS even with a single disk - if only I understood exactly how it fails and how to scavenge for pieces when it does.
Definitely I'm going to use my solution so. All the next generation FS stuff is cool (btrfs also indeed) but for the simplest use case people just need safe data and fix it it the disk goes bad.
But I am interested as well about your question.
I scrub on a weekly basis. One day ZFS started reporting silent errors on disk ada3, just 4kB:
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
scan: scrub repaired 4K in 21h05m with 0 errors on Mon Aug 29 20:52:45 2016
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ada3 ONLINE 0 0 2 <---
ada4 ONLINE 0 0 0
ada6 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada5 ONLINE 0 0 0
2016-09-05: 1.7MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-09-12: 5.2MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-09-19: 300kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-09-26: 1.8MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-03: 3.1MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-10: 84kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-17: 204kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-24: 388kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-11-07: 3.9MB silently corrupted on ada3 (ST5000DM000-1FK178)
The next week the server again became unreachable during a scrub. I could access the console over IPMI but the network seemed non-working even though the link was up. I checked the IPMI event logs and saw multiple correctable memory ECC errors:
Correctable Memory ECC @ DIMM1A(CPU1) - Asserted
MCA: Bank 4, Status 0xdc00400080080813
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f80, APIC ID 0
MCA: CPU 0 COR OVER BUSLG Source RD Memory
MCA: Address 0x5462930
MCA: Misc 0xe00c0f2b01000000
The reason these memory errors always happened on ada3 is not because of a bad drive or bad cables, but likely due to the way FreeBSD allocates buffer memory to cache drive data: the data for ada3 was probably located right on defective physical memory page(s), and the kernel never moves that data around. So it's always ada3 data that seems corrupted.
PS: the really nice combinatorial property of raidz2 with 6 drives is that when silent corruption occurs, the kernel has 15 different ways to attempt to rebuild the data ("6 choose 4 = 15").
I wonder what 'really' happened.
It's kind of (literally?) like immutability. If you allow even a little mutability, it ruins it.
I think all filesystems should be able to add error-correction data to ensure data integrity.
Realistically though, I don't know if MySQL has this. You'd probably be better off using a filesystem that gives these kind of guarantees and running your database on that.
HP/Dell/SM servers are all ECC memory top to bottom, are properly wired (snark) and SANs have also ECC everywhere.
Even on the disk caches.
And in this particular instance, it was basically some private server build being messed up.
On a slight tangent ZFS will checksum data and store that checksum in the block pointer (i.e. not with the data itself) so it can tell which of the copies is correct. The same extends to RAID 5 and RAID 6, although with RAID 6 you can intelligently work out which block might be bad. However that is assuming the block devices are returning consistent data and you are the one talking to the block devices. If the disks were sat behind a hardware RAID controller and the controller was the one you'd be hard pressed to identify the source of the data corruption. The checksumming in ZFS comes to the rescue here again.
I recommend checking out this video  from Bryan Cantrill. It's about Joyent's object store Manta but features a fair bit of ZFS history. Also it features the usual rant level that one can come to expect from a Bryan Cantrill talk which I quite enjoy. There are plenty of other videos available on ZFS.
I can't remember if it was merged upstream, but some folks from Facebook worked on a write-back cache for MD-RAID (4/5/6 personalities) in Linux which essentially closes the write-hole too. It allows one to stage dirty RAID stripe data in a non-volatile medium (NVDIMMs/flash) before submitting block requests to the underlying array. On recovery, the cache is scanned for dirty stripes, which are restored before actually using the P/Q parity to rebuild user data. I worked on something similar in a prior project where we cached dirty stripes in an NVDIMM, and also mirrored it on its controller-pair (in a dual-controller server architecture) using NTB. It was a fun project, when neither the PMEM nor the NTB driver subsystems were in the mainline kernel.
Haven't tried it but it seems to already be merged, at least write-through. They now work on implementing writeback and this IIRC isn't merged yet.
Last I checked (maybe a year or 2 ago), I read that btrfs also suffered from the write hole. Is that still the case?
So ZFS protects against end-user mistakes.
I was really hoping about a story on some large-scale study on silent data corruption, but no, just an ankedote.
I have, incidentally, seen this in corporate environments on traditionally-engineered sever-class hardware as well. This is just a much more easily-discussed case.
I am in the minority group that gets very frustrated and paranoid when my Video or Photos gets corrupted.
Synology has Btrfs on some range of their NAS. But most of them are expensive.
I really want a Consumer NAS, or preferably even Time Capsule ( with two 2.5" HDD instead of one drive ) with built in ZFS and ECC Memory, by default weekly scrub drive. And alert you when there is problem.
And lastly, do any of those consumer Cloud Storage, OneDrive, DropBox, Amazon, iCloud have these protection in place? Because I would much rather Data Corruption be someone else problem then complexity at my end.
Unfortunatly, btrfs is not stable and zfs needs a "super computer" or at least as much GBs of ECC RAM as you can buy. This solution is designed to any machine and any FS.
ECC is a nice to have, but ZFS does not have special requirement over say a regular page cache. The only difference is that ZFS will discovery bit-flips instead of just ignoring them as ext4 or xfs would do.
Actually it seems ECC is important for ZFS filesystems see:
Thus, if one is using ZFS for data reliability, one ought to use ECC memory as well.
The inflection made by the previous comment tends to lead people to think ECC RAM is needed for ZFS specifically. As the blog post you link to points out it's equally applicable to all filesystems.
That's rigth the kind of hardware I was referring to, 1 GB of plain RAM.
Truly, I haven't tested ZFS yet for that reason I've always read that ZFS has big requirements so I refrained to try it. It seems I should give it a try. ;)
Btrfs is another story I've used it for years and I'd prefer not to have to use it anymore untill it'll become "stable" and "performance". :)
My main issue is to be able to repair a "silent" data corruption on a single drive machine. Am I able to use x% of my "partition" to data repair or do I need to use other partition/drive to mirror/raid it?
If I understand right zfs can detect bitrot ("not really" a big deal) but without any local copy It can't self heal.
My use case is an arm A20 SoC (lime2) to storage local backups among other things, so I need something that detects and repairs silent data corruption at rest by itself (using a single drive).
A poor man NAS/server. ;)
Ideally, though, please add an additional disk and set it up as a mirror.
ZFS can detect the silent data corruption during data access or during a zpool scrub (which can be run on a live production server). If there happen to be multiple copies, then ZFS can use one of the working copies to repair the corrupted copy.
Anyway I will try to use it for my main PC which has several disks and continue to use my solution for single disk machines (laptop, vps, SoC...). :)
In order to achive the same with ZFS you have to run RAID-Z2 on sparse files.