Hacker News new | past | comments | ask | show | jobs | submit login
ZFS: Apple’s New Filesystem That Wasn’t (dtrace.org)
301 points by swills on June 15, 2016 | hide | past | favorite | 128 comments

It's kind of a shame that ZFS hasn't seen better adoption, and that btrfs has been pretty stagnant, and that in general next generation file systems are mostly still-born. It would be nice to see some improvements in this area as 10tb hdd's are eminent, and data management is important.

However, it's worth noting that RAID, especially in software on systems without ECC ram is less than ideal. Beyond this is the overhead of managing larger filesystems with ZFS. The DIY raid servers that support it have had some troublesome issues that I've experienced first hand.

It's likely a lot of these advantages have been displaced by the discontinuation of Apple's server projects as well as other fears. By similar note, I've always been somewhat surprised that NTFS hasn't been more popular for external media, as it's a pretty good fit there.

In the end, software at this level has been held back significantly by the patent hornets nest that exists in the world today. I truly hope that we can see some reform in this space, but with the likes of TPP and similar treaty negotiations around the world today, that is pretty unlikely. Eventually some countries will have to strike out, break these bad treaties and reign in IP law. The vast majority of software is undeserving of patent protection. Just as software really shouldn't have copyright law that lasts for decades. It's a shame all around.

"It's kind of a shame that ZFS hasn't seen better adoption"

We[1] are full steam ahead!

We transitioned all of rsync.net, globally, to ZFS starting in ate 2012. After that, we began offering zfs send and receive (over ssh, of course) into rsync.net filesystems[2].

It's working wonderfully. We have many, many customers that do extremely efficient backups of very large zpools/snapshots into the rsync.net location of their choice. It's been tested and reviewed favorably in the press.[3]

[1] rsync.net

[2] http://www.rsync.net/products/zfsintro.html

[3] https://0x.co/G36YPW

Yep, used an "Oh By Code" as a shortener for that arstechnica link ... URL shortening is a secondary use-case:


I ran into a bug just today that I'm still struggling to believe. A large Oracle server fell over, just reporting "IO errors". The issue was ultimately traced to NTFS's handling of fragmentation[0].

I've gone about ten years thinking running a defrag - particular on a virtual disk sitting on a SAN in a RAID array - made no sense. Today's downtime says otherwise.

It's one thing to perform poorly. But seeing services just die with disk write errors on a disk that's 30% full has left me with a very low opinion of NTFS. Our Oracle person sounds like a desktop maintenance person from the 90's making an issue out of "the importance of weekly defrags".

[0] https://support.microsoft.com/en-au/kb/967351

In my DBA days in the 90s/2000s, this sort of issue was usually a lousy DBA that had extent sizes set too small.

Windows can clearly handle big databases. The people getting paid to run the platform need to have a clue about the platforms they are running.

XFS is subject to fragmentation as well. Allegedly you can minimize it with some careful planning and mount options, but it seems easier just to enable the scheduled tsk that can be preempted.

It took btrfs the same thing (maybe only over a year or two though) to realize they needed automatically triggered rebalance, which is essentially a very similar operation.

There is nothing special about storage that makes the absence of ECC in a computer any more of a bad idea than it already is. Building machines without ECC is a bad idea.

> Building machines without ECC is a bad idea.

I'd say it depends on the use case. For business storage servers, especially "mission-critical", sure. (Then again, this data was created somewhere; presumably mainly on laptops without ECC. Why is that not a bad idea?)

But for a home storage server, ECC is no more critical than buying a UPS, shielded SATA cables, enterprise-grade drives, a high quality power supply, etc. etc. Together these upgrades will easily double the cost of the storage server. Since you'll want an off-site backup solution anyway in case of disaster, why bother paying for an expensive server that only reduces the chance of needing to restore from backup from 0.001% to 0.00001%?

> But for a home storage server, ECC is no more critical than buying a UPS, shielded SATA cables, enterprise-grade drives, a high quality power supply, etc. etc. Together these upgrades will easily double the cost of the storage server. Since you'll want an off-site backup solution anyway in case of disaster, why bother paying for an expensive server that only reduces the chance of needing to restore from backup from 0.001% to 0.00001%?

Flipping it around: what's the point of backing up to a remote server if the data you're backing up is itself corrupted?

I don't think most of the upgrades you mentioned are really in the same bucket as ECC DRAM because only the ECC DRAM seems likely to prevent data corruption. Failing drives and cables and power loss, even failures that corrupt data, should be survivable events assuming drive redundancy, particularly given a filesystem like ZFS that will detect and repair the corruption.

Any work we do to improve the reliability of consumer software is limited by the reliability of the underlying machine.

If everyone used ECC RAM, it would be marginally more expensive than regular RAM, but it would let us (theoretically) cut the number of real-life crashes by a huge degree. I think if every computer system were like this, it would have non-obvious positive effects on society (like the ability to safely give computers increased responsibility).

> it would let us (theoretically) cut the number of real-life crashes by a huge degree

Do you have a source for that claim? I know people have done tests with running large molecular simulations on hundreds of GPUs with/without ECC for days on end, and who've come to the conclusions that ECC probably isn't worth it [1].

And this makes sense. Most of a program's memory usage is data. Thus the rare event of a bitflip is most likely to occur in data. Such a bitflip doesn't cause a crash in a consumer system; maybe it changes the color of one pixel in one frame of a video file, or adds some other tiny noise somewhere.

[1] http://dl.acm.org/citation.cfm?id=2484774

Interesting, thanks. From what I can tell, the probability of one bit flip per year is about 25% for a machine that's always on and that's under constant load. Reducing the load about halves the probability. If we consider then a work laptop, average life of three years, used 8-5 five days a week, you get a probability of ~10% for a single bitflip happening during the machine's lifetime.

The question is then: would reducing one crash per lifetime of the machine on every ten machines be a huge improvement?

I'm not sure if I'm interpreting the paper correctly, but I don't think those numbers quite match what the authors found. I find their presentation confusing, though, so possibly I'm the one that's wrong.

I think you understand this but in case it's not clear to others (I wasn't sure until I got to the end of the paper) all of the machines in their study are using ECC memory. This allows them to measure both the number of Correctable Errors (single bit) and Uncorrectable Errors (more than single).

I think you are right that the chances of a single bit flip for a randomly chosen machine are about 25%. In Table 1, they say CE incidence for a randomly chosen machine is 32% per year. But they also show that a machine that has one correctable memory error is much more likely to exhibit many more.

I think that "CE Median Affct" (also in Table 1) says that among the machines that have been affected by at least one correctable error, the mean number of errors is 277. Thus rather than one crash per 10 machines over 3 years, it's likely that you'll get hundreds of crashes, but with one (or zero, or two) of those 10 machines responsible for all of them.

The Uncorrectable Error rate they found was about 1/20 of the correctable rate, a little over 1% per year. Assuming the same "use time" factor applies, this would mean that with ECC none of the 10 machines would be likely to ever crash due to bit flips, reducing the number of crashes by hundreds.

But this doesn't tell the full story either. Maybe it's best to view it as 1000 employees, each of whom is issued a laptop. If you buy them machines without ECC, the initial cost is less but 100 of them will experience frequent crashes during the lifetime of the laptop, leading to frustration, possible lost work, and early machine replacement.

If you pay more up front for ECC, the number of affected employees will be reduced from 100 to 5. How much extra one should be willing to pay for this reduction depends on circumstances, but I think the impact is quite different than merely avoiding 100 equally distributed random crashes over the course of 3 years.

I agree this point is confusing. However, I think the "10% of employees will experience frequent crashes and possible data loss" sounds way too extreme.

I mean, how many companies do you know that give all their workers Xeon-powered laptops? My impression is that 99% of people have i3/5/7 powered machines at work.

Are really companies experiencing frequent failures caused by RAM on 10% of their employee's machines? Can anyone with such knowledge weigh in?

The problems with bitflip on consumer workstations is greatly overstated by some. The single biggest cause of crashes is, and always will be, the end user themselves. Or PEBKAC[1] as some like to call it.

While reducing bitflip on all devices would be a nice step forward for consumer electronics, pragmatically I have to agree with you that it would have negligible impact in the real world. Those that need it generally already pay for it, and those that don't already find countless other and more frequent ways of breaking their machines.

[1] https://en.wikipedia.org/wiki/User_error#Acronyms_and_other_...

IIRC, most crashes are due to poor driver implementations, and other software bugs. I can't think of a user interaction that would crash a system, unless they physically beat the hell out of their system causing damage to spinning rust.

I get what you're saying, but often the bad drivers are because users are installing non-recommended drivers to begin with. Or installing known buggy software that experienced users like us would have avoided.

I've seen users place PC and Macs next to radiators or surround their device with folders and reference books, oblivious to the fact that computers can overheat and need ventilation.

I've seen users trying to use Excel as a database, then get impatience when Excel starts hanging or Windows starts running slow, and retaliate by pressing more buttons more rapidly thus locking the entire OS up.

I've seen users who would just switch the power off at the wall even night. Overtime NTFS would corrupt to a point where random crashes would happen.

I've also seen people physical attack - kick and thump - their computer when it runs slowly, an application crashes, or just through bad temper. This can damage the HDD and lead to more crashes.

There are also users who think they understand computers, so they like to tinker with lower level settings (IRQs, paging, CMOS settings, driver options, etc) then get surprised when things behave unexpectedly.

Most recently though, I had to explain to one guy who installed his own motherboard and accidentally cut a deep scratch into it, why his random crashes are potentially his own fault.

So I definitely blame users for most PC crashes. Even with the few number of kernel panics I've had over the last 15 or so years, I can attribute most of them to myself. eg I was experimenting with undocumented Windows APIs to change low level behaviors. Or I was experimenting with running non-supported file systems as my main OS drive.

You say based on what information? Zfs in particular, with the safety features (like checksumming) it tries to guarantee, relies on ram not lying to you. http://research.cs.wisc.edu/adsl/Publications/zfs-corruption...

Cost-benefit analysis. These things are relatively expensive. My home NAS basically stores photos, mp3, and some videos. It's backed up offsite. The vast majority of the time, it's in hibernate. It uses ext4, because I didn't get ECC RAM. Why not? I picked the board on price, and didn't realize it until I tried to install FreeNAS[], which uses ZFS, so I switched to OpenMediaVault, and never looked back.

What's the worst that's going to happen? Some photo is going to get a single bit error? So what? The data doesn't need super high fidelity. What about a file system crash? Well that's what backups are for.

[] I would also like to take time that I found the FreeNAS community completely horrible. They're focused solely on people setting up like HIPPA compliant scientific data, or something. It's ZFS for everything. Backups to USB drives are stupid, and you're stupid for having that as a backup strategy, and instead should drop another $1000 on a second NAS for your house, and then backup across a network.

OpenMediaVault's community on the other hand is much more helpful and open to both professional and personal use cases.

"What's the worst that's going to happen? Some photo is going to get a single bit error? So what? The data doesn't need super high fidelity."

A fact that complicates this argument is the prevalence of compressed data formats. A wrong bit in a photo could corrupt the metadata, or a block of DCT coefficients, such that a whole image block or scan line does not decode, effectively ruining the photo. A wrong bit in a .tar.gz could well corrupt the entire archive. You never know where that bad bit will land.

(Yes, this has happened to me -- a single bit flip, probably from RAM error, corrupted a large .tar.gz -- I tracked it down to the bit that changed because I had another copy.)

> "I would also like to take time that I found the FreeNAS community completely horrible. They're focused solely on people setting up like HIPPA compliant scientific data, or something. It's ZFS for everything. Backups to USB drives are stupid...."

In fairness to the FreeNAS community, USB drives often aren't a good backup solution:

-> USB thumb drives break all the time. They're terrible choice for any long term storage plan. So you're better off with a HDD / SSD

-> Keeping a USB device inserted all the time means the backup is permanently accessible. This is great for convenience but terrible for a backup as malware (eg ransomware) could access and write to your backup. And even if your backup isn't mounted, it could still be subject to any other physical disaster that knocks out your main archive of data (eg flooding). Backups should always be kept separate - ideally offsite.

-> So now you've got a removable HDD that you need to connect and disconnect frequently. You better buy something reasonably rugged because cheaper drives might fail due to frequent manhandling. Which means you're no longer looking at a budget device.

-> Finally, USB backup solutions cannot be automated. So there's always the risk that someone would let their backups grow out of date. If not out of forgetfulness then just out of plain laziness (eg "I'm too busy to do a backup right now, I'll do it 'tomorrow'").

So while USB storage can be workable as a backup solution, there are a lot of caveats that would invalidate the whole point of a backup if one isn't careful.

Correct my if I'm wrong, but other than the connection itself, aren't most modern high speed thumb drives very similar to SSDs, and other USB storage effectively an HDD in an enclosure with a USB interface?

I mean if you have an 8TB NAS, and you want to use an 8TB USB drive to backup said NAS locally, I'm not sure I see that as a problem. Backup doesn't HAVE to be significantly more durable than the source material, which is why multiple backups and at least one off site are recommended.

After my time doing support for iomega (so long ago), I don't consider anything really backed up unless there are at least 3 copies, and at least 1 of them is offsite.

> "Correct my if I'm wrong, but other than the connection itself, aren't most modern high speed thumb drives very similar to SSDs"

The mid to high end stuff are. The lower end is less so. But you pay a premium for decent storage capacity on a decent thumb drive, so either way, you're back to an external HDD / SSD drive.

> "and other USB storage effectively an HDD in an enclosure with a USB interface?"

Indeed. Hense why I said "So you're better off with a HDD / SSD" when referring to other USB storage devices.

> "I mean if you have an 8TB NAS, and you want to use an 8TB USB drive to backup said NAS locally, I'm not sure I see that as a problem. Backup doesn't HAVE to be significantly more durable than the source material, which is why multiple backups and at least one off site are recommended."

I don't have an issue with USB storage per se, I was just saying there are caveats to consider. And while you're right that backups don't have to be more durable than the source, you have to bare in mind that the kind of people who would be looking for advice about using USB devices as a backup solution would likely be the same kind of people who wouldn't have multiple USB devices nor the kind of people who would test their backups to ensure the medium hasn't degraded. They also wouldn't likely be the same people to keep their backups offsite. As these kinds of checks and additions can add significant costs.

Or to put it another way, if one is unwilling to spend $1000 on a second NAS then they are unlikely to want to spend $1000 on a few decent external drives. So that person is going to start scaling back their requirements (cheaper drives, fewer drives, etc) and quickly end up in a situation where their backup solution is total garbage.

Bare in mind, the kind of people who would wander onto FreeNAS's forums looking for backup advice are unlikely to be people like you and I who understand how to implement these things correctly. So it doesn't surprise me that many members of the FreeNAS community veto recommending USB drives knowing how easy it would be for someone inexperienced to get things wrong (eg leaving their USB device connected forgetting that some malware would just write to USB device as well)

> It uses ext4, because I didn't get ECC RAM. Why not? I picked the board on price, and didn't realize it until I tried to install FreeNAS[], which uses ZFS, so I switched to OpenMediaVault, and never looked back.

I don't follow the reasoning here? It's not like ZFS gets worse if you're not using ECC RAM. Rather it's: ZFS is reliable enough that using ECC RAM becomes worthwhile. ZFS is still more reliable than ext4 even if you're not using ECC RAM.

Any relaxed requirements are fine until they bite you in the ass, and you lose 11TB of DVD/BR rips that you took the time to rip to watch at your whim/convenience. I can't imagine how upset I'd be if I didn't have the really important stuff redundantly backed up.

That doesn't seem to have any connection to what I said?

What happens if you have a corrupted file (or more) that then gets replicated to your offsite? Are you keeping multiple versions offsite?

This of course depends on your data, but for the home user who generally stores photos, videos and music, 99.9% of the changes consist in adding new files. Thus any modifications of an existing file are somewhat suspicious, and keeping also the old version in your offsite backup is a very good idea and essentially won't increase the storage requirement.

E.g. with rsync, you use the --backup and --suffix flags.

Yeah, I made the same mistake wrt non-ecc... If I'd known, I'd have gotten an Asus board (was an amd processor), and used ECC... I lost a lot of DVD/BD rips, and a lot of time ripping them.

If a bit flips and the checksum is wrong, then the file is marked corrupted, and if you have redundancy or backups, the error is reported and you can recover. Too easy.

If a bit flips after checksumming, whether in the data or checksum, the file can either be fixed automatically or marked corrupted for manual restoration.

But if the problem is the RAM or CPU, the bit could also flip before checksumming. Then the incorrectly written data will have a matching checksum and the corruption can't be detected.

That's why ECC is important.

Fully agree.

But notice that a bit-error in a text file in plan storage has some consequences. That same bit-error in a text file in a RAID (or encrypted storage) has very different consequences.

Some bad ideas are just worse than others.

When writing redundant (parity data) across multiple storage devices, you're increasing the chance of error without ECC. While there's still a chance/risk with single drive, the effects of such error are not as significant.

The effects are exactly as significant either way: corrupt data is written to disk. Whether the corrupt data is written one time or several times is immaterial.

May I introduce you to error correction?

Having an error on one disk, versus multiple disks in a RAID totally matters.

How do you figure? If the single disk has bad data written to it, it contains bad data. If the RAID has bad data written to it, it contains bad data. Either way, the bad data is bad, because it was undetectably corrupted in memory before being written out.


Serious question: Do they not teach Hamming codes in undergrad anymore?

Do you not understand the problem here? The problem isn't detecting data corruption in storage, which parity checking also helps with. The problem is detecting data corruption in memory. ECC only matters after irrecoverable data loss _in memory_ has occurred. Without ECC the system will continue to operate; with ECC the system will halt without writing data at all, or correct it before storage even comes into play.

Writing corrupt data once or writing it a thousand times redundantly with parity data looks exactly the same, if it's allowed to happen. It's a valid write of invalid data. Either way, the data you have on your cheap laptop drive or your six-figure storage array is corrupt. This is why the ancestor commenter is correct about the storage technology being irrelevant to the choice to use or not use ECC.

I haven't been in school for many years. They taught Hamming codes, but they also taught garbage-in, garbage-out.

> I've always been somewhat surprised that NTFS hasn't been more popular for external media, as it's a pretty good fit there.

Isn't that because NTFS is Microsoft-proprietary? AFAIK all the existing third-party implementations are reverse-engineered and hence not 100% safe. Meanwhile even HFS+ has a public specification published by Apple.

Yeah, it is... that said, I do wish MS (NTFS) and Oracle (ZFS) were more permissive wrt patents and would do patent releases. Beyond that, it would be great if MS would even release open documentation on NTFS. I still think it's a pretty great option for portable storage.

The CDDL license explicitly provides a patent grant and since the ZFS code was released under the CDDL, I'm not sure what you're asking for.

I'd be curious to hear about the issues you've had with zfs. I'm running it on a fileserver in my house, and it's been solid for a few years, but my use isn't very demanding. I'd like to know what problems I should be keeping an eye out for (aside from the usual stuff like drive failures, etc.).

ZFS pool IO performance drops like a rock off a very tall cliff once the pool is in the 80%+ full range. So for every TB of usable pool space, you need to over-provision by 200-300GB. This is with current gene enterprise SSDs mind you. Even worse things happen when using spinning rust -- like ZFS dropping drives from the pool because IO has degraded so much it thinks theres a bad drive.

It turns out that some raid controllers (Including the very popular LSI variety) + certain kernel versions result in pools and drives being dropped from the array for no discernible reason. Nothing [useful] to logs, just reboot and pray

Non-ECC RAM is just asking for trouble in any circumstance where you care about both data integrity and IOPS. Bitflips can and will get you, now they won't wind up hosing your entire pool (most likely) like some apocryphal stories suggest. But, you'll get bad data replicated and the possibility of borking some other sector during the 'recovery' process from the bitflip.

When you're in a large enough pool (PB-scale) this becomes even more painful...

100% true about the performance cliff, though at Delphix we worked a bunch to raise performance before the cliff and to push out that cliff. All COW filesystems filesystems (and I'll throw SSD FTLs into there as well) require over-provisioning for consistent performance.

And it's also a fair point that for enterprise storage running at 80%-90% of capacity is a reasonable restriction whereas the drive on my laptop is always basically %99 full. It will be interesting to see how APFS addresses this since I'd guess that most of their users look more like my laptop than my file server.

We always over-provision any way, but with ZFS the overage is a requirement for any reasonable performance. With XFS, you can get away with 90-95% full; yes, there's degraded performance but nothing like the 10 IOPS you get from ZFS.

Don't get me wrong I like ZFS/btrfs; I adore snapshot send/receive. It at times though really handicaps itself.

I did a brief write-up about this effect on the ZoL issue tracker:


In short, if you want more consistent performance as the pool fills, disable metaslab_lba_weighting_enabled, but be prepared to lose some sequential performance when the pool is empty.

It's my understanding that you shouldn't use a RAID controller with ZFS...

LSI cards can be run in both RAID mode and passthrough / standard SAS/SATA controller mode for XFS/ZFS/btrfs -- which works wonderfully most of the time and cuts down on hw heterogeneity. You will have to use some disk controller or another, your motherboard is not going to have 12/24/36/48/80 SATA ports

Nor a battery-backed write cache.

I believe adding a ZIL device alleviates the low free space performance issue.

Adding a ZIL device is equivalent to over-provisioning, but with performance impacts in high performance situations

From what I understand a ZIL helps with workloads which need to fsync a lot (say databases) and also NFS (due to COMMIT commands issued by clients).

I've been running ZFS in prod since 2008 (on Solaris). Bear in mind that was not long after the "silent data corruption" problem that bit Joyent, but more quotodien problems such as:

* no reserved space on the disk means you can fill it up and then not be able to delete files or extend it because there was no space for COW operations. Sun's answer? Delete the filesystem, recreate and restore from backup.

* zpool management is a pain in the arse generally, and the zpool/zfs distinction is pretty clearly designed on the assumption you'll have sysadmins paying a lot of attention to the low lever FS and pool management.

Imagine Steve Jobs being told "your disk filled up and now you need to reformat it". Apple didn't reject ZFS because (whatever some embittered Sun blogger says) of NIH. They rejected it because it wasn't fit for purpose as a consumer desktop filesystem.

But surely, it would have been a simpler task to modify ZFS to support these things. I mean, the full disk issue is relatively straight forward to fix. And I have a hard time imaging that other issues would be harder than writing your own from scratch.

> But surely, it would have been a simpler task to modify ZFS to support these things.

The first time I encountered the disk full problem was 2009. There were emails from Sun engineers saying "this is a dumb situation, we should have a reserved block of storage to avoid this" from 2006 and nothing had been done. Why would anyone in Apple believe if was going to get better if Sun's response to paying customers was "reformat and rebuild" for at least three years?

There was no way of fixing the limitations around vdevs, like not being able to grow existing arrays, not mixing block sizes without a big perfomance hit, and so on, without a huge overhaul of some of the fundamentals.

ZFS was, like a lot of Sun technologies, built on assumptions from a world where machines are scarce, have a high ratio of admins:machines, and many users per machines. It's excellent at meeting its design goals, but there is no way I can imagine Apple shipping it when they'd be telling people, "want to grow your Mac Pro's array? Rebuild it. Want to reclaim space on a full system? Rebuild it. Want avoid dropping off a performance cliff? Never drop below 20% free space."

(As an aside, I've been told by Sun engineers around the same 2008-2009 era that dependency resolution in packagme managers for distros proved Linux was a shit toy for morons because "real" OSes assumed skilled engineers who would be inspecting patches package-by-package so they knew exactly what would be on the system. Even recently, with Solaris 11 shipping with a package manager, I've had Sun veterans complain that "Solaris is dead" because it's "too much like Linux". Sun appears to have fostered, at least in areas I've dealt with, a culture that's completely out-of-step with how computing works this century. I doubt it would have taken too many of those sorts of conversations to convince people at Apple that Sun just didn't get Apple's "just works" vision.)

I know all too well how details can prevent otherwise simple solutions. But for the full disk system, a simple solution is to not allocate all the available space, surely it cannot be hard to reserve a little bit such that you will never encounder the full volume issue.

You mention it shortly, and I belive that ZFS was designed for a world where growing existing arrays, reaching 80+% utilization, or using different block sizes all is things that never happen anyway. ZFS could do with a rewrite to allow this, but it would not be ZFS compatible anymore.

But I will not deny that would you describe could be a telltale sign of a development team Apple simply cannot work with. But nothing prevent Apple from forking OpenZFS and work on it in house (perhaps apart from patent issues).

There's also the problem that you can't grow a raidz vdev by adding disks. I can just imagine Apple support trying to explain that one.

In Solaris ZFS you can now AIUI - it's just that hasn't made it into the open-source version.

That's true, but you can grow a pool by progressively replacing smaller disks with larger disks.

It wasn't ZFS perse... the software I was using wouldn't properly enable the drives configured as hot spares... that and a few errors (I didn't use ECC), and a couple bad drives, and everything (11TB of data) lost. All the really important stuff was well backed up, as was my audio library. I never did re-rip all my DVDs & BluRays again since that happened.

Good info, thanks. I'll admit, I'm pretty terrified of having to replace one of the drives in my array. I've replaced dozens or maybe even hundreds of drives in Sun, SGI and DEC storage arrays over the years, but I have a lot less confidence in my homebrew setup. Fortunately, I have regular complete backups of everything there, so even if I lost the whole array, all I'd really lose is the time involved in fixing it, but that's still painful.

I felt the same way, which was why I went Raid-Z2 (dual parity) with two hot spares configured... even had a spare controller next to the box. Still, lost it all.

Damn, how did that happen? When the first drive failed, why didn't the hot spare get activated? What software was it that didn't work properly?

This worries me because Raid-Z2 with two hot spares is exactly the system that I was planning to set up.

The FreeNAS version I was using had the bug wrt hot spares... and I didn't use ECC... The errors also weren't clear which drive was failing... was 0 in some places 1 in others, and it turns out it was both. The LSI board I was using apparently had some buggy issues as well.

That combined with using Seagate 7200 rpm drives in a year that was particularly bad (an offline storage company mentioned them in a blog article iirc). just a bad chain of events. The drives when I pulled out and tested them individually after the entire array crashed, about 7 out of 12 had significant issues, and of the other 5, 2 more developed errors shortly after (I only used them as scratch/project drives).

Honestly, today, I'd probably get another Synology box, a bit more expensive, but been so much less hassle. I had upgraded my 4-drive (2010 plus model iirc) synology box to 4TB wd red drives, and been running that ever since. I don't think I'd ever do a homebrew NAS on a single machine ever again. If I had to do something in a company, would probably lean towards distributed file systems for that level of redundancy.

Were you not scrubbing at all? (Blaming FreeNAS rather than you if so - but if one's doing it by hand following the ZFS HOWTOs it's definitely something that's impressed on one to set up).

When I use a NAS software, that's meant to be a NAS, I expect it to do maintenance that's required. I don't expect to have to be an expert in ZFS. By the same note, I don't expect to have to know what the storage and growth configurations of AWS RDS instances either.

My Synology NAS device has been in service for around 6 years now... I upgraded the drives from 4x1tb to 4x4TB when the freenas server bit the dust... and it's still going strong... I don't have to think about it, it just sits there and runs.... A single-purpose software like this should do its' job. The end.

Agreed. I just think it's not really fair to say that ZFS lost your data, if the main cause was that FreeNAS uses/configures ZFS in a way directly contrary to the ZFS documentation.

Were there Baracuda 7200.11? That firmware version was notorious for corruption

IIRC, yes... horrible drives.

I've lost my movie collection to a different issue, and I never got around to re-rip it again. But I have promised myself that if I have to rip it again, I will do it script based, such that backing the script up, re-ripping will be a matter of installing the same software, and running the script.

true enough... At least today I'd get the benefit of h.265 and use roughly 1/4 the space, and probably go 1080p for the BR rips instead of 720p.

I haven't really gotten a chance lately to dig into it as much as I want, but I have heard that BTRFS is actually doing pretty well, with lots of updates in the last year that fix the middle-ground bugs, with a few big bugs still out there (such as raid6 mode in general).

The main big data file system I am keeping my eye on is HAMMER2... but development is slow. I really like a lot of things Matthew Dillon is doing there though, especially the network stack. If I were doing a new isp or wisp I would probably be basing it on dflybsd.

I mainly mean that it'll be a while before btrfs stabilizes and even then, it's been how many next-gen file systems in floss land each with critical issues, usually non-technical in nature in even the past 5 years?

Kind of weird how Linux filesystems always seem to have problems....it seems like that's the sort of problem open source would do a reasonable good job at.

Do they have more problems or more users in many more different situations? And then they are vocal...

ZFS has this sterling reputation but a fair number of people have actually lost data to it. Some of it is bad hardware,orb ad choices in hardware, some of it is bad administration, it's still happened.

Linus isn't very happy: http://www.linux-magazine.com/Online/News/Linus-Torvalds-Ups...

I don't expect any filesystem to eliminate the need for actual backups, but

10TB? 15TB SSDs are already shipping with much larger not too far out: http://arstechnica.com/information-technology/2016/03/samsun...

Okay, wasn't aware of that option... mainly going by 8TB WD reds on amazon, and figuring 12-12TB should be around soon.

eminent != imminent != immanent

ZFS isn't catching on because it solves a 1990s-era problem that people just don't care that much about right now. People are focused on cloud for the moment. Cloud storage doesn't require the application developer or user to care about the filesystem, and clouds are not implemented with fancy filesystems on the individual hosts, but rather using fancy facility-wide filesystems that have better management and durability features than ZFS offers.

ZFS would have been totally brilliant on a 90s-style multiuser enterprise or higher education deployment, though. It just missed its window.

Distributed filesystems are backed by local filesystems. The local filesystem can be ZFS with all of its benefits even if you also have a fancy facility-wide filesystem running on top of it.

In general is it just me or has ZFS become more popular lately. Saw Ubuntu get behind it, even in light of Btrfs being available for many years now. https://github.com/zfsonlinux/zfs/commits/master is pretty active...

It seems everyone at some point expected Btrfs to shoot ahead and leave other file systems in the dust, so there was no point in bothering with ZFS, "just wait a bit and Btrfs will be the default everywhere". And besides ZFS has all the legal issues with it.

But it seems Btrfs progress was rather slow, so even in spite of legal issue interest in ZFS is still growing.

the thing about ZFS is that it was an actual selling/functioning product before btrfs was in development (links at end). There always were people using it - whether through the main Sun/Oracle Solaris product or via one of the OpenSolaris / illumos / Freebsd / Bsd continuations.

Btrfs was not an fs in wide production use the way xfs or ext3 / ext4 were, so it's understandable that btrfs progress is slow. The two pieces of software are just really far apart in timelines and maturity. People are paying attention to ZFS mainly because it's had a history of working and functioning in production environments and in heavy load, for a decade now.

In my opinion, in order to have parity with ZFS, some big distribution (bigger than Suse, maybe Debian or Ubuntu) needs to set btrfs as its default fs, and have banks, stock exchanges, dns providers, virtual hosting companies, etc hammer the hell out of it for another 6 or 7 years and get all the bugs that come out of those experiences fixed. Then people will start paying attention to it.

afaik btrfs development started 2007 (https://en.wikipedia.org/wiki/Btrfs)

ZFS was introduced as a part of Solaris in 2005, released as openZFS in 2006 (https://en.wikipedia.org/wiki/ZFS)

Honestly... I don't trust btrfs at the moment. It has had some nasty data losing bugs in it in the past, and admittedly its been a while since I heard of one, but it's also not often that you'll hear of someone actually using it in production. Even their wiki page is woefully lacking in examples with which to sell it: https://btrfs.wiki.kernel.org/index.php/Production_Users.

I want to like it, we really need a nice, stable next-generation file system in Linux land (that doesn't have the licensing issues associated with ZFS)

I do like it. However I am very wary of using the advanced features (beyond the de-fragmentation/scrubbing/balance kind) without a /good/ backup of the data on the system and a good retention policy.

It has been my experience with using BTRFS for my own personal data that it's the less tested code paths which had / might have the issues you're describing. Things like sparse files (I got bit by this bug), snapshots (regular scrubbing and backups are things you're doing already right), and other features that are less frequently used.

A small production cluster where I work uses XFS over BTRFS though, because of the very FUD that's mentioned and the type of storage happening on it matching the sparse pattern (even though that bug should be fixed).

IIRC, Linus Media Group uses btrfs for a few of their servers. But, as shown by some of their videos, they aren't the best when it comes to good practice...

Seems like running on SSD/NVRAM may call for some new thinking. Running on watches may be even secondary to that.

Not that ZFS won’t run well on SSD, but it feels like there’s a gulf between filesystems designed with SSDs in mind and those designed with spindles in mind.

It seems that COW filesystems in general are better for use with SSDs... iirc, SSDs themselves do similar things as part of their wear leveling algorithms anyway. It seems to me that it's time to break some of this down. With solid state storage reaching levels close to RAM in terms of latency, I think eventually they'll come together, and that's where it will get really interesting.

On the flip side, with spinning storage approaching 10tb drives, the need for enhanced filesystems for RAID arrays with those beasts are needed as well.

In general, we really need some patent releases from those companies strangling filesystem advancements (namely Oracle and MS).

Out of curioisity, how do you feel that Oracle is strangling filesystem advancement?

Encryption is a big one that's been annoying everyone. Oracle baked encryption into their own ZFS implementation ages ago, but not in a way that works with the open source variety. Years later, we're only just now starting to see some solutions for this.

I'd like to see something comparable to btrfs/zfs but designed for plain nand flash. Less opaque layers to guess your way around. UBIFS?

If I remember one talk correctly ZFS was designed with persistent memory in mind, so I don't think SSDs are far out of their design thoughts.

[edit: it was a BSD Now episode]

Definitely not. ZFS was primarily focused on spinning disks. Take for example the fact that the project discarded block indirection early on: the resulting random reads would be basically fine on SSDs and debilitating on HDDs. Ditto TRIM which is still not (IMO) fully baked in ZFS.

Meanwhile, this works, is under active development, and is essentially at HEAD of both openzfs and zfsonlinux:


Not sure what your point is. Are you saying Apple should use this? AFAICT it's still under the CDDL, which is the main reason Apple abandoned the ZFS plans in the first place.

Except for the fact DTrace is under the exact same license and ships with OS X.

Pulling DTrace out in the event of legal action, vs switching filesystems is a whole different beast.

DTrace is not a core technology on which the rest of the platform is built. If you read the actual article, this is discussed.

Agreed with all of this, but it's hard to imagine the licensing was truly insurmountable.

More or less everyone I've ever talked to or seen discuss it agrees (including several people in the comments on this blog post) -- Apple Legal felt ZFS was a no-go without patent indemnification from Sun/Oracle. Sun was willing, but Oracle said no as soon as they purchased.

At that point the word came down from Apple Legal to Engineering that it wasn't happening. And there's nothing more insurmountable than your own legal department.

"At that point the word came down from Apple Legal to Engineering that it wasn't happening. And there's nothing more insurmountable than your own legal department. "

Just to note: It doesn't really work this way, even at Apple.

It really means "legal was better at arguing their side than whoever argued against them to the SVP/CEO who made the decision"

While apple is worse than most (from talking to friends and counterparts), no company, even Apple, is so silly that the legal department can't be overruled (any general purpose software company that works like that goes bankrupt pretty quickly).

It just takes a really good business case for doing so, and none existed here.

That is,if it had been considered super-mission-critical, it just would have happened anyway, they would have taken the risk.

(I kind of just hate when these stories get portrayed as legal ruling with an iron fist with nobody having any say over them.)

Just scrub out some J's :p

People forget that Microsoft has perennially had their Cairo/Object/Whatever FileSystem in development.

WinFS, that was an interesting concept when it was first announced. I'm still curious if it has got wings or not. Maybe some day :)

It even worked: https://www.youtube.com/watch?v=VW2H4sgakIA at some point anyway. Now they're just "well, we put the ideas in SQL Server" but I still hope it comes back around. _That_ is the future of storage. ZFS/btrfs/HAMMER are all very, very cool, but WinFS was a level above that.

True but NTFS is also far superior to HPFS, so arguably MSFT doesn't need a replacement for NTFS nearly as badly as Apple needs a replacement for HPFS.

Minor nit, but HPFS was OS/2's filesystem. Apple's is HFS (and HFS+).

And NTFS was based on HPFS

Also, Microsoft has been developing ReFS as a replacement for NTFS.

So, according to this article, Apple was making a deal with Sun to use ZFS, and later finally dropped it after an alleged discussion between Steve Jobs and Larry Ellison when Oracle owned it.

My question is why did Apple think they needed to make a deal to use ZFS in the first place, and if so has Canonical (who says they'll ship openZFS with Ubuntu) made a deal with Oracle ?

It's true that OpenZFS is more than Oracle's ZFS, but unless I'm mistaken, the vast majority of code in that project is still owned by Oracle.

This article makes me uneasy.

Apple started in 2006 with no deal or coordination with Sun. Later when the FUD started, they were looking for protection from patent lawsuits (i.e. from NetApp), for support and expertise, and, possibly, reassurances around patent clauses within the CDDL with which I'm not an expert.

Maybe the deal was to make ZFS ready for personal computer / Apple requirements that may have interested Sun but not Oracle ? In the middle of an acquisition with a guaranteed few years of uncertainty, no strong backing from the new boss meant that Jobs simply bailed out.

Rumors in any case.

Maybe I'm being irrational, but I would never touch ZFS because Oracle.

> this is the moral equivalent of using newspaper as insulation: it’s fine until the completely anticipated calamity destroys everything you hold dear.

How... unnecessarily inflammatory. Hardlinked backups work remarkably well, and are incredibly simple to implement and understand. Of course they can be corrupted, but then again, so can every other form of backup in existence (that said, there are no protections built-in to a hardlinked backup).

I work in backup (this has become a core part of my business: http://macdaddy.io/mac-backup-software/) and part of the reason I wrote that is because of the continual reports of Time Machine backups failing to restore when people needed them. The numbers are huge. I wouldn't touch Time Machine in any conditions. I would manually backup before using that. This may be anecdotal evidence, but if you want to see some of the issues people run into with Time Machine just read through some of the posts people make about it: https://www.google.com/search?q=time+machine+backup+failure&...

Most of this is indeed directly related to directory backups. I also do Snapshots of sorts, but it only makes hardlinks to files, not directories. That proves to be remarkably more reliable.

Hey, I'm interested in trying that. What's your business model? Nagware or limited time trial or what? The price is reasonable but I'd definitely like a long trial.

Anyone else here using it?

It gives a month's free trial where it fully functions. Then it says it requires payment when it starts up after that. Though if you don't pay anything it actually continues working. So you have plenty of time to try it out.

Hard links to directories are weird enough that many modern filesystems (including APFS) don't support them. Yes the language is perhaps unnecessarily colorful; just trying to keep things interesting.

Hardlinked backups do not work well at all with large files that change frequently, such as video and image editing projects or VM images.

It's a real shame, because HFS is an absolute abomination. Apple will remain without a reliable, full-featured filesystem for another decade, it seems.

ZFS is fantastic until you try to grow a volume. Emphasis on fantastic.

If you create a zpool with '-o autoexpand=on' you can replace disks with larger ones one at a time, waiting for each one to resilver. After replacing the last disk the pool will jump to the new size.

What difficulty did you encounter? I've had a pool running since ~2009, originally as 4x750GB, then 4x3TB in 2013, and I recall that the transition was excellent. As oofabz said, the trick is to enable "autoexpand", which can be done even after creation. In fact, I recall that I had replaced each drive, initially seeing no increase in capacity, but after enabling autoexpand the new space was immediately available.

You can grow a RAID-Z “vertically” like this, but can't grow it horizontally by adding disks… which is frustrating.

(I'm also frustrated that it's not possible to change redundancy within a vdev — e.g. go from RAID-Z1 to RAID-Z2.)

You can grow it horizontally by adding a new vdev. ZFS doesn't allow you to do anything you may ever possibly want to do, but it does allow you to do more than any other RAID solution. It's not perfect but it's still pretty impressive.

I've been using the same ZFS array since 2009...

At first, I assumed that RAIDZ was the obvious way to go. But I switched to a concatenation of mirrored pairs; I grow the array by adding another pair to it.

I've actually had six disks fail in six months, without data loss. Was scary. But wow.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact