Hacker News new | past | comments | ask | show | jobs | submit login
File Systems, Data Loss and ZFS (clusterhq.com)
128 points by ferrantim on Sept 19, 2014 | hide | past | web | favorite | 55 comments

I switched to FreeBSD a couple of years ago, partly for the sake of ZFS which is a first-class filesystem on that platform. FreeBSD was much more similar to Linux than I expected, and where there were differences, the FreeBSD way was usually simpler. My system has been stabler ever since, and I no longer fear to hit the "update" button.

This. ZFS, MAC, LLVM, good docs, good manual pages, awesome stability, up to date ports and packages. What is there not to like?

On the subject of the article, FreeBSD supports DMAR (VT-d) as well but I'm not sure how this protects data in context to ZFS.

I have two ZFS root machines on FreeBSD, one server and one laptop. ZFS is nothing short of wonderful.

The main issue with FreeBSD is that desktop-y things tend to be a PITA - anything involving browser plugins, java, anything like that.

A big part of that is the state of the linux emulation. While it's solid technically, the linux it's based on is ancient... the NEW linux base that is being phased in is based on CentOS 6, which is already over 3 years old. The CURRENT linux base is Fedora 10, which is 6 years old. Both of these are so old that compatibility with proprietary software are limited.

I don't use either java or any browser plugins (other than portable Firefox plugins) so I haven't come across that yet. I used freebsd for a desktop for 6 years, replacing Solaris and not once did I use Linux compat as well.

It got shot in favour of windows because TBH that pays the bills better at the moment.

FreeBSD however runs a lot of the hidden glue at our company.

That appears to just be for virtualization. That is not quite the same as using it to protect the kernel from malfunctioning devices because only devices that are being passed through to a guest have any kind of restrictions. Please correct me if I am wrong.

How's package management? I switched away from *BSD a decade ago because the sysadmin side was so far behind Debian (which, to be fair, was true of almost everything else) with update reliability, speed, etc. What's it like in the modern era?

Much, much improved with the introduction of pkgng. Now you can use the stock binary packages without going through the ports tree. The workflow is "pkg update", "pkg install ...", "pkg upgrade", "pkg delete ..." (includes automatic dependency removal). It's basically the same user experience as APT or yum. Then, there's whatever PC-BSD uses, which is also supposed to be very user friendly. (I wouldn't know. PC-BSD stopped supporting 32-bit PCs a few releases back, so I gave up on it.)

That said, you can still build and distribute custom packages---the biggest strength of the FreeBSD Ports tree compared to other distributions, in my opinion. This too has improved with the introduction of the OPTIONSNG framework and the Poudriere build system. It used to be that you had to manually run "make config" per port or to somehow shoehorn per-port options into /etc/make.conf. All of that's been replaced with a much more sensible way to configure the port build process. For example, the following sets the default versions of Perl and Apache to pull in as dependencies (if needed by the port), and then it configures several ports:

  DEFAULT_VERSIONS= perl5=5.18 apache=2.4
  ca_root_nss_SET=  ETCSYMLINK
  moreutils_UNSET=  MANPAGES
Poudriere handles en masse package building. If you want to build and distribute your own custom package repository, you'll want to use Poudriere to manage the process for you. For more information see http://www.bsdnow.tv/tutorials/poudriere first, then RTFM. It's super easy to use. Also, you can custom build a few packages with Poudriere, but fall back on the main FreeBSD package repository. It's pretty stellar.

Nowadays, I only need to use portsnap/portupgrade when bootstrapping my Poudriere deployment.

Er, I mean portmaster, not portupgrade. I haven't used portupgrade for some time now.

FreeBSD has recently included a new binary package manager (pkg) which seems to work well enough, but in all honesty it's still feels a few years behind apt / pacman / yum.

The nice thing is how well you can mix and match ports (source based management, in case you weren't aware) with the binary packages. And FreeBSD ports are surprisingly easy to manage too.

It's also worth mentioning that FreeBSD runs circles around a lot Linux distributions when it comes to upgrading. There's a tool, literally called freebsd-update, which is the equivalent of apt-get dist-upgrade and makes the whole process so painless you could be forgiven for thinking you were running a rolling release OS.

> And FreeBSD ports are surprisingly easy to manage too.

It's been ages since I've used FreeBSD, but I have fond memories of using portupgrade (and portinstall, which comes with it) to manage source-based ports. It's probably faster to do make && make install instead, but the automation from portupgrade is useful and the -c/-C options can be helpful, as is its control over when to clean the package build dir.

I sort of grew up with the BSDs as my first introduction to a Unix-like environment, and in all honesty it was the reason I first migrated to Gentoo when I started using Linux (the Gentoo way was more familiar). I now use Arch for personal use (and Ubuntu for services), but I think the FreeBSD philosophy is in many ways superior. Clunky, perhaps, but usually void of unnecessary complexity.

> FreeBSD has recently included a new binary package manager (pkg) which seems to work well enough, but in all honesty it's still feels a few years behind apt / pacman / yum.

You don't expect someone to be able to write a full-featured bug-free package manager equivalent to those that have matured over a decade in under a two years do you? What't they've done so far is beyond impressive and at this pace it will be improved beyond the competition within 3 years.

I think the FreeBSD devs have done a cracking job with pkg and I did comment about how it's a new package manager to give context that it's likely to be refined and improved upon with time.

All I was just trying to give a balanced opinion but it seems you can't post anything on here these days without someone finding criticism. {sigh}

> You don't expect someone to be able to write a full-featured bug-free package manager equivalent to those that have matured over a decade in under a two years do you?

Well, you kind of would, wouldn't you? It seems that this should be a well-understood problem these days, at least unless you're going to do something wildly experimental like Nix. Maybe this says more about the archaic languages and systems we're still, though.

(Not that I know much about 'pkg' specifically. What does it do that's different from every other mainstream package manager out there? Why not use APT, for example? Since Debian has a FreeBSD flavour, presumably APT should be able to run without too much tweaking.)

As others stated, binary package management is state-of-the-art now, thanks to the `pkg` command (aka the pkgng system). In fact I can never remember all the sub commands for apt+dpkg but pkg is a dream by comparison. I particularly like `pkg audit` - https://www.freebsd.org/doc/handbook/pkgng-intro.html#pkgng-...

Btw, I would completely disagree that the sysadmin side is or was behind Debian. FreeBSD is way easier to sysadmin than most Linux, even without pkg.

> Btw, I would completely disagree that the sysadmin side is or was behind Debian. FreeBSD is way easier to sysadmin than most Linux, even without pkg.

This may be true now but it certainly wasn't the case in the early 2000s. When we switched, our number of systems went up by an order of magnitude (100s) but the amount of sysadmin time required actually went down. Consistent package management and debconf were most of the reason.

Interestingly I switched from Linux to FreeBSD around 2002, primarily because I was frustrated with Linux sysadmin. Perhaps I was more of a newbie but the inconsistencies in Linux system organization and the confusing, low quality, half-obsolete docs, and unbounded time required for troubleshooting were some of the reasons. When I switched to FreeBSD and started using ports, my sysadmin time dropped significantly. But I was only using a single system.

You can configure ports the way that you like on one system. Then tarball /usr/local and bring the tarball to every other system. Just wipe out the existing /usr/local and replace it. This is supposed to be a feature that reduces time spent.

The ports stay very much up to date and I've never had an update break my system (in contrast to my experience on Ubuntu and even Slackware - must admit I haven't used Debian that much). I like the way it embraces the fact that updates will sometimes require manual intervention and so there's a standard file that you read before updating and occasionally there are things you have to do, but as long as you follow that you'll be fine. I also think the split between ports and base system works well.

At an implementation level the packaging system is still made of twine - Ruby scripts that call Makefiles are not my idea of robust engineering. But I've gradually come to accept that, just like the project's continued use of CVS, they've managed to make it work.

The Ruby scripts (assuming you're referring to portupgrade?) are optional.

The FreeBSD Project might still maintain a CVS repository, but all development's been moved to Subversion. Or if Git is your thing, they have some kind of synchronization set up with https://github.com/freebsd/freebsd.

Isn't Ruby scripts that call Makefiles exactly how Homebrew works on the Mac? :-D

Yeah I love FreeBSD. I wish more people gave it a chance before rushing to ZFS FUSE. Granted things are a little different now that ZoL is around and has proven to be stable, but it always struck me as a little odd that some would point blank refuse to even try FreeBSD yet welcome the lesser tested and poorer performing solution of running ZFS in FUSE. But each to their own I guess.

I was one of the people who rushed to do a ZFS setup on Ubuntu when those capabilities first started appearing ~5 or so years ago. There were some strange bugs that pushed me onto BSD, and the entire time since then since then I've been so stable it almost makes me nervous to try again despite the positive reception of modern ZOL (if it aint broke etc). Seriously, impressively stable.

If you haven't seen it already, this post by the same author about the State of ZFS on Linux might interest you: https://clusterhq.com/blog/state-zfs-on-linux/

I've read it, but thanks for the breadcrumb. It definitely seeded thoughts in my head about giving it another try the next time I do a clean reformat on my fileserver.

Is ZFS in FUSE is a thing anymore, with the "native" port from zfsonlinux.org?

Development of ZFS-FUSE ceased a few years ago when ZFSOnLinux surpassed it. That said, ZFSOnLinux will likely implement an option to build a FUSE driver in the future, but it is not a priority:


I'm not sure to be honest. But their site (zfs-fuse.net) seems to be down for me.

Thanks for the explanation of misdirected writes. I've heard the term before, but didn't know exactly what caused it. Reading this post was like watching one of those How Things are Made shows on the Discovery Channel. Very interesting to see how some things I take for granted actually work.

Misdirected writes are not as well known as they should be. I am happy to increase awareness of them.

>ZFS is operating on a system without an IOMMU (Input Output Memory Management Unit) and a malfunctioning or malicious device modifies its memory.

If a Linux system possessing an IOMMU was booted with iommu=pt as a kernel command line option, does the IOMMU still protect from this type of failure? This option puts the IOMMU into passthrough mode which is required to successfully use peripherals on some motherboards.

No. This mode was introduced specifically for virtualization so that the IOMMU will only restrict access to a guest machine's memory, such as when KVM is in use:


The only case in which this would help is when ZFS is on the host and a device passed through to a guest malfunctions.

TLDR: "its data integrity capabilities far exceed any other production filesystem available on Linux today"

"In the case that we have two mirrored disks and accept the performance penalty of the controller reading both, the controller will be able to detect differences, but has no way to determine which copy is the correct copy."

If you 'seed' the checksum algorithm for a block with the block number being written, a subsequent read of a different block that produces the same data will have a checksum failure. That would make it possible to choose which block has the right data.

So, if you are willing to eat the performance, you can detect single misdirected writes.

When I wrote that, I was talking about hardware RAID 1, which has no checksums.

But the disks have their own checksums, haven't they?

The low level formatting has ECC, which never leaves the drive. That said, there are two cases to consider for misdirected writes. One is that the write clobbers multiple sectors in which case you would get uncorrectable sectors. The second is that it perfectly replaces another sector. In that case, the ECC is a perfect match as the ECC is stored with the sector. Neither drive would report a problem, but the data would not match. This is what I described as being a problem and traditional RAID is incapable of dealing with it.

And that is where I stated that drives can report a problem. If they 'seed' their ECC algorithm with the sector number (XOR-ing the result with it would be sufficient), they can (statistically) detect that, when they read sector #X, what they got wasn't what they ever wrote as sector #X.

In fact, I guess they already do. If they didn't, there would be misdirected reads, too.

The low level formatting does include a sector number, but it is not part of ECC. I am not sure what your point is. Your theoretical description of how hard drives could work does not reflect reality. Research by CERN and others has confirmed the existence of misdirected writes. Deployed ZFS installations are detecting corruption in situations where the drives report everything is fine. Even if the storage hardware improves, having end to end checksums in the filesystem will continue to make sense.

That said, I think you are fixating on one way that things can go wrong. Another way that misdirected writes can occur is a bit-flip in the micro-controller's memory. This also allows for misdirected reads as well as reading/writing data that has a single bit flipped. These devices micro-controllers do not have ECC memory. Even if it were added, you still need to prove that there are no programming bugs via formal verification, but given that these devices are black boxes that cannot be inspected, you cannot rely on the claim of a proof even if one is done and there would still be the possibility for errata in the micro-controller. It is far easier to just use end-to-end checksums in the filesystem. Even if you think the device is trustworthy, end-to-end checksums give you the ability to check that it is doing what it is supposed to do. You simply do not have that with traditional RAID.

I found the Reordering Across Flushes section really interesting. So one rule of thumb is that you should not use hardware RAID with battery backup? Are there other types of hardware that would give you the same problems?

I usually advise people to avoid hardware RAID controllers on the basis that they introduce unnecessary risk. If you want speed, it is best to use a device like the ZeusRAM as a SLOG device:


As for other devices, it is possible for bugs in software block devices to cause reordering across flushes. Finding out requires reviewing the code for any way that an IO before a flush can occur after it. This is a better situation than that with hardware RAID controllers, whose firmware is closed source and cannot be inspected.

Well, with ZFS you want to avoid hardware RAID controllers completely. The protections from ZFS only work if the filesystem doesn't have anything in between it and the actual disks. Depending on your vendor, it can actually be difficult to get a card that lets you have JBOD access to a large disk array.

The only exception that I can think of is encryption. You could wrap a disk with an encryption layer in software, but then you could still to make a separate virtual device for each disk.

A hardware RAID controller could limit ZFS' ability to provide integrity, but not enough that I would say another filesystem does a better job there. All filesystems are compromised by the failures that traditional RAID can introduce.

That said, I would never recommend a hardware RAID controller for use on any system. They add additional cost and additional failure modes, but give little in terms of benefits.

As with everything, It Depends.

Battery backup with a hardware RAID controller is fantastic for increasing write performance of a RAID array. You can literally call flush() on a file, and it returns almost instantly, so if you are doing this a lot it makes sense to either use one of these or a solid state drive. If the system is well-maintained, then the chances of it failing are "small". However, the whole ethos of ZFS is that it wants to manage individual discs itself.

With regard to other hardware, there are certainly devices out there that will ignore flush requests, in the name of performance. These are usually consumer-grade devices. If you're paying extra for an enterprise-grade drive from a reputable manufacturer, you should be fine.

Genuine non-volatile RAM would avoid the issue of battery failures. About two years ago LSI announced a partnership to use MRAM in their RAID controllers but I haven't seen a product materialize out of that.

Does anyone have a good up-to-date comparison with btrfs on this topic?

Check out the previous HN thread on ZoL:


Thanks that was very helpful. Seems btrfs is still a bit behind, some of it by design.

Valerie Aurora, who worked both on ZFS and btrfs seems to think the btrfs architecture is better in a few ways:


Thanks for the link. That was a very interesting article. Btrfs sounds very interesting.

Does ZFS on Linux support ARM? I'd like to give it a spin in Arch Linux ARM.

I like to see that too!

I worked on ARM Linux NAS system for an SOC company for a while. I think ZFS on ARM is good idea mainly because the low cost, low power CPU.

Does anyone else see such need? If so, fill out this survey:


I am trying the lean startup method. :-) If there is < 20 people show interested in this concept, I won't spend more time on it.

Have you thought about FreeBSD for the NAS? It's a pretty common base for home-brew NAS systems. I'm not sure how the ARM support is though.

In theory, yes, but in practice, 32-bit support is a work in progress. You could try it, but you would want to make certain that you boot the kernel with vmalloc set to something larger than the amount of RAM that the system has, yet smaller than the 2GB of kernel address available on 32-bit (e.g. 1G). Otherwise, you could run into problems where kernel virtual memory allocations hang because of virtual address space exhaustion. This is due to a design decision in Linux to cripple kernel virtual memory, although it does not affect 64-bit systems because the kernel virtual address space is much larger than system memory at this time. This should be fixed in the next 6 months, but until then, you will need to be careful with it.

ZFS on CoreOS anyone? [CoreOS does have btrfs support]


Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact