
File Systems, Data Loss and ZFS - ferrantim
https://clusterhq.com/blog/file-systems-data-loss-zfs/
======
lmm
I switched to FreeBSD a couple of years ago, partly for the sake of ZFS which
is a first-class filesystem on that platform. FreeBSD was much more similar to
Linux than I expected, and where there were differences, the FreeBSD way was
usually simpler. My system has been stabler ever since, and I no longer fear
to hit the "update" button.

~~~
acdha
How's package management? I switched away from *BSD a decade ago because the
sysadmin side was so far behind Debian (which, to be fair, was true of almost
everything else) with update reliability, speed, etc. What's it like in the
modern era?

~~~
laumars
FreeBSD has recently included a new binary package manager (pkg) which seems
to work well enough, but in all honesty it's still feels a few years behind
apt / pacman / yum.

The nice thing is how well you can mix and match ports (source based
management, in case you weren't aware) with the binary packages. And FreeBSD
ports are surprisingly easy to manage too.

It's also worth mentioning that FreeBSD runs circles around a lot Linux
distributions when it comes to upgrading. There's a tool, literally called
_freebsd-update_ , which is the equivalent of _apt-get dist-upgrade_ and makes
the whole process so painless you could be forgiven for thinking you were
running a rolling release OS.

~~~
feld
> FreeBSD has recently included a new binary package manager (pkg) which seems
> to work well enough, but in all honesty it's still feels a few years behind
> apt / pacman / yum.

You don't expect someone to be able to write a full-featured bug-free package
manager equivalent to those that have matured over a decade in under a two
years do you? What't they've done so far is beyond impressive and at this pace
it will be improved beyond the competition within 3 years.

~~~
laumars
I think the FreeBSD devs have done a cracking job with _pkg_ and I did comment
about how it's a new package manager to give context that it's likely to be
refined and improved upon with time.

All I was just trying to give a balanced opinion but it seems you can't post
anything on here these days without someone finding criticism. { _sigh_ }

------
ferrantim
Thanks for the explanation of misdirected writes. I've heard the term before,
but didn't know exactly what caused it. Reading this post was like watching
one of those How Things are Made shows on the Discovery Channel. Very
interesting to see how some things I take for granted actually work.

~~~
ryao
Misdirected writes are not as well known as they should be. I am happy to
increase awareness of them.

------
oakwhiz
>ZFS is operating on a system without an IOMMU (Input Output Memory Management
Unit) and a malfunctioning or malicious device modifies its memory.

If a Linux system possessing an IOMMU was booted with iommu=pt as a kernel
command line option, does the IOMMU still protect from this type of failure?
This option puts the IOMMU into passthrough mode which is required to
successfully use peripherals on some motherboards.

~~~
ryao
No. This mode was introduced specifically for virtualization so that the IOMMU
will only restrict access to a guest machine's memory, such as when KVM is in
use:

[http://lwn.net/Articles/329174/](http://lwn.net/Articles/329174/)

The only case in which this would help is when ZFS is on the host and a device
passed through to a guest malfunctions.

------
contingencies
TLDR: "its data integrity capabilities far exceed any other production
filesystem available on Linux today"

------
Someone
_" In the case that we have two mirrored disks and accept the performance
penalty of the controller reading both, the controller will be able to detect
differences, but has no way to determine which copy is the correct copy."_

If you 'seed' the checksum algorithm for a block with the block number being
written, a subsequent read of a different block that produces the same data
will have a checksum failure. That would make it possible to choose which
block has the right data.

So, if you are willing to eat the performance, you can detect single
misdirected writes.

~~~
ryao
When I wrote that, I was talking about hardware RAID 1, which has no
checksums.

~~~
Someone
But the disks have their own checksums, haven't they?

~~~
ryao
The low level formatting has ECC, which never leaves the drive. That said,
there are two cases to consider for misdirected writes. One is that the write
clobbers multiple sectors in which case you would get uncorrectable sectors.
The second is that it perfectly replaces another sector. In that case, the ECC
is a perfect match as the ECC is stored with the sector. Neither drive would
report a problem, but the data would not match. This is what I described as
being a problem and traditional RAID is incapable of dealing with it.

~~~
Someone
And that is where I stated that drives can report a problem. If they 'seed'
their ECC algorithm with the sector number (XOR-ing the result with it would
be sufficient), they can (statistically) detect that, when they read sector
#X, what they got wasn't what they ever wrote as sector #X.

In fact, I guess they already do. If they didn't, there would be misdirected
reads, too.

~~~
ryao
The low level formatting does include a sector number, but it is not part of
ECC. I am not sure what your point is. Your theoretical description of how
hard drives could work does not reflect reality. Research by CERN and others
has confirmed the existence of misdirected writes. Deployed ZFS installations
are detecting corruption in situations where the drives report everything is
fine. Even if the storage hardware improves, having end to end checksums in
the filesystem will continue to make sense.

That said, I think you are fixating on one way that things can go wrong.
Another way that misdirected writes can occur is a bit-flip in the micro-
controller's memory. This also allows for misdirected reads as well as
reading/writing data that has a single bit flipped. These devices micro-
controllers do not have ECC memory. Even if it were added, you still need to
prove that there are no programming bugs via formal verification, but given
that these devices are black boxes that cannot be inspected, you cannot rely
on the claim of a proof even if one is done and there would still be the
possibility for errata in the micro-controller. It is far easier to just use
end-to-end checksums in the filesystem. Even if you think the device is
trustworthy, end-to-end checksums give you the ability to check that it is
doing what it is supposed to do. You simply do not have that with traditional
RAID.

------
IgorPartola
I found the Reordering Across Flushes section really interesting. So one rule
of thumb is that you should not use hardware RAID with battery backup? Are
there other types of hardware that would give you the same problems?

~~~
mbreese
Well, with ZFS you want to avoid hardware RAID controllers completely. The
protections from ZFS only work if the filesystem doesn't have anything in
between it and the actual disks. Depending on your vendor, it can actually be
difficult to get a card that lets you have JBOD access to a large disk array.

The only exception that I can think of is encryption. You could wrap a disk
with an encryption layer in software, but then you could still to make a
separate virtual device for each disk.

~~~
ryao
A hardware RAID controller could limit ZFS' ability to provide integrity, but
not enough that I would say another filesystem does a better job there. All
filesystems are compromised by the failures that traditional RAID can
introduce.

That said, I would never recommend a hardware RAID controller for use on any
system. They add additional cost and additional failure modes, but give little
in terms of benefits.

------
pedrocr
Does anyone have a good up-to-date comparison with btrfs on this topic?

~~~
xenophonf
Check out the previous HN thread on ZoL:

[https://news.ycombinator.com/item?id=8303333](https://news.ycombinator.com/item?id=8303333)

~~~
pedrocr
Thanks that was very helpful. Seems btrfs is still a bit behind, some of it by
design.

Valerie Aurora, who worked both on ZFS and btrfs seems to think the btrfs
architecture is better in a few ways:

[https://lwn.net/Articles/342892/](https://lwn.net/Articles/342892/)

~~~
xenophonf
Thanks for the link. That was a very interesting article. Btrfs sounds very
interesting.

------
phireph0x
Does ZFS on Linux support ARM? I'd like to give it a spin in Arch Linux ARM.

~~~
tkinom
I like to see that too!

I worked on ARM Linux NAS system for an SOC company for a while. I think ZFS
on ARM is good idea mainly because the low cost, low power CPU.

Does anyone else see such need? If so, fill out this survey:

[https://docs.google.com/forms/d/1HRt_aYmuGkyQvBp9p2Tr3-9ygXR...](https://docs.google.com/forms/d/1HRt_aYmuGkyQvBp9p2Tr3-9ygXR0YyTe2u7rtjTirOo/viewform?usp=send_form)

I am trying the lean startup method. :-) If there is < 20 people show
interested in this concept, I won't spend more time on it.

~~~
mbreese
Have you thought about FreeBSD for the NAS? It's a pretty common base for
home-brew NAS systems. I'm not sure how the ARM support is though.

------
lsllc
ZFS on CoreOS anyone? [CoreOS does have btrfs support]

~~~
ferrantim
Yes: [https://github.com/ClusterHQ/flocker/blob/zfs-on-coreos-
tuto...](https://github.com/ClusterHQ/flocker/blob/zfs-on-coreos-
tutorial-667/docs/experimental/zfs-on-coreos.rst)

~~~
riashhero
sdf

