
ZFS Gotchas - pmoriarty
http://nex7.blogspot.ch/2013/03/readme1st.html
======
IgorPartola
Non-ZFS Gotchas:

Other file systems and RAID setups do not checksum your data. If you have a
mirror of two disks (RAID1) and during a read two blocks differ from one
another, most (all?) RAID controllers (hardware and software) will simply
choose the lower numbered disk as canonical and silently "repair" the bad
block. This leads to loads of silent data corruption on what we might consider
a reliable storage solution.

By contrast, ZFS will store both the block and its checksum and will use the
block with the correct checksum as canonical.

In other words, if you care about your data at all, use ZFS. I am frankly
surprised it's not the standard file system for most situations as it is the
only production filesystem that can actually be trusted with data.

P.S.: I have been told that at least on Linux if you have more than two
drives, the Linux software RAID controller will try to choose the version of
the block that is agreed upon by most drives, if that's possible. This is no
guarantee, but it's better than randomly choosing one version.

P.P.S: BTRFS and friends seem to not yet be as production ready as ZFS.
Conversely, ZFS works beautifully on Linux thanks to the ZFS on Linux project.

~~~
voltagex_
Sorry about the number of posts, but
[https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs...](https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-
devel/RUBQHIaUugE) makes me very nervous to use ZFS on Linux.

~~~
IgorPartola
I can only speak from my own limited experience, but for me ZFS on Ubuntu
works as advertised. I was trying to put together a long term storage solution
without becoming a storage expert. ZFS fit the bill. If you are nervous, go
with a BSD: they are great choices.

~~~
tjoff
I can only speak from my own limited experience, but for me RAID6 on linux
works as advertised...

I'll gladly take silent data corruption over a hard total failure of the
entire volume any day. At least I have a chance at fixing the former, and for
my use cases that's preferable.

Every time ZFS on linux is brought up it's always met with skepticism, just as
a lot of people trying out ZFS in virtual machines discovered the hard way
(lost volumes) there are a lot that can go wrong (and it does seem like it can
go wrong) and any version running on linux will not have been tested as much
making it a risk. And that's the thing you are trying to minimize with using
ZFS on Linux in the first place...

Now BTRFS isn't ready for production yet so that leaves, well, nothing if you
want checksums and cheap snapshots.

I guess I'll have to wait another 5 years and see if BTRFS is ready, until
then it seems the only thing I can hope for is luck.

------
mhw
> Examples of things you don't want to do if you want to keep your data intact
> include using non-ECC RAM

I'm planning to build a (Linux) development workstation soon and was intending
to run ZFS on at least a proportion of the disks in the system, if not the
whole thing. But building a workstation that supports ECC appears to be
ridiculously hard and expensive - most of the Haswell CPUs don't support ECC.
Can anyone point me at a sensible CPU/motherboard combo that's comparable to
something like an i7-4770k/Gigabyte GA-Z87MX-D3H?

(As an indication of the state of the component market, pcpartpicker.com
doesn't support searching CPUs or motherboards by ECC support, but it does
allow you to search by what colour the motherboard is!)

~~~
Nanzikambe
YMMV, but I've used ZFS on all my workstations and servers for over 2 years
now. Many of them are ZFS on top of LUKS/cryptsetup for FDE. None of the
machines use ECC RAM, and due lots of things (ie. me doing dumb shit, bad
cutting edge kernel builds, fubar GFX drivers causing X to boot to black
screen etc) I frequently either hard reset them, or Alt + Prtscreen + REISUB
them, I've never had a single issue with ZFS.

By comparison, I've had issues with virtually every other filesystem that
exists for Linux: ext(2,3,tho not 4), reiserfs, and especially the much hated
xfs, death to xfs etc :)

I run his hardware, picked because... well cheap:

    
    
       GIGABYTE GA-990FXA-UD5 ( AMD 990FX/SB950 - Socket AM3+ - FSB5200 ) 
       4 x DDR3 4GB DDR1866 (PC3-15000) - KINGSTON HyperX [KHX1866C9D3/4G] 
       or 
       4 x 16gb of the same type as above
       AMD FX 8-Core FX-8320 [Socket AM3+ - 1000Kb - 3.5 GHz - 32nm ]
    

The ZFS all include root + swap running on ZFS with FDE, this sort of thing
for workstations:

    
    
       NAME                    USED  AVAIL  REFER  MOUNTPOINT
       rpool                  13.2G  25.9G    30K  none
       rpool/root             11.1G  25.9G  5.10G  /
       rpool/root/home-users  6.02G  25.9G  6.02G  /home
       rpool/swap             2.06G  27.4G   646M  -
    

servers:

    
    
       NAME                         USED  AVAIL  REFER  MOUNTPOINT
       rpool0/ROOT/ROOT               6G  1.42T   198K  /
       rpool0/ROOT/swap              32G  1.42T   700M  -
       rpool0/ROOT/BACKUP           676G  1.42T   676G  /nfs/backup
       rpool0/ROOT/HOME            14.1G  1.42T  14.1G  /home
       rpool0/ROOT/MUSIC            358G  1.42T   358G  /nfs/music
       rpool0/ROOT/portage         14.9G  1.42T  14.9G  /nfs/portage
       rpool0/ROOT/vmware          45.9G  1.42T  45.9G  /vmware
    
    

The server hosts several virtual machines, many of which have their own root
over NFS, the point being that the ZFS gets plenty of use.

~~~
dTal
You ever try JFS? I standardised on it for all my machines a few years ago and
it's served me very well - fast fsck and no corruption so far. Bit long in the
tooth now but that's no bad thing for a filesystem.

~~~
Nanzikambe
I have not, but the only reason for not trying it is because of my increasing
use of RAID in hardware and VMs, I wanted something that was completely
agnostic to disks, which is something ZFS delivers but JFS didn't seem to when
I checked.

Basically I wanted something that doesn't intertwine the concept of file-
system with the concept of disks or partitions, but as space. So changes to
the underlying pool of disks (or partitions, or loopback files, etc) don't
mean so much painful rebuilding for me, something I spent many hundreds of
hours on in the past, often at stupid o'clock in the morning. Did I mention
how much I hate XFS? That's why :)

------
javajosh
What I would like to see is a Feynman-style "Introduction to Filesystems"
geared toward programmers. And when I say Feynman-style, I mean lucid, devoid
of jargon, and dense with meaningful concepts.

Granted, most application programmers don't deal with the file system in any
meaningful way - most interaction is deferred to other processes (e.g. the
datastore). But I for one would be interested in certain silly things like,
oh, writing a program that can (empirically) tell the difference between a
spinning disk and a solid state disk, or what file system it's running on,
just from performance characteristics. Other fun things would be to determine
just how fast you can write data to disk, and what parameters make this rate
faster or slower.

------
voltagex_
All this article tells me is that I'm not smart enough to run a ZFS NAS. But
what are my alternatives?

~~~
dap
I wouldn't take that lesson from it.

Many of these considerations don't have anything to do ZFS per se, but come up
in designing any non-trivial storage system. These include most of the
comments about IOPS capacity, the comments about characterizing your workload
to understand if it would benefit from separate intent logs devices and SSD
read caches, the notes about quality hardware components (like ECC RAM), and
most of the notes about pool design, which come largely from the physics of
disks and how most RAID-like storage systems use them.

Several of the points are just notes about good things that you could ignore
if you want to, but are also easy to understand, like "compression is good."
"Snapshots are not backups" is true, but that's missing the point that
constant-time snapshots are incredibly useful even if they don't _also_ solve
the backup problem.

Many of the caveats are highly configuration specific: hot spares are good
choices in many configurations; the 128GB DRAM limit is completely bogus in my
experience; and the "zfs destroy" problem has been largely fixed for a long
time now.

~~~
voltagex_
There was a cheesy "CSI" video with some Sun engineers building a zfs pool
from a crapton of USB sticks. I think I might do that to have a play with it
before I spend ~$1500AUD building a NAS.

~~~
enneff
Or you could just use multiple partitions on a single scratch disk (or even a
ramdisk). A great way to get familiar with the workflow and tools without
going to a lot of effort. This is how I first experimented with ZFS.

~~~
danudey
You can also create files on disk and add them to pools.

There's actually a trick you can use to create a failed ZFS array (e.g. if you
want to create a 5+1 array with only five disks), where you use dd to create a
sparse file of the appropriate size, which lets you create a '1 TB' file while
only writing 1 byte to disk. Add it to your zpool along with the rest of your
disks, then fail it out, remove it, and replace it with a physical disk.

It's a good way of taking a drive with 2 TB of data on it and adding it to a
ZFS filesystem which includes that disk while also keeping the data, but
without using a separate disk (or copying the data twice).

~~~
voltagex_
Crap, I should have done that. Oh well, 4 USB sticks was only $24.

------
brongondwana
I don't see any mention of "for god's sake don't let your pool get more than
90% full or the molasses gremlins will ooze out of your disk controllers and
gum up the plates" in that list...

------
fiatmoney
I'd add one more: it is impossible to remove a vdev from a pool. That means in
a small server / home scenario, you have less flexibility to throw in an extra
mirror vdev of old disks you had lying around, or gradually changing your
storage architecture if HDD sizes increase faster than your data consumption.

However, none of these diminish what I think is the main use case of ZFS: very
robustly protecting against corruption below the total-disk-failure level.

~~~
voltagex_
Does this mean I couldn't start with 4x3TB drives and move to 4TB drives as
they become cost-effective?

This might kill my dreams of a ZFS NAS...

~~~
herf
You can remove each disk and replace with a bigger one (wait for rebuild each
time), and if you set autoexpand, the pool will resize afterwards.

~~~
voltagex_
Doing some more reading, this sounds like it "degrades" the array each time
and could be risky. Other replies to my comment seem to suggest it can't be
done.

~~~
takeda
I believe there are no problems with your scenario, there's zpool replace
command which does this:

    
    
         zpool replace [-f] pool device [new_device]
    
             Replaces old_device with new_device.  This is equivalent to attaching
             new_device, waiting for it to resilver, and then detaching
             old_device.
    
             The size of new_device must be greater than or equal to the minimum
             size of all the devices in a mirror or raidz configuration.
    
             new_device is required if the pool is not redundant. If new_device is
             not specified, it defaults to old_device.  This form of replacement
             is useful after an existing disk has failed and has been physically
             replaced. In this case, the new disk may have the same /dev path as
             the old device, even though it is actually a different disk.  ZFS
             recognizes this.
    
             -f      Forces use of new_device, even if its appears to be in use.
                     Not all devices can be overridden in this manner.

~~~
mosselman
Thanks for pointing this out. I still don't know how to see all of this in a
home NAS context however. It would be really great if someone could explain.

Lets say I have 6 SATA ports. I have 4 drives that I collected from various
computers that I now want to unify in a home built NAS:

    
    
       A. 1TB
       B. 2TB
       C. 4TB
       D. 1TB
       E. -empty-
       F. -empty-
    

Now all my drives are full and I want to either add a disk or replace a disk;
How do I:

    
    
       1. replace disk A. with a 4TB disk
       2. add a 4TB disk on slot E.

~~~
enneff

        1. connect 4T disk to F, "zpool replace pool A F"
        2. connect 4T disk to E, "zpool add pool E"
    

Where A, E, and F are virtual devices (vdevs) provided by your OS. On my
FreeBSD system they are da0, da1, etc.

~~~
mosselman
Thanks. There are a lot of complicated articles spread around which make ZFS
seem very hard for some reason. This seems fine.

------
flyinghamster
Another gotcha: If you intend to physically move a pool of disks from one
system to another, or even move them to different SATA or SAS ports on the
same system (or different drive bays in a JBOD), you should EXPORT THE POOL
FIRST. I came within a hair's breadth of losing a pool when I didn't do that,
and import attempts resulted in "corrupted data" messages. I managed to
straighten that out and recover the pool, but it was a close call.

When it is not exported, the pool stores the device names of its component
disks, and if the wrong disks end up on the wrong devices, you get the
"corrupted data" problem even though the data really isn't corrupted.

------
rdtsc
Article mentions clustered file systems.

Is there any reason why one might choose that for a small office / home office
setup instead of ZFS? Would it make setup/expansion easier.

Taking it in another direction.

I've been following this LeoFS project which basically replicates Amazon's S3
storage (so any S3 clients can work with it).

[http://leo-project.net/leofs/](http://leo-project.net/leofs/)

Its claim is one can add nodes to the cluster to expand its storage
capability. Given enough physical space it might be cheaper to find 3-4
machines (towers) with 4 drive bays than building one new server with 12+
bays. Then use linux's fuse client to access it.

~~~
lmm
I've never seen a clustered filesystem that was friendly or simple enough for
the small office / home use case. With ZFS I pile a bunch of disks in a
server, run samba on it, and there you go. Still, best of luck if you try it.

~~~
e12e
Only problem with ZFS for budget/home use as I see it, is the inability to
grow zpools (add a disk to get more space on an existing filesystem).
Basically the the design is more "buy what you need (eg: 4x2TB), then replace
(eg: 4x4TB as prices come down) -- and throw away your old disks".

~~~
rdtsc
Wouldn't "add more disks to get more space" also be a problem due to physical
limitations on home computer style machine (even if ZFS allowed that).

At some point 2 older towers or hardware with only 4 drive bays might be
cheaper than one motherboard and server tower that takes 8 or 12 drives.

That is why I was wondering about the clustered/distributed FS option.

~~~
e12e
If you're building something today, and get an avoton board, you'll have
10+sata ports on the board. I don't know what kind of midi-tower only support
4x 3.5" drives -- maybe you're thinking hot-swap, front-accessible?

If you get a full tower, you should be able to fit a couple of (or
equivalent):
[http://www.newegg.com/Product/Product.aspx?Item=N82E16817198...](http://www.newegg.com/Product/Product.aspx?Item=N82E16817198058)

(3.5" & 5.25" Black Tray-less 5 x 3.5" HDD in 3 x 5.25" Bay SATA Cage) in
front -- if yo feel you need front-accessible drives.

I'd say you normally want less boxes to take care of (even if failure will be
more catastrophic) on such a small scale. Depends on what you need of course.

------
brunoqc
Is there a similar article but about BTRFS?

~~~
pronoiac
I'll say that reading about the ext3 to btrfs conversion made me laugh - it's
a mad scientist thing that builds all the btrfs pointers in the ext3 free
space, pointing to the data blocks as stored by ext3, and it keeps the
filesystem readable as ext3 until it's completed and mounted as btrfs.

~~~
seunosewa
What's funny about it?

------
dang
Changed to document title to avoid the linkbait in "Things Nobody Told You
About ZFS".

~~~
rosser
While I don't disagree that the original title is perhaps mildly link-baity, I
don't think "Read me first" is very apt, either; this article assumes a
nontrivial degree of familiarity with ZFS and its concepts.

~~~
dang
Fair enough; if you or anyone would like to suggest a neutral, accurate title,
we'd be happy to change it.

~~~
rosser
Something along the lines of "ZFS gotchas" or "ZFS caveats", perhaps?

Unrelated: I really don't think dang's post here warranted downvoting. I
expect that it's being done by folks who _really_ want HN's moderators to know
they're unhappy about titles being changed as capriciously as they sometimes
are. This post is an actual moderator soliciting actual input on how to make
at least one instance of that problem better, however, and I think that should
be encouraged, not dinged.

~~~
dang
Ok, we changed it to "ZFS Gotchas" because the body of the article more or
less uses that word.

It's ok for dang to get dinged. But "capricious"? No. I realize it sometimes
seems that way to people paying sporadic attention, but every change we make
is in keeping with the HN guidelines, and those are hardly secret or unclear.

We make mistakes, of course, and are always open to improvement. The helpful
way to criticize an HN title is to suggest a better one.

~~~
lmm
It may be the case that every change is in keeping with the guidelines, but
it's certainly not the case that the guidelines are applied consistently.
Looking at stories where the submitter added some useful, non-linkbaity detail
that wasn't in the original title, it seems entirely arbitrary whether that
detail will be left in place or cut from the title. And belittling your users'
experience like that ("sporadic attention"?) will make you few friends.

Better policy a): Don't change the titles users submit. Trust the community to
flag linkbait titles.

Better policy b): Don't allow users to submit a title. Automatically scrape
the title from the linked page. Have moderators change the titles when they're
unhelpful or misleading (this is much less likely to annoy users than the
current system, because the moderator wouldn't be replacing the submitter's
title, it would be the HN system replacing another part of the HN system).

edit: Suggest a better _title_ rather than suggest a better policy? How is one
supposed to suggest a better title? There is no form for that, and it's not
something to clutter the comments with.

------
kevin1024
> Do not use raidz1 for disks 1TB or greater in size.

Oops, I'm doing that on my home NAS. Does anyone know why this is bad?

~~~
edude03
I am as well (5x 3TB in Raidz1). I'm pretty sure it's because of the
likelihood of having an unreadable bit/byte/sector on one of the non failed
disks gets higher as the capacity increases and thus there is a good chance
that you'll lose some data. This article discusses the theory.
[http://www.zdnet.com/blog/storage/why-raid-5-stops-
working-i...](http://www.zdnet.com/blog/storage/why-raid-5-stops-working-
in-2009/16)

~~~
voltagex_
Is there a way to check statistics of failed read/writes?

~~~
dap
"zpool status" will show if there have been errors reading data from
individual devices. If a drive experiences enough failures, at least on
illumos and Solaris-based systems, it will be marked degraded or faulted and
removed from service. You can view individual failures on these systems with
"fmdump -e". Here's a made-up worked example:
[https://blogs.oracle.com/bobn/entry/zfs_and_fma_two_great](https://blogs.oracle.com/bobn/entry/zfs_and_fma_two_great)

------
flyinghamster
Also note, if you're using an SAS HBA: the LSI 1068E-R based HBAs, when
flashed with IT firmware, have a persistent drive mapping setting. When it is
active, it will always give a specific physical disk the same device name,
even if it gets moved to a different physical port. Let's say you a drive with
serial number ABC123 on port 0 as c4t0d0 and serial number XYZ987 on port 1 as
c4t1d0. You could swap the two drives and ABC123 would still be c4t0d0 and
XYZ987 would still be c4t1d0.

If all ports on the controller were in use, and you yanked c4t7d0 and slid in
a new drive, it would become c4t8d0 (unless you used lsiutil to remove the
persistent mapping).

Other HBAs may have similar features.

------
namewithhe1d
Ok, say you implemented Dedep accidentally and now your table is huge. I'm
sure it is a slog to go back, but HOW DO YOU DO IT. I've tried the method
here:
[https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs...](https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-
discuss/at3Cb08d21A) and didn't get it fixed. Any more thoughts? Please don't
recommend I make a new pool and send all data over, I don't have the space at
70% capacity.

------
colechristensen
Overall, this is quite good. Some of it is oversold as "things nobody told you
about" and some of it could really benefit from real data. A few things are a
little too close to seemingly unsupported folk-wisdom rather than sound
advice.

------
Touche
Something I never see discussed, what is ZFS's effect on battery life? I have
not put it on laptops for this concern.

