
Replacing a silently failing disk in a ZFS pool - rodrigo975
https://imil.net/blog/2019/07/02/Replacing-a-silently-failing-disk-in-a-ZFS-pool/
======
js2
> What? there’s a shitton of docs on this topic! Are you stupid?

I've been using one form of Unix or another since 1994. I've been employed as
a sys admin over many of those years. I think the git CLI is just dandy. I
love iptables. lsof? No problem. No CLI has made me feel more dumb than ZFS's.
I have more "how to" notes for it than any other Unix CLI that I use. So no,
you aren't stupid. (The Linux "ip" CLI isn't my favorite either – when you
need a rosetta stone for your CLI it's not a great sign
[https://access.redhat.com/sites/default/files/attachments/rh...](https://access.redhat.com/sites/default/files/attachments/rh_ip_command_cheatsheet_1214_jcs_print.pdf))

------
dusted
Before I put my first zfs system in production, I prepared for the following
scenarios, using the actual hardware: Corrupt bits on three disks. Cut power
to disk Pulled disk out while running. Decided I hated a healthy disk and
replaced it with a different disk. Pulled a disk from the zfs and the root
mirror and installed a new root mirror without backups. So I now know that it
has the ability to recover from most things I could come up with. If I had
only kept notes on how to do it, but I deferred that with "I'll do the google
and maybe setup a play-system in the same state to test stuff before doing it
on the real system".. not a good admin this one (me, myself, not the author,
kudos for them to write about their experience and make us all smarter)..

~~~
gravypod
I think every single person in charge of a production storage system has
pulled their hair out for a while dry running these things. If only we built
playbooks for our software that showed exactly what to do in these simple
failure modes.

This post is a very strange failure mode so it'd probably not be in that
guide.

~~~
dusted
yeh, it's somewhat strange, then again, disks trying their best, and drivers
trying their best, to serve, probably result in many silent retries that just
appear as slight performance jitter.. It's not like the FS would know the
first time a disk re-reads a sector, it might not even be relocated the first
time it happens.. Maybe we need some mechanism for reporting "yeah, we got
your bits but it wasn't as smooth sailing as usual" to filesystems such as
zfs.

~~~
toast0
As marcosdumay says, smart should handle this, but you do have to look.

Iffy sectors get marked as Pending, Reallocated sectors are also marked.
There's some smart stats for seek times too, I believe (but those weren't as
clearly associated with failure as sector counts).

There is a danger that manufacturers will avoid acknowledging problems in
smart, but so far, I haven't heard about that happening too often. There are
certainly some drives that are dead or dying with fine smart values, but it
seems rare, and there's often a not obvious dependency where a drive that is
very unhealthy may not be able to record new smart values.

~~~
ianhowson
In my experience, SMART doesn't detect anything of value. 95% of my disk
failures are:

\- Seek time gets very high (>80%). \- Drive drops dead and won't respond

SMART doesn't detect either of these.

Setting an alert on seek times > 30ms has been by far the best predictor of
drive failure. If the seek times go over ~400ms then other components start
complaining about slow/missing I/O, but by that point you've already gone
crazy because your array is unusably slow.

If a drive develops bad sectors SMART will report it, but that's rare enough
that I don't bother watching for it. It's a non-event with ZFS because of the
checksumming. These drives never have one random error; they start raising
hundreds in the space of minutes, and that's the moment you replace the drive.

------
dvdgsng
Big Kudos to ZFS (on Linux), it's just amazing how sane and stable it is. It
saved my ass multiple times in the past years. I've also just finished
upgrading a RAIDZ1 vdev by replacing one disk at a time with bigger ones.
Resilvering took 15h for each disk and there was some trouble with a (already
replaced) disk failing during that. Panic mode set in, but ZFS provided the
means to fix it quite easily - all good. Best decision ever to pick ZFS.

~~~
the8472
you can replace multiple disks at a time, assuming you have enough drive
slots.

~~~
radiowave
And assuming the array has sufficient redundancy to allow multiple disks to be
offline, which a RAIDZ1 vdev doesn't.

~~~
kogir
If you have space you can “zfs attach” the new drives to make mirror vdevs
with the disks they’re replacing, and then “zfs detach” the old drives to
break the mirror when done.

This is preferable because you’re never exposed to any additional risk of data
loss and can replace more disks at once than original pool redundancy would
allow.

~~~
rincebrain
Not on a raidz vdev you can't.

You can't make mirrors out of anything other than existing mirrors or single
disk vdevs.

(You _can_ run zpool replace on more disks than the pool redundancy has,
assuming you don't need to disconnect the old disks to put the new ones
in...running zpool replace on two disks in a four-disk raidz1 is perfectly
legal, as long as the old disks are still there.)

------
pmlnr
Interestingly enough I had a reverse scenario a few weeks ago: ZFS alerted on
rw errors on a drive, smartctl and the kernel was perfectly fine with it. The
drive was in fact failing, the physical noises (!) made it clear, but other
than ZFS, nothing was reporting it.

~~~
masklinn
Hardware failing is a noisier (literally in your case) version of silent data
corruption, which is one of the core original use cases and justifications for
ZFS:

> BILL MOORE We had several design goals, which we’ll break down by category.
> The first one that we focused on quite heavily is data integrity. If you
> look at the trend of storage devices over the past decade, you’ll see that
> while disk capacities have been doubling every 12 to 18 months, one thing
> that’s remaining relatively constant is the bit-error rate on the disk
> drives, which is about one uncorrectable error every 10 to 20 terabytes. The
> other interesting thing to note is that at least in a server environment,
> the number of disk drives per deployment is increasing, so the amount of
> data people have is actually growing at a super-exponential rate. That means
> with the bit-error rate being relatively constant, you have essentially an
> ever-decreasing amount of time until you notice some form of uncorrectable
> data error. That’s not really cool because before, say, about 20 terabytes
> or so, you would see either a silent or a noisy data error.

> JEFF BONWICK In retrospect, it isn’t surprising either because the error
> rates we’re observing are in fact in line with the error rates the drive
> manufacturers advertise. So it’s not like the drives are performing out of
> spec or that people have got a bad batch of hardware. This is just the
> nature of the beast at this point in time.

> BM So, one of the design principles we set for ZFS was: never, ever trust
> the underlying hardware.

> […]

> Greenplum has created a data-warehousing appliance consisting of a rack of
> 10 Thumpers (SunFire x4500s). They can scan data at a rate of one terabyte
> per minute. That’s a whole different deal. Now if you’re getting an
> uncorrectable error occurring once every 10 to 20 terabytes, that’s once
> every 10 to 20 minutes—which is pretty bad, actually.

------
the8472
_> But… at less than 40K/s! Turns out that very logically the failing disk and
its timeouts was slowing down the silvering, so I learned that to avoid this
kind of situation, you should offline the failing disk from the zpool:_

Yeah I ran into this too. I wish raidz would route around the slow drive by
re-balanceing reads to the more performant drives based on IO queue depth
(reconstructing data from parity if needed). They do it for mirrors, but not
raidz.

------
xattt
I just spent about a week trying to get a raidz pool going for a home server
under Ubuntu. It was about one of the most frustrating experiences in recent
memory. My goal was to run the root system off an SSD with the heavily used
folders offloaded onto a ZFS raidz pool.

I followed the Ubuntu Root on ZFS guide with changes I thought would be
appropriate.

I gave up eventually and came to a disgruntled conclusion is that it works for
beginner users tho follow guides to the T and advanced users that know the
changes that need to be made from institutional knowledge, but not for people
in between.

~~~
Filligree
That depends on your OS. Ubuntu is, frankly, very bad at it -- though few
other distributions are very much better.

I'd recommend NixOS. ZFS is a proper first-class filesystem there, and you can
use it almost however you like.

~~~
tracker1
Thanks... have a used server I'm going to play with in a couple weekends...
won't have the final drives for the new shared drive setup for a couple months
(moving away from an old nas box) and going to try a few scenarios with an
nvme as the boot drive via adapter, and most storage/backup to the RAID array.

Was planning on trying both unraid and windows server as host OSes, and maybe
even using a NAS distro in a VM managing the individual drives, etc. Would
prefer something that supported docker, full vms and a friendly NAS UI, but
not sure such a beast exists.

~~~
dflock
FreeNAS is linux distro with a friendly NAS UI:
[https://freenas.org/](https://freenas.org/) \- but still linux, so you get
all the usual linux stuff - docker, full VMs, etc...

~~~
chousuke
FreeNAS is based on FreeBSD, not the Linux kernel. It's from a completely
different lineage of operating systems.

~~~
dflock
Ooops, thanks for the correction!

------
tomxor
Note that his zpool is in "raidz" mode not "mirror" mode. The latter is going
to be far better and faster at handling replacement of dead/dying drives
because you can pretty much just remove the old one, plug the new one in and
then update the zpool (in that order). It's also faster in normal use due to
lack of performance issues related to calculating and storing of parity bits
in raidz.

------
linsomniac
I've got a RAID-Z2 array that is being kind of weird, and I'm not entirely
sure where the problem is, but it has never lost data. At first I thought it
was a marginal disk. Now I'm thinking it might be the controller.

It's amazing how easy it is to work with ZFS though. I was able to take each
drive out of the pool, run badblocks on it (read/write to exercise it and look
for errors), and then add it back in once it passed the burn in test. Easy
peasy. Even with each drive being a dm-crypt.

I've been using ZFS for a very long time, and never had data loss on it. Even
back in the early days of ZFS+FUSE under Linux. At one time I had 5 big
systems running ZFS for backups of ~150 machines.

------
linsomniac
Has anyone investigated Stratis? I only recently heard about it, and it sounds
like it has a ways to go to match feature parity, but it is targeting ZFS and
Btrfs, seems like it has made a lot of progress in the last 2 years, and has
Redhat behind it (?). It's not clear to me how closely it will match ZFS
features, but it seems like it might be on a path to maturity before Btrfs.

[https://stratis-storage.github.io/](https://stratis-storage.github.io/)

I really wish we had HAMMER on Linux.

~~~
vermaden
Its nowhere near ZFS/HAMMER/BTRFS ...

Check their FAQ here: [https://stratis-
storage.github.io/faq/](https://stratis-storage.github.io/faq/)

In short:

 _In terms of its design, Stratis is very different from ZFS /BTRFS, since
they are both in-kernel filesystems. Stratis is a userspace daemon that
configures and monitors existing components from Linux’s device-mapper
subsystem, as well as the XFS filesystem._

Red Hat move seems very strange here because it would be better just to hire
several BTRFS developers and join the BTRFS Linux ecosystem instead of writing
something new ... I still do not know what for ...

~~~
InvaderFizz
> it would be better just to hire several BTRFS developers and join the BTRFS
> Linux ecosystem instead of writing something new

I found this previous discussion rather enlightening on the subject:

[https://news.ycombinator.com/item?id=14907771](https://news.ycombinator.com/item?id=14907771)

------
pedrocr
If you already have a spare disk installed doesn't it make more sense to just
add it to pool anyway and just bump up the redundancy to Z2 or Z3?

~~~
gvb
That is one of those Great Debates.

If your spare disk is a "hot spare" where it is plugged in and powered up,
yes.

If your spare disk philosophy is "cold spare" where the drive is in a box on a
shelf, probably not.

Hot spares are faster and easier to turn into replacement drives but the "hot
spare" is subject to usage-based (electrical, thermal, and mechanical) failure
modes. For a given drive, the probability of failure is largely proportional
to power-on hours[1].

Cold spares are not accumulating power-on hours so they likely won't fail
while waiting to be used and presumably won't fail for quite some time after
being installed as a replacement.

[1] Intuitively, but hard to prove because of many confounding factors.

Backblaze has the best drive failure rate data that I'm aware of:
[https://www.backblaze.com/b2/hard-drive-test-
data.html](https://www.backblaze.com/b2/hard-drive-test-data.html)

------
IronWolve
I have a few zfs pools mostly for temp space for tape backup. I tried using an
old attached dell storage for ZOL with luns, but ZOL had locking issues. But
pretty much we replace all cheap solaris zfs storage for a mix of
flash/consumer drive nimble.

------
cmurf

        Jul  2 12:51:02 <kern.crit> newcoruscant kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
    

Looks like the FreeBSD kernel has a similar configuration as Linux, which is a
command timer of ~30 seconds. If the drive hangs longer, then the whole link
is reset. And the reason why the drive is hanging is lost in that reset.

The likely reason why the drive is hanging, if it's a consumer drive, is it
has very high bad sector recovery time, and can approach 3 minutes. That's
pretty crazy.

Anyway, it's central to any RAID to be able to get a discrete read error from
the drive. And that read error will include the LBA for the bad sector. And
that information is needed to know what data is affected and where to get a
good copy (from mirror or from reconstruction using parity). This is the same
on md raid on Linux, ZFS, Btrfs, and even hardware RAID will depend on it.
There are too many commands in the queue to just assume one of those commands
and therefore one of those requested sectors (which could be thousands) is the
bad sector. A discrete read error with LBA is necessary.

And the link reset prevents that.

On Linux, this is set per drive (this is a Linux SCSI command timer, which
applies to all PATA and SATA drives too, and it is not a drive setting, it's a
kernel setting) timeout:

    
    
        $ cat /sys/block/sda/device/timeout 
        30
    

Ideally the drive supports SCT ERC (that's for SATA, there's a SCSI/SAS
equivalent) and you use 'smartctl -l scterc' to change it to a value less than
the kernel command timer. That way the drive itself gives up faster, and
issues a discrete read error, and now the kernel (and ZFS) can figure out how
to repair the problem.

If the drive doesn't support configurable SCT ERC, you have to raise the
kernel command timer to an obscene level.

    
    
        # echo 180 > /dev/block/sdN/device/timeout
    

And now the kernel will just wait and wait and wait and wait until finally the
drive does give up on recovery and issues a discrete read error, and the
kernel and ZFS can fix it.

The late tl;dr is, this is not a ZFS problem. This is the result of long set
kernel command timers, and a refusal by kernel developers to update them for
the incredibly high (some might say crazy) deep recoveries that consumer
drives use. But here's the thing, macOS and Windows know about these high
recoveries, and tolerate them without doing link resets. And that's why they
eventually recover bad sectors, but the user notices them as performance slow
downs.

And what fixes them? A clean install. And that's because a sector write that
fails due to a bad sector, causes a remap. Which is the same mechanism that
ZFS, Btrfs, md raid, and hardware raid will all depend on. The read error
results in getting good data from a copy (mirror or reconstruction from
parity), the good copy is both sent to the application layer as well as
results in an overwrite command to the sector that had the read error. If that
write fails the drive firmware itself remaps that LBA to a spare sector.

Anyway, this is a misconfiguration, the question is who is to blame for it?
And even that's complicated because the drive doesn't announce its SCT ERC
support or value. You have to poll it. Could the kernel poll for this? I
guess? But should it? Probably not its domain? Is it a distribution question
rather than an upstream kernel question? Perhaps, yes, the distros should use
high command timers by default, and expect sysadmins will reduce them if the
use case requires it.

